LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   sed/awk: How do I split a text stream into mulitple lines with varying lengths? (http://www.linuxquestions.org/questions/programming-9/sed-awk-how-do-i-split-a-text-stream-into-mulitple-lines-with-varying-lengths-4175451252/)

maddes.b 02-22-2013 05:18 AM

sed/awk: How do I split a text stream into mulitple lines with varying lengths?
 
sed/awk: How do I split a text stream (one line) into mulitple lines with different/varying lengths?

I have one file which contains a single one liner (text stream) that contains multiple lines/records.
Each line holds its full length in the first 5 chars.
Now I want to add a linebreak behind each line to separate them and get multiple lines.

Example:
Code:

00009test00011stream00006X000210123456789ABCDEF00005
Wanted result:
Code:

00009test              ( 9 chars: 5+4)
00011stream            (11 chars: 5+6)
00006X                ( 6 chars: 5+1)
000210123456789ABCDEF  (21 chars: 5+16)
00005                  ( 5 chars: 5+0)

My idea for sed was...
#1 get the first 5 chars as value
#2 backtrack to start of pattern
#3 print chars in the specified length plus newline
#4 start over at step #1, until nothing found

I have little experience with sed.
Tried to find a solution for #2 and #3 in the man pages and on the internet, but didn't get it.
Maybe awk could be a better option to achieve this.

Any help is greatly appreciated.
Maddes

druuna 02-22-2013 06:27 AM

First of all: Your example is not consistent:
Code:

00009test00011stream00006X0002101234567890ABCDEF00005

000210123456789ABCDEF  (21 chars: 5+16) a missing zero

Should be:0002101234567890ABCDEF  (22 chars: 5+17)

Have a look at this:
Code:

#!/usr/bin/awk -f

BEGIN{
A = "00009test00011stream00006X0002201234567890ABCDEF00005"

LENGTH=length(A)

while ( LENGTH >= 5 )
  {
    B = substr(A,1,5)
    C = substr(A,1,B)
    print C
    A = substr(A,B+1)
    LENGTH=length(A)
  }
}

Sample run:
Code:

$ ./blaat.awk
00009test
00011stream
00006X
0002201234567890ABCDEF
00005


whizje 02-22-2013 07:03 AM

Code:

sed 's/\([0-9]\)*\([a-zA-Z]\)*/\1\2
/g' file        #insert literal return
or
Grep -Eo "[0-9]*[a-zA-Z]*" file

At the moment I am not able to test it

colucix 02-22-2013 07:20 AM

A pure shell solution:
Code:

#!/bin/bash
#
line=$(<file)
while [[ ${#line} > 0 ]]
do
  len=${line:0:5}
  len=${len//0/}
  echo ${line:0:$len}
  line=${line:$len}
done


druuna 02-22-2013 08:23 AM

Quote:

Originally Posted by whizje (Post 4897415)
Code:

sed 's/\([0-9]\)*\([a-zA-Z]\)*/\1\2
/g' file        #insert literal return
or
Grep -Eo "[0-9]*[a-zA-Z]*" file

At the moment I am not able to test it

This works for the example given, but not if the string would, for example, look like this: 00009123400011stream00064

The OP doesn't secifically mention this scenario, but this part hints that it might: 000210123456789ABCDEF

@colucix: I'm too awk minded ;) I should really spend more time with bash internals.....

maddes.b 02-22-2013 08:34 AM

@colucix:
I do not have bash available in all places.

@druuna:
Thanks a lot, that got me really going in awk. Seems that awk is much closer to real programming than sed.
I adopted this to the following:
PHP Code:

#!/usr/bin/awk -f

BEGIN {
  
# split stream into awk records/lines on null-byte (if not valid in stream)
  # only needed when line feed is valid in stream
  # see http://www.gnu.org/software/gawk/manual/html_node/Records.html
  #RS = "\0"
}

{
  
# define the fixed length of the first field of each data record, which holds the complete record length
  
LENGTHCHARS 5

  
# assign stream to variable for easier usage, get length as additional info
  
STREAM = $0
  LENGTHSTREAM 
length(STREAM)

  
# init processing of stream: all chars to go and start at the beginning
  
LENGTHTOGO LENGTHSTREAM
  OFFSET 
0
  
# check if enough chars left for the first field (data record length)
  
while ( LENGTHTOGO >= LENGTHCHARS ) {
    
# get the length of the data record from the first field, add zero to make it a number
    # see http://www.gnu.org/software/gawk/manual/html_node/Conversion.html
    
LENGTHRECORD substr(STREAMOFFSET+1LENGTHCHARS)
    
# leave while-loop if ill-formatted data record length
    
if ( LENGTHRECORD LENGTHCHARS ) {
      print 
"ERROR: Ill-formatted record length in \"line\" " FNR " at offset " OFFSET " of " FILENAME
      SUFFIX 
""
      
if ( LENGTHTOGO 20 ) {
        
LENGTHTOGO 20
        SUFFIX 
"..."
      
}
      print 
"       " substr(STREAMOFFSET+1LENGTHTOGOSUFFIX
      next  
# skip to next awk record/line
    
}
    
# leave while-loop if data record not completely in stream
    
if ( LENGTHRECORD LENGTHTOGO ) {
      print 
"ERROR: Incomplete record in \"line\" " FNR " at offset " OFFSET " of " FILENAME
      
print "       " substr(STREAMOFFSET+1)
      
next  # skip to next awk record/line
    
}

    
# extract data record from stream and print to STDOUT
    
RECORD substr(STREAMOFFSET+1LENGTHRECORD)
    print 
RECORD

    
# skip to next data record in stream
    
OFFSET += LENGTHRECORD
    LENGTHTOGO 
-= LENGTHRECORD
  
}
  
# check if any chars are unexpectedly left over
  
if ( LENGTHTOGO ) {
      print 
"ERROR: Ill-formatted record at end of \"line\" " FNR " at offset " OFFSET " of " FILENAME
      
print "       " substr(STREAMOFFSET+1)
  }



whizje 02-22-2013 08:37 AM

If the key always starts with 4 zeros it could be easily adapted but the OP isn't very clear about that. He presented his solution instead of a clear description of the problem.
Nevermind now I get it.
At the OP is there a max length?

whizje 02-22-2013 08:50 AM

@colucix If the stream is very long awk would definitely be the prefered solution. @Druuna I meant. I am not quite awake.

maddes.b 02-22-2013 09:30 AM

Quote:

Originally Posted by whizje (Post 4897479)
If the key always starts with 4 zeros it could be easily adapted but the OP isn't very clear about that. He presented his solution instead of a clear description of the problem.
Nevermind now I get it.
At the OP is there a max length?

It's not a key, it specifies the record length in decimal including itself.
So the max for length with 5 chars is 99999, so 5 chars for the length field and 99994 for record data.
Record data can contain anything (digits, chars, hex values).

As awk code looks like basic it didn't need much explanation - at least for me.

maddes.b 02-22-2013 10:39 AM

For really huge files you have to use GNU awk.

For example the default awk of Solaris failed on a file with a 342KB text stream.
Fortunately GNU awk is available in /opt on all our servers.

Thanks again to all for their help
Maddes

colucix 02-22-2013 03:46 PM

Quote:

Originally Posted by maddes.b (Post 4897477)
@colucix:I do not have bash available in all places.

Here is a Bourne Shell version (just for fun)...
Code:

#!/bin/sh
read line < file
end=`expr length "$line"`
while [ "$line" ]
do
  len=`expr substr $line 1 5`
  expr substr $line 1 $len
  pos=`expr $len + 1`
  line=`expr substr $line $pos $end`
done

and an even older Bourne Shell version (without the Berkley estensions expr substr and expr length)...
Code:

#!/bin/sh
read line < file
while [ "$line" ]
do
  len=`expr $line : '\(.....\)'`
  sub=`expr $line : "\(.\{$len\}\)"`
  echo $sub
  line=`expr $line : "$sub\(.*\)"`
done

:jawa:

maddes.b 02-22-2013 05:10 PM

@colucix: Thanks a lot.


All times are GMT -5. The time now is 05:54 PM.