LinuxQuestions.org
Support LQ: Use code LQ3 and save $3 on Domain Registration
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 02-22-2013, 04:18 AM   #1
maddes.b
LQ Newbie
 
Registered: Aug 2009
Location: Germany
Distribution: Debian, OpenWrt
Posts: 23

Rep: Reputation: 1
Question sed/awk: How do I split a text stream into mulitple lines with varying lengths?


sed/awk: How do I split a text stream (one line) into mulitple lines with different/varying lengths?

I have one file which contains a single one liner (text stream) that contains multiple lines/records.
Each line holds its full length in the first 5 chars.
Now I want to add a linebreak behind each line to separate them and get multiple lines.

Example:
Code:
00009test00011stream00006X000210123456789ABCDEF00005
Wanted result:
Code:
00009test              ( 9 chars: 5+4)
00011stream            (11 chars: 5+6)
00006X                 ( 6 chars: 5+1)
000210123456789ABCDEF  (21 chars: 5+16)
00005                  ( 5 chars: 5+0)
My idea for sed was...
#1 get the first 5 chars as value
#2 backtrack to start of pattern
#3 print chars in the specified length plus newline
#4 start over at step #1, until nothing found

I have little experience with sed.
Tried to find a solution for #2 and #3 in the man pages and on the internet, but didn't get it.
Maybe awk could be a better option to achieve this.

Any help is greatly appreciated.
Maddes

Last edited by maddes.b; 02-22-2013 at 05:38 AM. Reason: Fix typo in example. Thanks to druuna.
 
Old 02-22-2013, 05:27 AM   #2
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371
First of all: Your example is not consistent:
Code:
00009test00011stream00006X0002101234567890ABCDEF00005

000210123456789ABCDEF  (21 chars: 5+16) a missing zero

Should be:0002101234567890ABCDEF  (22 chars: 5+17)
Have a look at this:
Code:
#!/usr/bin/awk -f

BEGIN{
A = "00009test00011stream00006X0002201234567890ABCDEF00005"

LENGTH=length(A)

while ( LENGTH >= 5 )
  {
    B = substr(A,1,5)
    C = substr(A,1,B)
    print C
    A = substr(A,B+1)
    LENGTH=length(A)
  }
}
Sample run:
Code:
$ ./blaat.awk
00009test
00011stream
00006X
0002201234567890ABCDEF
00005
 
2 members found this post helpful.
Old 02-22-2013, 06:03 AM   #3
whizje
Member
 
Registered: Sep 2008
Location: The Netherlands
Distribution: Slackware64 current
Posts: 582

Rep: Reputation: 129Reputation: 129
Code:
sed 's/\([0-9]\)*\([a-zA-Z]\)*/\1\2
/g' file        #insert literal return
or
Grep -Eo "[0-9]*[a-zA-Z]*" file
At the moment I am not able to test it

Last edited by whizje; 02-22-2013 at 06:06 AM. Reason: typo
 
Old 02-22-2013, 06:20 AM   #4
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,458

Rep: Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941
A pure shell solution:
Code:
#!/bin/bash
#
line=$(<file)
while [[ ${#line} > 0 ]]
do
  len=${line:0:5}
  len=${len//0/}
  echo ${line:0:$len}
  line=${line:$len}
done
 
2 members found this post helpful.
Old 02-22-2013, 07:23 AM   #5
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371
Quote:
Originally Posted by whizje View Post
Code:
sed 's/\([0-9]\)*\([a-zA-Z]\)*/\1\2
/g' file        #insert literal return
or
Grep -Eo "[0-9]*[a-zA-Z]*" file
At the moment I am not able to test it
This works for the example given, but not if the string would, for example, look like this: 00009123400011stream00064

The OP doesn't secifically mention this scenario, but this part hints that it might: 000210123456789ABCDEF

@colucix: I'm too awk minded I should really spend more time with bash internals.....
 
Old 02-22-2013, 07:34 AM   #6
maddes.b
LQ Newbie
 
Registered: Aug 2009
Location: Germany
Distribution: Debian, OpenWrt
Posts: 23

Original Poster
Rep: Reputation: 1
Exclamation

@colucix:
I do not have bash available in all places.

@druuna:
Thanks a lot, that got me really going in awk. Seems that awk is much closer to real programming than sed.
I adopted this to the following:
PHP Code:
#!/usr/bin/awk -f

BEGIN {
  
# split stream into awk records/lines on null-byte (if not valid in stream)
  # only needed when line feed is valid in stream
  # see http://www.gnu.org/software/gawk/manual/html_node/Records.html
  #RS = "\0"
}

{
  
# define the fixed length of the first field of each data record, which holds the complete record length
  
LENGTHCHARS 5

  
# assign stream to variable for easier usage, get length as additional info
  
STREAM = $0
  LENGTHSTREAM 
length(STREAM)

  
# init processing of stream: all chars to go and start at the beginning
  
LENGTHTOGO LENGTHSTREAM
  OFFSET 
0
  
# check if enough chars left for the first field (data record length)
  
while ( LENGTHTOGO >= LENGTHCHARS ) {
    
# get the length of the data record from the first field, add zero to make it a number
    # see http://www.gnu.org/software/gawk/manual/html_node/Conversion.html
    
LENGTHRECORD substr(STREAMOFFSET+1LENGTHCHARS)
    
# leave while-loop if ill-formatted data record length
    
if ( LENGTHRECORD LENGTHCHARS ) {
      print 
"ERROR: Ill-formatted record length in \"line\" " FNR " at offset " OFFSET " of " FILENAME
      SUFFIX 
""
      
if ( LENGTHTOGO 20 ) {
        
LENGTHTOGO 20
        SUFFIX 
"..."
      
}
      print 
"       " substr(STREAMOFFSET+1LENGTHTOGOSUFFIX
      next  
# skip to next awk record/line
    
}
    
# leave while-loop if data record not completely in stream
    
if ( LENGTHRECORD LENGTHTOGO ) {
      print 
"ERROR: Incomplete record in \"line\" " FNR " at offset " OFFSET " of " FILENAME
      
print "       " substr(STREAMOFFSET+1)
      
next  # skip to next awk record/line
    
}

    
# extract data record from stream and print to STDOUT
    
RECORD substr(STREAMOFFSET+1LENGTHRECORD)
    print 
RECORD

    
# skip to next data record in stream
    
OFFSET += LENGTHRECORD
    LENGTHTOGO 
-= LENGTHRECORD
  
}
  
# check if any chars are unexpectedly left over
  
if ( LENGTHTOGO ) {
      print 
"ERROR: Ill-formatted record at end of \"line\" " FNR " at offset " OFFSET " of " FILENAME
      
print "       " substr(STREAMOFFSET+1)
  }


Last edited by maddes.b; 02-22-2013 at 10:22 AM. Reason: latest version of script, use php code formatting
 
Old 02-22-2013, 07:37 AM   #7
whizje
Member
 
Registered: Sep 2008
Location: The Netherlands
Distribution: Slackware64 current
Posts: 582

Rep: Reputation: 129Reputation: 129
If the key always starts with 4 zeros it could be easily adapted but the OP isn't very clear about that. He presented his solution instead of a clear description of the problem.
Nevermind now I get it.
At the OP is there a max length?

Last edited by whizje; 02-22-2013 at 07:43 AM.
 
Old 02-22-2013, 07:50 AM   #8
whizje
Member
 
Registered: Sep 2008
Location: The Netherlands
Distribution: Slackware64 current
Posts: 582

Rep: Reputation: 129Reputation: 129
@colucix If the stream is very long awk would definitely be the prefered solution. @Druuna I meant. I am not quite awake.

Last edited by whizje; 02-22-2013 at 08:06 AM.
 
Old 02-22-2013, 08:30 AM   #9
maddes.b
LQ Newbie
 
Registered: Aug 2009
Location: Germany
Distribution: Debian, OpenWrt
Posts: 23

Original Poster
Rep: Reputation: 1
Quote:
Originally Posted by whizje View Post
If the key always starts with 4 zeros it could be easily adapted but the OP isn't very clear about that. He presented his solution instead of a clear description of the problem.
Nevermind now I get it.
At the OP is there a max length?
It's not a key, it specifies the record length in decimal including itself.
So the max for length with 5 chars is 99999, so 5 chars for the length field and 99994 for record data.
Record data can contain anything (digits, chars, hex values).

As awk code looks like basic it didn't need much explanation - at least for me.

Last edited by maddes.b; 02-22-2013 at 10:24 AM.
 
Old 02-22-2013, 09:39 AM   #10
maddes.b
LQ Newbie
 
Registered: Aug 2009
Location: Germany
Distribution: Debian, OpenWrt
Posts: 23

Original Poster
Rep: Reputation: 1
For really huge files you have to use GNU awk.

For example the default awk of Solaris failed on a file with a 342KB text stream.
Fortunately GNU awk is available in /opt on all our servers.

Thanks again to all for their help
Maddes
 
Old 02-22-2013, 02:46 PM   #11
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,458

Rep: Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941
Quote:
Originally Posted by maddes.b View Post
@colucix:I do not have bash available in all places.
Here is a Bourne Shell version (just for fun)...
Code:
#!/bin/sh
read line < file
end=`expr length "$line"`
while [ "$line" ]
do
  len=`expr substr $line 1 5`
  expr substr $line 1 $len
  pos=`expr $len + 1`
  line=`expr substr $line $pos $end`
done
and an even older Bourne Shell version (without the Berkley estensions expr substr and expr length)...
Code:
#!/bin/sh
read line < file
while [ "$line" ]
do
  len=`expr $line : '\(.....\)'`
  sub=`expr $line : "\(.\{$len\}\)"`
  echo $sub
  line=`expr $line : "$sub\(.*\)"`
done

Last edited by colucix; 02-22-2013 at 04:13 PM.
 
1 members found this post helpful.
Old 02-22-2013, 04:10 PM   #12
maddes.b
LQ Newbie
 
Registered: Aug 2009
Location: Germany
Distribution: Debian, OpenWrt
Posts: 23

Original Poster
Rep: Reputation: 1
@colucix: Thanks a lot.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How do I replace the text between patterns located on separate lines? (sed, awk, etc) Quon Programming 5 02-12-2012 06:27 AM
split very large 200mb text file by every N lines (sed/awk fails) doug23 Programming 8 08-10-2009 06:08 PM
Replacing text on specific lines with sed or awk? Lantzvillian Linux - Newbie 5 10-17-2007 09:00 AM
awk/gawk/sed - read lines from file1, comment out or delete matching lines in file2 rascal84 Linux - General 1 05-24-2006 09:19 AM


All times are GMT -5. The time now is 07:07 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration