[SOLVED] Using sed to search and stop at a blank line

danielbmartin · 03-16-2013, 11:33 AM

Quote:

Originally Posted by grail

Code:

awk '{print > "file"++i}' RS="" infile

Remarkably concise, but I don't understand how it works. Please walk us through it.

Daniel B. Martin

grail · 03-16-2013, 12:53 PM

RS="" - Set record separator to an empty line

print > "file"++i - print the current record (ie all up to the empty line) into a file called "fileN", where N is 1, 2, 3, etc

danielbmartin · 03-16-2013, 01:47 PM

Quote:

Originally Posted by grail

RS="" - Set record separator to an empty line

print > "file"++i - print the current record (ie all up to the empty line) into a file called "fileN", where N is 1, 2, 3, etc

Thank you for this explanation. I now understand a distinction between record and line.

Now, a nitpick. Empty line could mean a null line, or it could mean a line containing only white space. When displayed on the screen both look alike. Your solution is short and sweet (I admire that) but it depends on empty line = null line.

Daniel B. Martin

grail · 03-17-2013, 10:51 AM

Quote:

Now, a nitpick. Empty line could mean a null line, or it could mean a line containing only white space. When displayed on the screen both look alike. Your solution is short and sweet (I admire that) but it depends on empty line = null line.

And I am sure by now you could easily convert this to allow for whitespace

danielbmartin · 03-17-2013, 08:31 PM

This is an interesting problem and, as a learning experience, I improved on previous solutions.

Instead of sequence numbers I used the first line in each "paragraph" as part of the output file names.

This InFile ...

Code:

able
choice1-1
choice1-2
choice1-3

baker
choice2-1
choice2-2
choice2-3
choice2-4
choice2-5

charlie
choice3-1
choice3-2

dog
choice4-1
choice4-2
choice4-3

... produces these four OutFiles ...
dbm686out.able

Code:

able
choice1-1
choice1-2
choice1-3

dbm686out.baker

Code:

baker
choice2-1
choice2-2
choice2-3
choice2-4
choice2-5

dbm686out.charlie

Code:

charlie
choice3-1
choice3-2

dbm686out.dog

Code:

dog
choice4-1
choice4-2
choice4-3

This code (using bash) does the job...

Code:

# File identification
   Path=$(readlink -f $0 | cut -d'.' -f1)
 InFile=$Path"inp.txt"
 
# In this version each output file includes the first line of each paragraph.
o=$(readlink -f $0 | cut -d'.' -f1)"out"  #o = output file names
rm  $o'.'*  # Blow away any leftover output files.
ofid=""  # Initialize ofid, Output File IDentfier.
while read line
  do
    if [[ "$ofid" == "" ]]
      then ofid=$line
    fi
    if [[ "$line" == "" ]]
      then ofid=""
      else echo $line >> $o'.'$ofid
    fi
  done < $InFile

... and this code (based grail's superb awk one-liner) is more concise...

Code:

# File identification
   Path=$(readlink -f $0 | cut -d'.' -f1)
 InFile=$Path"inp.txt"

# In this version each output file includes the first line of each paragraph.
o=$(readlink -f $0 | cut -d'.' -f1)"out"  #o = output file names
awk -v o=$o '{print > o"."$1}' RS="" $InFile

Suggestions and corrections are gratefully accepted.

Daniel B. Martin

danielbmartin · 03-17-2013, 08:37 PM

This is an interesting problem and, as a learning experience, I improved on previous solutions.

Instead of sequence numbers I used the first line in each "paragraph" as part of the output file names.

This InFile ...

Code:

able
choice1-1
choice1-2
choice1-3

baker
choice2-1
choice2-2
choice2-3
choice2-4
choice2-5

charlie
choice3-1
choice3-2

dog
choice4-1
choice4-2
choice4-3

... produces these four OutFiles ...
dbm690out.able

Code:

choice1-1
choice1-2
choice1-3

dbm690out.baker

Code:

choice2-1
choice2-2
choice2-3
choice2-4
choice2-5

dbm690out.charlie

Code:

choice3-1
choice3-2

dbm690out.dog

Code:

choice4-1
choice4-2
choice4-3

This code (using bash) does the job...

Code:

# File identification
   Path=$(readlink -f $0 | cut -d'.' -f1)
 InFile=$Path"inp.txt"

# In this version each output file excludes the first line of each paragraph.
o=$(readlink -f $0 | cut -d'.' -f1)"out"  #o = output file names
rm $o'.'*  # Blow away any leftover output files.
ofid=""  # Initialize ofid, Output File IDentfier.
while read line
  do
    if [[ "$ofid" == "" ]];
      then ofid=$line;
    fi
    if [[ "$line" == "" ]];
      then ofid="";
    fi
    if [[ "$ofid" != "$line" ]];
      then echo $line >> $o'.'$ofid
    fi
  done < $InFile

... and this code (based grail's superb awk one-liner) is more concise...

Code:

# File identification
   Path=$(readlink -f $0 | cut -d'.' -f1)
 InFile=$Path"inp.txt"

# In this version each output file excludes the first line of each paragraph.
o=$(readlink -f $0 | cut -d'.' -f1)"out"  #o = output file names
awk -v o=$o '{t=$1;$1="";sub(/^ /,"");gsub(" ","\n")} {print > o"."t}' RS="" $InFile

Suggestions and corrections are gratefully accepted.

Daniel B. Martin

grail · 03-18-2013, 08:41 AM

Might want to check the output files that are using the second awk solution. I think you will find that your data is not line for line, but now on a single line.

Example:

Instead of dbm690out.able being:

Code:

choice1-1
choice1-2
choice1-3

I believe it will look like:

Code:

choice1-1 choice1-2 choice1-3

danielbmartin · 03-18-2013, 10:05 AM

Quote:

Originally Posted by grail

Might want to check the output files that are using the second awk solution. I think you will find that your data is not line for line, but now on a single line.

Recognition of a bug is the first step toward fixing the bug. The man who points out a flaw in my code is helping me. Thank you, grail.

I edited post #21 to show corrected code. It works but is unlovely. Is there a cleaner way?

Daniel B. Martin

grail · 03-18-2013, 01:19 PM

How about:

Code:

awk -vo=$o '{t=$1;$1="";sub(/^\n/,"");print > o "." t}' RS="" OFS="\n" file

And just as a quickie, a ruby alternative:

Code:

ruby -ane 'BEGIN{$/=""};IO.write("name."+ $F[0],$F[1..-1]*"\n")' file

danielbmartin · 03-18-2013, 09:42 PM

Now, let's make the problem more challenging by permitting multi-word "choice" lines.

With this InFile ...

Code:

able
how now
brown cow

baker
now is the time
for all good men
to come to the aid
of their party

charlie
the quick brown fox
jumps over
the lazy programmer

dog
words to live by:
let sleeping dogs lie

...this bash code ...

Code:

# File identification
   Path=$(readlink -f $0 | cut -d'.' -f1)
 InFile=$Path"inp.txt"

# In this version each output file excludes the first line of each paragraph.
o=$(readlink -f $0 | cut -d'.' -f1)"out"  #o = output file names
rm $o'.'*  # Blow away any leftover output files.
ofid=""  # Initialize ofid, Output File IDentfier.
while read line
  do
    if [[ "$ofid" == "" ]];
      then ofid=$line;
    fi
    if [[ "$line" == "" ]];
      then ofid="";
    fi
    if [[ "$ofid" != "$line" ]];
      then echo $line >> $o'.'$ofid
    fi
  done < $InFile

# For debugging...
for file in $o*; do echo; echo $file "..."; cat $file; done

... produces this result ...

Code:

/home/daniel/Desktop/LQfiles/dbm690out.able ...
how now
brown cow

/home/daniel/Desktop/LQfiles/dbm690out.baker ...
now is the time
for all good men
to come to the aid
of their party

/home/daniel/Desktop/LQfiles/dbm690out.charlie ...
the quick brown fox
jumps over
the lazy programmer

/home/daniel/Desktop/LQfiles/dbm690out.dog ...
words to live by:
let sleeping dogs lie

... but I'm unable to code an equivalent in awk. Anyone care to take a shot at it?

Daniel B. Martin

grail · 03-19-2013, 12:43 AM

My hint will be, have a look at the input field separator (FS)