LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Using sed to search and stop at a blank line (https://www.linuxquestions.org/questions/programming-9/using-sed-to-search-and-stop-at-a-blank-line-4175454081/)

Tech109 03-14-2013 11:51 AM

Using sed to search and stop at a blank line
 
I'm using a sed statement within a bash shell script to search through a file and stop when it reaches a blank line.

The sed statement is working, but I'm having trouble understanding how. (I found it online somewhere).

Code:

"sed -n "\?$i?,\?^$\|pattern?p"
A few things to note:
1. Variable $i is coming from a while loop.
2. I'm using the "?" as a delimiter so sed doesn't choke on special characters which may be used in the strings it's searching.
3. I understand using the "," as a range, but am having trouble understanding why the "?" after "$i" would not be commented-out with a "\" and also why the "|pattern" text is there.
4. I understand using "-n" to suppress output, then using "p" to print only what is returned from sed.

If someone could help break this down for me, I would appreciate it.

Here is the original sed statement before I modified it:

Code:

sed -n "/$i/,/^$\|pattern/p"

grail 03-14-2013 12:16 PM

The final example makes sense but I am at a loss how your first would work?? My first issue would be the incorrect number of quotes and why the line would start with them?

Secondly, it is my understanding that sed only allows the changing of the delimiter the following, s/// ... so s??? could be used. On a quick test of a file here it definitely does not work
for me to have "?" as the delimiter for a range.

danielbmartin 03-14-2013 12:41 PM

Quote:

Originally Posted by Tech109 (Post 4911575)
I'm using a sed statement within a bash shell script to search through a file and stop when it reaches a blank line.

With this InFile ...
Code:

Line 1
Line 2
Line 3
 
Line 4
Line 5

Line 6
Line 7

... this code ...
Code:

sed '/^ *$/q' $InFile >$OutFile
... produced this OutFile ...
Code:

Line 1
Line 2
Line 3

Note this (possibly acceptable) defect: the output contains all lines up to and including the first blank line.

To eliminate that blank line ...
Code:

sed '/^ *$/q' $InFile |sed '$d' >$OutFile
Daniel B. Martin

David the H. 03-15-2013 09:32 AM

@grail. Yes, you can change the delimiter of the address regex if you prefix the first delimiting character with a backslash, as in this case (\?regex?). It's in the man page.

Now lets try breaking down the command, minus the delimiters (and assuming the first quote mark is just a typo):

Code:

sed -n
$i            #address 1
,
^$\|pattern  #address 2
p            #command

The first address is your "$i" variable, naturally.

The second address is a complex regex. "|" is the "or" separator, enabled by prefixing it with a backslash because you're still in basic regex mode. If you used the "-r" option to enter extended regex mode, the backslash becomes unnecessary*.

So range 2 is either "pattern" or "^$", a blank line.

All told, it prints every line from "$i" to either the first instance of "pattern" or the first blank line.

*See the appropriate section of the grep man page for more details on basic vs. extended regex.

Edit: @daniel, I really hate seeing multiple commands chained together when one can do the job. In this case replace "q" with "Q" and it will exit before printing the last line.
Code:

sed '/^ *$/Q'  $InFile
Unfortunately, this won't work if you need to start printing from any line other than the first though. Speaking of which, ed would do the job easily, thanks to its ability to designate relative line positions.

Code:

i=2
printf '%s\n' "$i,/^$/-1p" | ed -s infile.txt

How to use ed:
http://wiki.bash-hackers.org/howto/edit-ed
http://snap.nlc.dcccd.edu/learn/nlc/ed.html
(also read the info page)

Tech109 03-15-2013 09:59 AM

Thanks everyone - yes, sorry, the first double-quote before sed is a typo.

Thanks to David H for breaking this down, makes more sense now.

The file i'm searching through is formatted like this:

"string1"
"choice1"
"choice2"
"choice3"

"string2"
"choice1"
"choice2"
"choice3"

So what I am doing is searching for each "stringx" and grabbing it, plus its following choices, down to the blank line, because that is where the list ends and the next string begins. Then for each string + choices found in the source file, I'm writing those to a new file. The actual source file can contain hundreds of entries like above.

danielbmartin 03-15-2013 10:04 AM

Quote:

Originally Posted by David the H. (Post 4912241)
Edit: @daniel, I really hate seeing multiple commands chained together when one can do the job. In this case replace "q" with "Q" and it will exit before printing the last line.
Code:

sed '/^ *$/Q'  $InFile

Perfect! Technical intuition suggested this could be done but I couldn't find the Q. Thank you!

Daniel B. Martin

grail 03-15-2013 10:22 AM

Thanks David ... hadn't seen that one before ... tick something new today :)

David the H. 03-15-2013 11:18 AM

Quote:

Originally Posted by Tech109 (Post 4912261)
So what I am doing is searching for each "stringx" and grabbing it, plus its following choices, down to the blank line, because that is where the list ends and the next string begins. Then for each string + choices found in the source file, I'm writing those to a new file. The actual source file can contain hundreds of entries like above.


Oh, well if that's what you want, consider using the csplit utility instead (it's part of the coreutils). It splits text into multiple files based on patterns or numbers of lines.

Code:

csplit -f "file-" -b "%03d.txt" -z infile.txt '/^$/' '{*}'
Read the info page for the full details on how it's used.

The only problem with the above is that the blank lines are still left in the new files. But a simple bit of post-processing with sed can remove those.

Code:

for fname in file*.txt; do sed -i '/^$/d' "$fname"; done

danielbmartin 03-15-2013 11:23 AM

Quote:

Originally Posted by Tech109 (Post 4912261)
So what I am doing is searching for each "stringx" and grabbing it, plus its following choices, down to the blank line, because that is where the list ends and the next string begins. Then for each string + choices found in the source file, I'm writing those to a new file. The actual source file can contain hundreds of entries like above.

With this InFile ...
Code:

string1
choice1-1
choice1-2
choice1-3

string2
choice2-1
choice2-2
choice2-3
choice2-4
choice2-5

string3
choice3-1
choice3-2

string4
choice4-1
choice4-2
choice4-3

... this code ...
Code:

#!/bin/bash    Daniel B. Martin  Mar13
#
#  To execute this program, launch a terminal session and enter:
#  bash /home/daniel/Desktop/LQfiles/dbm684.bin
#
#  This program inspired by:
#  http://www.linuxquestions.org/questions/programming-9/
#    using-sed-to-search-and-stop-at-a-blank-line-4175454081/

# File identification
  Path=$(cut -d'.' -f1 <<< ${0})
 InFile=$Path"inp.txt"

echo
echo "Method of LQ member danielbmartin #1"
# Blow away any leftover output files.
rm -f $Path'out'*'.txt'
k=1
while read line
  do
    if [[ "$line" == "" ]]; then let k=$k+1
                            else echo $line >> $Path'out'$k'.txt'
    fi
  done < $InFile


echo; echo "Normal end of job."; echo
exit

... produced four subset output files, as specified.

Daniel B. Martin

grail 03-15-2013 12:24 PM

Yeah not sure why this would have to be so difficult:
Code:

awk '{print > "file"++i}' RS="" infile
Based on Daniel's input file this yields 4 output files with the required data stored in them.

gnashley 03-15-2013 12:28 PM

Code:

while read line ; do
    case $line in
      '') exit ;;
        *) echo $line >> wherever ;;
    esac
  done < $InFile

A case statement is probably faster than using '[[' or 'test' builtins -and also faster than a pipe through sed (twice!) for small files. sed gives me headaches...

danielbmartin 03-15-2013 02:21 PM

Quote:

Originally Posted by gnashley (Post 4912405)
Code:

while read line ; do
    case $line in
      '') exit ;;
        *) echo $line >> wherever ;;
    esac
  done < $InFile


Did you test this? I don't see any code which modifies the "wherever." Without that, all output goes into the same file. I modified your code thusly ...
Code:

k=1
while read line ; do
    case $line in
      '') let k=$k+1 ;;
        *) echo $line >> $Path'out'$k'.txt' ;;
    esac
  done < $InFile

... and it works.

Daniel B. Martin

gnashley 03-16-2013 03:01 AM

I don't see why the output file name needs to be modified by a counter. I guess I'm missing something -I thought the idea was to "stop when it reaches a blank line". Either way, the case statement will be faster than [[ or test.

grail 03-16-2013 03:28 AM

@gnashley - your original idea was correct for the first post , but as of post #5 the OP has now asked that each part of the file be entered into separate files

gnashley 03-16-2013 04:51 AM

Oops, I guess I've slept since then... No, wait, it seems to be raining in my hat!

danielbmartin 03-16-2013 11:33 AM

Quote:

Originally Posted by grail (Post 4912399)
Code:

awk '{print > "file"++i}' RS="" infile

Remarkably concise, but I don't understand how it works. Please walk us through it.

Daniel B. Martin

grail 03-16-2013 12:53 PM

RS="" - Set record separator to an empty line

print > "file"++i - print the current record (ie all up to the empty line) into a file called "fileN", where N is 1, 2, 3, etc

danielbmartin 03-16-2013 01:47 PM

Quote:

Originally Posted by grail (Post 4912957)
RS="" - Set record separator to an empty line

print > "file"++i - print the current record (ie all up to the empty line) into a file called "fileN", where N is 1, 2, 3, etc

Thank you for this explanation. I now understand a distinction between record and line.

Now, a nitpick. Empty line could mean a null line, or it could mean a line containing only white space. When displayed on the screen both look alike. Your solution is short and sweet (I admire that) but it depends on empty line = null line.

Daniel B. Martin

grail 03-17-2013 10:51 AM

Quote:

Now, a nitpick. Empty line could mean a null line, or it could mean a line containing only white space. When displayed on the screen both look alike. Your solution is short and sweet (I admire that) but it depends on empty line = null line.
And I am sure by now you could easily convert this to allow for whitespace :)

danielbmartin 03-17-2013 08:31 PM

This is an interesting problem and, as a learning experience, I improved on previous solutions.

Instead of sequence numbers I used the first line in each "paragraph" as part of the output file names.

This InFile ...
Code:

able
choice1-1
choice1-2
choice1-3

baker
choice2-1
choice2-2
choice2-3
choice2-4
choice2-5

charlie
choice3-1
choice3-2

dog
choice4-1
choice4-2
choice4-3

... produces these four OutFiles ...
dbm686out.able
Code:

able
choice1-1
choice1-2
choice1-3

dbm686out.baker
Code:

baker
choice2-1
choice2-2
choice2-3
choice2-4
choice2-5

dbm686out.charlie
Code:

charlie
choice3-1
choice3-2

dbm686out.dog
Code:

dog
choice4-1
choice4-2
choice4-3

This code (using bash) does the job...
Code:

# File identification
  Path=$(readlink -f $0 | cut -d'.' -f1)
 InFile=$Path"inp.txt"
 
# In this version each output file includes the first line of each paragraph.
o=$(readlink -f $0 | cut -d'.' -f1)"out"  #o = output file names
rm  $o'.'*  # Blow away any leftover output files.
ofid=""  # Initialize ofid, Output File IDentfier.
while read line
  do
    if [[ "$ofid" == "" ]]
      then ofid=$line
    fi
    if [[ "$line" == "" ]]
      then ofid=""
      else echo $line >> $o'.'$ofid
    fi
  done < $InFile

... and this code (based grail's superb awk one-liner) is more concise...
Code:

# File identification
  Path=$(readlink -f $0 | cut -d'.' -f1)
 InFile=$Path"inp.txt"

# In this version each output file includes the first line of each paragraph.
o=$(readlink -f $0 | cut -d'.' -f1)"out"  #o = output file names
awk -v o=$o '{print > o"."$1}' RS="" $InFile

Suggestions and corrections are gratefully accepted.

Daniel B. Martin

danielbmartin 03-17-2013 08:37 PM

This is an interesting problem and, as a learning experience, I improved on previous solutions.

Instead of sequence numbers I used the first line in each "paragraph" as part of the output file names.

This InFile ...
Code:

able
choice1-1
choice1-2
choice1-3

baker
choice2-1
choice2-2
choice2-3
choice2-4
choice2-5

charlie
choice3-1
choice3-2

dog
choice4-1
choice4-2
choice4-3

... produces these four OutFiles ...
dbm690out.able
Code:

choice1-1
choice1-2
choice1-3

dbm690out.baker
Code:

choice2-1
choice2-2
choice2-3
choice2-4
choice2-5

dbm690out.charlie
Code:

choice3-1
choice3-2

dbm690out.dog
Code:

choice4-1
choice4-2
choice4-3

This code (using bash) does the job...
Code:

# File identification
  Path=$(readlink -f $0 | cut -d'.' -f1)
 InFile=$Path"inp.txt"

# In this version each output file excludes the first line of each paragraph.
o=$(readlink -f $0 | cut -d'.' -f1)"out"  #o = output file names
rm $o'.'*  # Blow away any leftover output files.
ofid=""  # Initialize ofid, Output File IDentfier.
while read line
  do
    if [[ "$ofid" == "" ]];
      then ofid=$line;
    fi
    if [[ "$line" == "" ]];
      then ofid="";
    fi
    if [[ "$ofid" != "$line" ]];
      then echo $line >> $o'.'$ofid
    fi
  done < $InFile

... and this code (based grail's superb awk one-liner) is more concise...
Code:

# File identification
  Path=$(readlink -f $0 | cut -d'.' -f1)
 InFile=$Path"inp.txt"

# In this version each output file excludes the first line of each paragraph.
o=$(readlink -f $0 | cut -d'.' -f1)"out"  #o = output file names
awk -v o=$o '{t=$1;$1="";sub(/^ /,"");gsub(" ","\n")} {print > o"."t}' RS="" $InFile

Suggestions and corrections are gratefully accepted.

Daniel B. Martin

grail 03-18-2013 08:41 AM

Might want to check the output files that are using the second awk solution. I think you will find that your data is not line for line, but now on a single line.

Example:

Instead of dbm690out.able being:
Code:

choice1-1
choice1-2
choice1-3

I believe it will look like:
Code:

choice1-1 choice1-2 choice1-3

danielbmartin 03-18-2013 10:05 AM

Quote:

Originally Posted by grail (Post 4913858)
Might want to check the output files that are using the second awk solution. I think you will find that your data is not line for line, but now on a single line.

Recognition of a bug is the first step toward fixing the bug. The man who points out a flaw in my code is helping me. Thank you, grail.

I edited post #21 to show corrected code. It works but is unlovely. Is there a cleaner way?

Daniel B. Martin

grail 03-18-2013 01:19 PM

How about:
Code:

awk -vo=$o '{t=$1;$1="";sub(/^\n/,"");print > o "." t}' RS="" OFS="\n" file
And just as a quickie, a ruby alternative:
Code:

ruby -ane 'BEGIN{$/=""};IO.write("name."+ $F[0],$F[1..-1]*"\n")' file

danielbmartin 03-18-2013 09:42 PM

Now, let's make the problem more challenging by permitting multi-word "choice" lines.

With this InFile ...
Code:

able
how now
brown cow

baker
now is the time
for all good men
to come to the aid
of their party

charlie
the quick brown fox
jumps over
the lazy programmer

dog
words to live by:
let sleeping dogs lie

...this bash code ...
Code:

# File identification
  Path=$(readlink -f $0 | cut -d'.' -f1)
 InFile=$Path"inp.txt"

# In this version each output file excludes the first line of each paragraph.
o=$(readlink -f $0 | cut -d'.' -f1)"out"  #o = output file names
rm $o'.'*  # Blow away any leftover output files.
ofid=""  # Initialize ofid, Output File IDentfier.
while read line
  do
    if [[ "$ofid" == "" ]];
      then ofid=$line;
    fi
    if [[ "$line" == "" ]];
      then ofid="";
    fi
    if [[ "$ofid" != "$line" ]];
      then echo $line >> $o'.'$ofid
    fi
  done < $InFile

# For debugging...
for file in $o*; do echo; echo $file "..."; cat $file; done

... produces this result ...
Code:

/home/daniel/Desktop/LQfiles/dbm690out.able ...
how now
brown cow

/home/daniel/Desktop/LQfiles/dbm690out.baker ...
now is the time
for all good men
to come to the aid
of their party

/home/daniel/Desktop/LQfiles/dbm690out.charlie ...
the quick brown fox
jumps over
the lazy programmer

/home/daniel/Desktop/LQfiles/dbm690out.dog ...
words to live by:
let sleeping dogs lie

... but I'm unable to code an equivalent in awk. Anyone care to take a shot at it?

Daniel B. Martin

grail 03-19-2013 12:43 AM

My hint will be, have a look at the input field separator (FS)


All times are GMT -5. The time now is 04:57 PM.