search for inverse of pattern and join with the line before it

akelder · 07-14-2009, 05:03 AM

Let's say I have:

Code:

122. some text, March 2, 1996
123.  some text, April 22, 1997
124. some text, April 23,
1998
125.  some text, May 1,
1999
20555.   some text, August 3, 2007
20556. some text, July 3,
2008
20557. some text, July
4, 2009
20558. some text, August 1, 2010

And I need to turn it into:

Code:

122. some text, March 2, 1996
123.  some text, April 22, 1997
124. some text, April 23, 1998
125.  some text, May 1, 1999
20555.   some text, August 3, 2007
20556. some text, July 3, 2008
20557. some text, July 4, 2009
20558. some text, August 1, 2010

I hacked together some sed that will find lines that start with a number between 1 and 5 digits, followed by a dot and one or more spaces and replace newline at the end with a space to join each line containing this pattern with the line that follows:

Code:

cat file | sed -e '/\(^[0-9]\{1,5\}\.\s\+\)/N;s/\n/ /g'

And also sed that will find lines that start with "text" and join each with the preceding line:

Code:

cat file | sed ':a; $!N;s/\ntext/ text/;ta;P;D'

But I cannot figure out how to find lines not matching a pattern (inverse match, like "grep -v") and then append each to the preceding line.

Something like the following (which doesn't work):

Code:

cat file | sed -e ':a; /\(^[0-9]\{1,5\}\.\s\+\)!/N;s/\n/ /;ta;P;D'

Any way to do this with sed (or awk, or something else)?

ghostdog74 · 07-14-2009, 05:49 AM

Code:

awk '/[a-zA-Z]/' file

PMP · 07-14-2009, 06:18 AM

try this out, It worked for me

Code:

cat file | tr "\n" " " | sed 's/\([0-9]\+\.\)/\n\1/g'

pixellany · 07-14-2009, 06:19 AM

Quote:

Originally Posted by ghostdog74

Code:

awk '/[a-zA-Z]/' file

I don't grasp what this does or how it fits with the question....???

colucix · 07-14-2009, 06:19 AM

Code:

awk '/[[:digit:]][.]/{
  if ( string != "" )
     print string
  string=$0
}
!/[[:digit:]][.]/{
  print string, $0
  string=""
}
END { if ( string != "" )
       print string
}' testfile

ghostdog74 · 07-14-2009, 07:19 AM

Quote:

Originally Posted by pixellany

I don't grasp what this does or how it fits with the question....???

that was formulated according to the output he wants using the sample input. BUT it does not check for the digits and such because the output OP wants all have alphabets. Hence my suggestion. Of course, if there are more variation of input then there will be a need to do more thorough check like what colucix did.

colucix · 07-14-2009, 07:39 AM

ghostdog, the input has lines that contain only the year which has to be appended to the previous lines. Only the output has alphabets in every line. In my code I just checked the first number immediately followed by a dot, as suggested by the OP.

sundialsvcs · 07-14-2009, 08:20 AM

awk, or the Perl programming language, would be an appropriate tool for this, because the task at hand is expressed algorithmically.

(1) Initialize a line-buffer to an empty string.

(2) While not end-of-file, read another newline-delimited string and append it to the buffer.

(3) Look within the buffer for "some text, and a date." If you find that, remove it from the head of the buffer and output it. Keep the tail of the string in the buffer.

(4) Repeat step (3) until no more matches can be found.

(5) When you reach the end-of-file, don't forget what's still in the buffer (if anything). In this case I don't think you intend to do anything with it.

This algorithm suggests itself because, in the data you provide, I see that newlines can appear anywhere in a date, which is nevertheless seen as one.

The two tools that I spoke of are "power tools" for doing this kind of string-manipulation and file parsing.

ghostdog74 · 07-14-2009, 08:32 AM

Quote:

Originally Posted by colucix

ghostdog, the input has lines that contain only the year which has to be appended to the previous lines. Only the output has alphabets in every line. In my code I just checked the first number immediately followed by a dot, as suggested by the OP.

thanks. i missed the year to append to previous.

Code:

awk '/[0-9][.]/ && NR>1{ print "";}{printf "%s",$0}' file

Kenhelm · 07-14-2009, 12:08 PM

This uses GNU sed:-
All of the lines are appended one at a time in the pattern space separated by \n. If the start of the latest line in doesn't match the number pattern then the 's' command replaces the last '\n' in the pattern space with a space.
It can join a continuous run of lines which don't start with the number pattern.

Code:

echo \
'124. some text, April 23,
1998
20557. some text, July
4, 2009
some more text
20558. some text, August 1, 2010' |

sed -r ':a N; /\n[0-9]{1,5}\.\s[^\n]*$/! s/(.*)\n/\1 /; ba'

124. some text, April 23, 1998
20557. some text, July 4, 2009 some more text
20558. some text, August 1, 2010

akelder · 07-14-2009, 01:54 PM

Thanks a lot, everyone, this is some great stuff!

akelder · 07-14-2009, 02:24 PM

Quote:

Originally Posted by Kenhelm

Code:

     1   2 3   4        5        6   7 8 9   10   11    12       
sed -r ':a N; /\n[0-9]{1,5}\.\s[^\n]*$/! s/(.*)\n/\1 /; ba'

Kenhelm, this is great and works perfectly, but I don't fully understand it. Here's what I see, please correct me.. :-P

1. Tells sed to expect extended regexp
2. Creates a label to return to later
3. Appends the next line of input into the pattern space
4. Pattern starts with newline
5. Pattern to match
6. ?
7. Matches till the end of line
8. Negates previous expression
9. Substitutes
10. Anything followed by a newline?
11. Replaces with ?? <-- Edit: "\1" means first saved substring
12. Returns to label

Thanks again!

akelder · 07-14-2009, 02:30 PM

Quote:

Originally Posted by ghostdog74

Code:

awk '/[0-9][.]/ && NR>1{ print "";}{printf "%s",$0}' file

Ghostdog, this works perfectly, but could you explain how it works?

Cheers!

akelder · 07-14-2009, 02:45 PM

Quote:

Originally Posted by colucix

Code:

awk '/[[:digit:]][.]/{
  if ( string != "" )
     print string
  string=$0
}
!/[[:digit:]][.]/{
  print string, $0
  string=""
}
END { if ( string != "" )
       print string
}' testfile

colucix, thanks much, sorry for being dense, but how do I run this? Running it from the shell runs without error, but doesn't work.

akelder · 07-14-2009, 02:56 PM

Quote:

Originally Posted by PMP

try this out, It worked for me

Code:

cat file | tr "\n" " " | sed 's/\([0-9]\+\.\)/\n\1/g'

PMP, thanks, that works great. Clever workaround to not have to bother with holding stuff in the buffer.. I see you're using that \1 at the end, too.. Gotta figure out what that means..