LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   search for inverse of pattern and join with the line before it (https://www.linuxquestions.org/questions/programming-9/search-for-inverse-of-pattern-and-join-with-the-line-before-it-739953/)

akelder 07-14-2009 05:03 AM

search for inverse of pattern and join with the line before it
 
Let's say I have:

Code:

122. some text, March 2, 1996
123.  some text, April 22, 1997
124. some text, April 23,
1998
125.  some text, May 1,
1999
20555.  some text, August 3, 2007
20556. some text, July 3,
2008
20557. some text, July
4, 2009
20558. some text, August 1, 2010

And I need to turn it into:

Code:

122. some text, March 2, 1996
123.  some text, April 22, 1997
124. some text, April 23, 1998
125.  some text, May 1, 1999
20555.  some text, August 3, 2007
20556. some text, July 3, 2008
20557. some text, July 4, 2009
20558. some text, August 1, 2010

I hacked together some sed that will find lines that start with a number between 1 and 5 digits, followed by a dot and one or more spaces and replace newline at the end with a space to join each line containing this pattern with the line that follows:

Code:

cat file | sed -e '/\(^[0-9]\{1,5\}\.\s\+\)/N;s/\n/ /g'
And also sed that will find lines that start with "text" and join each with the preceding line:

Code:

cat file | sed ':a; $!N;s/\ntext/ text/;ta;P;D'
But I cannot figure out how to find lines not matching a pattern (inverse match, like "grep -v") and then append each to the preceding line.

Something like the following (which doesn't work):

Code:

cat file | sed -e ':a; /\(^[0-9]\{1,5\}\.\s\+\)!/N;s/\n/ /;ta;P;D'
Any way to do this with sed (or awk, or something else)?

ghostdog74 07-14-2009 05:49 AM

Code:

awk '/[a-zA-Z]/' file

PMP 07-14-2009 06:18 AM

try this out, It worked for me
Code:

cat file | tr "\n" " " | sed 's/\([0-9]\+\.\)/\n\1/g'

pixellany 07-14-2009 06:19 AM

Quote:

Originally Posted by ghostdog74 (Post 3607070)
Code:

awk '/[a-zA-Z]/' file

I don't grasp what this does or how it fits with the question....???

colucix 07-14-2009 06:19 AM

Code:

awk '/[[:digit:]][.]/{
  if ( string != "" )
    print string
  string=$0
}
!/[[:digit:]][.]/{
  print string, $0
  string=""
}
END { if ( string != "" )
      print string
}' testfile


ghostdog74 07-14-2009 07:19 AM

Quote:

Originally Posted by pixellany (Post 3607091)
I don't grasp what this does or how it fits with the question....???

that was formulated according to the output he wants using the sample input. BUT it does not check for the digits and such because the output OP wants all have alphabets. Hence my suggestion. Of course, if there are more variation of input then there will be a need to do more thorough check like what colucix did.

colucix 07-14-2009 07:39 AM

ghostdog, the input has lines that contain only the year which has to be appended to the previous lines. Only the output has alphabets in every line. In my code I just checked the first number immediately followed by a dot, as suggested by the OP.

sundialsvcs 07-14-2009 08:20 AM

awk, or the Perl programming language, would be an appropriate tool for this, because the task at hand is expressed algorithmically.

(1) Initialize a line-buffer to an empty string.

(2) While not end-of-file, read another newline-delimited string and append it to the buffer.

(3) Look within the buffer for "some text, and a date." If you find that, remove it from the head of the buffer and output it. Keep the tail of the string in the buffer.

(4) Repeat step (3) until no more matches can be found.

(5) When you reach the end-of-file, don't forget what's still in the buffer (if anything). In this case I don't think you intend to do anything with it.

This algorithm suggests itself because, in the data you provide, I see that newlines can appear anywhere in a date, which is nevertheless seen as one.

The two tools that I spoke of are "power tools" for doing this kind of string-manipulation and file parsing.

ghostdog74 07-14-2009 08:32 AM

Quote:

Originally Posted by colucix (Post 3607155)
ghostdog, the input has lines that contain only the year which has to be appended to the previous lines. Only the output has alphabets in every line. In my code I just checked the first number immediately followed by a dot, as suggested by the OP.

thanks. i missed the year to append to previous.
Code:

awk '/[0-9][.]/ && NR>1{ print "";}{printf "%s",$0}' file

Kenhelm 07-14-2009 12:08 PM

This uses GNU sed:-
All of the lines are appended one at a time in the pattern space separated by \n. If the start of the latest line in doesn't match the number pattern then the 's' command replaces the last '\n' in the pattern space with a space.
It can join a continuous run of lines which don't start with the number pattern.
Code:

echo \
'124. some text, April 23,
1998
20557. some text, July
4, 2009
some more text
20558. some text, August 1, 2010' |

sed -r ':a N; /\n[0-9]{1,5}\.\s[^\n]*$/! s/(.*)\n/\1 /; ba'

124. some text, April 23, 1998
20557. some text, July 4, 2009 some more text
20558. some text, August 1, 2010


akelder 07-14-2009 01:54 PM

Thanks a lot, everyone, this is some great stuff!

akelder 07-14-2009 02:24 PM

Quote:

Originally Posted by Kenhelm (Post 3607449)
Code:

    1  2 3  4        5        6  7 8 9  10  11    12     
sed -r ':a N; /\n[0-9]{1,5}\.\s[^\n]*$/! s/(.*)\n/\1 /; ba'


Kenhelm, this is great and works perfectly, but I don't fully understand it. Here's what I see, please correct me.. :-P

1. Tells sed to expect extended regexp
2. Creates a label to return to later
3. Appends the next line of input into the pattern space
4. Pattern starts with newline
5. Pattern to match
6. ?
7. Matches till the end of line
8. Negates previous expression
9. Substitutes
10. Anything followed by a newline?
11. Replaces with ?? <-- Edit: "\1" means first saved substring
12. Returns to label

Thanks again!

akelder 07-14-2009 02:30 PM

Quote:

Originally Posted by ghostdog74 (Post 3607220)
Code:

awk '/[0-9][.]/ && NR>1{ print "";}{printf "%s",$0}' file

Ghostdog, this works perfectly, but could you explain how it works?

Cheers!

akelder 07-14-2009 02:45 PM

Quote:

Originally Posted by colucix (Post 3607093)
Code:

awk '/[[:digit:]][.]/{
  if ( string != "" )
    print string
  string=$0
}
!/[[:digit:]][.]/{
  print string, $0
  string=""
}
END { if ( string != "" )
      print string
}' testfile


colucix, thanks much, sorry for being dense, but how do I run this? Running it from the shell runs without error, but doesn't work.

akelder 07-14-2009 02:56 PM

Quote:

Originally Posted by PMP (Post 3607090)
try this out, It worked for me
Code:

cat file | tr "\n" " " | sed 's/\([0-9]\+\.\)/\n\1/g'

PMP, thanks, that works great. Clever workaround to not have to bother with holding stuff in the buffer.. I see you're using that \1 at the end, too.. Gotta figure out what that means..


All times are GMT -5. The time now is 06:08 AM.