LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Sed, Awk, Perl - Merge lines unless they match a certain string (https://www.linuxquestions.org/questions/programming-9/sed-awk-perl-merge-lines-unless-they-match-a-certain-string-875229/)

Dsw0002 04-15-2011 02:44 PM

Sed, Awk, Perl - Merge lines unless they match a certain string
 
What is the best way to merge lines, in sed, awk or perl, that occur between certain strings?
I'm new to sed scripting and I have been working on this for some time now.
I have a large file (sample below) that I need to edit.

>contig...........translated
(sequences)
(sequences)
(sequences)
>contig...........translated
(sequences)
.
.
.

What I need looks something like this.
>contig...........translated
(sequences)(sequences)(sequences)
>contig...........translated
(sequences).............

I'm working with a very large file so simply merging all the lines then adding a new line character before ">contig" and after "translated" won't work, at least not with sed.

Sergei Steshenko 04-15-2011 02:51 PM

It can be done in a very straightforward manner in Perl. I do not understand where the capacity problem comes from - you just write to your output file (sequences) between '>contig...........translated' stripping the former of "\n" ('chomp' function in Perl).

Nominal Animal 04-15-2011 04:23 PM

Quote:

Originally Posted by Dsw0002 (Post 4325798)
What is the best way to merge lines, in sed, awk or perl, that occur between certain strings?

I don't know about best, but one possibility is this awk script,
Code:

awk 'BEGIN      { RS="[\r\n]+" ; nl="" ; sep=" " }
    /^>contig/ { printf("%s%s%s", nl, $0, RT); nl="" ; sp="" ; next }
                { printf("%s%s", sp, $0) ; nl=RT ; sp=sep }
    END        { printf("%s", nl) }' file

which streams the input file (only one line or so in memory at any time) quite efficiently, and keeps whatever newline convention you might be using.
The sep=" " in the first line specifies the delimiter that replaces the newlines in merged lines. Use sep="" if you want to merge the lines without any intervening separator.

This awk script is a bit more complex than absolutely necessary, but I only recently found out how to retain the newline convention efficiently, and wanted to apply that ;)

If you need something even more efficient, or want to work with unlimited-length lines (not having to read even a single complete line into RAM), I'd write a small utility in C. I think it'd only take about a hundred lines of code, even if you used unistd.h low-level I/O for maximum efficiency.

grail 04-16-2011 03:07 AM

Seems I am on the same wave length as Nominal:
Code:

awk '/contig/{ret="\n";if(a)nl=ret}{printf "%s%s%s",nl,$0,ret;a=1;nl=ret=""}' file

kurumi 04-16-2011 04:12 AM

Code:

$ ruby -ne 'print /contig/? "\n"+$_: $_.chomp' file

crts 04-16-2011 05:20 AM

Code:

sed  ':a />contig/! {$bb;N;ba};:b s/\n//g;1!s/>contig/\n&/' file
If you last line is '>contig...' then the jumppoint ':b' in the above command is obsolete.


All times are GMT -5. The time now is 10:39 AM.