Old 04-15-2011, 02:44 PM   #1
LQ Newbie
Registered: Apr 2011
Posts: 4

Rep: Reputation: 0
Sed, Awk, Perl - Merge lines unless they match a certain string

What is the best way to merge lines, in sed, awk or perl, that occur between certain strings?
I'm new to sed scripting and I have been working on this for some time now.
I have a large file (sample below) that I need to edit.


What I need looks something like this.

I'm working with a very large file so simply merging all the lines then adding a new line character before ">contig" and after "translated" won't work, at least not with sed.
Old 04-15-2011, 02:51 PM   #2
Sergei Steshenko
Senior Member
Registered: May 2005
Posts: 4,481

Rep: Reputation: 454Reputation: 454Reputation: 454Reputation: 454Reputation: 454
It can be done in a very straightforward manner in Perl. I do not understand where the capacity problem comes from - you just write to your output file (sequences) between '>contig...........translated' stripping the former of "\n" ('chomp' function in Perl).
Old 04-15-2011, 04:23 PM   #3
Nominal Animal
Senior Member
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 946Reputation: 946Reputation: 946Reputation: 946Reputation: 946Reputation: 946Reputation: 946Reputation: 946
Originally Posted by Dsw0002 View Post
What is the best way to merge lines, in sed, awk or perl, that occur between certain strings?
I don't know about best, but one possibility is this awk script,
awk 'BEGIN      { RS="[\r\n]+" ; nl="" ; sep=" " }
     /^>contig/ { printf("%s%s%s", nl, $0, RT); nl="" ; sp="" ; next }
                { printf("%s%s", sp, $0) ; nl=RT ; sp=sep }
     END        { printf("%s", nl) }' file
which streams the input file (only one line or so in memory at any time) quite efficiently, and keeps whatever newline convention you might be using.
The sep=" " in the first line specifies the delimiter that replaces the newlines in merged lines. Use sep="" if you want to merge the lines without any intervening separator.

This awk script is a bit more complex than absolutely necessary, but I only recently found out how to retain the newline convention efficiently, and wanted to apply that

If you need something even more efficient, or want to work with unlimited-length lines (not having to read even a single complete line into RAM), I'd write a small utility in C. I think it'd only take about a hundred lines of code, even if you used unistd.h low-level I/O for maximum efficiency.

Last edited by Nominal Animal; 04-15-2011 at 04:25 PM.
Old 04-16-2011, 03:07 AM   #4
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,467

Rep: Reputation: 2856Reputation: 2856Reputation: 2856Reputation: 2856Reputation: 2856Reputation: 2856Reputation: 2856Reputation: 2856Reputation: 2856Reputation: 2856Reputation: 2856
Seems I am on the same wave length as Nominal:
awk '/contig/{ret="\n";if(a)nl=ret}{printf "%s%s%s",nl,$0,ret;a=1;nl=ret=""}' file
Old 04-16-2011, 04:12 AM   #5
Registered: Apr 2010
Posts: 228

Rep: Reputation: 45
$ ruby -ne 'print /contig/? "\n"+$_: $_.chomp' file
Old 04-16-2011, 05:20 AM   #6
Senior Member
Registered: Jan 2010
Posts: 1,608

Rep: Reputation: 449Reputation: 449Reputation: 449Reputation: 449Reputation: 449
sed  ':a />contig/! {$bb;N;ba};:b s/\n//g;1!s/>contig/\n&/' file
If you last line is '>contig...' then the jumppoint ':b' in the above command is obsolete.


