LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 12-07-2011, 05:39 PM   #1
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 122

Rep: Reputation: 15
Delete lines based on a regexp including a tab


Hi all,

I have a 4.0 GB text file formatted like this:
Code:
APOM|Contig4256_149_40_404	
APOM|Contig4256_149_40_404	APOM|ALV_AA_014_E12_Y01_SCF	96.15	104	4	0	4	107	45	148	5e-54	 208
APOM|Contig4256_149_40_404	APOM|ALV_LSA_076_A05_X01_SCF	28.33	120	67	4	4	106	104	221	7e-07	52.4
APOM|Contig4256_149_40_404	APOM|Contig13233_92_40_716	30.00	120	65	4	4	106	66	183	2e-06	51.2
APOM|Contig4256_149_40_404	APOM|Contig14333_304_174_746	28.07	114	65	3	4	100	44	157	4e-06	50.1
APOM|Contig4256_149_40_404	APOM|ALV_LSB_086_C12_X01_SCF	26.60	94	63	2	1	89	41	133	1e-05	47.8
APOM|Contig4256_149_40_404	APOM|ALV_LSA_044_C12_X01_SCF	30.43	92	54	4	4	86	44	134	3e-05	47.0
APOM|Contig4256_149_40_404	APOM|ALV_LSA_216_A01_G09_Y01_SCF	31.25	80	47	3	17	89	99	177	6e-05	45.8
APOM|Contig4256_149_40_404	APOM|ALV_LSA_262_B02_B07_X01_SCF	30.53	95	56	4	4	89	35	128	6e-05	45.8
APOM|Contig4254_238_764_1497
I want to delete lines that have nothing after the first tab (in this example the first and last lines). Note that the number of characters on either side of the first tab is variable. Can anyone suggest a way to do this? Awk seems like the natural way to start but I can't figure how to skip printing of lines if field 2 is empty. Any suggestions would be greatly appreciated.

Thanks!
Kevin
 
Old 12-07-2011, 05:56 PM   #2
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 946Reputation: 946Reputation: 946Reputation: 946Reputation: 946Reputation: 946Reputation: 946Reputation: 946
How about
Code:
sed -ne '/\t[^\t\r ]/ p' input-file > output-file
It has inverted logic. It only outputs lines if a horizontal tab character is followed by a non-whitespace character. I include carriage return (\r) as a whitespace character, so that it'll still work correctly even if you happen to have CRLF (\r\n) newlines instead of the more Unix/Linux-traditional LF (\n) only.

Note that you can use
Code:
head -n 50 input-file | sed -ne '/\t[^\t\r ]/ p'
to test the script; it only looks at 50 first lines of your humongous input file, and prints them on screen. If that looks okay, try
Code:
head -n 2000 input-file | sed -ne '/\t[^\t\r ]/ p' | less
which does the same for 2000 first lines, this time piping through the pager. I'd only then use the first command above to create a new output file.

I too work a lot with large files. This kind of testing tends to become second nature after you realize how much time you save, compared to repeatedly hitting Ctrl-C to stop the output, and closing the terminal window in frustration..
 
Old 12-07-2011, 05:59 PM   #3
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 14,833

Rep: Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820
Maybe sed ? Something like "sed "/^[^[\t]]*\t$/d" infile"
 
Old 12-07-2011, 06:07 PM   #4
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,811
Blog Entries: 1

Rep: Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191
You could also try awk:
Code:
awk 'NF!=1{print}' file
 
Old 12-08-2011, 10:22 AM   #5
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 122

Original Poster
Rep: Reputation: 15
Thanks all! sycamorex, would I have to specify a field delimiter for that awk command or is a tab / whitespace a default?
 
Old 12-08-2011, 10:34 AM   #6
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,811
Blog Entries: 1

Rep: Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191
No, it's the default one so you don't have to specify it.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
how to copy some lines in a file and delete these lines after gartura Linux - General 1 07-20-2010 09:55 AM
Delete Duplicate Lines in a file, leaving only the unique lines left xmrkite Linux - Software 6 01-14-2010 07:18 PM
sed delete lines from file one if regexp are listed in file two fucinheira Programming 6 09-17-2009 09:28 AM
REGEXP Match * through multiple lines ? ALInux Linux - Software 12 08-14-2007 08:39 AM
awk/gawk/sed - read lines from file1, comment out or delete matching lines in file2 rascal84 Linux - General 1 05-24-2006 10:19 AM


All times are GMT -5. The time now is 10:09 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration