LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 06-03-2010, 01:58 PM   #1
JoshConsulting
LQ Newbie
 
Registered: Jun 2010
Posts: 4

Rep: Reputation: 0
Quick, simple, shell command question


As a seasoned Windows veteran, posting in a Linux newbie forum just feels... Odd

Anyway, I've recently begun working with large (>50GB) text files, and Windows tools just haven't been able to cut it. Cygwin has proven amazingly useful for quickly searching (grep), sorting (cat), combining, and other operations with files far too large to load into my paltry 24GB RAM.

Unfortunately, my lack of experience with commands really hampers me; I usually have to Google around to find something similar to what I need, and tweak it to fit my needs. But I simply couldn't locate anything like my current request.

Basically, I need something that can go through a text file, remove any lines longer then a specified length (roughly 30 characters, in this case) and write the results to another text file. I don't mean truncate or blank the lines - I need them completely removed. I also need to do something similar to lines shorter then a certain length (7), but I expect it will be fairly easy to switch the two around.

Any advice? Remember, I need something that avoids loading the whole file into memory, or the program will simply crash. I have a few TB of disk space and a RAID array with ~500 MB\s read\write speeds, so parsing the file on disk won't be a big deal.

Summary: Need a shell command to delete lines longer then a specified length from a text file.

Thanks for the help!
 
Old 06-03-2010, 02:13 PM   #2
rweaver
Senior Member
 
Registered: Dec 2008
Location: Louisville, OH
Distribution: Debian, CentOS, Slackware, RHEL, Gentoo
Posts: 1,833

Rep: Reputation: 164Reputation: 164
The command you're going to want here is either sed or awk...

The sed is easy and I know it off top of my head so...

Code:
sed -n '/^.\{7\}/p' | sed -n '/^.\{30\}/!p'
Which basically says (run silently) 'if there are more than or equal to 7 chars starting at beginning of line print it' and then 'if there are more than or equal to 30 chars from beginning of line don't print it'

so... testing--
Code:
core:~$ cat t.fi
1
12
123
1234
12345
123456
1234567
12345678
123456789
1234567890
core:~$ cat t.fi | sed -n '/^.\{3\}/p' | sed -n '/^.\{8\}/!p'
123
1234
12345
123456
1234567
core:~$
See also: http://sed.sourceforge.net/sed1line.txt for a great number of spectacular sed one liners that can be modified to do a lot of common tasks.

Edit: Also-- Welcome to LQ! Don't feel bad about not knowing linux commands as well as windows, you'll learn and you're well on the road already and doing the right thing to learn those commands.

Edit2: You might want to redirect the output to another file "> clean-file.txt" or what have you, or alternately make sed do the changes inline (-i), but make sure you've got good backups!

Last edited by rweaver; 06-03-2010 at 02:28 PM.
 
Old 06-03-2010, 02:20 PM   #3
Samotnik
Member
 
Registered: Jun 2006
Location: Belarus
Distribution: Debian GNU/Linux testing/unstable
Posts: 471

Rep: Reputation: 40
awk is pretty powerful language for text processing. It's able to do much more, than a simple remove of long lines, if you'll learn it.
E.g. there is a short command to remove from text file all lines longer than 30 chars.
Code:
cat <filename> | awk 'length($0) <= 30' > <newfile>
 
Old 06-03-2010, 02:25 PM   #4
JoshConsulting
LQ Newbie
 
Registered: Jun 2010
Posts: 4

Original Poster
Rep: Reputation: 0
Wow, thanks for the fast help rweaver and Sam.

Code:
sed -n '/^.\{7\}/p' | sed -n '/^.\{30\}/!p'
At the moment, my problem is not knowing the abbreviations and commands you used, specifically "/^.\{7\}/p". While I plan to eventually learn the nuances of Linux scripting, I'm trying to do this first and I feel better when I know what's going on behind the scene.

As such, this looks perfect:

Code:
cat <filename> | awk 'length($0) <= 30' > <newfile>
I suspected it would be that simple, I had spent a while googling awk but couldn't find documentation on controlling line length. Thanks again for the help, I'm really learning to like the command shell


I'll run it and see if it does what I need, it takes a while to parse a 60-odd GB file...
 
Old 06-03-2010, 02:28 PM   #5
rweaver
Senior Member
 
Registered: Dec 2008
Location: Louisville, OH
Distribution: Debian, CentOS, Slackware, RHEL, Gentoo
Posts: 1,833

Rep: Reputation: 164Reputation: 164
and the corresponding page for awk one liners-- http://www.catonmat.net/blog/wp-cont...9/awk1line.txt

I'll add that the awk & sed book by O'Reilly is pretty fair, although I don't know if its still in print or been updated recently.
 
Old 06-03-2010, 02:36 PM   #6
rweaver
Senior Member
 
Registered: Dec 2008
Location: Louisville, OH
Distribution: Debian, CentOS, Slackware, RHEL, Gentoo
Posts: 1,833

Rep: Reputation: 164Reputation: 164
Quote:
Originally Posted by JoshConsulting View Post
Wow, thanks for the fast help rweaver and Sam.

Code:
sed -n '/^.\{7\}/p' | sed -n '/^.\{30\}/!p'
At the moment, my problem is not knowing the abbreviations and commands you used, specifically "/^.\{7\}/p". While I plan to eventually learn the nuances of Linux scripting, I'm trying to do this first and I feel better when I know what's going on behind the scene.
The important bit here is
^ <- start of line
{7} or {30} <- number of matches required to be true
. <- match any char
p or !p <- print / don't print if true

the \ are just escape chars for the shell.

In my opinion, 7/10 times the sed is far simpler than the awk, in this case not so much... but that may be because I'm very familiar with regular expressions which many unix apps use (grep, awk, sed, perl, etc.)

Check out the two one liner pages, just getting a basic understanding of the apps from them will massively improve your ability to manipulate data how you want.
 
Old 06-03-2010, 03:10 PM   #7
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 23,066
Blog Entries: 11

Rep: Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910
Quote:
Originally Posted by rweaver View Post
the \ are just escape chars for the shell.
Actually ... you've used single quotes, so the
\ *aren't* for the shell. They're for sed. If you
used the -r flag w/ sed you could skip the \ ...



Cheers,
Tink
 
Old 06-03-2010, 04:17 PM   #8
rweaver
Senior Member
 
Registered: Dec 2008
Location: Louisville, OH
Distribution: Debian, CentOS, Slackware, RHEL, Gentoo
Posts: 1,833

Rep: Reputation: 164Reputation: 164
Doh! my bad, wasn't paying attention to my quote marks! doh.
 
Old 06-03-2010, 07:13 PM   #9
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,256

Rep: Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686
I would just add that the use of cat with either sed or awk in this situation is not required.
Simplest awk I can think of is:
Code:
awk 'length > 6 || length < 31' file > new_file
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Simple shell commands Alexman Linux - Newbie 3 08-23-2007 02:11 PM
Quick and simple question please! lonecrow Linux - Software 13 11-15-2005 09:38 AM
Quick question: Learning commands Kami.JZ Linux - Newbie 4 10-24-2004 05:14 PM
Quick Simple Question Oracl3 Linux - Newbie 4 04-05-2004 09:40 PM
simple, quick question.. b0uncer Linux - Hardware 4 12-28-2003 01:34 PM


All times are GMT -5. The time now is 02:35 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration