LinuxQuestions.org
Latest LQ Deal: Complete CCNA, CCNP & Red Hat Certification Training Bundle
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 12-13-2011, 10:42 PM   #1
csbushy
LQ Newbie
 
Registered: Dec 2011
Posts: 2

Rep: Reputation: Disabled
Help with awk or sed search.


I am new to scripts and linux. I have a very large csv file that I currently search for specific characters in and output the line to another file. I currently use grep and get a mix of data. I was told that awk or sed would work better but have never used them. Each line is supposed to contain 329 items or columns. The main issue is as stated above it is a very large file, over 4 GB. With the restrictions I am using with grep I have almost cut the output in half, but am still getting unwanted data. Any information on good beginner books or help would be greatly appreciated.
 
Click here to see the post LQ members have rated as the most helpful post in this thread.
Old 12-13-2011, 11:11 PM   #2
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 946Reputation: 946Reputation: 946Reputation: 946Reputation: 946Reputation: 946Reputation: 946Reputation: 946
What are your criteria in selecting the records to output?

When developing such processing, I often use head -n lines inputfile to only try the processing on the lines initial lines in file inputfile. (I too handle very large data files, including text data files, routinely.)

Since you are working with table-form data, I would consider using awk. Grep and sed have no concept of fields, really, just text lines. They apply regular expression patterns to the entire line. Awk, however, splits the record (line) to fields first, although you can also access the record as a whole.

Assuming all commas in your file are field separators (no text fields containing commas) and all newlines are record separators (no multiline text fields), then you can start with
Code:
head -n lines big-input-file | awk '
    BEGIN {
        RS="[\r\n]+"
        FS=","
    }

    ($3 ~ /A/) { print $0 }
'
After it works correctly, skip the head part, and use
Code:
awk '
    BEGIN {
        RS="[\r\n]+" ;
        FS=","
    }

    ($3 ~ /A/) { print $0 }
' big-input-file > output-file
to process the entire input file, saving the results to output-file.

The BEGIN rule sets the record separator to any newline convention (which also skips any empty lines), and field separator to a comma. This part is run once, before any input files are processed. It is often used to construct tables et cetera needed to apply the rules; you can even read and process other files here using a simple loop. (It is useful when you have the patterns saved in another file; that way you can use the same script in different situations easily.)

The other rule is applied to each record (line, in this case) of input, one by one. This one is just an example. The rule checks if the third field in the record contains A (here, A is a regular expression, like grep and sed patterns), and if and only if so, outputs the record. (When used this way, ^ means the start of the field, and $ the end of the field.)

If you supply multiple rules, they all will be applied against each record, although you can use next to tell awk to skip the rest of the rules and continue with the next record instead.

All awk variants I know stream their input; they read, process, and output each record by record. Thus, even a complex awk script will probably need very little memory. I suspect a simple grep command is much faster, though. I've never run into input size limits using awk; even several dozen gigabytes should be no problem for you. (Except it will take a while to run, of course. If you need something faster, it is time to use good old C, in my experience.)

You can read ~ as "left side matches the regular expression pattern on the right side", and !~ as "left side does not match the regular expression on the right side". You can also use == and != to compare exact strings.

Awk can also manipulate individual fields in the input records; I guess therein lies its main power. I'd need to know exactly what you would like to do with your input to be more specific.

As to guides, I use GNU Awk User's Manual as my reference, but note that many features are GNU awk specific. (The manual states the differences in each case, if you read carefully). As to awk tutorials, I'd start by doing a web search on awk tutorial.

Perhaps others could point you to known good tutorials or guides?
 
2 members found this post helpful.
Old 12-13-2011, 11:12 PM   #3
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 370Reputation: 370Reputation: 370Reputation: 370
You're very "light" on the specifics of what data you're trying to match, it's position in the line, and/or any relation it has to other data in the line. So, all I can offer is a general suggestion:

regular expressions

If you're not already familiar with them, there's a bit of a learning curve. Here are two links to some regular expression guides I found with a quick search:
Regular Expressions - grymoire.com
Regular Expressions - User Guide - zytrax.com

One thing to note: not all applications implement the same syntax/feature set that others do. I think the grymoire web page touches on that a little bit.

Last edited by Dark_Helmet; 12-13-2011 at 11:15 PM.
 
1 members found this post helpful.
Old 12-13-2011, 11:25 PM   #4
csbushy
LQ Newbie
 
Registered: Dec 2011
Posts: 2

Original Poster
Rep: Reputation: Disabled
Thanks to Nominal Animal and Dark Helmet. This gives me great information to get started with and I will look into your referenced book and links. Again thank you very much.
 
Old 12-13-2011, 11:54 PM   #5
Telengard
Member
 
Registered: Apr 2007
Location: USA
Distribution: Kubuntu 8.04
Posts: 579
Blog Entries: 8

Rep: Reputation: 147Reputation: 147
Quote:
Originally Posted by Nominal Animal View Post
As to guides, I use GNU Awk User's Manual as my reference, but note that many features are GNU awk specific. (The manual states the differences in each case, if you read carefully). As to awk tutorials, I'd start by doing a web search on awk tutorial.

Perhaps others could point you to known good tutorials or guides?
Agreed. The GNU Awk User's Manual is probably the single most complete AWK reference I've found to date. It is quite well written and dense with information.

As for tutorials, I've got this in my bookmarks.

UNIX tips and tricks for a new user, Part 3: Introducing filters and regular expressions
 
Old 12-14-2011, 04:46 AM   #6
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,576
Blog Entries: 31

Rep: Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195
Quote:
Originally Posted by csbushy View Post
Thanks to Nominal Animal and Dark Helmet. This gives me great information to get started with and I will look into your referenced book and links. Again thank you very much.
If you get stuck, please ask again and include some samples of the file contents.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Find URL in Debian package index via awk/sed (=find a line, then search from there) maddes.b Linux - Software 11 06-28-2013 08:37 AM
awk/sed help mailvaganam Programming 5 05-03-2011 01:35 AM
Sed/awk/grep search for number string of variable length in text file Alexr Linux - Newbie 10 01-19-2010 02:34 PM
Help with awk or sed. tuxtutorials Linux - Software 1 07-23-2009 03:45 AM
Sed, Awk, grep,Search,delete joyds219 Linux - Newbie 6 04-03-2008 07:15 AM


All times are GMT -5. The time now is 08:33 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration