LinuxQuestions.org
Support LQ: Use code LQ3 and save $3 on Domain Registration
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 09-19-2012, 02:54 PM   #1
atjurhs
Member
 
Registered: Aug 2012
Posts: 183

Rep: Reputation: Disabled
finding and filtering data between two files


Hi guys,

I have two files that I need "find and filter" data between. These files are way to large to import to excel where I think I could do this with a lookup command.

anyways

The first file I'll call list.dat It contains a single column of dats that can contain as many as a hundred or so records, and it'll look like this:

4327
4339
4361


The second file has 54 columns of space or comma seperated (I know beforehand if it's comma or space seperated) data. I'll call it big.dat it'll look something like this where the ellipses go all the way out to 54 columns and can have hundreds of thousands of records.


header1 header2 header3 header4 header5 header6 header7 ...
234910 432 0 10 1.3365 956385 -13.188 ... out to 54
1 4320
234910 432 0 10 1.3365 905931 -13.188 ... out to 54
1 4327
234911 432 0 10 1.4369 00000 -10.188 ... out to 54
1 4331
234912 432 0 10 0.3135 111375791 18.188 ... out to 54
1 4334
234913 432 0 10 1.6365 86790 -19.188 ... out to 54
1 4335
234915 432 0 10 1.3765 00000 0.188 ... out to 54
1 4339 209 0001
234919 432 0 10 2.3385 231357976 -11.188 ... out to 54
1 4349 324 0001
234920 432 0 10 1.3365 00000 -1.188 ... out to 54
1 4359 111 0001
234921 432 0 10 5.3365 00000 10.188 ... out to 54
1 4360 875 0001
234933 432 0 10 1.3065 00000 -0.198 ... out to 54
1 4361 234 0001
234978 432 0 10 3.3311 53211856 -13.188 ... out to 54
1 4441 676 0001

what I need to do is take the first value from list.dat look in the 2nd and only 2nd column of every other line of big.dat and see if that number exists. If that number exists in the 2nd column then write that preceeding line to a file. I'll call that file small.dat. Then go to the next number is list.dat and repeat the process. so in the example given, small.dat would look like:

234910 432 0 10 1.3365 905931 -13.188 ...
234915 432 0 10 1.3765 00000 0.188 ...
234933 432 0 10 1.3065 500001 -0.198 ...

I think the beginnings of it are something like this:

Code:
awk '$2 == NUMERICAL_STRING { print FILENAME, $0 }' big.dat
but I don't know how to skip lines and I don't know how to pass into NUMERICAL_STRING the values contained in list.dat and there's probably something needed between FILENAME and $0

thanks so much for whatever help you guys can give, I'll soooo much appreciate it.

Tabby
 
Old 09-19-2012, 04:54 PM   #2
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976
What about this?
Code:
awk 'FNR == NR {_[$1]++} FNR < NR && FNR > 1 {line = $0; getline; if ( $2 in _ ) print line}' list.dat big.dat > small.dat
It assumes that the structure of big.dat is exactly as shown, with the first line (and only the first line) containing a header.
 
Old 09-19-2012, 07:13 PM   #3
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.8, Centos 5.10
Posts: 17,260

Rep: Reputation: 2328Reputation: 2328Reputation: 2328Reputation: 2328Reputation: 2328Reputation: 2328Reputation: 2328Reputation: 2328Reputation: 2328Reputation: 2328Reputation: 2328
Code:
grep -B1 <pattern> <in file>|head -1 >out_file
For speed/efficiency on really large infiles, I'd write a sliding window program in Perl.
I'd store the patterns_to_match in a hash, and then check each one as I read once only through the input data file.
 
Old 09-19-2012, 09:45 PM   #4
atjurhs
Member
 
Registered: Aug 2012
Posts: 183

Original Poster
Rep: Reputation: Disabled
colucix, I'll give that a try tomorrow and let you know

Chris, that may work but is waaaaaay above my ability. I know just a little awk a little sed and a little bash shell. I don't know anything about Perl

Tabby
 
Old 09-20-2012, 12:29 AM   #5
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.8, Centos 5.10
Posts: 17,260

Rep: Reputation: 2328Reputation: 2328Reputation: 2328Reputation: 2328Reputation: 2328Reputation: 2328Reputation: 2328Reputation: 2328Reputation: 2328Reputation: 2328Reputation: 2328
Well, you may not need that level of efficiency, but Perl is really good at data wrangling (relevant to this case, hashes are a basic data type in Perl )
 
Old 09-23-2012, 04:16 AM   #6
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1958Reputation: 1958Reputation: 1958Reputation: 1958Reputation: 1958Reputation: 1958Reputation: 1958Reputation: 1958Reputation: 1958Reputation: 1958Reputation: 1958
The only real tricky thing about this problem is that you have to move back one line from the matched one for printing. Since most of the traditional text matching tools process files unidirectionally, this generally means that each line has to be saved temporarily in a variable or similar so that it's still available for printing when the next line is checked.

Other than that it's just a matter of looping through the values contained in the first file.

Anyway, the most efficient way is certainly to use a tool like awk or perl that can integrate both actions (see colocix's post). But even a simple shell loop could be used, with a variety of tools available for the matching, as long as you don't mind running a separate process for each search.

Here are a few variations that I quickly whipped up.

Code:
while read -r val; do

	#choose your poison:
	echo "/^[0-9][ ]\+$val/-1p" | ed -s mainfile.txt
	sed -rn "/^[0-9][ ]+$val/!h ; /^[0-9][ ]+$val/{g;p}" mainfile.txt
	awk -v m="$val" '($2 != m) { ln=$0 } ; ($2 == m) { print ln }' mainfile.txt
	grep -E -B1 "^[0-9][ ]+$val" mainfile.txt | head -n 1

done <searchfile.txt >outfile.txt
The regex I used assumes that the line always starts with one or more digits, then one or more spaces, followed by the search value. You can adjust this to fit your actual format.

One thing I love about ed is that the line addressing syntax lets you easily move back one line from the match for printing/editing.

For the sed and awk versions, OTOH, we have to stash the line away on non-matches, and print it on matches. With sed we work via the hold buffer, and awk with a simple variable.

awk is also nice in that it doesn't require a fancy regex, just the raw value passed to it from the shell via the -v option. It would also be useful for matching higher numbered fields, where the regex would get uglier.

As first posted by chrism01, grep's -B1 option prints the matching line plus the one before it, and head subsequently filters out the unwanted line.
 
Old 09-23-2012, 04:59 AM   #7
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 671Reputation: 671Reputation: 671Reputation: 671Reputation: 671Reputation: 671
The grep | head example won't work because head will only print out only the very first line.

You could massage the first file to create patterns for for grep to use with the -f option. Then pipe the output though sed or awkward to pass through only the odd lines.

Code:
sed 's/.*/. &/' file >patterns
grep -B1 -f patterns file2 | sed -n '1~2p'
The patterns file will prevent matches on odd numbered lines, when one contains that number. When using sed or awk or even perl, you have to be careful your patterns don't miss lines or have false matches. If the first file may have lines like "13 1263" then you want a pattern like"[0-9]* 1263.
Also consider running the first file through "sort -u" to eliminate duplicates. A duplicate could cause the same lines to to be printed more than once.

Last edited by jschiwal; 09-23-2012 at 05:50 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Finding data in large no. of files cooker97 Linux - Newbie 7 07-10-2012 04:48 AM
[SOLVED] looking for a GUI application for finding and deleting duplicate 'data' files Robert.Thompson Linux - Newbie 2 01-11-2011 02:14 AM
awk - data filtering question cliffyao Linux - Newbie 1 01-15-2010 12:29 PM
Finding files and then finding content within those files... Maeltor Linux - Software 5 03-13-2007 01:06 PM
iptables packet data filtering firewall <Ol>Origy Linux - Networking 1 08-15-2005 07:07 AM


All times are GMT -5. The time now is 02:30 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration