ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I would like to set up a shell script that writes only those lines into a new file where the first, the 4th, the 5th and the 6th field are the same but the 7th field is different. Do I need to somehow set up an index? I have tried to solve this with Perl scripts, but a comparison using indeces is much to slow and swallows up all my RAM. I wonder whether I can use awk for this?
Thanks for your help.
I have changed the data now to fit the requirements. If you look at lines 3 and 4 and the last two lines, they fulfill the criteria mentioned in my first post. So they should be written to a new file. How would you do this with awk? Comparisons within the same line I don't have a problem with, but I would have to compare variables from two lines and then my knowledge of awk is not sufficient. Can you give me a code example? By the way, the list is always sorted in increasing order. So comparisons would always be done in successive lines. Thanks.
I think you mean that you want to print rows that share the same data in their 1st, 4th, 5th and 6th fields.
Try this, haven't tested it and I'm a newbie but the theory is sound (I hope):
Code:
sort -u sourcefile.txt > fileone.txt
N=0
while read data ; do
N=$((N+1))
echo "Writing file: file$N.$data.txt"
grep -i "$data" sourcefile.txt > file$N.$data.txt
done < fileone.txt
It should write your output to a file called "file+Number+data-string.txt" and it will echo the data being written to the screen.
The original file should not be overwritten but back it up first.
Edit to add - write a component to count the number of lines in each output file and to process those with multiple data lines to determine those that match your criteria.
I have a project to finish but if I get time I'll take another look at it for you.
Last edited by diondeville; 05-27-2010 at 05:53 AM.
Hello all. Thanks for your patience with the insufficient description of the data on my part. The lines in the original file I am working with are actually not sorted, so they could be in any order. I've only done this for the example here. The crucial thing to consider is, that the line(s) should only be printed into the outfile if the 7th field does not match! Like I mentioned, indexing has not worked for me so far, since the files are very big (up to 20 Gbytes) and I would like to process them in one go. Would be great to get a solution with linux shell. Thanks.
You say the file is not sorted. So the lines we need to investigate will not necessarily be consecutive? correct?
Also, do you need to print both the original line and the one with 7th field not matching?
Plus, is it likely/probable that there are more than one lines matching the first set of fields with the 7th not matching?
Should all of these be printed?
As you have stated the file is so large, you will need to provide a more detailed explanation on the criteria you require.
I think we all understand the matching criteria as follows:
Fields 1, 4, 5, 6 all equal across lines but field 7 is not equal
As above, we now need more clarification on what to print and how it may be located?
Haven't tested it, some of the scripting might be wrong (I'm sure someone will correct it for us), but the process is valid (and long winded)
Run it from the same directory as your source file is held and change the top line file named 'sourcefile.txt' to the name of your source file. Other than coding corrections, it should work (I hope, given it's taken me an hour or so to write).
Let us know how it works.
Code:
#! /bin/bash
cp sourcefile.txt sourcecopy.txt
sed -i 's#^[ \t]*##;s#[ \t]*$##g' sourcecopy.txt # delete white space from the beginning and end of each line
sort sourcecopy.txt > tmp.txt
sed -i '/^$/d' tmp.txt # delete blank lines
cp tmp.txt sourcecopy.txt
sort -u sourcecopy.txt > fileone.txt
### We now hve three files sourcecopy.txt, fileone.txt and sourcefile.txt
N=0
while read data ; do
N=$((N+1))
##### Remove one occurence of the data from the source file (copy).
sed -i 's/'"$data"'//1' sourcecopy.txt
##### Reformt the data
echo $data > tempfile.txt
sed -i "s/ /\n/g" tempfile.txt # give each field its own line
sed -i '/^$/d' tempfile.txt # delete blank lines, shouldn't be any but we never know...
sed -i 's#^[ \t]*##;s#[ \t]*$##g' tempfile.txt # delete white space from the beginning and end of each line (uniformity is essential)
sed -i "s/$/ /g" tempfile.txt # add a space to the end of each line
##### Split & filter the data
head -n 1 tempfile.txt > fieldone.txt # move the top line (field one) to a file of its own
sed -i '1d' tempfile.txt delete the top line (field one) from the file
touch fieldtwo.txt # create blank file for field two
echo "[:num:] " > fieldtwo.txt
sed -i '1d' tempfile.txt # remove top line (field 2) from tempfile.txt
touch fieldthree.txt # create blank file for field three
echo "[:num:] " > fieldthree.txt
sed -i '1d' tempfile.txt # remove top line (field 3) from tempfile.txt
head -n 1 tempfile.txt > fieldfour.txt
sed -i '1d' tempfile.txt
head -n 1 tempfile.txt > fieldfive.txt
sed -i '1d' tempfile.txt
head -n 1 tempfile.txt > fieldsix.txt
sed -i '1d' tempfile.txt
touch fieldseven.txt # create blank file for field seven
echo "[:num:]" > fieldseven.txt
sed -i '1d' tempfile.txt # remove top line (field 7) from tempfile.txt
#### Recombine the data
paste -d "" fieldone.txt fieldtwo.txt fieldthree.txt fieldfour.txt fieldfive.txt fieldsix.txt fieldseven.txt > rebuild.txt
NEW=rebuild.txt
echo "Writing file: file$N.$data.txt"
#### Compare the recombined data with the sourcefile and place data with matching 1st, 4th, 5th and 6th fields into a seperate file
grep -i "$NEW" sourcecopy.txt > file$N.$data.txt
done < fileone.txt
Well, given that the O.P. mentioned that the input files are on the order of 20Gb, I suspect that any bash/sort/etc. solution would be either too slow and/or require too much memory for any such solution to be practical.
If I were to try to solve the problem, I think that indexing the data using, e.g., SQLite and then printing the line where the index count was greater than one would be a much more usable solution.
Depending on how many lines are to be printed (and thus the data size that must be held), content-addressable arrays might be viable -- algorithmically similar to indexing with SQLite?
Depending on how many lines are to be printed (and thus the data size that must be held), content-addressable arrays might be viable -- algorithmically similar to indexing with SQLite?
The problem that I see is that the duplicated key with unique field 7 combination can occur at any place in the 20 Gb file. Here's a simple AWK program that works well for the small sample data the OP provided, but I strongly suspect that it would "choke" on a 20Gb file, even if the OP had a 100 Gb swap file.
Code:
#!/bin/gawk -f
{
key= $1 SUBSEP $4 SUBSEP $5 SUBSEP $6
count[key] += 1
count2[key,$7] += 1
record[key,$7] = record[key,$7] SUBSEP $0
}
END {
for (key in count) {
if (count[key] > 1) {
split(key, dups, SUBSEP)
for (key2 in count2) {
if (count2[key2] != 1) continue
split(key2,dup2,SUBSEP)
matched=1
for (i=1;i<5;++i) {
if (dup2[i] != dups[i]) {
matched=0
break
}
}
if (matched == 1) {
ndup=split(record[key,dup2[5]],line,SUBSEP)
for (n=2; n<=ndup; ++n) {
print line[n]
}
}
}
}
}
}
<edit>
For what it's worth, here's the output from the OPs sample data:
Sorry I am only replying now, I have been travelling and wasn't online over the weekend. The two suggestions in the previous two posts are both great solutions and both work in principle. PTrenholme, I've tried the script on a 4 Gb file, but had to stop it after 1 hour (I've been using my laptop to run it, which I had to switch off), but will try to run it again overnight on my linux desktop. Grails' solution seems to be the way to go in terms of speed, it's quite quick of course to sort the data beforehand. I will let you know whether everything works on the large files as well, as soon as I've had the chance to test it.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.