indexing variables in lists using shell scripts

enigma69 · 05-27-2010, 01:53 AM

Hi all

I have a question concerning indexing over lists with unix shell scripts. I have very large text files (up to 20 Gb) with the data shown below:

8 0 0 1 14128656 1 0
43 0 0 0 2169432 1 0
47 1 0 5 105131 0 1
47 1 0 5 105131 1 0
48 0 0 0 4047238 0 1
57 0 0 4 19591591 1 2
61 0 0 0 20837007 1 1
74 1 0 1 3495911 0 3
74 1 0 5 3495911 1 3
74 1 0 5 3495911 0 0

I would like to set up a shell script that writes only those lines into a new file where the first, the 4th, the 5th and the 6th field are the same but the 7th field is different. Do I need to somehow set up an index? I have tried to solve this with Perl scripts, but a comparison using indeces is much to slow and swallows up all my RAM. I wonder whether I can use awk for this?
Thanks for your help.

grail · 05-27-2010, 02:10 AM

Well I would say that awk is the answer except based on the shown data the fifth field is never equal to any of the others??

Code:

8	0	0	1	14128656	1	0
                                ^^^^^^^^fifth field

enigma69 · 05-27-2010, 03:17 AM

8 0 0 1 14128656 1 0
43 0 0 0 2169432 1 0
47 1 0 5 105131 1 1
47 1 0 5 105131 1 0
48 0 0 0 4047238 0 1
57 0 0 4 19591591 1 2
61 0 0 0 20837007 1 1
74 1 0 1 3495911 0 3
74 1 0 5 3495911 1 3
74 1 0 5 3495911 1 0

I have changed the data now to fit the requirements. If you look at lines 3 and 4 and the last two lines, they fulfill the criteria mentioned in my first post. So they should be written to a new file. How would you do this with awk? Comparisons within the same line I don't have a problem with, but I would have to compare variables from two lines and then my knowledge of awk is not sufficient. Can you give me a code example? By the way, the list is always sorted in increasing order. So comparisons would always be done in successive lines. Thanks.

diondeville · 05-27-2010, 05:46 AM

I think you mean that you want to print rows that share the same data in their 1st, 4th, 5th and 6th fields.

Try this, haven't tested it and I'm a newbie but the theory is sound (I hope):

Code:

sort -u sourcefile.txt > fileone.txt

N=0

while read data ; do

N=$((N+1))

echo "Writing file: file$N.$data.txt"

grep -i "$data" sourcefile.txt > file$N.$data.txt

done < fileone.txt

It should write your output to a file called "file+Number+data-string.txt" and it will echo the data being written to the screen.

The original file should not be overwritten but back it up first.

Edit to add - write a component to count the number of lines in each output file and to process those with multiple data lines to determine those that match your criteria.

I have a project to finish but if I get time I'll take another look at it for you.

grail · 05-27-2010, 05:49 AM

ahhhh ... the part about being on different lines is what was missing (cause I was looking and saying to myself that they still can't possibly match).

So my question to you now is what if the matching is across separated lines? eg.

Code:

74 1 0 5 3495911 1 3
57 0 0 4 19591591 1 2
74 1 0 5 3495911 1 0

If it is across any number of lines it is actually more trivial.

The clue is to use arrays as the index can be a string

diondeville · 05-27-2010, 06:01 AM

Quote:

Originally Posted by grail

ahhhh ... the part about being on different lines is what was missing (cause I was looking and saying to myself that they still can't possibly match).

Yep, I lost a bit of hair figuring it out too!

grail · 05-27-2010, 06:17 AM

I forgot to ask, do we print both lines?

enigma69 · 05-27-2010, 08:03 AM

Hello all. Thanks for your patience with the insufficient description of the data on my part. The lines in the original file I am working with are actually not sorted, so they could be in any order. I've only done this for the example here. The crucial thing to consider is, that the line(s) should only be printed into the outfile if the 7th field does not match! Like I mentioned, indexing has not worked for me so far, since the files are very big (up to 20 Gbytes) and I would like to process them in one go. Would be great to get a solution with linux shell. Thanks.

grail · 05-27-2010, 08:50 AM

You say the file is not sorted. So the lines we need to investigate will not necessarily be consecutive? correct?
Also, do you need to print both the original line and the one with 7th field not matching?

Plus, is it likely/probable that there are more than one lines matching the first set of fields with the 7th not matching?
Should all of these be printed?

As you have stated the file is so large, you will need to provide a more detailed explanation on the criteria you require.
I think we all understand the matching criteria as follows:

Fields 1, 4, 5, 6 all equal across lines but field 7 is not equal

As above, we now need more clarification on what to print and how it may be located?

diondeville · 05-27-2010, 11:54 AM

Haven't tested it, some of the scripting might be wrong (I'm sure someone will correct it for us), but the process is valid (and long winded)

Run it from the same directory as your source file is held and change the top line file named 'sourcefile.txt' to the name of your source file. Other than coding corrections, it should work (I hope, given it's taken me an hour or so to write).

Let us know how it works.

Code:

#! /bin/bash

cp sourcefile.txt sourcecopy.txt

sed -i 's#^[ \t]*##;s#[ \t]*$##g' sourcecopy.txt # delete white space from the beginning and end of each line
sort sourcecopy.txt > tmp.txt
sed -i '/^$/d' tmp.txt # delete blank lines
cp tmp.txt sourcecopy.txt

sort -u sourcecopy.txt > fileone.txt

### We now hve three files sourcecopy.txt, fileone.txt and sourcefile.txt

N=0

while read data ; do

N=$((N+1))

##### Remove one occurence of the data from the source file (copy).

sed -i 's/'"$data"'//1' sourcecopy.txt

##### Reformt the data

echo $data > tempfile.txt
sed -i "s/ /\n/g" tempfile.txt # give each field its own line
sed -i '/^$/d' tempfile.txt # delete blank lines, shouldn't be any but we never know...
sed -i 's#^[ \t]*##;s#[ \t]*$##g' tempfile.txt # delete white space from the beginning and end of each line (uniformity is essential)
sed -i "s/$/ /g" tempfile.txt # add a space to the end of each line

##### Split & filter the data

head -n 1 tempfile.txt > fieldone.txt # move the top line (field one) to a file of its own
sed -i '1d' tempfile.txt delete the top line (field one) from the file

touch fieldtwo.txt # create blank file for field two
echo "[:num:] " > fieldtwo.txt
sed -i '1d' tempfile.txt # remove top line (field 2) from tempfile.txt

touch fieldthree.txt # create blank file for field three
echo "[:num:] " > fieldthree.txt
sed -i '1d' tempfile.txt # remove top line (field 3) from tempfile.txt

head -n 1 tempfile.txt > fieldfour.txt
sed -i '1d' tempfile.txt

head -n 1 tempfile.txt > fieldfive.txt
sed -i '1d' tempfile.txt

head -n 1 tempfile.txt > fieldsix.txt
sed -i '1d' tempfile.txt

touch fieldseven.txt # create blank file for field seven
echo "[:num:]" > fieldseven.txt
sed -i '1d' tempfile.txt # remove top line (field 7) from tempfile.txt

#### Recombine the data

paste -d "" fieldone.txt fieldtwo.txt fieldthree.txt fieldfour.txt fieldfive.txt fieldsix.txt fieldseven.txt > rebuild.txt

NEW=rebuild.txt

echo "Writing file: file$N.$data.txt"

#### Compare the recombined data with the sourcefile and place data with matching 1st, 4th, 5th and 6th fields into a seperate file

grep -i "$NEW" sourcecopy.txt > file$N.$data.txt

done < fileone.txt

PTrenholme · 05-27-2010, 12:11 PM

Well, given that the O.P. mentioned that the input files are on the order of 20Gb, I suspect that any bash/sort/etc. solution would be either too slow and/or require too much memory for any such solution to be practical.

If I were to try to solve the problem, I think that indexing the data using, e.g., SQLite and then printing the line where the index count was greater than one would be a much more usable solution.

catkin · 05-27-2010, 12:44 PM

Depending on how many lines are to be printed (and thus the data size that must be held), content-addressable arrays might be viable -- algorithmically similar to indexing with SQLite?

PTrenholme · 05-27-2010, 06:52 PM

Quote:

Originally Posted by catkin

Depending on how many lines are to be printed (and thus the data size that must be held), content-addressable arrays might be viable -- algorithmically similar to indexing with SQLite?

The problem that I see is that the duplicated key with unique field 7 combination can occur at any place in the 20 Gb file. Here's a simple AWK program that works well for the small sample data the OP provided, but I strongly suspect that it would "choke" on a 20Gb file, even if the OP had a 100 Gb swap file.

Code:

#!/bin/gawk -f
{
  key= $1 SUBSEP $4 SUBSEP $5 SUBSEP $6
  count[key] += 1
  count2[key,$7] += 1
  record[key,$7] = record[key,$7] SUBSEP $0 
}
END {
  for (key in count) {
    if (count[key] > 1) {
      split(key, dups, SUBSEP)
      for (key2 in count2) {
        if (count2[key2] != 1) continue
        split(key2,dup2,SUBSEP)
        matched=1
        for (i=1;i<5;++i) {
          if (dup2[i] != dups[i]) {
            matched=0
            break
          }
        }
        if (matched == 1) {
          ndup=split(record[key,dup2[5]],line,SUBSEP)
          for (n=2; n<=ndup; ++n) {
            print line[n]
          }
        }
      }
    }
  }
}

<edit>
For what it's worth, here's the output from the OPs sample data:

Code:

$ cat dup.data 
8 0 0 1 14128656 1 0
43 0 0 0 2169432 1 0
47 1 0 5 105131 1 1
47 1 0 5 105131 1 0
48 0 0 0 4047238 0 1
57 0 0 4 19591591 1 2
61 0 0 0 20837007 1 1
74 1 0 1 3495911 0 3
74 1 0 5 3495911 1 3
74 1 0 5 3495911 1 0

$ ./dup_key dup.data
47 1 0 5 105131 1 0
47 1 0 5 105131 1 1
74 1 0 5 3495911 1 0
74 1 0 5 3495911 1 3

</edit>

grail · 05-28-2010, 02:29 AM

Well if we are going to assume the lines are after each other and both to be printed, this seems a little easier:

Code:

awk 'f && $1$4$5$6 in a && a[$1$4$5$6] != $7{print b"\n"$0}{a[$1$4$5$6]=$7;b=$0;f=1}' file

enigma69 · 06-01-2010, 01:14 AM

Hi all,

Sorry I am only replying now, I have been travelling and wasn't online over the weekend. The two suggestions in the previous two posts are both great solutions and both work in principle. PTrenholme, I've tried the script on a 4 Gb file, but had to stop it after 1 hour (I've been using my laptop to run it, which I had to switch off), but will try to run it again overnight on my linux desktop. Grails' solution seems to be the way to go in terms of speed, it's quite quick of course to sort the data beforehand. I will let you know whether everything works on the large files as well, as soon as I've had the chance to test it.

Thanks again! This is a great forum.

C