indexing variables in lists using shell scripts
Hi all
I have a question concerning indexing over lists with unix shell scripts. I have very large text files (up to 20 Gb) with the data shown below: 8 0 0 1 14128656 1 0 43 0 0 0 2169432 1 0 47 1 0 5 105131 0 1 47 1 0 5 105131 1 0 48 0 0 0 4047238 0 1 57 0 0 4 19591591 1 2 61 0 0 0 20837007 1 1 74 1 0 1 3495911 0 3 74 1 0 5 3495911 1 3 74 1 0 5 3495911 0 0 I would like to set up a shell script that writes only those lines into a new file where the first, the 4th, the 5th and the 6th field are the same but the 7th field is different. Do I need to somehow set up an index? I have tried to solve this with Perl scripts, but a comparison using indeces is much to slow and swallows up all my RAM. I wonder whether I can use awk for this? Thanks for your help. |
Well I would say that awk is the answer except based on the shown data the fifth field is never equal to any of the others??
Code:
8 0 0 1 14128656 1 0 |
8 0 0 1 14128656 1 0
43 0 0 0 2169432 1 0 47 1 0 5 105131 1 1 47 1 0 5 105131 1 0 48 0 0 0 4047238 0 1 57 0 0 4 19591591 1 2 61 0 0 0 20837007 1 1 74 1 0 1 3495911 0 3 74 1 0 5 3495911 1 3 74 1 0 5 3495911 1 0 I have changed the data now to fit the requirements. If you look at lines 3 and 4 and the last two lines, they fulfill the criteria mentioned in my first post. So they should be written to a new file. How would you do this with awk? Comparisons within the same line I don't have a problem with, but I would have to compare variables from two lines and then my knowledge of awk is not sufficient. Can you give me a code example? By the way, the list is always sorted in increasing order. So comparisons would always be done in successive lines. Thanks. |
I think you mean that you want to print rows that share the same data in their 1st, 4th, 5th and 6th fields.
Try this, haven't tested it and I'm a newbie but the theory is sound (I hope): Code:
sort -u sourcefile.txt > fileone.txt The original file should not be overwritten but back it up first. Edit to add - write a component to count the number of lines in each output file and to process those with multiple data lines to determine those that match your criteria. I have a project to finish but if I get time I'll take another look at it for you. |
ahhhh ... the part about being on different lines is what was missing (cause I was looking and saying to myself that they still can't possibly match).
So my question to you now is what if the matching is across separated lines? eg. Code:
74 1 0 5 3495911 1 3 The clue is to use arrays as the index can be a string |
Quote:
|
I forgot to ask, do we print both lines?
|
Hello all. Thanks for your patience with the insufficient description of the data on my part. The lines in the original file I am working with are actually not sorted, so they could be in any order. I've only done this for the example here. The crucial thing to consider is, that the line(s) should only be printed into the outfile if the 7th field does not match! Like I mentioned, indexing has not worked for me so far, since the files are very big (up to 20 Gbytes) and I would like to process them in one go. Would be great to get a solution with linux shell. Thanks.
|
You say the file is not sorted. So the lines we need to investigate will not necessarily be consecutive? correct?
Also, do you need to print both the original line and the one with 7th field not matching? Plus, is it likely/probable that there are more than one lines matching the first set of fields with the 7th not matching? Should all of these be printed? As you have stated the file is so large, you will need to provide a more detailed explanation on the criteria you require. I think we all understand the matching criteria as follows: Fields 1, 4, 5, 6 all equal across lines but field 7 is not equal As above, we now need more clarification on what to print and how it may be located? |
Haven't tested it, some of the scripting might be wrong (I'm sure someone will correct it for us), but the process is valid (and long winded) :)
Run it from the same directory as your source file is held and change the top line file named 'sourcefile.txt' to the name of your source file. Other than coding corrections, it should work (I hope, given it's taken me an hour or so to write). Let us know how it works. Code:
#! /bin/bash |
Well, given that the O.P. mentioned that the input files are on the order of 20Gb, I suspect that any bash/sort/etc. solution would be either too slow and/or require too much memory for any such solution to be practical.
If I were to try to solve the problem, I think that indexing the data using, e.g., SQLite and then printing the line where the index count was greater than one would be a much more usable solution. |
Depending on how many lines are to be printed (and thus the data size that must be held), content-addressable arrays might be viable -- algorithmically similar to indexing with SQLite?
|
Quote:
Code:
#!/bin/gawk -f For what it's worth, here's the output from the OPs sample data: Code:
$ cat dup.data |
Well if we are going to assume the lines are after each other and both to be printed, this seems a little easier:
Code:
awk 'f && $1$4$5$6 in a && a[$1$4$5$6] != $7{print b"\n"$0}{a[$1$4$5$6]=$7;b=$0;f=1}' file |
Hi all,
Sorry I am only replying now, I have been travelling and wasn't online over the weekend. The two suggestions in the previous two posts are both great solutions and both work in principle. PTrenholme, I've tried the script on a 4 Gb file, but had to stop it after 1 hour (I've been using my laptop to run it, which I had to switch off), but will try to run it again overnight on my linux desktop. Grails' solution seems to be the way to go in terms of speed, it's quite quick of course to sort the data beforehand. I will let you know whether everything works on the large files as well, as soon as I've had the chance to test it. Thanks again! This is a great forum. C |
All times are GMT -5. The time now is 10:46 AM. |