Keep only the first instance of rows with same ID in linux
Greetings!
My file has a format like this: Code:
1 ABC Code:
1 ABC Thank You! |
If you show the python code, may be possible to optimize it.
ok |
You should post the code you have up to this point.
Understand also that members are not here to provide code examples for you, but to instead offer suggestions as to how to succeed with your attempts and improve them. Therefore you need to show your attempt and explain what else you have tried. I will move this question to the Programming forum to get it better exposure. |
Do this: [b]sort the file.[/i]
Now, you can loop through the file and select the first record each time the key-value changes from one record to the next. Sorting will cause all occurrences of the same key to be adjacent to one another. You simply need to remember what the previous key-value (if any) was, so that you can know if it changed. Sorting and Searching are processes that have been so extensively studied – and optimized – that a one Dr. Knuth wrote an entire (incomprehensible ... :rolleyes:) book about it. "Sorting hundreds of thousands of records," or even millions, today is really no big deal. If you know that n streams of data are sorted identically, you can also merge the sorted streams, dealing with only one record from each stream at a time. Incidentally, if you remember all those science-fiction movies which had lots of spinning magnetic tapes ... that's what they were doing. Those tapes contained sorted record streams which were merged with other sorted streams. Incidentally, there are tape-sort algorithms such as polyphase merge sort which were used to put new records into key sequence for further processing. Main-memory was non-existent in those days, and disk (drum ...) memory was unreliable and slow. But, those "oldy moldy COBOL" ;) algorithms are just as efficient as ever they were – and are still in use today. The database-repair product that I still(!) sell uses sorting-based techniques to achieve a more-than-100x speed improvement over competing algorithms that used keyed random access. |
I will surely remember it, for any later postings. Thanks!
I was converting this file into dictionary, so I placed a condition: Code:
%time Thank You! |
Quote:
But you must have the memory to contain all of the unique key-values, and there must be no significant "memory pressure." If memory is in short supply, then suddenly your algorithm will run much slower because the swap-device has become your "hidden disk file." "Hundreds of thousands," on today's typically RAM-endowed computers, is probably just a big yawn: the computer probably does have the memory lying around, waiting for something to do. But, if you've got tens of millions, you'd probably need to be considering a different approach. The advantage of a sort-based approach is that, once the data are sorted, the algorithm's execution time is strictly linear and almost no RAM is required. Very efficient ways to sort very large amounts of data are well-at-hand, even in the good ol' sort command or, IIRC, built right in to Python. |
There are occasions where sorting isn't appropriate - only the data owner can ascertain that. But if it can be, the sort command is all you need. One command. Saves making all those calls to functions - that's probably where all the time is being spent.
If the data need to be processed in the order presented I would use awk - others might prefer perl. In either case, a simple "one-liner" will suffice. And run much faster than that code. |
Thank you so much for detailed explanation.
Yeah, you are right. The RAM is over-loaded with other processes that's why it was taking a lot of time but after applying "sort" and then running the code seems to take less time. Thanks once again :) |
I'm not sure if Python or Perl are similar to C. What I'm thinking is opening two files, reading from one, writing to the other, and doing that writing conditionally. Yes I agree that a sort in advance is helpful so that you do not have to work hard at remembering which records were previously encountered. As a matter of fact, one could modify one of the sort algorithms to instead of moving the records, to delete them once they've detected the criteria merits removal. I'm assuming that for Python, the same is also possible.
|
Yes, it should be sort+awk. Here is my suggestion (untested though):
Code:
sort -k1n,2 test.txt | awk 'NR==1 || stored!=$1 { stored=$1; print $0; }' Code:
SELECT key,MIN(value) FROM table GROUP BY key ORDER BY key; |
Where I value awk is that it is extremely useful when you don't want to sort.
Code:
awk '!tmp[$1]++ {print}' in.file |
Quote:
Quote:
Quote:
|
Python3 Program
With python, it's not really necessary to use a dictionary at all since it's really just acting as a "membership tester" in this scenario.
Also, be careful printing from dictionary.items() - you can't be sure what order they'll be printed in! Instead, you can use sets. Additionally, you are using split('\t') to parse each line. Why not use csv which can use any delimiter? For example Code:
# checker.py Code:
$ wc -l test2.txt It's of course important to think about the problem from other angles too. After all, a blunt axe swung harder and faster is still slow. Do you know much about the files? Specifically, what the max number of different "ids" are? If you know you have 3 different ids and have printed out all 3, why bother continuing to iterate through a file? You could then try something like the below, where it will stop iterating over the file once 3 Ids have been found. Code:
import sys |
Quote:
|
Code:
awk '!t[$1]++' in.file Code:
awk '!($1 in t) { print; t[$1] }' in.file |
All times are GMT -5. The time now is 02:43 PM. |