LinuxQuestions.org - [SOLVED] Keep only the first instance of rows with same ID in linux

Page 1 of 2

Show 50 post(s) from this thread on one page

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Keep only the first instance of rows with same ID in linux (https://www.linuxquestions.org/questions/programming-9/keep-only-the-first-instance-of-rows-with-same-id-in-linux-4175611122/)

Asoo	08-02-2017 06:56 AM

Keep only the first instance of rows with same ID in linux

Greetings!

My file has a format like this:

Code:

1 ABC

1 DEF

1 MNO

2 PQR

2 UVW

3 XYZ

The result file should look like this:

Code:

1 ABC

2 PQR

3 XYZ

To keep only the first instance of each ID. The file contains >100,000 of lines. Python code is taking lot of time.

Thank You!

AnanthaP

08-02-2017 07:16 AM

If you show the python code, may be possible to optimize it.

ok

rtmistler

08-02-2017 07:17 AM

You should post the code you have up to this point.

Understand also that members are not here to provide code examples for you, but to instead offer suggestions as to how to succeed with your attempts and improve them. Therefore you need to show your attempt and explain what else you have tried.

I will move this question to the Programming forum to get it better exposure.

sundialsvcs

08-02-2017 07:33 AM

Do this: [b]sort the file.[/i]

Now, you can loop through the file and select the first record each time the key-value changes from one record to the next. Sorting will cause all occurrences of the same key to be adjacent to one another. You simply need to remember what the previous key-value (if any) was, so that you can know if it changed.

Sorting and Searching are processes that have been so extensively studied – and optimized – that a one Dr. Knuth wrote an entire (incomprehensible ... :rolleyes:) book about it. "Sorting hundreds of thousands of records," or even millions, today is really no big deal.

If you know that n streams of data are sorted identically, you can also merge the sorted streams, dealing with only one record from each stream at a time.

Incidentally, if you remember all those science-fiction movies which had lots of spinning magnetic tapes ... that's what they were doing. Those tapes contained sorted record streams which were merged with other sorted streams. Incidentally, there are tape-sort algorithms such as polyphase merge sort which were used to put new records into key sequence for further processing.

Main-memory was non-existent in those days, and disk (drum ...) memory was unreliable and slow. But, those "oldy moldy COBOL" ;) algorithms are just as efficient as ever they were – and are still in use today.

The database-repair product that I still(!) sell uses sorting-based techniques to achieve a more-than-100x speed improvement over competing algorithms that used keyed random access.

Asoo	08-02-2017 07:36 AM

I will surely remember it, for any later postings. Thanks!

I was converting this file into dictionary, so I placed a condition:

Code:

%time



infile=open("test.txt","r")  



dictionary = {}

for line in infile:

    k, v = line.strip().split('\t')

    if k in dictionary:

        continue;

    else:

        dictionary[k.strip()] = v.strip()



for key, value in dictionary.items():

    outfile.write('\t%s\t%s\n' % (key, value))

I have added "if" condition and it is working fine for saving the first instance. Can this code be more optimized to run for >100,000 of rows?

Thank You!

sundialsvcs

08-02-2017 07:44 AM

Quote:

Originally Posted by Asoo (Post 5743177)

I will surely remember it, for any later postings. Thanks!

I was converting this file into dictionary, so I placed a condition:

Code:

%time



infile=open("test.txt","r")  



dictionary = {}

for line in infile:

    k, v = line.strip().split('\t')

    if k in dictionary:

        continue;

    else:

        dictionary[k.strip()] = v.strip()



for key, value in dictionary.items():

    outfile.write('\t%s\t%s\n' % (key, value))

I have added "if" condition and it is working fine for saving the first instance. Can this code be more optimized to run for >100,000 of rows?

Thank You!

Your algorithm appears to be solid. If you've got the memory, un-contested for any other purpose, to hold all of the (unique) records at the same time, then this algorithm should be quick. Python's implementation of hash-tables (dictionaries) is of course a good one. Your use of an "if not already in dictionary" test is correct and necessary.

But you must have the memory to contain all of the unique key-values, and there must be no significant "memory pressure." If memory is in short supply, then suddenly your algorithm will run much slower because the swap-device has become your "hidden disk file."

"Hundreds of thousands," on today's typically RAM-endowed computers, is probably just a big yawn: the computer probably does have the memory lying around, waiting for something to do. But, if you've got tens of millions, you'd probably need to be considering a different approach.

The advantage of a sort-based approach is that, once the data are sorted, the algorithm's execution time is strictly linear and almost no RAM is required. Very efficient ways to sort very large amounts of data are well-at-hand, even in the good ol' sort command or, IIRC, built right in to Python.

syg00

08-02-2017 08:16 AM

There are occasions where sorting isn't appropriate - only the data owner can ascertain that. But if it can be, the sort command is all you need. One command. Saves making all those calls to functions - that's probably where all the time is being spent.

If the data need to be processed in the order presented I would use awk - others might prefer perl. In either case, a simple "one-liner" will suffice. And run much faster than that code.

Asoo	08-02-2017 08:19 AM

Thank you so much for detailed explanation.

Yeah, you are right. The RAM is over-loaded with other processes that's why it was taking a lot of time but after applying "sort" and then running the code seems to take less time.

Thanks once again :)

rtmistler

08-02-2017 08:21 AM

I'm not sure if Python or Perl are similar to C. What I'm thinking is opening two files, reading from one, writing to the other, and doing that writing conditionally. Yes I agree that a sort in advance is helpful so that you do not have to work hard at remembering which records were previously encountered. As a matter of fact, one could modify one of the sort algorithms to instead of moving the records, to delete them once they've detected the criteria merits removal. I'm assuming that for Python, the same is also possible.

NevemTeve

08-02-2017 08:31 AM

Yes, it should be sort+awk. Here is my suggestion (untested though):

Code:

sort -k1n,2 test.txt | awk 'NR==1 || stored!=$1 { stored=$1; print $0; }'

Also this is something databases can do:

Code:

SELECT key,MIN(value) FROM table GROUP BY key ORDER BY key;

syg00

08-02-2017 05:15 PM

Where I value awk is that it is extremely useful when you don't want to sort.

Code:

awk '!tmp[$1]++ {print}' in.file

astrogeek

08-02-2017 06:18 PM

Quote:

Originally Posted by syg00 (Post 5743508)

Where I value awk is that it is extremely useful when you don't want to sort.

Code:

awk '!tmp[$1]++ {print}' in.file

This also better meets the OP's stated requirements:

Quote:

Originally Posted by Asoo (Post 5743145)

To keep only the first instance of each ID.

As already suggested in post #7...

Quote:

Originally Posted by syg00 (Post 5743207)

If the data need to be processed in the order presented I would use awk

When we sort we get only the first instance of each sorted ID, which is not necessarily the first occurance of that ID in linear file order.

Sefyir

08-03-2017 01:38 PM

Python3 Program

With python, it's not really necessary to use a dictionary at all since it's really just acting as a "membership tester" in this scenario.
Also, be careful printing from dictionary.items() - you can't be sure what order they'll be printed in!
Instead, you can use sets. Additionally, you are using split('\t') to parse each line. Why not use csv which can use any delimiter?
For example

Code:

# checker.py

import sys                                                                      

import csv                                                                      

                                                                                

def checker(_file):                                                            

    accounted_for = set()                                                      

    with open(_file) as f:                                                      

        lines = csv.reader(f, delimiter='\t')                                  

        for line in lines:                                                      

            k, v = line                                                        

            if k not in accounted_for:                                          

                accounted_for.add(k)                                            

                print('\t'.join(line))                                        

                                                                                

for _file in sys.argv[1:]:                                                      

    checker(_file)

I only duplicated the demo values, but it is efficient with memory, faster and will report identified values immediately.

Code:

$ wc -l test2.txt

63214200 test2.txt

$ du -h test2.txt

362M        test2.txt



$ time ./checker.py test2.txt 

1        ABC

2        PQR

3        XYZ



real        0m16.195s

user        0m16.104s

sys        0m0.076s





$ time ./checker_op_code.py test2.txt 

        1        ABC

        2        PQR

        3        XYZ





real        0m28.999s

user        0m28.912s

sys        0m0.080s



$ time awk '!tmp[$1]++ {print}' test2.txt

1        ABC

2        PQR

3        XYZ



real        0m12.929s

user        0m12.816s

sys        0m0.096s

EDIT:

It's of course important to think about the problem from other angles too. After all, a blunt axe swung harder and faster is still slow.
Do you know much about the files? Specifically, what the max number of different "ids" are?

If you know you have 3 different ids and have printed out all 3, why bother continuing to iterate through a file?
You could then try something like the below, where it will stop iterating over the file once 3 Ids have been found.

Code:

import sys                                                                      

import csv                                                                      

                                                                                

def checker(_file, max_results=2**99):                                          

    accounted_for = set()                                                      

    with open(_file) as f:                                                      

        lines = csv.reader(f, delimiter='\t')                                  

        while len(accounted_for) < num_results:                                

            k, v = next(lines)                                                  

            if k not in accounted_for:                                          

                accounted_for.add(k)                                            

                yield '\t'.join((k, v))                                        

                                                                                

for _file in sys.argv[1:]:                                                      

    for result in checker(_file, max_results=3):                                

        print(result)

grail

08-03-2017 09:12 PM

Quote:

Originally Posted by syg00 (Post 5743508)

Where I value awk is that it is extremely useful when you don't want to sort.

Code:

awk '!tmp[$1]++ {print}' in.file

Just a reminder that the print is not required in this case :)

MadeInGermany

08-05-2017 02:44 AM

Code:

awk '!t[$1]++' in.file

is quick and dirty for

Code:

awk '!($1 in t) { print; t[$1] }' in.file

The latter also saves some memory because there is only the hashed index and no value.

All times are GMT -5. The time now is 02:43 PM.

Page 1 of 2

Show 50 post(s) from this thread on one page