LinuxQuestions.org
Latest LQ Deal: Linux Power User Bundle
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 08-02-2017, 06:56 AM   #1
Asoo
LQ Newbie
 
Registered: Apr 2017
Posts: 29

Rep: Reputation: Disabled
Keep only the first instance of rows with same ID in linux


Greetings!

My file has a format like this:

Code:
1 ABC
1 DEF
1 MNO
2 PQR
2 UVW
3 XYZ
The result file should look like this:

Code:
1 ABC
2 PQR
3 XYZ
To keep only the first instance of each ID. The file contains >100,000 of lines. Python code is taking lot of time.

Thank You!
 
Old 08-02-2017, 07:16 AM   #2
AnanthaP
Member
 
Registered: Jul 2004
Location: Chennai, India
Distribution: UBUNTU 5.10 since Jul-18,2006 on Intel 820 DC
Posts: 843

Rep: Reputation: 201Reputation: 201Reputation: 201
If you show the python code, may be possible to optimize it.

ok
 
Old 08-02-2017, 07:17 AM   #3
rtmistler
Moderator
 
Registered: Mar 2011
Location: Sutton, MA. USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu
Posts: 5,589
Blog Entries: 12

Rep: Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959
You should post the code you have up to this point.

Understand also that members are not here to provide code examples for you, but to instead offer suggestions as to how to succeed with your attempts and improve them. Therefore you need to show your attempt and explain what else you have tried.

I will move this question to the Programming forum to get it better exposure.
 
Old 08-02-2017, 07:33 AM   #4
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 8,444
Blog Entries: 4

Rep: Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896
Do this: [b]sort the file.[/i]

Now, you can loop through the file and select the first record each time the key-value changes from one record to the next. Sorting will cause all occurrences of the same key to be adjacent to one another. You simply need to remember what the previous key-value (if any) was, so that you can know if it changed.

Sorting and Searching are processes that have been so extensively studied – and optimized – that a one Dr. Knuth wrote an entire (incomprehensible ... ) book about it. "Sorting hundreds of thousands of records," or even millions, today is really no big deal.

If you know that n streams of data are sorted identically, you can also merge the sorted streams, dealing with only one record from each stream at a time.

Incidentally, if you remember all those science-fiction movies which had lots of spinning magnetic tapes ... that's what they were doing. Those tapes contained sorted record streams which were merged with other sorted streams. Incidentally, there are tape-sort algorithms such as polyphase merge sort which were used to put new records into key sequence for further processing.

Main-memory was non-existent in those days, and disk (drum ...) memory was unreliable and slow. But, those "oldy moldy COBOL" algorithms are just as efficient as ever they were – and are still in use today.

The database-repair product that I still(!) sell uses sorting-based techniques to achieve a more-than-100x speed improvement over competing algorithms that used keyed random access.

Last edited by sundialsvcs; 08-02-2017 at 07:41 AM.
 
1 members found this post helpful.
Old 08-02-2017, 07:36 AM   #5
Asoo
LQ Newbie
 
Registered: Apr 2017
Posts: 29

Original Poster
Rep: Reputation: Disabled
I will surely remember it, for any later postings. Thanks!

I was converting this file into dictionary, so I placed a condition:

Code:
%time

infile=open("test.txt","r")  

dictionary = {}
for line in infile:
    k, v = line.strip().split('\t')
    if k in dictionary:
        continue;
    else:
        dictionary[k.strip()] = v.strip()

for key, value in dictionary.items():
    outfile.write('\t%s\t%s\n' % (key, value))
I have added "if" condition and it is working fine for saving the first instance. Can this code be more optimized to run for >100,000 of rows?

Thank You!
 
Old 08-02-2017, 07:44 AM   #6
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 8,444
Blog Entries: 4

Rep: Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896
Quote:
Originally Posted by Asoo View Post
I will surely remember it, for any later postings. Thanks!

I was converting this file into dictionary, so I placed a condition:

Code:
%time

infile=open("test.txt","r")  

dictionary = {}
for line in infile:
    k, v = line.strip().split('\t')
    if k in dictionary:
        continue;
    else:
        dictionary[k.strip()] = v.strip()

for key, value in dictionary.items():
    outfile.write('\t%s\t%s\n' % (key, value))
I have added "if" condition and it is working fine for saving the first instance. Can this code be more optimized to run for >100,000 of rows?

Thank You!
Your algorithm appears to be solid. If you've got the memory, un-contested for any other purpose, to hold all of the (unique) records at the same time, then this algorithm should be quick. Python's implementation of hash-tables (dictionaries) is of course a good one. Your use of an "if not already in dictionary" test is correct and necessary.

But you must have the memory to contain all of the unique key-values, and there must be no significant "memory pressure." If memory is in short supply, then suddenly your algorithm will run much slower because the swap-device has become your "hidden disk file."

"Hundreds of thousands," on today's typically RAM-endowed computers, is probably just a big yawn: the computer probably does have the memory lying around, waiting for something to do. But, if you've got tens of millions, you'd probably need to be considering a different approach.

The advantage of a sort-based approach is that, once the data are sorted, the algorithm's execution time is strictly linear and almost no RAM is required. Very efficient ways to sort very large amounts of data are well-at-hand, even in the good ol' sort command or, IIRC, built right in to Python.

Last edited by sundialsvcs; 08-02-2017 at 07:50 AM.
 
1 members found this post helpful.
Old 08-02-2017, 08:16 AM   #7
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 15,795

Rep: Reputation: 2157Reputation: 2157Reputation: 2157Reputation: 2157Reputation: 2157Reputation: 2157Reputation: 2157Reputation: 2157Reputation: 2157Reputation: 2157Reputation: 2157
There are occasions where sorting isn't appropriate - only the data owner can ascertain that. But if it can be, the sort command is all you need. One command. Saves making all those calls to functions - that's probably where all the time is being spent.

If the data need to be processed in the order presented I would use awk - others might prefer perl. In either case, a simple "one-liner" will suffice. And run much faster than that code.
 
1 members found this post helpful.
Old 08-02-2017, 08:19 AM   #8
Asoo
LQ Newbie
 
Registered: Apr 2017
Posts: 29

Original Poster
Rep: Reputation: Disabled
Thank you so much for detailed explanation.

Yeah, you are right. The RAM is over-loaded with other processes that's why it was taking a lot of time but after applying "sort" and then running the code seems to take less time.

Thanks once again

Last edited by Asoo; 08-02-2017 at 08:20 AM.
 
Old 08-02-2017, 08:21 AM   #9
rtmistler
Moderator
 
Registered: Mar 2011
Location: Sutton, MA. USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu
Posts: 5,589
Blog Entries: 12

Rep: Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959Reputation: 1959
I'm not sure if Python or Perl are similar to C. What I'm thinking is opening two files, reading from one, writing to the other, and doing that writing conditionally. Yes I agree that a sort in advance is helpful so that you do not have to work hard at remembering which records were previously encountered. As a matter of fact, one could modify one of the sort algorithms to instead of moving the records, to delete them once they've detected the criteria merits removal. I'm assuming that for Python, the same is also possible.
 
Old 08-02-2017, 08:31 AM   #10
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 3,235

Rep: Reputation: 979Reputation: 979Reputation: 979Reputation: 979Reputation: 979Reputation: 979Reputation: 979Reputation: 979
Yes, it should be sort+awk. Here is my suggestion (untested though):
Code:
sort -k1n,2 test.txt | awk 'NR==1 || stored!=$1 { stored=$1; print $0; }'
Also this is something databases can do:
Code:
SELECT key,MIN(value) FROM table GROUP BY key ORDER BY key;
 
2 members found this post helpful.
Old 08-02-2017, 05:15 PM   #11
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 15,795

Rep: Reputation: 2157Reputation: 2157Reputation: 2157Reputation: 2157Reputation: 2157Reputation: 2157Reputation: 2157Reputation: 2157Reputation: 2157Reputation: 2157Reputation: 2157
Where I value awk is that it is extremely useful when you don't want to sort.
Code:
awk '!tmp[$1]++ {print}' in.file
 
3 members found this post helpful.
Old 08-02-2017, 06:18 PM   #12
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=14, FreeBSD_10{.0|.1|.2}
Posts: 4,454
Blog Entries: 6

Rep: Reputation: 2394Reputation: 2394Reputation: 2394Reputation: 2394Reputation: 2394Reputation: 2394Reputation: 2394Reputation: 2394Reputation: 2394Reputation: 2394Reputation: 2394
Quote:
Originally Posted by syg00 View Post
Where I value awk is that it is extremely useful when you don't want to sort.
Code:
awk '!tmp[$1]++ {print}' in.file
This also better meets the OP's stated requirements:

Quote:
Originally Posted by Asoo View Post
To keep only the first instance of each ID.
As already suggested in post #7...
Quote:
Originally Posted by syg00 View Post
If the data need to be processed in the order presented I would use awk
When we sort we get only the first instance of each sorted ID, which is not necessarily the first occurance of that ID in linear file order.

Last edited by astrogeek; 08-02-2017 at 06:21 PM. Reason: Added #7 ref, typos
 
1 members found this post helpful.
Old 08-03-2017, 01:38 PM   #13
Sefyir
Member
 
Registered: Mar 2015
Distribution: Linux Mint
Posts: 517

Rep: Reputation: 236Reputation: 236Reputation: 236
Python3 Program

With python, it's not really necessary to use a dictionary at all since it's really just acting as a "membership tester" in this scenario.
Also, be careful printing from dictionary.items() - you can't be sure what order they'll be printed in!
Instead, you can use sets. Additionally, you are using split('\t') to parse each line. Why not use csv which can use any delimiter?
For example

Code:
# checker.py
import sys                                                                      
import csv                                                                      
                                                                                
def checker(_file):                                                             
    accounted_for = set()                                                       
    with open(_file) as f:                                                      
        lines = csv.reader(f, delimiter='\t')                                   
        for line in lines:                                                      
            k, v = line                                                         
            if k not in accounted_for:                                          
                accounted_for.add(k)                                            
                print('\t'.join(line))                                        
                                                                                
for _file in sys.argv[1:]:                                                      
    checker(_file)
I only duplicated the demo values, but it is efficient with memory, faster and will report identified values immediately.

Code:
$ wc -l test2.txt
63214200 test2.txt
$ du -h test2.txt
362M	test2.txt

$ time ./checker.py test2.txt 
1	ABC
2	PQR
3	XYZ

real	0m16.195s
user	0m16.104s
sys	0m0.076s


$ time ./checker_op_code.py test2.txt 
	1	ABC
	2	PQR
	3	XYZ


real	0m28.999s
user	0m28.912s
sys	0m0.080s

$ time awk '!tmp[$1]++ {print}' test2.txt
1	ABC
2	PQR
3	XYZ

real	0m12.929s
user	0m12.816s
sys	0m0.096s
EDIT:

It's of course important to think about the problem from other angles too. After all, a blunt axe swung harder and faster is still slow.
Do you know much about the files? Specifically, what the max number of different "ids" are?

If you know you have 3 different ids and have printed out all 3, why bother continuing to iterate through a file?
You could then try something like the below, where it will stop iterating over the file once 3 Ids have been found.

Code:
import sys                                                                      
import csv                                                                      
                                                                                
def checker(_file, max_results=2**99):                                          
    accounted_for = set()                                                       
    with open(_file) as f:                                                      
        lines = csv.reader(f, delimiter='\t')                                   
        while len(accounted_for) < num_results:                                 
            k, v = next(lines)                                                  
            if k not in accounted_for:                                          
                accounted_for.add(k)                                            
                yield '\t'.join((k, v))                                         
                                                                                
for _file in sys.argv[1:]:                                                      
    for result in checker(_file, max_results=3):                                
        print(result)

Last edited by Sefyir; 08-09-2017 at 10:07 AM.
 
Old 08-03-2017, 09:12 PM   #14
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,527

Rep: Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896
Quote:
Originally Posted by syg00 View Post
Where I value awk is that it is extremely useful when you don't want to sort.
Code:
awk '!tmp[$1]++ {print}' in.file
Just a reminder that the print is not required in this case
 
Old 08-05-2017, 02:44 AM   #15
MadeInGermany
Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 424

Rep: Reputation: 185Reputation: 185
Code:
awk '!t[$1]++' in.file
is quick and dirty for
Code:
awk '!($1 in t) { print; t[$1] }' in.file
The latter also saves some memory because there is only the hashed index and no value.
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
data manipulation in linux: how to make columns from rows in linux Rozak Linux - Newbie 3 09-12-2015 02:07 AM
[SOLVED] Switch to instance instead of creating a new instance of a program javascriptninja Linux - Newbie 8 01-28-2012 03:39 PM
How to convert 1 column into several rows in Linux? markraem Linux - Software 9 03-30-2010 11:24 AM
Compare two fields on consecutive rows and print the two rows aditi_borkar Linux - Newbie 3 04-09-2009 05:49 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:59 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration