Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place! |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
|
|
06-22-2017, 06:35 AM
|
#16
|
LQ Veteran
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,236
|
Standard answer to speed up text processing code is to use (properly constructed) perl.
The python code is overly complex, and no doubt adds to the runtime. awk shouldn't be written to mirror that code, but use awk imperatives.
Also, the python code in post #1 won't produce the output in post #7 as no attempt was made to account for the header. Here is a quick awk attempt - it should be (much ?) faster.
Code:
awk 'BEGIN{fl=1 ; i=0} (NR == 1) {next} ; !_[$1]++ {i++} ; {if (i % 4) {print $0 > "out"fl".txt"} else { delete _ ; print $0 > "out"++fl".txt" ; _[$1]++ ; i=1 }}' Input.txt
|
|
1 members found this post helpful.
|
06-22-2017, 06:53 AM
|
#17
|
LQ Addict
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,631
|
you can use with to open a file, see here: https://stackoverflow.com/questions/...file-in-python (for example).
You closed still only one output file, but opened a lot...
|
|
|
06-22-2017, 08:23 AM
|
#18
|
Member
Registered: Apr 2017
Posts: 33
Original Poster
Rep:
|
Quote:
Originally Posted by pan64
probably you need to generate several pieces instead of that one big file.
|
I want to split the file into chunks but based on grouping the data into ID's as the same ID's information should be present in the same file but not in the other file. I could have split the file using csplit or split, but then the same ID's information will not be present in the same file.
|
|
|
06-22-2017, 08:32 AM
|
#19
|
Member
Registered: Apr 2017
Posts: 33
Original Poster
Rep:
|
Quote:
Originally Posted by syg00
Standard answer to speed up text processing code is to use (properly constructed) perl.
The python code is overly complex, and no doubt adds to the runtime. awk shouldn't be written to mirror that code, but use awk imperatives.
Also, the python code in post #1 won't produce the output in post #7 as no attempt was made to account for the header. Here is a quick awk attempt - it should be (much ?) faster.
Code:
awk 'BEGIN{fl=1 ; i=0} (NR == 1) {next} ; !_[$1]++ {i++} ; {if (i % 4) {print $0 > "out"fl".txt"} else { delete _ ; print $0 > "out"++fl".txt" ; _[$1]++ ; i=1 }}' Input.txt
|
Thank you so much for the reply.
Actually, I just added the first line to indicate the ID's separately (So I just initiated NR==0) Yes, it worked perfectly.
|
|
|
06-22-2017, 09:37 AM
|
#20
|
LQ Veteran
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,236
|
Remove the test altogether - if you have a lot of data, no sense testing every record.
|
|
|
06-22-2017, 09:58 AM
|
#21
|
Moderator
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,891
|
I realize that you've marked this as solved.
I would've approached this very differently, however will also admit that I saw the original examples and felt it was pretty simple, not noticing that you were citing a very large amount of data to be processed.
My solution would've been a program over a script or scripted language. If it were small files, a script.
I would've written a program that would've opened the original file as read-only, opened a new write-to file and then processed the records in a simple loop which would test the first value and choose to write that record to the output file versus not.
I feel this possible solution, based on my experience doing similar things with text files, would be very fast.
|
|
|
06-23-2017, 02:04 AM
|
#22
|
Member
Registered: Apr 2017
Posts: 33
Original Poster
Rep:
|
Quote:
Originally Posted by rtmistler
I realize that you've marked this as solved.
I would've approached this very differently, however will also admit that I saw the original examples and felt it was pretty simple, not noticing that you were citing a very large amount of data to be processed.
My solution would've been a program over a script or scripted language. If it were small files, a script.
I would've written a program that would've opened the original file as read-only, opened a new write-to file and then processed the records in a simple loop which would test the first value and choose to write that record to the output file versus not.
I feel this possible solution, based on my experience doing similar things with text files, would be very fast.
|
Can you elaborate your solution? I can try with this one also, if it is much faster... then why not!
|
|
|
06-23-2017, 02:06 AM
|
#23
|
Member
Registered: Apr 2017
Posts: 33
Original Poster
Rep:
|
Quote:
Originally Posted by syg00
Remove the test altogether - if you have a lot of data, no sense testing every record.
|
It is removed. Thanks!
Last edited by Asoo; 06-23-2017 at 09:15 AM.
|
|
|
06-23-2017, 06:46 AM
|
#24
|
Moderator
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,891
|
Quote:
Originally Posted by Asoo
Can you elaborate your solution? I can try with this one also, if it is much faster... then why not!
|
My short summary would be: - C program.
- open() using read-only for one file and write/create for the other file.
- read() from the source file in a loop until EOF.
- Conditionally write() to the output file.
A concern is that if you didn't understand the earlier descriptive text, then you are not generally a C programmer, familiar with file operations. Therefore suggest you do not follow this solution, unless you wish to tackle coming up to speed well enough with C programming to be able to accomplish this.
Quote:
Originally Posted by rtmistler
I would've written a program that would've opened the original file as read-only, opened a new write-to file and then processed the records in a simple loop which would test the first value and choose to write that record to the output file versus not.
|
|
|
|
06-23-2017, 09:14 AM
|
#25
|
Member
Registered: Apr 2017
Posts: 33
Original Poster
Rep:
|
Quote:
Originally Posted by rtmistler
My short summary would be: - C program.
- open() using read-only for one file and write/create for the other file.
- read() from the source file in a loop until EOF.
- Conditionally write() to the output file.
A concern is that if you didn't understand the earlier descriptive text, then you are not generally a C programmer, familiar with file operations. Therefore suggest you do not follow this solution, unless you wish to tackle coming up to speed well enough with C programming to be able to accomplish this.
|
Yeah, I have worked only in Java and Python. So coding this in C will take much time. Thank you so much for your help.
|
|
|
06-29-2017, 08:07 AM
|
#26
|
Member
Registered: Apr 2017
Posts: 33
Original Poster
Rep:
|
Quote:
Originally Posted by syg00
Standard answer to speed up text processing code is to use (properly constructed) perl.
The python code is overly complex, and no doubt adds to the runtime. awk shouldn't be written to mirror that code, but use awk imperatives.
Also, the python code in post #1 won't produce the output in post #7 as no attempt was made to account for the header. Here is a quick awk attempt - it should be (much ?) faster.
Code:
awk 'BEGIN{fl=1 ; i=0} (NR == 1) {next} ; !_[$1]++ {i++} ; {if (i % 4) {print $0 > "out"fl".txt"} else { delete _ ; print $0 > "out"++fl".txt" ; _[$1]++ ; i=1 }}' Input.txt
|
The code works fine but in some files few columns are missing for the last entry. I got a file with more number of columns than 3, so it missing the last few columns of the last row only. Any suggestions?
|
|
|
All times are GMT -5. The time now is 06:10 AM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|