[SOLVED] how to group data in linux

syg00 · 06-22-2017, 06:35 AM

Standard answer to speed up text processing code is to use (properly constructed) perl.
The python code is overly complex, and no doubt adds to the runtime. awk shouldn't be written to mirror that code, but use awk imperatives.
Also, the python code in post #1 won't produce the output in post #7 as no attempt was made to account for the header. Here is a quick awk attempt - it should be (much ?) faster.

Code:

awk 'BEGIN{fl=1 ; i=0} (NR == 1) {next} ; !_[$1]++ {i++} ; {if (i % 4) {print $0 > "out"fl".txt"}  else { delete _ ; print $0 > "out"++fl".txt" ; _[$1]++ ; i=1 }}' Input.txt

pan64 · 06-22-2017, 06:53 AM

you can use with to open a file, see here: https://stackoverflow.com/questions/...file-in-python (for example).
You closed still only one output file, but opened a lot...

Asoo · 06-22-2017, 08:23 AM

Quote:

Originally Posted by pan64

probably you need to generate several pieces instead of that one big file.

I want to split the file into chunks but based on grouping the data into ID's as the same ID's information should be present in the same file but not in the other file. I could have split the file using csplit or split, but then the same ID's information will not be present in the same file.

Asoo · 06-22-2017, 08:32 AM

Quote:

Originally Posted by syg00

Standard answer to speed up text processing code is to use (properly constructed) perl.
The python code is overly complex, and no doubt adds to the runtime. awk shouldn't be written to mirror that code, but use awk imperatives.
Also, the python code in post #1 won't produce the output in post #7 as no attempt was made to account for the header. Here is a quick awk attempt - it should be (much ?) faster.

Code:

awk 'BEGIN{fl=1 ; i=0} (NR == 1) {next} ; !_[$1]++ {i++} ; {if (i % 4) {print $0 > "out"fl".txt"}  else { delete _ ; print $0 > "out"++fl".txt" ; _[$1]++ ; i=1 }}' Input.txt

Thank you so much for the reply.

Actually, I just added the first line to indicate the ID's separately (So I just initiated NR==0) Yes, it worked perfectly.

syg00 · 06-22-2017, 09:37 AM

Remove the test altogether - if you have a lot of data, no sense testing every record.

rtmistler · 06-22-2017, 09:58 AM

I realize that you've marked this as solved.

I would've approached this very differently, however will also admit that I saw the original examples and felt it was pretty simple, not noticing that you were citing a very large amount of data to be processed.

My solution would've been a program over a script or scripted language. If it were small files, a script.

I would've written a program that would've opened the original file as read-only, opened a new write-to file and then processed the records in a simple loop which would test the first value and choose to write that record to the output file versus not.

I feel this possible solution, based on my experience doing similar things with text files, would be very fast.

Asoo · 06-23-2017, 02:04 AM

Quote:

Originally Posted by rtmistler

I realize that you've marked this as solved.

I would've approached this very differently, however will also admit that I saw the original examples and felt it was pretty simple, not noticing that you were citing a very large amount of data to be processed.

My solution would've been a program over a script or scripted language. If it were small files, a script.

I would've written a program that would've opened the original file as read-only, opened a new write-to file and then processed the records in a simple loop which would test the first value and choose to write that record to the output file versus not.

I feel this possible solution, based on my experience doing similar things with text files, would be very fast.

Can you elaborate your solution? I can try with this one also, if it is much faster... then why not!

Asoo · 06-23-2017, 02:06 AM

Quote:

Originally Posted by syg00

Remove the test altogether - if you have a lot of data, no sense testing every record.

It is removed. Thanks!

rtmistler · 06-23-2017, 06:46 AM

Quote:

Originally Posted by Asoo

Can you elaborate your solution? I can try with this one also, if it is much faster... then why not!

My short summary would be:

C program.
open() using read-only for one file and write/create for the other file.
read() from the source file in a loop until EOF.
Conditionally write() to the output file.

A concern is that if you didn't understand the earlier descriptive text, then you are not generally a C programmer, familiar with file operations. Therefore suggest you do not follow this solution, unless you wish to tackle coming up to speed well enough with C programming to be able to accomplish this.

Quote:

Originally Posted by rtmistler

I would've written a program that would've opened the original file as read-only, opened a new write-to file and then processed the records in a simple loop which would test the first value and choose to write that record to the output file versus not.

Asoo · 06-23-2017, 09:14 AM

Quote:

Originally Posted by rtmistler

My short summary would be:

C program.
open() using read-only for one file and write/create for the other file.
read() from the source file in a loop until EOF.
Conditionally write() to the output file.

A concern is that if you didn't understand the earlier descriptive text, then you are not generally a C programmer, familiar with file operations. Therefore suggest you do not follow this solution, unless you wish to tackle coming up to speed well enough with C programming to be able to accomplish this.

Yeah, I have worked only in Java and Python. So coding this in C will take much time. Thank you so much for your help.

Asoo · 06-29-2017, 08:07 AM

Quote:

Originally Posted by syg00

Standard answer to speed up text processing code is to use (properly constructed) perl.
The python code is overly complex, and no doubt adds to the runtime. awk shouldn't be written to mirror that code, but use awk imperatives.
Also, the python code in post #1 won't produce the output in post #7 as no attempt was made to account for the header. Here is a quick awk attempt - it should be (much ?) faster.

Code:

awk 'BEGIN{fl=1 ; i=0} (NR == 1) {next} ; !_[$1]++ {i++} ; {if (i % 4) {print $0 > "out"fl".txt"}  else { delete _ ; print $0 > "out"++fl".txt" ; _[$1]++ ; i=1 }}' Input.txt

The code works fine but in some files few columns are missing for the last entry. I got a file with more number of columns than 3, so it missing the last few columns of the last row only. Any suggestions?