LinuxQuestions.org - Scripting: split file into 12 lines array

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Scripting: split file into 12 lines array (https://www.linuxquestions.org/questions/programming-9/scripting-split-file-into-12-lines-array-773866/)

zklone

12-06-2009 06:59 PM

Scripting: split file into 12 lines array

Hi,

I need to split a file into an array. The split is at every 12 line.
eg.

line 1
line 2
...
line 24

then the array will look something like

items[0] = line 1 ... line 12
items[1] = line 13 ... line 24

right now I am read line by line from the file and putting into an array. This is a little slow.

If there is a better way, please point me in the right direction.

Thanks,
Mike

ghostdog74

12-06-2009 07:12 PM

tell us what exactly what problem you are solving.

Telemachos

12-06-2009 07:28 PM

Quote:

Originally Posted by zklone (Post 3782142)

right now I am read line by line from the file and putting into an array. This is a little slow.

I'm not sure how you can avoid doing this (in one way or another). You have to read the file to get the items at all, so to that degree you're I/O bound.

If the lines are uniform, you could theoretically do something involving bytes or size, but that strikes me as an unlikely possibility.

lwasserm

12-06-2009 08:25 PM

I don't know if it will run faster, but you could use something like this (untested code, just for concept)

Code:

INDEX=0

LINENUMBER=1



while whatever-is-appropriate; do

ARRAY[INDEX]=sed '$LINENUMBER,+11p' /path/to/file

((INDEX++))

((LINENUMBER*=12*INDEX)) 

done

Note that sed starts line numbering at 1, not at 0.

ta0kira

12-06-2009 08:31 PM

Quote:

Originally Posted by lwasserm (Post 3782200)

I don't know if it will run faster, but you could use something like this (untested code, just for concept)

Code:

INDEX=0

LINENUMBER=1



while whatever-is-appropriate; do

ARRAY[INDEX]=sed '$LINENUMBER,+11p' /path/to/file

((INDEX++))

((LINENUMBER*=12*INDEX)) 

done

Note that sed starts line numbering at 1, not at 0.

This will cause sed to read the entire file every time through the loop, not just the 12 lines requested.

Please see the thread below:
sed script to parse a file into smaller files with set # of lines

Kevin Barry

syg00

12-06-2009 09:04 PM

I did some testing a while back, and found perl was faster at subsetting a (huge) file than sed, even if both were stopped after the requisite lines (only) were found rather than continuing to read.
As usual, YMMV.

ghostdog74

12-06-2009 11:07 PM

Quote:

Originally Posted by ta0kira (Post 3782205)

Please see the thread below:
[B]sed script to parse a file into smaller files with set # of lines

with 80million? lines of file, you can (for the last post in that thread)
1) lose the cat because its useless,
2) avoid using bash's while read loop to read big files.
3) and if bash solution is desired, no need to call external sed command. use bash's own string substitution.
4) or use awk

ta0kira

12-07-2009 01:09 AM

Quote:

Originally Posted by ghostdog74 (Post 3782283)

Please post an example, either here or in the other thread.
Kevin Barry

ghostdog74

12-07-2009 04:50 AM

Quote:

Originally Posted by ta0kira (Post 3782358)

Please post an example, either here or in the other thread.
Kevin Barry

an example for which point ? 1,2,3 or 4?

ta0kira

12-07-2009 01:44 PM

Quote:

Originally Posted by ghostdog74 (Post 3782521)

an example for which point ? 1,2,3 or 4?

Your solution to the problem taking into account all 4.
Kevin Barry

ghostdog74

12-07-2009 06:08 PM

Quote:

Originally Posted by ta0kira (Post 3783030)

Your solution to the problem taking into account all 4.
Kevin Barry

1) Instead of

Code:

cat $file | while....

. use input redirection

Code:

while read ...

do

done < $filename

or open/close the file

Code:

exec 4<"$filename"

while read -r line <&4

do

  ....

done

exec >&4-

2) its well known that processing large files with bash's while read loop is slower (much slower) than using tools like awk. you can search some of my previous posts (way back) which i demonstrated this.

3) I am not sure what that sed line is doing ie s/./&/, care to explain?

4) Have already provided awk suggestion in that thread.

ta0kira

12-07-2009 09:39 PM

Quote:

Originally Posted by ghostdog74 (Post 3783307)

Ok, but I'm not sure how you direct to separate files with awk without running through the file more than once. That's my awk ignorance, though, which is why I was hoping you had an example.

Quote:

Originally Posted by ghostdog74 (Post 3783307)

3) I am not sure what that sed line is doing ie s/./&/, care to explain?

I'm not sure, either. It's something that OP had in his or her original script, and again, I didn't test my code. It was an example. This isn't the thread to argue about such things; therefore, it would be helpful if you'd show what you mean by "awk can do it better."
Kevin Barry

ghostdog74

12-07-2009 10:02 PM

Quote:

Originally Posted by ta0kira (Post 3783457)

Ok, but I'm not sure how you direct to separate files with awk without running through the file more than once.

you mean this?

Code:

awk 'NR%4==1{++c}{print $0 > "file-"c".txt"}' file

NR%4==1 means at every 4th line. eg

Code:

$ more file

1

2

3

4

5

6

7

8

9

10

$ awk 'NR%4==1' file 

1

5

9

using this concept, OP can change it to NR%4000000 for his requirement. notice that count variable "c" is incremented at every 4th line. this variable "c" will be appended to file name.
the awk one liner above summarizes what OP did will those bunch of seds

Code:

sed -n '1,4000000 s/./&/w $FileName.01' $FileName

...

...

sed -n '76000001,$ s/./&/w $FileName.20' $FileName

the s/./&/w is just writing to the file (my guess), which is just a simple print with redirection ">" in awk.

Quote:

I'm not sure, either. It's something that OP had in his or her original script, and again, I didn't test my code.

whatever it is, there's no need to call sed. (echo or printf will do )

Quote:

it would be helpful if you'd show what you mean by "awk can do it better."

what i mean "better" is in the sense of speed performance on big files as compared to bash's while read loop.

All times are GMT -5. The time now is 06:55 PM.