split very large 200mb text file by every N lines (sed/awk fails)
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
split very large 200mb text file by every N lines (sed/awk fails)
Hi All,
I have a large text file with over a million lines, and I need to split the file by every N lines.
In the end, I need to have three separate files. The first will have every 3 lines starting with the very first line (no header), the second will have every 3 lines starting with the second line, and so on for the third line.
Unfortunately, commands that I have tried so far, including:
I have a large text file with over a million lines, and I need to split the file by every N lines.
In the end, I need to have three separate files. The first will have every 3 lines starting with the very first line (no header), the second will have every 3 lines starting with the second line, and so on for the third line.
Unfortunately, commands that I have tried so far, including:
All fail at some point, and start shifting in the sequence after a certain number (probably an integer overflow).
Are there any other commands I should try which should be able to work for the entire file?
Thanks!
Doug
Check out csplit ('man csplit') at the console. Something like
Code:
$> cd /home/user
$> csplit -k --prefix=smallpart ./verybigfile 3 {*}
$> for i in ./smallpart*; do cat $i|head -n 1; done > ./lines147etc.txt
$> for i in ./smallpart*; do cat $i|head -n 2|tail -n 1; done > ./lines258etc.txt
$> for i in ./smallpart*; do cat $i|tail -n 1; done > ./lines369etc.txt
might do it for you.
Caveats: As I tested this, csplit produced the first file ./smallpart00 with only _two_ lines. If there are less the 3 lines in the last small file, the last line will still get added to one or another of the three collected lines files. So you'd need to edit them accordingly. Watch out for line order issues. I've found the limit of max 100 splits is _not_ valid any longer, at least not with for the version distributed with Debians GNU coreutils 6.10-6.
I have a large text file with over a million lines, and I need to split the file by every N lines.
In the end, I need to have three separate files. The first will have every 3 lines starting with the very first line (no header), the second will have every 3 lines starting with the second line, and so on for the third line.
Unfortunately, commands that I have tried so far, including:
I have a large text file with over a million lines, and I need to split the file by every N lines.
How many millions? I tried your sed command on a 5-million line file (generated with seq $((5 * 1000 * 1000))), and it seemed to work just fine. That is, after pasting all 3 together again resulted in the same file plus 2 extra blank lines.
How many millions? I tried your sed command on a 5-million line file (generated with seq $((5 * 1000 * 1000))), and it seemed to work just fine. That is, after pasting all 3 together again resulted in the same file plus 2 extra blank lines.
Unfortunately I can guarantee you that sed does not work properly. Each of the three row pairs follows this format:
SomeNumber choice_of_three_words text --->
SomeNumber choice_of_two_words text --->
SomeNumber one_word text --->
Every time I have tried the sed command one of the three result files ends up with a mix of words starting about 22,000 rows down that could never otherwise end up in that file. I have checked the original datafile to ensure that the problem is not in the original file.
hasienda -- are the only lines I need to check the very last ones?
This will append to any existing files, takes input from stdin, takes file names as arguments, and doesn't bother checking for correct usage, but it works to split the lines in a round-robin fashion.
Code:
#! /bin/bash
my_file[0]="$1"
my_file[1]="$2"
my_file[2]="$3"
fail=0
while [ "$fail" -lt 1 ]; do
for ((x=0; x<3; ++x)); do
if read my_line; then
echo "$my_line" >> ${my_file[$x]}
else
fail=1
fi
done
done
This will append to any existing files, takes input from stdin, takes file names as arguments, and doesn't bother checking for correct usage, but it works to split the lines in a round-robin fashion.
Code:
#! /bin/bash
my_file[0]="$1"
my_file[1]="$2"
my_file[2]="$3"
fail=0
while [ "$fail" -lt 1 ]; do
for ((x=0; x<3; ++x)); do
if read my_line; then
echo "$my_line" >> ${my_file[$x]}
else
fail=1
fi
done
done
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.