LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 08-03-2009, 06:05 PM   #1
doug23
LQ Newbie
 
Registered: Aug 2009
Posts: 18

Rep: Reputation: 0
split very large 200mb text file by every N lines (sed/awk fails)


Hi All,

I have a large text file with over a million lines, and I need to split the file by every N lines.

In the end, I need to have three separate files. The first will have every 3 lines starting with the very first line (no header), the second will have every 3 lines starting with the second line, and so on for the third line.

Unfortunately, commands that I have tried so far, including:

$ sed -n '2~3p' somefile
$ awk 'NR%3==0'
$ perl -ne 'print ((0 == $. % 3) ? $_ : "")'

All fail at some point, and start shifting in the sequence after a certain number (probably an integer overflow).

Are there any other commands I should try which should be able to work for the entire file?

Thanks!
Doug
 
Old 08-03-2009, 08:04 PM   #2
hasienda
Member
 
Registered: May 2009
Location: Saxony, Germany
Distribution: Debian/GNU Linux
Posts: 36

Rep: Reputation: 18
Post a simple start

Quote:
Originally Posted by doug23 View Post
Hi All,

I have a large text file with over a million lines, and I need to split the file by every N lines.

In the end, I need to have three separate files. The first will have every 3 lines starting with the very first line (no header), the second will have every 3 lines starting with the second line, and so on for the third line.

Unfortunately, commands that I have tried so far, including:

$ sed -n '2~3p' somefile
$ awk 'NR%3==0'
$ perl -ne 'print ((0 == $. % 3) ? $_ : "")'

All fail at some point, and start shifting in the sequence after a certain number (probably an integer overflow).

Are there any other commands I should try which should be able to work for the entire file?

Thanks!
Doug
Check out csplit ('man csplit') at the console. Something like
Code:
$> cd /home/user
$> csplit -k --prefix=smallpart ./verybigfile 3 {*}
$> for i in ./smallpart*; do cat $i|head -n 1; done > ./lines147etc.txt
$> for i in ./smallpart*; do cat $i|head -n 2|tail -n 1; done > ./lines258etc.txt
$> for i in ./smallpart*; do cat $i|tail -n 1; done > ./lines369etc.txt
might do it for you.

Caveats: As I tested this, csplit produced the first file ./smallpart00 with only _two_ lines. If there are less the 3 lines in the last small file, the last line will still get added to one or another of the three collected lines files. So you'd need to edit them accordingly. Watch out for line order issues. I've found the limit of max 100 splits is _not_ valid any longer, at least not with for the version distributed with Debians GNU coreutils 6.10-6.
 
Old 08-03-2009, 08:05 PM   #3
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 453Reputation: 453Reputation: 453Reputation: 453Reputation: 453
Quote:
Originally Posted by doug23 View Post
Hi All,

I have a large text file with over a million lines, and I need to split the file by every N lines.

In the end, I need to have three separate files. The first will have every 3 lines starting with the very first line (no header), the second will have every 3 lines starting with the second line, and so on for the third line.

Unfortunately, commands that I have tried so far, including:

$ sed -n '2~3p' somefile
$ awk 'NR%3==0'
$ perl -ne 'print ((0 == $. % 3) ? $_ : "")'

All fail at some point, and start shifting in the sequence after a certain number (probably an integer overflow).

Are there any other commands I should try which should be able to work for the entire file?

Thanks!
Doug

So, why "probably" ? I.e. why wouldn't you write slightly more code and establish the exact root cause ? You want us to do the debugging ?
 
Old 08-03-2009, 09:01 PM   #4
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,467

Rep: Reputation: 846Reputation: 846Reputation: 846Reputation: 846Reputation: 846Reputation: 846Reputation: 846
Quote:
Originally Posted by doug23 View Post
Hi All,

I have a large text file with over a million lines, and I need to split the file by every N lines.
How many millions? I tried your sed command on a 5-million line file (generated with seq $((5 * 1000 * 1000))), and it seemed to work just fine. That is, after pasting all 3 together again resulted in the same file plus 2 extra blank lines.
 
Old 08-04-2009, 02:58 PM   #5
doug23
LQ Newbie
 
Registered: Aug 2009
Posts: 18

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by Sergei Steshenko View Post
So, why "probably" ? I.e. why wouldn't you write slightly more code and establish the exact root cause ? You want us to do the debugging ?
Because Sergei I do not know how to debug linux code.
 
Old 08-04-2009, 03:05 PM   #6
doug23
LQ Newbie
 
Registered: Aug 2009
Posts: 18

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by ntubski View Post
How many millions? I tried your sed command on a 5-million line file (generated with seq $((5 * 1000 * 1000))), and it seemed to work just fine. That is, after pasting all 3 together again resulted in the same file plus 2 extra blank lines.
Unfortunately I can guarantee you that sed does not work properly. Each of the three row pairs follows this format:

SomeNumber choice_of_three_words text --->
SomeNumber choice_of_two_words text --->
SomeNumber one_word text --->

Every time I have tried the sed command one of the three result files ends up with a mix of words starting about 22,000 rows down that could never otherwise end up in that file. I have checked the original datafile to ensure that the problem is not in the original file.

hasienda -- are the only lines I need to check the very last ones?

Thank you very much for your help,
Doug
 
Old 08-04-2009, 03:34 PM   #7
jbo5112
LQ Newbie
 
Registered: Jan 2009
Posts: 20

Rep: Reputation: 1
bash script

This will append to any existing files, takes input from stdin, takes file names as arguments, and doesn't bother checking for correct usage, but it works to split the lines in a round-robin fashion.

Code:
#! /bin/bash

my_file[0]="$1"
my_file[1]="$2"
my_file[2]="$3"
fail=0

while [ "$fail" -lt 1 ]; do
    for ((x=0; x<3; ++x)); do
        if read my_line; then
            echo "$my_line" >> ${my_file[$x]}
        else
            fail=1
        fi
    done
done
./my_script out1 out2 out3 < input

Last edited by jbo5112; 08-04-2009 at 03:38 PM.
 
Old 08-04-2009, 04:19 PM   #8
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 453Reputation: 453Reputation: 453Reputation: 453Reputation: 453
Quote:
Originally Posted by doug23 View Post
Because Sergei I do not know how to debug linux code.
Nonsense. Your code has nothing to do with Linux.

For example, modify your Perl one-liner into full blown script and debug it.

Here is a Perl for Windows, for example:

http://strawberryperl.com/ ->
http://strawberryperl.com/releases.html ->
http://strawberryperl.com/download/s...6-portable.zip .
 
Old 08-10-2009, 06:08 PM   #9
doug23
LQ Newbie
 
Registered: Aug 2009
Posts: 18

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by jbo5112 View Post
This will append to any existing files, takes input from stdin, takes file names as arguments, and doesn't bother checking for correct usage, but it works to split the lines in a round-robin fashion.

Code:
#! /bin/bash

my_file[0]="$1"
my_file[1]="$2"
my_file[2]="$3"
fail=0

while [ "$fail" -lt 1 ]; do
    for ((x=0; x<3; ++x)); do
        if read my_line; then
            echo "$my_line" >> ${my_file[$x]}
        else
            fail=1
        fi
    done
done
./my_script out1 out2 out3 < input

Worked GREAT! Thank you very much for your help!

Doug
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
copy last 10000+ lines of large text file to a temporary file emilyg Linux - Newbie 3 06-24-2009 02:43 PM
Manipulating Text File with awk or sed kushalkoolwal Programming 2 09-10-2008 07:35 PM
Split large file in several files using scripting (awk etc.) chipix Programming 14 10-29-2007 11:16 AM
Replacing text on specific lines with sed or awk? Lantzvillian Linux - Newbie 5 10-17-2007 09:00 AM


All times are GMT -5. The time now is 11:38 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration