LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Read large text files (~10GB), parse for columns, output (https://www.linuxquestions.org/questions/programming-9/read-large-text-files-%7E10gb-parse-for-columns-output-717217/)

vache 04-06-2009 12:17 PM

Read large text files (~10GB), parse for columns, output
 
Hello, world.

The Goal: Read in ASCII text files, parse out specific columns, send to standard out.

I'm currently using a simple awk {print $1, ...} script to accomplish this. It's all fine and good, but the files I'm reading in are massive (10GB is not uncommon in our environment) and I speculate (hope) that a C or C++ application can parse these files faster than awk.

My C-fu is weak at best (featured below is my "101" level C code mushed together after lots of Google searches - ha) and it's actually slower than awk. If it matters, I have access to some very powerful hardware (64 bit quad core Xeon, 6GB of RAM).

What alternatives are there to fopen/etc. for reading in large files and parsing? Thanks in advance.

Code:

#include <stdio.h>
#include <string.h>

int main( int argc, char *argv[] )
{
    /* No file supplied? */
    if ( argc == 1 )
    {
        puts( "\nYou must supply a file to parse\n" );
        return 1;
    }

    /* Open the file, read-only */
    FILE *file = fopen( argv[1], "r" );

    /* If the file exists... */
    if ( file != NULL )
    {
        char line[256];
        char del[] = " ";

        /* While we can read a line from the file... */
        while ( fgets( line, sizeof line, file ) != NULL )
        {
            /* Convert each line in to tokens */
            char *result = NULL;
            result = strtok( line, del );
            int tkn = 1;

            /* "Foreach" token... */
            while( result != NULL )
            {
                /* If tkn matches our list, then print */
                /* $1, $2, $4, $6, $11, $12, $13 */
                if  (
                        tkn == 1 || tkn == 2 || tkn == 3 ||
                        tkn == 4 || tkn == 6 || tkn == 11 ||
                        tkn == 12 || tkn == 13
                    )
                {
                    printf( "%s ", result );
                }
                tkn++;
                result = strtok( NULL, del );
            }
        }
        fclose( file );
    } else {
        printf( "%s", argv[1] );
    }
    return 0;
}


Telemachos 04-06-2009 12:40 PM

What in the world could you do to a text file to make it 10GB? Wow.

Maybe I'm missing something, but if the main issue is simply that you can't load the whole file into memory at once, any solution would work if it read line by line. There are good, straightforward ways to do this in many scripting languages (Perl or Python, for example), and a higher level language would allow you to leverage very powerful built-in string techniques. I guess what I'm saying is that I don't know if C would offer a significant speed increase. The main speed issue is just going through all the lines, it seems, rather than any lower level algorithm. I'll be curious to hear if others know better.

Sergei Steshenko 04-06-2009 12:48 PM

Quote:

Originally Posted by vache (Post 3500207)
Hello, world.

The Goal: Read in ASCII text files, parse out specific columns, send to standard out.

I'm currently using a simple awk {print $1, ...} script to accomplish this. It's all fine and good, but the files I'm reading in are massive (10GB is not uncommon in our environment) and I speculate (hope) that a C or C++ application can parse these files faster than awk.

My C-fu is weak at best (featured below is my "101" level C code mushed together after lots of Google searches - ha) and it's actually slower than awk. If it matters, I have access to some very powerful hardware (64 bit quad core Xeon, 6GB of RAM).

What alternatives are there to fopen/etc. for reading in large files and parsing? Thanks in advance.

Code:

#include <stdio.h>
#include <string.h>

int main( int argc, char *argv[] )
{
    /* No file supplied? */
    if ( argc == 1 )
    {
        puts( "\nYou must supply a file to parse\n" );
        return 1;
    }

    /* Open the file, read-only */
    FILE *file = fopen( argv[1], "r" );

    /* If the file exists... */
    if ( file != NULL )
    {
        char line[256];
        char del[] = " ";

        /* While we can read a line from the file... */
        while ( fgets( line, sizeof line, file ) != NULL )
        {
            /* Convert each line in to tokens */
            char *result = NULL;
            result = strtok( line, del );
            int tkn = 1;

            /* "Foreach" token... */
            while( result != NULL )
            {
                /* If tkn matches our list, then print */
                /* $1, $2, $4, $6, $11, $12, $13 */
                if  (
                        tkn == 1 || tkn == 2 || tkn == 3 ||
                        tkn == 4 || tkn == 6 || tkn == 11 ||
                        tkn == 12 || tkn == 13
                    )
                {
                    printf( "%s ", result );
                }
                tkn++;
                result = strtok( NULL, del );
            }
        }
        fclose( file );
    } else {
        printf( "%s", argv[1] );
    }
    return 0;
}


I don't think you grounds to believe that your "C" code will be faster than awk or Perl.

vache 04-06-2009 01:50 PM

Quote:

Originally Posted by Telemachos (Post 3500238)
What in the world could you do to a text file to make it 10GB? Wow.

Infrastructure hardware on a class A network :)

vache 04-06-2009 01:51 PM

Quote:

Originally Posted by Sergei Steshenko (Post 3500248)
I don't think you grounds to believe that your "C" code will be faster than awk or Perl.

Hmm?

Sergei Steshenko 04-06-2009 01:58 PM

Quote:

Originally Posted by vache (Post 3500312)
Hmm?

Because, for example, Perl uses highly optimized regular expressions engine, and at all is highly optimized for text parsing.

Telemachos 04-06-2009 02:32 PM

@ Vache: Think about what you're doing here:
  • Open a file.
  • Start a loop which takes one line at a time from the file, saves state and ends when you hit EOF.
  • Check each line inside the loop for a match against a number of expressions.
  • Print the line if you hit a match and move onto the next line in the file. (I assume you want to print when you hit the first match and then skip the rest of the tests. No reason to keep testing the same line after you've hit a match.)
What I think Sergei is saying, and I am certainly saying, is that you don't have any special reason to think C will be significantly faster than Perl at doing those things. In addition, if developer time matters to you, then Perl is potentially faster (to develop) since it's built to handle strings, lines and regular expressions. Edit: That said, maybe I'm missing something obvious. I do that all the damn time.

jglands 04-06-2009 02:56 PM

Why not use VB?

Sergei Steshenko 04-06-2009 03:18 PM

Quote:

Originally Posted by jglands (Post 3500366)
Why not use VB?

The OP mentions awk, so he's most likely on UNIX-like system, and VB is unavailable.

jglands 04-06-2009 03:23 PM

He should install windows then.

Sergei Steshenko 04-06-2009 06:42 PM

Quote:

Originally Posted by jglands (Post 3500387)
He should install windows then.

What for ? And why to pay money for an OS which is definitely not necessary for the task ?

syg00 04-06-2009 07:24 PM

I/O is your problem - plain and simple. No matter how fast your CPU is, they all wait (for I/O completion in this case) at the same speed.

Go parse the first Gig of the data (only) - then go do it again. See the difference; that's caching versus real I/O.

Sergei Steshenko 04-07-2009 04:02 AM

Quote:

Originally Posted by syg00 (Post 3500582)
I/O is your problem - plain and simple. No matter how fast your CPU is, they all wait (for I/O completion in this case) at the same speed.

Go parse the first Gig of the data (only) - then go do it again. See the difference; that's caching versus real I/O.

That's right, the OS does wonders using smart caching, but for big files one can't fool nature.

jglands 04-07-2009 08:37 AM

He should use windows, because you get what you pay for. If it's free it must be junk.

Telemachos 04-07-2009 09:39 AM

Before anyone get's all riled up: please don't feed the troll.

int0x80 04-07-2009 10:20 AM

Quote:

Originally Posted by jglands (Post 3501078)
He should use windows, because you get what you pay for. If it's free it must be junk.

I have heard of this thing called the "10% Rule". Basically, you have to be smarter than 10% of all 4-year-olds to be able to use Linux.

jglands 04-07-2009 10:24 AM

That comes from a pimple faced teenager who has no life. Use windows and you don't have to spend your nights home alone.

int0x80 04-07-2009 10:28 AM

This coming from an MCSE who gets down on his knees and prays to his gods: Ballmer and Gates. Use Linux and you don't have to spend your weekends re-installing your relatives' computers. Antivirus 2009 LOLOLOLOL.

jglands 04-07-2009 10:31 AM

I got my MCSE in six months and at least I can pronounce my guys name. What kind of guy has a name of the thumbsucker off of peanuts?

int0x80 04-07-2009 10:34 AM

Oh I know let's re-use the same horrible kernel over and over and just put a different UI over it. Leave the real computer science to the computer scientists and enjoy your sheltered existence at the help desk.

jglands 04-07-2009 10:37 AM

If it works why create a new kernel? At least I have a job. Most companies don't use linux and if they do they use it because they have no real budget. So how well does McDonalds pay?

int0x80 04-07-2009 10:39 AM

McDonalds is a multinational corporation with more locations than whatever lame .NET fail company you work for. Which business will survive the recession?

Telemachos 04-07-2009 10:41 AM

@ int0x80: jglands has posted only to this thread and only to troll. Please stop feeding him.

jglands 04-07-2009 10:42 AM

Microsoft has billions in the bank and sells their products for a good profit. How much profit do you get for a Linux download/big mac? More Linux companies have went under then Microsoft has sold in copies of Windows.

.NET is setting the standard out there. If the original poster was smart he would use that over C or PERL. It's so much better!

ghostdog74 04-07-2009 10:42 AM

where's the moderator?

int0x80 04-07-2009 10:44 AM

If the OP paid for a Microsoft OS/compiler/rip-off, would they solve his query for free? Or would they try to nickel and dime even more money out of him? More like Windows Genuine FAILAGE, imo.

sundialsvcs 04-07-2009 10:44 AM

:rolleyes: Stick to the subject, please... "Cheap beer and forums do not mix."

No, it probably won't be "better than awk."

"awk" is a very well-written program that is specialized for doing what you are doing.

All of the delays associated with this task will be mechanical ones: disk I/O times and network time. But "awk" knows to tell the operating-system that the file is being read sequentially, and therefore the operating system will know how to line-up lots of file buffers and other tricks to streamline the operation as much as the hardware will allow.

If the time required to do this task is problematic to the business, then there are various things that you can do:
  1. Invest in fast storage-hardware... SATA, FireWire.
  2. Instead of using the disk controllers built into the motherboard, buy a controller card. An inexpensive unit can make a dramatic difference.
  3. Put the input file and the output file on different disk volumes.
  4. Do not follow the siren that says, "put it all in memory..." Abandon all hope, ye who enter there!
Face it: when you're dealing with 10 gigabytes of data, "some things take time." If you're doing the task in "awk," and doing it well, then you are using a robust tool that was specifically designed for the task. You have not erred in the approach that you are using right now. "Diddling with it" will not improve it.

Telemachos 04-07-2009 10:45 AM

For the record, it would be unfortunate to lock the whole thread. The question (How do I deal with a mega-sized file and the associated I/O problems?) is a serious one and deserves some discussion.

jglands 04-07-2009 10:50 AM

Well he would at least have support? What does he have from linux now? Some pimple faced kids telling him he is wrong instead of helping him.

int0x80 04-07-2009 10:50 AM

Quote:

Originally Posted by sundialsvcs (Post 3501222)
:rolleyes: Stick to the subject, please... "Cheap beer and forums do not mix."

Sorry, I just get frustrated when people reply with stupid responses that are irrelevant to the original issue ("use an interpreted language", "perl can do regex", "windows > linux", etc). The last one strikes a nerve as you can imagine ;]

Sergei Steshenko 04-07-2009 10:51 AM

Quote:

Originally Posted by sundialsvcs (Post 3501222)
:rolleyes: Stick to the subject, please... "Cheap beer and forums do not mix."

No, it probably won't be "better than awk."

"awk" is a very well-written program that is specialized for doing what you are doing.

All of the delays associated with this task will be mechanical ones: disk I/O times and network time. But "awk" knows to tell the operating-system that the file is being read sequentially, and therefore the operating system will know how to line-up lots of file buffers and other tricks to streamline the operation as much as the hardware will allow.

If the time required to do this task is problematic to the business, then there are various things that you can do:
  1. Invest in fast storage-hardware... SATA, FireWire.
  2. Instead of using the disk controllers built into the motherboard, buy a controller card. An inexpensive unit can make a dramatic difference.
  3. Put the input file and the output file on different disk volumes.
  4. Do not follow the siren that says, "put it all in memory..." Abandon all hope, ye who enter there!
Face it: when you're dealing with 10 gigabytes of data, "some things take time." If you're doing the task in "awk," and doing it well, then you are using a robust tool that was specifically designed for the task. You have not erred in the approach that you are using right now. "Diddling with it" will not improve it.

By the way, if I understood the OP correctly, the lines are independent, i.e. line by line parsing should be OK.

If it's the case, then the very first legitimate question is: "Why is it single 10GB file and not a number of much smaller files ?".

The point is that a number of files may be stored on separate hard drives and better yet the drives can be connected to different CPUs, so the whole processing can be done in parallel and the the results can be merged.

int0x80 04-07-2009 10:52 AM

Quote:

Originally Posted by jglands (Post 3501233)
Well he would at least have support? What does he have from linux now? Some pimple faced kids telling him he is wrong instead of helping him.

I don't see any .NET devs on here showing him the way???

FAIL

jglands 04-07-2009 10:53 AM

That's right. He asked here instead of someplace that will help him.

int0x80 04-07-2009 10:55 AM

Quote:

Originally Posted by jglands (Post 3501240)
That's right. He asked here instead of someplace that will help him.

Just because some of the members on here (Telemachos, Sergei) can't read or aren't smart enough to solve problems before posting doesn't mean that the entire community is worthless. LQ is representative of the internet with people of varying levels of intelligence. Some are stars (sundialsvcs), and others have no light on upstairs (jglands).

jglands 04-07-2009 10:59 AM

1 Attachment(s)
Quote:

Originally Posted by int0x80 (Post 3501246)
Some are stars (sundialsvcs), and others have no light on upstairs (jglands).

Just because I have no hair doesn't mean my lights are not on. I could look like this guy.

int0x80 04-07-2009 11:01 AM

1 Attachment(s)
Quote:

Originally Posted by jglands (Post 3501251)
Just because I have no hair doesn't mean my lights are not on. I could look like this guy.

Unfortunately you look like this...

Telemachos 04-07-2009 11:01 AM

Quote:

Originally Posted by int0x80 (Post 3501246)
Just because some of the members on here (Telemachos, Sergei) can't read or aren't smart enough to solve problems before posting doesn't mean that the entire community is worthless.

Charming. Sergei and I said essentially the same thing as Sundialscvs, though I admit he said it more fully. What we all said was that the OP's C code was unlikely to beat a pre-existing tool (awk, Perl, Python, whatever) because the big issue was the simple math of the filesize.

jglands 04-07-2009 11:02 AM

1 Attachment(s)
Ok Code Monkey

vache 04-07-2009 11:03 AM

So, the short answer to my question is "no". Thanks :)

int0x80 04-07-2009 11:06 AM

Quote:

Originally Posted by jglands (Post 3501257)
Ok Code Monkey

Clearly that is an image of your kin. Notice the `language name="C#"` in your image.

FAIL

Maybe one day the MS crowd will evolve to Linux.

jglands 04-07-2009 11:06 AM

See the guy has given up on Linux. About time.

int0x80 04-07-2009 11:08 AM

The solution was not "use VB". You lose.

jglands 04-07-2009 11:08 AM

You just wish your stuff could be as good as C#. Good luck with finding your answer Vache. You won't get an answer from these preteens.

Sergei Steshenko 04-07-2009 11:09 AM

Quote:

Originally Posted by Telemachos (Post 3501254)
Charming. Sergei and I said essentially the same thing as Sundialscvs, though I admit he said it more fully. What we all said was that the OP's C code was unlikely to beat a pre-existing tool (awk, Perl, Python, whatever) because the big issue was the simple math of the filesize.

I once incidentally looked into Perl regular expressions code, which is a derived work of some standard RE library.

The most frequent comment was "we are doing/have changed this and that for efficiency reasons".

int0x80 04-07-2009 11:10 AM

You wish C# could be as good as Java. Good luck with your Xtra Proprietary OS.

vache 04-07-2009 11:13 AM

Oh, a little more info about the logs:

The log files are cycled out daily. There are ~100 of them, and they can be as small as a few megs, to over 10GB in size, all raw text. They all have the same number of columns. The data I wanted out of each file were specific columns, with no bias to the actual content inside the respective column (hence the simple awk statement)

I've done benchmarks using perl, ruby, php, and python scripts (all were slower than awk). My C app is about 1 second faster than awk on the same hardware, which really isn't any difference at all.

I just wanted to know if there was an alternative to the well-known (f|safe)read() functions specifically for large files, like the kind I was dealing with, which there isn't.

Thanks for the interesting feedback.

jglands 04-07-2009 11:20 AM

Quote:

Originally Posted by int0x80 (Post 3501281)
You wish C# could be as good as Java. Good luck with your Xtra Proprietary OS.

Windows has Java as well. That's why people create in Java, because they know there is no money in Linux so they want it to work in a real operating system.

int0x80 04-07-2009 11:22 AM

Quote:

Originally Posted by jglands (Post 3501293)
Windows has Java as well. That's why people create in Java, because they know there is no money in Linux so they want it to work in a real operating system.

So you just explained why C# is unnecessary and another way for MS to rip people off.

FAIL

jglands 04-07-2009 11:25 AM

Quote:

Originally Posted by int0x80 (Post 3501294)
So you just explained why C# is unnecessary and another way for MS to rip people off.

FAIL

You're just jealous because your coding in Linux isn't making you a dime.

WIN

int0x80 04-07-2009 11:30 AM

McDonalds took the money they were wasting on MS licenses and gave it to me to write code on Linux.

WIN


All times are GMT -5. The time now is 02:18 AM.