ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Introduction to Linux - A Hands on Guide
This guide was created as an overview of the Linux Operating System, geared toward new users as an exploration tour and getting started guide, with exercises at the end of each chapter.
For more advanced trainees it can be a desktop reference, and a collection of the base knowledge needed to proceed with system and network administration. This book contains many real life examples derived from the author's experience as a Linux system and network administrator, trainer and consultant. They hope these examples will help you to get a better understanding of the Linux system and that you feel encouraged to try out things on your own.
Click Here to receive this Complete Guide absolutely free.
I want to convert a large text file (180,000+ lines) of article titles and abstracts into a file where the title is on one line and all of the abstract text is on the next line and repeat for all of the title/abstract combos in the file. My solution is to create a C program that reads in a line at a time into a 150 character buffer before formatting it back to the output stream.
What I need to know is how to detect when it has read a blank line. Right now this is the only thing separating the articles is a blank line consisting of just a carriage return.
First, you should be careful about your definition of a "blank line". Were these lines entered by human beings? If so, is it possible that a line might contain nothing but one or more spaces and/or tabs? If tabs have no special meaning, then what I usually do, regardless of language, is to convert each tab into a space. Then I treat all consecutive spaces as a single space. Then I ignore any space at the beginning and at the end of a line. Only then would I consider whether the line is "blank".
Lines either end with a line feed (0x0A) or a carriage return line feed pair (0x0D followed by 0x0A). The first way is the Unix way; the second way is the Microsoft (spit) way. Just to make things easy, I usually ignore all 0x0D's that I find. This will work for everything except Macintosh, where each line ends with 0x0D but no 0x0A.
So an empty line (once you've taken care of squeezing any spaces and tabs that you wish to) is any 0x0A (line feed) which meets one of these two qualifications:
It's at the beginning of the whole file.
It's preceded by another 0x0A.
By the way, you could code and debug this about 20 times as fast in Perl.
Hope this helps.
Last edited by firstname.lastname@example.org; 02-14-2007 at 03:49 PM.
If the blank line is, as you say, just a carriage return, and you've copied the line into a character array, a simple test of whether "array == '\n'" is all it would take to test for a blank line. . .
I have a 23 Megs text file with 225997 lines (75321 blank lines) called `E_slvr_r.txt'.
Now I want to compare two approaches:
$ time ./a.out < E_slvr_r.txt
$ time grep '^[ \t\r]*$' E_slvr_r.txt | wc -l
$ # estimate relative overhead, %
$ echo '(0.170-0.104)/0.170 * 100' | bc -l
Time difference is about 0.07 seconds (39%), but pipeline is much easier to write than equivalent C program. And note, that though input file was relatively big, I still can use bash-approach in real-time applications (e.g. command line dictionary/encyclopedia), because time difference was less than 1 second.
My laptop: 1.6GHz Intel Centrino mobile, 1 GB RAM.
Another disadvantage of the C program is that it will count an additional blank line if it finds a line containing exactly 5000 characters, including the line feed. The first 4999 characters will be received by one fgets(), and a NUL character will end the data in the buffer. The concluding line feed will be received by the next fgets().
Last edited by email@example.com; 02-26-2007 at 11:50 AM.