Detecting Blank Lines in C
I want to convert a large text file (180,000+ lines) of article titles and abstracts into a file where the title is on one line and all of the abstract text is on the next line and repeat for all of the title/abstract combos in the file. My solution is to create a C program that reads in a line at a time into a 150 character buffer before formatting it back to the output stream.
What I need to know is how to detect when it has read a blank line. Right now this is the only thing separating the articles is a blank line consisting of just a carriage return. Thanks for the help! |
First, you should be careful about your definition of a "blank line". Were these lines entered by human beings? If so, is it possible that a line might contain nothing but one or more spaces and/or tabs? If tabs have no special meaning, then what I usually do, regardless of language, is to convert each tab into a space. Then I treat all consecutive spaces as a single space. Then I ignore any space at the beginning and at the end of a line. Only then would I consider whether the line is "blank".
Lines either end with a line feed (0x0A) or a carriage return line feed pair (0x0D followed by 0x0A). The first way is the Unix way; the second way is the Microsoft (spit) way. Just to make things easy, I usually ignore all 0x0D's that I find. This will work for everything except Macintosh, where each line ends with 0x0D but no 0x0A. So an empty line (once you've taken care of squeezing any spaces and tabs that you wish to) is any 0x0A (line feed) which meets one of these two qualifications:
By the way, you could code and debug this about 20 times as fast in Perl. Hope this helps. |
If the blank line is, as you say, just a carriage return, and you've copied the line into a character array, a simple test of whether "array[0] == '\n'" is all it would take to test for a blank line. . .
|
I found a method for it to work.
Code:
char word[200]; Thanks for your help! |
u can use
char buff[150]; fgets(buff,150,<filedescriptor>); **same as quoted by adymroxx above** or u can use this command if u dont want a c program grep -v '^$' ./hello.txt > hello1.txt this command will copy all lines from hello.txt to hello1.txt leaving blank lines |
u can use
char buff[150]; fgets(buff,150,<filedescriptor>); **same as quoted by adymroxx above** or u can use this command if u dont want a c program grep -v '^$' ./hello.txt > hello1.txt this command will copy all lines from hello.txt to hello1.txt leaving blank lines |
Perl, Python or awk are MUCH better suited for this sort of thing than C.
Can you provide an example of two subsequent input records? (copy-past into [code] tags to preserve the original formatting). |
deleting blank lines
I think it is simpler to do it with shell command ( grep for example) you can call the command from C language using system().
grep -v "^$" fileIn.txt > fileOut.txt gives the file without blank lines best regadrds bela |
Hello!
Here is a program, which count blank lines on stdin. Code:
// blank.c Code:
gcc -O3 blank.c Now I want to compare two approaches: Code:
$ time ./a.out < E_slvr_r.txt My laptop: 1.6GHz Intel Centrino mobile, 1 GB RAM. |
Another disadvantage of the C program is that it will count an additional blank line if it finds a line containing exactly 5000 characters, including the line feed. The first 4999 characters will be received by one fgets(), and a NUL character will end the data in the buffer. The concluding line feed will be received by the next fgets().
|
What about
Code:
time grep --mmap '^[ \t\r]*$' E_slvr_r.txt | wc -l |
In all seriousness, this very-common task that you are undertaking can be accomplished much faster and easier using one of the many "power tools" that are available in Linux and Unix.
For example, the awk program is specifically designed for tasks which can be generally described as "scan the file line-by-line and when you see a line that looks like this, do that." Like all "power tools" programs, gawk takes this disarmingly-simple concept and puts the whole thing "on steroids." |
Quote:
Code:
$ time grep '^[ \t\r]*$' E_slvr_r.txt | wc -l |
All times are GMT -5. The time now is 02:11 PM. |