LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Detecting Blank Lines in C (https://www.linuxquestions.org/questions/programming-9/detecting-blank-lines-in-c-529080/)

adymroxx 02-14-2007 03:10 PM

Detecting Blank Lines in C
 
I want to convert a large text file (180,000+ lines) of article titles and abstracts into a file where the title is on one line and all of the abstract text is on the next line and repeat for all of the title/abstract combos in the file. My solution is to create a C program that reads in a line at a time into a 150 character buffer before formatting it back to the output stream.

What I need to know is how to detect when it has read a blank line. Right now this is the only thing separating the articles is a blank line consisting of just a carriage return.

Thanks for the help!

wjevans_7d1@yahoo.co 02-14-2007 03:47 PM

First, you should be careful about your definition of a "blank line". Were these lines entered by human beings? If so, is it possible that a line might contain nothing but one or more spaces and/or tabs? If tabs have no special meaning, then what I usually do, regardless of language, is to convert each tab into a space. Then I treat all consecutive spaces as a single space. Then I ignore any space at the beginning and at the end of a line. Only then would I consider whether the line is "blank".

Lines either end with a line feed (0x0A) or a carriage return line feed pair (0x0D followed by 0x0A). The first way is the Unix way; the second way is the Microsoft (spit) way. Just to make things easy, I usually ignore all 0x0D's that I find. This will work for everything except Macintosh, where each line ends with 0x0D but no 0x0A.

So an empty line (once you've taken care of squeezing any spaces and tabs that you wish to) is any 0x0A (line feed) which meets one of these two qualifications:
  1. It's at the beginning of the whole file.
  2. It's preceded by another 0x0A.

By the way, you could code and debug this about 20 times as fast in Perl.

Hope this helps.

oneandoneis2 02-14-2007 04:37 PM

If the blank line is, as you say, just a carriage return, and you've copied the line into a character array, a simple test of whether "array[0] == '\n'" is all it would take to test for a blank line. . .

adymroxx 02-14-2007 04:50 PM

I found a method for it to work.
Code:

char word[200];

gets(word);
if (strcmp(word, "\0") == 0) **print title**

I know the dangers of gets() but this a one time use application that only I will use so I figured what the heck.

Thanks for your help!

varun_shrivastava 02-15-2007 12:54 AM

u can use

char buff[150];
fgets(buff,150,<filedescriptor>);
**same as quoted by adymroxx above**

or u can use this command if u dont want a c program

grep -v '^$' ./hello.txt > hello1.txt
this command will copy all lines from hello.txt to hello1.txt leaving blank lines

varun_shrivastava 02-15-2007 12:54 AM

u can use

char buff[150];
fgets(buff,150,<filedescriptor>);
**same as quoted by adymroxx above**

or u can use this command if u dont want a c program

grep -v '^$' ./hello.txt > hello1.txt
this command will copy all lines from hello.txt to hello1.txt leaving blank lines

matthewg42 02-15-2007 01:18 AM

Perl, Python or awk are MUCH better suited for this sort of thing than C.

Can you provide an example of two subsequent input records? (copy-past into [code] tags to preserve the original formatting).

abd_bela 02-23-2007 03:50 AM

deleting blank lines
 
I think it is simpler to do it with shell command ( grep for example) you can call the command from C language using system().

grep -v "^$" fileIn.txt > fileOut.txt

gives the file without blank lines
best regadrds
bela

firstfire 02-24-2007 03:42 AM

Hello!

Here is a program, which count blank lines on stdin.
Code:

// blank.c
#include <stdio.h>
#define BUFLEN 5000

int main()
{
        char buf[BUFLEN], *p;
        int counter = 0;
        while( !feof(stdin) ){
                fgets(buf, BUFLEN, stdin);
                p=buf;
                while(*p==' ' || *p=='\t') p++; // skip whitespaces
                if(*p=='\r') p++;
                if(*p=='\n') counter++;
        }
        printf("%d\n", counter);
        return 0;
}

I compile it with optimizing:
Code:

gcc -O3 blank.c
I have a 23 Megs text file with 225997 lines (75321 blank lines) called `E_slvr_r.txt'.

Now I want to compare two approaches:
Code:

$ time ./a.out < E_slvr_r.txt
75321

real    0m0.104s
user    0m0.088s
sys    0m0.016s
$ time grep '^[ \t\r]*$' E_slvr_r.txt | wc -l
75321

real    0m0.170s
user    0m0.156s
sys    0m0.016s
$ # estimate relative overhead, %
$ echo '(0.170-0.104)/0.170 * 100' | bc -l
38.82352941176470588200

Time difference is about 0.07 seconds (39%), but pipeline is much easier to write than equivalent C program. And note, that though input file was relatively big, I still can use bash-approach in real-time applications (e.g. command line dictionary/encyclopedia), because time difference was less than 1 second.

My laptop: 1.6GHz Intel Centrino mobile, 1 GB RAM.

wjevans_7d1@yahoo.co 02-26-2007 11:49 AM

Another disadvantage of the C program is that it will count an additional blank line if it finds a line containing exactly 5000 characters, including the line feed. The first 4999 characters will be received by one fgets(), and a NUL character will end the data in the buffer. The concluding line feed will be received by the next fgets().

nx5000 02-26-2007 12:29 PM

What about
Code:

time grep --mmap '^[ \t\r]*$' E_slvr_r.txt | wc -l

sundialsvcs 02-26-2007 07:31 PM

In all seriousness, this very-common task that you are undertaking can be accomplished much faster and easier using one of the many "power tools" that are available in Linux and Unix.

For example, the awk program is specifically designed for tasks which can be generally described as "scan the file line-by-line and when you see a line that looks like this, do that."

Like all "power tools" programs, gawk takes this disarmingly-simple concept and puts the whole thing "on steroids."

firstfire 02-27-2007 07:19 AM

Quote:

Originally Posted by nx5000
What about
Code:

time grep --mmap '^[ \t\r]*$' E_slvr_r.txt | wc -l

Code:

$ time grep  '^[ \t\r]*$' E_slvr_r.txt | wc -l
75321

real    0m0.170s
user    0m0.132s
sys    0m0.028s
$ time grep --mmap '^[ \t\r]*$' E_slvr_r.txt | wc -l
75321

real    0m0.162s
user    0m0.152s
sys    0m0.004s

So, flag `--mmap' improves performance by about 5% in my case. Thank you, nx5000! This kind of read speeding up is absolutely new to me, it's interesting!


All times are GMT -5. The time now is 02:11 PM.