LinuxQuestions.org
LinuxAnswers - the LQ Linux tutorial section.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices



Reply
 
Search this Thread
Old 02-14-2007, 04:10 PM   #1
adymroxx
Member
 
Registered: Mar 2005
Location: Iowa
Distribution: Fedora Core 9
Posts: 41

Rep: Reputation: 15
Detecting Blank Lines in C


I want to convert a large text file (180,000+ lines) of article titles and abstracts into a file where the title is on one line and all of the abstract text is on the next line and repeat for all of the title/abstract combos in the file. My solution is to create a C program that reads in a line at a time into a 150 character buffer before formatting it back to the output stream.

What I need to know is how to detect when it has read a blank line. Right now this is the only thing separating the articles is a blank line consisting of just a carriage return.

Thanks for the help!
 
Old 02-14-2007, 04:47 PM   #2
wjevans_7d1@yahoo.co
Member
 
Registered: Jun 2006
Location: Mariposa
Distribution: Slackware 9.1
Posts: 938

Rep: Reputation: 30
First, you should be careful about your definition of a "blank line". Were these lines entered by human beings? If so, is it possible that a line might contain nothing but one or more spaces and/or tabs? If tabs have no special meaning, then what I usually do, regardless of language, is to convert each tab into a space. Then I treat all consecutive spaces as a single space. Then I ignore any space at the beginning and at the end of a line. Only then would I consider whether the line is "blank".

Lines either end with a line feed (0x0A) or a carriage return line feed pair (0x0D followed by 0x0A). The first way is the Unix way; the second way is the Microsoft (spit) way. Just to make things easy, I usually ignore all 0x0D's that I find. This will work for everything except Macintosh, where each line ends with 0x0D but no 0x0A.

So an empty line (once you've taken care of squeezing any spaces and tabs that you wish to) is any 0x0A (line feed) which meets one of these two qualifications:
  1. It's at the beginning of the whole file.
  2. It's preceded by another 0x0A.

By the way, you could code and debug this about 20 times as fast in Perl.

Hope this helps.

Last edited by wjevans_7d1@yahoo.co; 02-14-2007 at 04:49 PM.
 
Old 02-14-2007, 05:37 PM   #3
oneandoneis2
Senior Member
 
Registered: Nov 2003
Location: London, England
Distribution: Ubuntu
Posts: 1,460

Rep: Reputation: 46
If the blank line is, as you say, just a carriage return, and you've copied the line into a character array, a simple test of whether "array[0] == '\n'" is all it would take to test for a blank line. . .
 
Old 02-14-2007, 05:50 PM   #4
adymroxx
Member
 
Registered: Mar 2005
Location: Iowa
Distribution: Fedora Core 9
Posts: 41

Original Poster
Rep: Reputation: 15
I found a method for it to work.
Code:
char word[200];

gets(word);
if (strcmp(word, "\0") == 0) **print title**
I know the dangers of gets() but this a one time use application that only I will use so I figured what the heck.

Thanks for your help!
 
Old 02-15-2007, 01:54 AM   #5
varun_shrivastava
Member
 
Registered: Jun 2006
Distribution: Ubuntu 7.04 Feisty
Posts: 79

Rep: Reputation: 15
u can use

char buff[150];
fgets(buff,150,<filedescriptor>);
**same as quoted by adymroxx above**

or u can use this command if u dont want a c program

grep -v '^$' ./hello.txt > hello1.txt
this command will copy all lines from hello.txt to hello1.txt leaving blank lines
 
Old 02-15-2007, 01:54 AM   #6
varun_shrivastava
Member
 
Registered: Jun 2006
Distribution: Ubuntu 7.04 Feisty
Posts: 79

Rep: Reputation: 15
u can use

char buff[150];
fgets(buff,150,<filedescriptor>);
**same as quoted by adymroxx above**

or u can use this command if u dont want a c program

grep -v '^$' ./hello.txt > hello1.txt
this command will copy all lines from hello.txt to hello1.txt leaving blank lines
 
Old 02-15-2007, 02:18 AM   #7
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 63
Perl, Python or awk are MUCH better suited for this sort of thing than C.

Can you provide an example of two subsequent input records? (copy-past into [code] tags to preserve the original formatting).
 
Old 02-23-2007, 04:50 AM   #8
abd_bela
Member
 
Registered: Dec 2002
Location: algeria
Distribution: redhat 7.3, debian lenny
Posts: 599

Rep: Reputation: 31
deleting blank lines

I think it is simpler to do it with shell command ( grep for example) you can call the command from C language using system().

grep -v "^$" fileIn.txt > fileOut.txt

gives the file without blank lines
best regadrds
bela
 
Old 02-24-2007, 04:42 AM   #9
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 640

Rep: Reputation: 375Reputation: 375Reputation: 375Reputation: 375
Hello!

Here is a program, which count blank lines on stdin.
Code:
// blank.c
#include <stdio.h>
#define BUFLEN 5000

int main()
{
	char buf[BUFLEN], *p;
	int counter = 0;
	while( !feof(stdin) ){
		fgets(buf, BUFLEN, stdin);
		p=buf;
		while(*p==' ' || *p=='\t') p++; // skip whitespaces
		if(*p=='\r') p++;
		if(*p=='\n') counter++;
	}
	printf("%d\n", counter);
	return 0;
}
I compile it with optimizing:
Code:
gcc -O3 blank.c
I have a 23 Megs text file with 225997 lines (75321 blank lines) called `E_slvr_r.txt'.

Now I want to compare two approaches:
Code:
$ time ./a.out < E_slvr_r.txt 
75321

real    0m0.104s
user    0m0.088s
sys     0m0.016s
$ time grep '^[ \t\r]*$' E_slvr_r.txt | wc -l
75321

real    0m0.170s
user    0m0.156s
sys     0m0.016s
$ # estimate relative overhead, %
$ echo '(0.170-0.104)/0.170 * 100' | bc -l
38.82352941176470588200
Time difference is about 0.07 seconds (39%), but pipeline is much easier to write than equivalent C program. And note, that though input file was relatively big, I still can use bash-approach in real-time applications (e.g. command line dictionary/encyclopedia), because time difference was less than 1 second.

My laptop: 1.6GHz Intel Centrino mobile, 1 GB RAM.
 
Old 02-26-2007, 12:49 PM   #10
wjevans_7d1@yahoo.co
Member
 
Registered: Jun 2006
Location: Mariposa
Distribution: Slackware 9.1
Posts: 938

Rep: Reputation: 30
Another disadvantage of the C program is that it will count an additional blank line if it finds a line containing exactly 5000 characters, including the line feed. The first 4999 characters will be received by one fgets(), and a NUL character will end the data in the buffer. The concluding line feed will be received by the next fgets().

Last edited by wjevans_7d1@yahoo.co; 02-26-2007 at 12:50 PM.
 
Old 02-26-2007, 01:29 PM   #11
nx5000
Senior Member
 
Registered: Sep 2005
Location: Out
Posts: 3,307

Rep: Reputation: 53
What about
Code:
time grep --mmap '^[ \t\r]*$' E_slvr_r.txt | wc -l
 
Old 02-26-2007, 08:31 PM   #12
sundialsvcs
Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 5,455

Rep: Reputation: 1172Reputation: 1172Reputation: 1172Reputation: 1172Reputation: 1172Reputation: 1172Reputation: 1172Reputation: 1172Reputation: 1172
In all seriousness, this very-common task that you are undertaking can be accomplished much faster and easier using one of the many "power tools" that are available in Linux and Unix.

For example, the awk program is specifically designed for tasks which can be generally described as "scan the file line-by-line and when you see a line that looks like this, do that."

Like all "power tools" programs, gawk takes this disarmingly-simple concept and puts the whole thing "on steroids."
 
Old 02-27-2007, 08:19 AM   #13
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 640

Rep: Reputation: 375Reputation: 375Reputation: 375Reputation: 375
Quote:
Originally Posted by nx5000
What about
Code:
time grep --mmap '^[ \t\r]*$' E_slvr_r.txt | wc -l
Code:
$ time grep  '^[ \t\r]*$' E_slvr_r.txt | wc -l
75321

real    0m0.170s
user    0m0.132s
sys     0m0.028s
$ time grep --mmap '^[ \t\r]*$' E_slvr_r.txt | wc -l
75321

real    0m0.162s
user    0m0.152s
sys     0m0.004s
So, flag `--mmap' improves performance by about 5% in my case. Thank you, nx5000! This kind of read speeding up is absolutely new to me, it's interesting!

Last edited by firstfire; 02-27-2007 at 08:20 AM.
 
  


Reply

Tags
programming


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Perl: testing for blank lines Garda Programming 4 11-16-2006 08:39 PM
Remove Blank Lines wwnexc Linux - Software 2 05-06-2006 12:14 PM
output number of blank lines tjgadu Linux - Newbie 7 06-09-2005 05:01 PM
How do i remove blank lines from a file? kakho Programming 1 04-15-2004 04:57 AM
Replace blank/almost blank lines in file Wynd Linux - General 3 01-27-2004 05:49 PM


All times are GMT -5. The time now is 05:10 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration