LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 01-25-2011, 03:33 AM   #1
frater
Member
 
Registered: Jul 2008
Posts: 110

Rep: Reputation: 23
Count the line


My question is about the same as the one posed here, but I'm not satisfied about the conclusion to use 'wc -l'
http://www.linuxquestions.org/questi...-files-583991/

I have a zabbix monitoring server and I am executing the following line each minute on each client

Code:
wc -l /proc/net/ip_conntrack
This is a piece of cake on most servers, but on 1 server it will take about 13 seconds as it has more about 40.000 (ligit) connections open.

I just want to count the lines to get the amount of connections. "wc -l" does more than I want it to and I'm hoping a program that just counts the lines and only does that, can do this in a faster way.

I am not good with C. I used it 20 years ago to manipulate strings when things became too slow in the native language (Clipper). I did write some nice things in C at the time, so it wonders me somehow why I can't do it (maybe I should try harder).

I tried to alter this program which does a bit more than just count the line, but didn't succeed (how embarassing).

http://www.gnu.org/software/cflow/ma...c-command.html

Could someone take a look at it?
I was thinking of calling it 'lc' and it should only return the amount of '\n'
Hopefully the binary is faster than 13 seconds...
 
Old 01-25-2011, 03:50 AM   #2
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,604

Rep: Reputation: 2934Reputation: 2934Reputation: 2934Reputation: 2934Reputation: 2934Reputation: 2934Reputation: 2934Reputation: 2934Reputation: 2934Reputation: 2934Reputation: 2934
Well I am not sure of the speed comparison, but how does something like this compare:
Code:
grep -c . /proc/net/ip_conntrack
The upside is blank lines will be skipped, the downside is a line with only white space will be counted. There are ways to counter this but of course they may slow it down.
 
Old 01-25-2011, 04:25 AM   #3
frater
Member
 
Registered: Jul 2008
Posts: 110

Original Poster
Rep: Reputation: 23
I did manage to alter that C-program and was able to compile an 'lc', but was disappointed to see it perform this bad. I also benchmarked the 'grep' and it performed the same as wc -l.
Here's my patched 'wc' http://pastebin.com/KUM0EwnN
I compiled it with 'gcc lc.c -o lc'

Code:
# time ./lc /proc/net/ip_conntrack
 38504 /proc/net/ip_conntrack

real    0m41.469s
user    0m0.152s
sys     0m38.682s
# time wc -l /proc/net/ip_conntrack
38059 /proc/net/ip_conntrack

real    0m10.162s
user    0m0.008s
sys     0m9.889s
# time grep -c . /proc/net/ip_conntrack
38192

real    0m10.115s
user    0m0.016s
sys     0m9.925s
 
Old 01-25-2011, 04:48 AM   #4
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Debian, Mint, Puppy, Raspbian
Posts: 3,421

Rep: Reputation: 200Reputation: 200Reputation: 200
wc has been developed over years by clever people.
all the bugs have been killed years ago.

I doubt very much it can be improved upon.
you are wasting your time if you think you can
improve upon basic unix tools.

I suggest your 13 seconds bottleneck is unlikely to be in wc
 
Old 01-25-2011, 04:55 AM   #5
taffy3350
LQ Newbie
 
Registered: Jan 2011
Posts: 5

Rep: Reputation: 0
Quote:
Originally Posted by bigearsbilly View Post
all the bugs have been killed years ago.
Severly doubt that, no program is completely free of bugs, since, imho, a bug can also be between the screen and the chair

Quote:
Originally Posted by bigearsbilly View Post
I doubt very much it can be improved upon.
you are wasting your time if you think you can
improve upon basic unix tools.
If everyone thought like that then we would still be using Assembly or maybe even PunchTape.


Last edited by taffy3350; 01-25-2011 at 04:57 AM.
 
Old 01-25-2011, 07:58 AM   #6
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,348

Rep: Reputation: 1502Reputation: 1502Reputation: 1502Reputation: 1502Reputation: 1502Reputation: 1502Reputation: 1502Reputation: 1502Reputation: 1502Reputation: 1502Reputation: 1502
Code:
# time ./lc /proc/net/ip_conntrack
 38504 /proc/net/ip_conntrack

real    0m41.469s
user    0m0.152s
sys     0m38.682s
# time wc -l /proc/net/ip_conntrack
38059 /proc/net/ip_conntrack

real    0m10.162s
user    0m0.008s
sys     0m9.889s
Most of the time is in sys, which suggests that increasing the buffer size to reduce the number of calls to read(2) may help.
 
Old 01-25-2011, 09:33 AM   #7
frater
Member
 
Registered: Jul 2008
Posts: 110

Original Poster
Rep: Reputation: 23
Quote:
Originally Posted by bigearsbilly View Post
wc has been developed over years by clever people.
all the bugs have been killed years ago
I'm not suggesting at all that 'wc' has bugs or written inefficient.
wc can do much more than just count lines and I assume it has some code in it that is able to count words instead of lines and it may execute a bit of code to test something that never changes.

I already found this page: http://www.stokebloke.com/wordpress/...they-are-slow/
which suggests the library function 'getc' should not be used, but fread is OK.

For someone using C on a daily basis it should be a piece of cake to rewrite 'lc'

BTW, the "wc" which I patched to make "lc" is not the one which is widely used. Maybe I should get hold of that one and try to modify that (take code out) to speed it up.

I think the key is into taking a big chunk of data each time you call the library function (getc,fread) and put that in a piece of static memory and count the amount of '\n'.

About improving code....
I can remember (20 years ago) speeding up a soundex() function for Clipper. They gave an example in assembly. I used my own algorithm for it and did it in C. Mine was 1000 times faster.

Last edited by frater; 01-25-2011 at 09:49 AM.
 
Old 01-25-2011, 10:27 AM   #8
marozsas
Senior Member
 
Registered: Dec 2005
Location: Campinas/SP - Brazil
Distribution: SuSE, RHEL, Fedora, Ubuntu
Posts: 1,413
Blog Entries: 1

Rep: Reputation: 65
may I suggest ...
Code:
cat -n /proc/net/ip_conntrack | tail -1 | awk '{print $1}'
"cat" is super fast and awk will receive just one line to output only the number....anyway, is just a crazy idea to test.

Last edited by marozsas; 01-25-2011 at 10:34 AM. Reason: I removed my test with /var/log/messages because I borked on copy-paste. sorry for that. never mind...
 
Old 01-25-2011, 10:33 AM   #9
frater
Member
 
Registered: Jul 2008
Posts: 110

Original Poster
Rep: Reputation: 23
Quote:
Originally Posted by frater View Post
I'm not suggesting at all that 'wc' has bugs or written inefficient.
wc can do much more than just count lines and I assume it has some code in it that is able to count words instead of lines and it may execute a bit of code to test something that never changes.
I just download coreutils and took a look at the source of 'wc'.
They already have a seperate loop for just counting lines, so I don't think it can be optimized that easily.

Maybe someone can still see some posibilities?
Can't it use a static piece of memory (buffer) which is then parsed and counted?
memchr is a library function, should a library function be used per se?

Code:
      /* Use a separate loop when counting only lines or lines and bytes --
         but not chars or words.  */
      while ((bytes_read = safe_read (fd, buf, BUFFER_SIZE)) > 0)
        {
          char *p = buf;

          if (bytes_read == SAFE_READ_ERROR)
            {
              error (0, errno, "%s", file);
              ok = false;
              break;
            }

          while ((p = memchr (p, '\n', (buf + bytes_read) - p)))
            {
              ++p;
              ++lines;
            }
          bytes += bytes_read;
        }
 
Old 01-25-2011, 11:03 AM   #10
frater
Member
 
Registered: Jul 2008
Posts: 110

Original Poster
Rep: Reputation: 23
Quote:
Originally Posted by marozsas View Post
may I suggest ...
Code:
cat -n /proc/net/ip_conntrack | tail -1 | awk '{print $1}'
"cat" is super fast and awk will receive just one line to output only the number....anyway, is just a crazy idea to test.
I tested it, but "wc -l" is much faster...

Code:
# time cat -n /test.pl | tail -n1
1933524 14467 addresses are on the whitelist
real    0m0.318s
user    0m0.240s
sys     0m0.040s
# time wc -l /test.pl
1933523 /test.pl

real    0m0.093s
user    0m0.076s
sys     0m0.016s
 
Old 01-25-2011, 09:24 PM   #11
frater
Member
 
Registered: Jul 2008
Posts: 110

Original Poster
Rep: Reputation: 23
Quote:
Originally Posted by ntubski View Post
Most of the time is in sys, which suggests that increasing the buffer size to reduce the number of calls to read(2) may help.
I think you're right.
But isn't there also room for speed improvement by just parsing the buffer in plain C (without calling a function in the C-library)?

The problem is that I can't put it into code...
 
Old 01-26-2011, 08:59 AM   #12
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,348

Rep: Reputation: 1502Reputation: 1502Reputation: 1502Reputation: 1502Reputation: 1502Reputation: 1502Reputation: 1502Reputation: 1502Reputation: 1502Reputation: 1502Reputation: 1502
Quote:
Originally Posted by frater View Post
I think you're right.
But isn't there also room for speed improvement by just parsing the buffer in plain C (without calling a function in the C-library)?
If parsing time is 0.152s that's the room for speed improvement...

Quote:
The problem is that I can't put it into code...
Have you tried setvbuf? http://pastebin.com/vsyNNTrT
 
Old 01-27-2011, 02:47 PM   #13
frater
Member
 
Registered: Jul 2008
Posts: 110

Original Poster
Rep: Reputation: 23
This is the standard wc -l
Code:
# time /usr/bin/wc -l /1.3GB.txt
32128160 /1.3GB.txt

real    0m49.020s
user    0m0.588s
sys     0m4.624s

I downloaded coreutils and compiled that wc, it turned out to be slightly faster than the one that came with Ubuntu 10.4LTS. I don't know why. It may even be due to the difference in version.

coreutils wc.c: http://pastebin.com/Z91ZFKrD

Code:
# time /opt/coreutil/coreutils-8.9/src/wc -l /1.3GB.txt
32128160 /1.3GB.txt

real    0m44.842s
user    0m1.120s
sys     0m1.584s
I modified the buffersize..
Code:
# diff wc.c wclb.c
54c54
< #define BUFFER_SIZE (16 * 1024)
---
> #define BUFFER_SIZE (1024 * 1024)
(your) lc compiled with -O3
Code:
# gcc lc.c -o lc -O3
# time ./lc /1.3GB.txt
32128160 /1.3GB.txt

real    0m47.802s
user    0m15.565s
sys     0m2.012s
# time ./lc /1.3GB.txt
32128160 /1.3GB.txt

real    0m45.938s
user    0m15.649s
sys     0m2.008s
(your) lc compiled with default options
Code:
# gcc lc.c -o lc
# time ./lc /1.3GB.txt
32128160 /1.3GB.txt

real    0m54.932s
user    0m19.857s
sys     0m1.864s
But you're using the library function 'getc' which is possibly a slow library function (according to that webpage). It may of course be a problem with the implementation on his machine....

In the old days when I was still writing in C for fast functions (in comparison with native Clipper) I never used any library functions. afaik this was not possible. There were some functions meant for parameter passing and allocating memory. I then parsed these buffers in native C. I posted these functions in public domain, but this was before Internet became popular. They were uploaded to my brother's BBS which was part of fidonet. I couldn't find any of my sources on the Internet...

Don't you think it's worthwile to change the source of coreutils wc.c instead of the other one which uses getc? wc.c uses another library function (memchr). This isn't needed is it? Or doesn't it give you a speed improvement?

Code:
      /* Use a separate loop when counting only lines or lines and bytes --
         but not chars or words.  */
      while ((bytes_read = safe_read (fd, buf, BUFFER_SIZE)) > 0)
        {
          char *p = buf;

          if (bytes_read == SAFE_READ_ERROR)
            {
              error (0, errno, "%s", file);
              ok = false;
              break;
            }

          while ((p = memchr (p, '\n', (buf + bytes_read) - p)))
            {
              ++p;
              ++lines;
            }
          bytes += bytes_read;
        }
    }



PS I googled my name in combination with Clipper and did find these files (how funny)

http://members.fortunecity.com/userg...tml/summer.htm
Trudf.zip
Set of MS C source UDFs (w/OBJs) that total numeric elements of an array, test null expressions, pad character strings, SOUNDEX(), & strip non-alphanumeric characters from strings - by J van Melis

That's more than 20 years ago.
I wish I could get hold of that file....

Last edited by frater; 01-27-2011 at 06:13 PM.
 
Old 01-28-2011, 06:59 AM   #14
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1288Reputation: 1288Reputation: 1288Reputation: 1288Reputation: 1288Reputation: 1288Reputation: 1288Reputation: 1288Reputation: 1288
One idea is, use the size of the file 'stat -c %s' as long as the file has a constant number of characters per line ... other than that you cannot get any faster than wc -l.
 
Old 01-28-2011, 12:38 PM   #15
frater
Member
 
Registered: Jul 2008
Posts: 110

Original Poster
Rep: Reputation: 23
Quote:
Originally Posted by H_TeXMeX_H View Post
One idea is, use the size of the file 'stat -c %s' as long as the file has a constant number of characters per line ... other than that you cannot get any faster than wc -l.
You can't use that trick for /proc/sys/net/ip_conntrack
It's only a pseudo file and 'stat -c %s' returns 0 (as I found out a while ago in another situation)

But I already made progress by modifying the buffersize of "wc.c" (didn't you see the results I posted?)

I'm currently in the process of obtaining my 25 year old sources in C.
These sources don't contain calls to library functions.
Hopefully things will return and I may even pick up programming in C again.

I still think/hope there's some room for improvement.
I will keep you posted (also if I don't succeed)

Cheers and thanks for all the input,

JP
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Script to count # of chars per line (if line meets certain criteria) and get avg #? kmkocot Linux - Newbie 3 09-13-2009 11:05 AM
write line count to a variable? tatoosh Linux - Newbie 9 07-30-2009 03:44 PM
word count in a line pantera Programming 4 08-25-2004 01:14 PM
How to count line numbers recursively in a directory? puzz_1 Linux - General 1 07-01-2004 09:43 AM
Count number of line in a file. philipina Programming 7 03-18-2004 05:04 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:38 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration