LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 07-19-2007, 12:28 AM   #16
fs11
Member
 
Registered: Aug 2006
Posts: 79

Original Poster
Rep: Reputation: 15

rupertwh, the bash script is a very straightforward solution to the problem.
However, when the data size gets very large, as in the case of large number of DNA sequences, the script is too slow for high performance computing applications...
 
Old 07-19-2007, 01:48 AM   #17
rupertwh
Member
 
Registered: Sep 2006
Location: Munich, Germany
Distribution: Debian / Ubuntu
Posts: 297

Rep: Reputation: 49
Quote:
Originally Posted by fs11
the script is too slow for high performance computing applications...
I was going to answer how there wouldn't be a significant difference, the harddisk being the limiting factor.
And then I made a 'quick' test with 100000 data records. The script only managed ~800 records/sec. Much less than I expected.
So, not expecting to be equally surprised by the performance of a well written C version, I won't write about the difference not being significant...
 
Old 07-19-2007, 03:34 AM   #18
rupertwh
Member
 
Registered: Sep 2006
Location: Munich, Germany
Distribution: Debian / Ubuntu
Posts: 297

Rep: Reputation: 49
ok, so I'm bored... ;o)
Couldn't resist and wrote a short C app to do the same. It's about 300 times faster (~250000 recs/sec).
I'm impressed. Didn't expect that much of a difference for more or less a straight (albeit selective) file copy.
 
Old 07-19-2007, 04:49 AM   #19
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682
I'm not an expert in genetics. ( understatement of the year! )

From your description:
Code:
for example...

>seq A
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
>seq B
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
>seq C
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg


and rank file contains

seqA :0.5
seqB :1.5
seqc :2.5


Then the file that gets the seq A, must get this :
>seq A
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
How many sequences do you have in the input file? In your example, would there be other sequences with a score of 0.5, and would they be written to same file? Your example uses floating point numbers with one significant digit. Is that typical or a simplification. The reason I'm asking is because if the rank has the form [[:digit:]]\.[[:digit:]], and the fractional digit always is a zero or a five, then that is a small set of possible values which can be selected using a regular expression; or multiplying by ten, first, you could use a case statement instead of a long string of if/else floating point comparisons.

Are there always a fixed number of lines for each sequence?
You have three lines in your example.

You can use a form like this in sed:
Code:
#n
/^seq A/,+3w fileA
/^seq B/,+3w fileB
/^seq C/,+3w fileC
...
/^seq Q/,+3w fileA
This will extract segments and write them to separate files. You might use the program that reads the rank file and sorts them to create the sed commands (in a sed program, let's say seqsplit.sed) and then run
sed -f seqsplit.sed data_file. I made an assumption that a different sequence such as seq Q might have the same rank
seqA :0.5
seqB :1.5
seqc :2.5
...
seqq :0.5

and so ends up in the same file.

Anyway, piping the data through a sed or awk filter will be faster than a script that reads a file line by line. The difference in speed might be enough.

By the way, you said you didn't know C++, but the code sample you posted was C++.

Here was a test run using your sample:
Code:
jschiwal@hpamd64:~/tmpdir> cat datafile
>seq A
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
>seq B
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
>seq C
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
>seq Q
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
jschiwal@hpamd64:~/tmpdir> cat selseq.sed
#n
/^>seq A/,+3 w fileA
/^>seq B/,+3 w fileB
/^>seq C/,+3 w fileC
/^>seq Q/,+3 w fileA

jschiwal@hpamd64:~/tmpdir> sed -f selseq.sed datafile
jschiwal@hpamd64:~/tmpdir> cat fileA
>seq A
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
>seq Q
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
jschiwal@hpamd64:~/tmpdir> cat fileB
>seq B
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
jschiwal@hpamd64:~/tmpdir> cat fileC
>seq C
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg

Last edited by jschiwal; 07-19-2007 at 05:06 AM.
 
Old 07-19-2007, 08:22 AM   #20
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
Just need to take out a few lines as follows:
Code:
/* this is the function to parse the input data */
static int parse_input_file(const char *fFile)
{
    /* open the file provided */
    FILE *input_file = fopen(fFile, "r");
    if (!input_file) return -1;

    /* this is to hold input data from the file */
    char current_line[1024];

    /* indicator of the current file */
    int current_file = -1;

    while (!feof(input_file))
    /* this loop keeps reading lines from the file until it reaches the end */
    {
    /* if there are less than 3 characters, the line is ignored ('\n' + '\0' = 2) */
    if (!fgets(current_line, 1024, input_file)) break;
    if (strlen(current_line) < 3) continue;

    /* to allow comparison to the list, remove the newline */
    current_line[strlen(current_line) - 1] = 0;

    /* if the line denotes a label, find the correct file and change it */
    if (current_line[0] == '>') current_file = determine_file(value(current_line + 1));

    /*else*/
    /* otherwise, send the output to the file and replace the newline */
    /* {*/
    send_file_output(current_line, strlen(current_line), current_file);
    send_file_output("\n", 1, current_file);
    /* }*/
    }

    return 0;
}
ta0kira
 
Old 07-19-2007, 08:37 AM   #21
kaz2100
Senior Member
 
Registered: Apr 2005
Location: Penguin land, with apple, no gates
Distribution: SlackWare > Debian testing woody(32) sarge etch lenny squeeze(+64) wheezy .. bullseye bookworm
Posts: 1,833

Rep: Reputation: 108Reputation: 108
Quote:
Originally Posted by fs11
rupertwh, the bash script is a very straightforward solution to the problem.
However, when the data size gets very large, as in the case of large number of DNA sequences, the script is too slow for high performance computing applications...
If this is the case, locating the target record in your huge database will be a rate limiting step. What you need is NOT C script, (NOT C++ either, C++ is slower than C), you need to do something different.
Many of the scientist in various field ran into same problem and already had solution.

Happy Penguins!
 
Old 07-19-2007, 09:33 AM   #22
rupertwh
Member
 
Registered: Sep 2006
Location: Munich, Germany
Distribution: Debian / Ubuntu
Posts: 297

Rep: Reputation: 49
Quote:
Originally Posted by kaz2100
If this is the case, locating the target record in your huge database will be a rate limiting step.
Actually, no. There is no need to locate a record in a huge database (see my script for an example).

I'm not sure why the bash script is so extremely slow, though. My guess is that either bash is *really* bad at string operations or the problem is that output files are closed/reopened for every record. Or both.

I was going to start learning Python this week -- I think I found a nice problem here to play around with...
 
Old 07-19-2007, 11:17 AM   #23
kaz2100
Senior Member
 
Registered: Apr 2005
Location: Penguin land, with apple, no gates
Distribution: SlackWare > Debian testing woody(32) sarge etch lenny squeeze(+64) wheezy .. bullseye bookworm
Posts: 1,833

Rep: Reputation: 108Reputation: 108
Quote:
Originally Posted by rupertwh
Actually, no. There is no need to locate a record in a huge database (see my script for an example).
I am not good to read bash script, however..
You sequentially scan the huge database file, if I understand your script correct. Then do something... I have seen one guy used "grep" directly, another guy used "perl" script to do same thing using similar programing concept, both choked (some two digit hours of run time, using cluster.) Anyway, I will not argue any more.

It is good for "nice problem here to play around with...", I agree.

Happy Penguins!

Last edited by kaz2100; 07-19-2007 at 11:19 AM.
 
Old 07-19-2007, 03:54 PM   #24
fs11
Member
 
Registered: Aug 2006
Posts: 79

Original Poster
Rep: Reputation: 15
Thanks for all this enthusism....
I would answer each of the queries by fellows here one by one...
@jschiwal
yes there can be other sequences with the same score as 0.5.However, the increments or the ranks are not always in 0.5.I mean there might be ranks like 1.004 or something...I hope this answers your question.If the case were that the fractional change in the ranks were the same for each time, the problem would have been much simpler, as you stated.
About C++, I would say that I try to work my way out, but I am not good at it...I have worked with Java for many years, and didnt touch C/C++, but the applications that I am working now in, requires me to have my interaction with C/C++


Quote:
Originally by kaz2100
If this is the case, locating the target record in your huge database will be a rate limiting step. What you need is NOT C script, (NOT C++ either, C++ is slower than C), you need to do something different.
what do you suggest must be done...any insight would be highly appreciable...

@ta0kira
Thankyou once again.....It wouldnt have been possible without your help...



Quote:
Originally by @rupertwh
ok, so I'm bored... ;o)
Couldn't resist and wrote a short C app to do the same. It's about 300 times faster (~250000 recs/sec).
I'm impressed. Didn't expect that much of a difference for more or less a straight (albeit selective) file copy.
Yah, that is what I originally expected when i worked with bash scripts initially, but the difference seems to be huge...probably you are right about the input/output thing of the bash scripts...
 
Old 07-19-2007, 05:22 PM   #25
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
Quote:
Originally Posted by kaz2100
...C++ is slower than C...
Not necessarily true. std::list using validated iterators and no optimization vs. your own linked list: PROBABLY. The only thing C++ really adds is member/virtual functions, constructors, and destructors (aside from compile-time sematics such as function overloading, templates, and namespaces.) Any floating point or *, /, or % operation will nullify any performance loss caused by additional dereferencing of a vtable for a virtual function. Even the best C programmers are no match for a compiler when turning a C++ class into fast C code. If you need virtual functions, constructors, and destructors you would be wasting your time to try that in C. If you don't, C might be the way to go. The STL is a whole different story, but can be optimized to be very fast.

Creating a binary program is just the first step, which tends to be the greatest leap in performance for me. The difference between the speeds of C and C++ is trivial at most with gcc -O2 or greater when going about things the same way, and nearly all C can be put into a C++ program. The algorithm will be the limiting factor in most cases. Shell scripting is not suitable for a speed-optimizing algorithm. That much is obvious. Code that works is a start, then analysis of goal resistance vs. the code needs to take place.

Tabulating and indexing code is a lot easier to write and maintain in C++. You can modify it more easily and the concentration shifts from the many modifications C code would require to what you are really trying to do. C is no place to design a program of this sort with no definite algorithm. If you intend to redesign the algorithm, I'd recommend C++, then convert as much of it to C or asm as possible if it still isn't fast enough. If the current algorithm works for you, you needn't look any further.
ta0kira
 
Old 07-19-2007, 09:01 PM   #26
rupertwh
Member
 
Registered: Sep 2006
Location: Munich, Germany
Distribution: Debian / Ubuntu
Posts: 297

Rep: Reputation: 49
oh cool, let's have a C vs C++ flamewar.

Anyways, did some more poking at the bash script and must say: bash is just really slow, no matter what.

But Python turned out to be pretty awesome. Still slower than C, of course, but way ahead of bash.

(this is my very first Python script, so Python hackers be gentle...)
Code:
#!/usr/bin/python

import math

datafname="data.txt"
rankfname="rank.txt"
outfbase="outfile_"

outmap = {}
outfiles = []
for i in (1,2,3,4,5,6,7,8,9,10) :
        outfiles.append(open(outfbase + '%02d' % i, 'w'))

rankfile = open(rankfname, 'r')
for line in rankfile.xreadlines() :
        tag = line.split(':')[0].strip()
        rank = float(line.split(':')[1].strip())

        rankno = int(math.ceil(rank * 10))
        if rankno == 0 :
                rankno = 1

        if rankno < 0 or rankno > 10 :
                print "Invalid rank: " + line
                continue

        outmap[tag] = outfiles[rankno-1]

rankfile.close()

tag=""
datafile = open(datafname, 'r')
for line in datafile.xreadlines() :
        if line[0] == '>' :
                tag = line[1:].strip()
                outfile = outmap.get(tag,"")
                if not outfile :
                        print "Unranked seq: " + tag
                        tag=""
                        continue

        if len(tag) > 0 :
                outfile.write(line)

datafile.close()
for f in outfiles :
        f.close()
So the score is currently (on my machine, obviously):
C: ~400.000 recs/s
Python: ~70.000 recs/s
Bash: ~1.000 recs/s

BTW: You can get some 15% speed increase on the C version by using a lookup table instead of a linked list.
 
Old 07-20-2007, 07:08 PM   #27
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
Quote:
Originally Posted by rupertwh
oh cool, let's have a C vs C++ flamewar.
Let's have a let's-have-a-flamewar vs. let's-not-have-a-flamewar flamewar
Quote:
Originally Posted by rupertwh
BTW: You can get some 15% speed increase on the C version by using a lookup table instead of a linked list.
As in a contiguous array? Yes, that would be much faster. Faster still would be numerical section labels so that the comparison would be regsiter vs. register rather than string vs. string. Of course, you would pay for that up front by converting all of the labels

Generally I like to process everything using references and then extract data from the end result. That speeds things up quite a bit because you don't have to move data multiple times. What I mean by this is figure out where things go first using pointers then copy once you've figured it out. Can't really do that here, though.

Here is another way to optimize for speed, along the lines of contiguous arrays. Not intending to provoke anything, I can objectively say that arrays of structures are much easier to handle with C++ than C if only for the fact that you can increment pointers without having to do sizeof. So back to the story. What you can do is create a 257^x matrix of labels, where x will be the number of dimensions. Each dimension will represent a different character in the label, with the last being reserved for the remaining matches.
Code:
#include <string>
#include <vector>


struct tabled_value
{
	tabled_value() : value() {}

	tabled_value(const std::string lLabel, double vValue) :
	label(lLabel), value(vValue) {}

	inline bool operator == (const char *lLabel) const
	{ return !strcmp(&this->label[0], lLabel); }

	std::string label;
	double      value;
};


class multi_table3
{
public:
	void add_value(const std::string lLabel, double vValue)
	{
	unsigned int X = 256, Y = 256, Z = 256, size = lLabel.size();

	if (size > 0) X = lLabel[0];
	if (size > 1) Y = lLabel[1];
	if (size > 2) Z = lLabel[2];

	master_table[X][Y][Z].insert(master_table[X][Y][Z].begin(), tabled_value());

	if (size > 3)
	master_table[X][Y][Z][0].label = lLabel.substr(3, size - 3);

	master_table[X][Y][Z][0].value = vValue;
	}

	double retrieve_value(const char *lLabel) const
	{
	unsigned int X = 256, Y = 256, Z = 256, size = strlen(lLabel);

	const char empty[] = "";
	const char *compare = empty;

	if (size > 0) X = lLabel[0];
	if (size > 1) Y = lLabel[1];
	if (size > 2) Z = lLabel[2];
	if (size > 3) compare = lLabel + 3;

	std::vector <tabled_value> ::const_iterator begin = master_table[X][Y][Z].begin(),
	                                            end   = master_table[X][Y][Z].end();

	while (begin < end)
	if (*begin == compare) return begin->value;
	else ++begin;

	return 0.0;
	}

private:
	std::vector <tabled_value> master_table[257][257][257];
};
Using the structures above (I'll spare you the entire test) vs. a normal vector (discounting allocation time) and a rank table of 50 labels, I get about 410,200 records per-second with the 1d structure vs. 615,200 with the 3d. That's a 50% performance increase. The following caveats apply, however:
  • record size = 80B
  • labels are the numbers 1 - 10 spelled out, then repeated with 1 - 4 suffixed
  • allocation time is the limiting factor for smaller data sets, but can be compiler-optimized to a trivial level with huge sets (tested with 726,000 records)

The test code is long, but if you REALLY want to see it I will provide it. For the most part it's just a retrofit of the previous code I provided.
ta0kira

I can't believe I've started using small yellow circles embedded with poor representations of facial expressions to imply light-heartedness and scolding. I need a break from the internet...
 
Old 07-20-2007, 08:01 PM   #28
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
Update: with -O1 and above I get .27s allocation time vs. 1.37s with -O0. Strangely enough, I get better improvement with 3d vs. 1d with -O0, giving me 1d @ 195,200 r/s vs. 3d @ 412,500 (111% increase in speed.)

I also tested vs. simulated contiguous arrays (effectively a C array) which gave me 283,600 r/s at -O0 (agreeably much better than 1d vector) and 376,200 at -O9 (56% improvement, 75% improvement.)

Obviously an informal test, but the approximate rates are repeatable. Not trying to start a competition, just so you know. I just like coming up with different ways to do things.
ta0kira

edit: looks like there needs to be > .5 million records involved and at least -O1 to make it worth it...

edit2: making the array of vectors 2d instead of 3d is MUCH better. ~ 30-125% improvement.

Last edited by ta0kira; 07-21-2007 at 11:16 AM.
 
Old 07-28-2007, 07:27 PM   #29
fs11
Member
 
Registered: Aug 2006
Posts: 79

Original Poster
Rep: Reputation: 15
Hi ta0kira

I wanted the code for the values b/w 0.0 to 1.0, but orignially the code was written for values between 1-10.How can i fix this.Help please
 
Old 07-28-2007, 08:06 PM   #30
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
This will be very simple.
  • change the data type of the value variables in both rank_spec and parse_rank_file to double
  • change strtol to strtod and remove the last argument, which previously denoted the number base
  • make the printf calls use %f instead of %i to show the rank values
  • change the return type of the value function to double
  • change the argument type of determine_file to double and rewrite it as appropriate
Really all you are doing here is allowing for decimals where integers were only allowed previously.
ta0kira
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
floating point number division vijraj Programming 9 03-06-2007 10:36 PM
C++ - Division by 0 and 'nan' Nightfox Programming 1 05-03-2006 11:58 AM
math division in javascript rblampain Programming 2 09-07-2005 04:07 AM
bash and math division problem bennethos Programming 5 10-17-2004 01:51 PM
ip address class division emailssent Linux - Networking 3 10-08-2004 06:38 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:30 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration