file division using C

fs11 · 07-19-2007, 12:28 AM

rupertwh, the bash script is a very straightforward solution to the problem.
However, when the data size gets very large, as in the case of large number of DNA sequences, the script is too slow for high performance computing applications...

rupertwh · 07-19-2007, 01:48 AM

Quote:

Originally Posted by fs11

the script is too slow for high performance computing applications...

I was going to answer how there wouldn't be a significant difference, the harddisk being the limiting factor.
And then I made a 'quick' test with 100000 data records. The script only managed ~800 records/sec. Much less than I expected.
So, not expecting to be equally surprised by the performance of a well written C version, I won't write about the difference not being significant...

rupertwh · 07-19-2007, 03:34 AM

ok, so I'm bored... ;o)
Couldn't resist and wrote a short C app to do the same. It's about 300 times faster (~250000 recs/sec).
I'm impressed. Didn't expect that much of a difference for more or less a straight (albeit selective) file copy.

jschiwal · 07-19-2007, 04:49 AM

I'm not an expert in genetics. ( understatement of the year! )

From your description:

Code:

for example...

>seq A
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
>seq B
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
>seq C
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg


and rank file contains

seqA :0.5
seqB :1.5
seqc :2.5


Then the file that gets the seq A, must get this :
>seq A
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg

How many sequences do you have in the input file? In your example, would there be other sequences with a score of 0.5, and would they be written to same file? Your example uses floating point numbers with one significant digit. Is that typical or a simplification. The reason I'm asking is because if the rank has the form [[:digit:]]\.[[:digit:]], and the fractional digit always is a zero or a five, then that is a small set of possible values which can be selected using a regular expression; or multiplying by ten, first, you could use a case statement instead of a long string of if/else floating point comparisons.

Are there always a fixed number of lines for each sequence?
You have three lines in your example.

You can use a form like this in sed:

Code:

#n
/^seq A/,+3w fileA
/^seq B/,+3w fileB
/^seq C/,+3w fileC
...
/^seq Q/,+3w fileA

This will extract segments and write them to separate files. You might use the program that reads the rank file and sorts them to create the sed commands (in a sed program, let's say seqsplit.sed) and then run
sed -f seqsplit.sed data_file. I made an assumption that a different sequence such as seq Q might have the same rank
seqA :0.5
seqB :1.5
seqc :2.5
...
seqq :0.5

and so ends up in the same file.

Anyway, piping the data through a sed or awk filter will be faster than a script that reads a file line by line. The difference in speed might be enough.

By the way, you said you didn't know C++, but the code sample you posted was C++.

Here was a test run using your sample:

Code:

jschiwal@hpamd64:~/tmpdir> cat datafile
>seq A
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
>seq B
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
>seq C
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
>seq Q
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
jschiwal@hpamd64:~/tmpdir> cat selseq.sed
#n
/^>seq A/,+3 w fileA
/^>seq B/,+3 w fileB
/^>seq C/,+3 w fileC
/^>seq Q/,+3 w fileA

jschiwal@hpamd64:~/tmpdir> sed -f selseq.sed datafile
jschiwal@hpamd64:~/tmpdir> cat fileA
>seq A
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
>seq Q
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
jschiwal@hpamd64:~/tmpdir> cat fileB
>seq B
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
jschiwal@hpamd64:~/tmpdir> cat fileC
>seq C
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg

ta0kira · 07-19-2007, 08:22 AM

Just need to take out a few lines as follows:

Code:

/* this is the function to parse the input data */
static int parse_input_file(const char *fFile)
{
    /* open the file provided */
    FILE *input_file = fopen(fFile, "r");
    if (!input_file) return -1;

    /* this is to hold input data from the file */
    char current_line[1024];

    /* indicator of the current file */
    int current_file = -1;

    while (!feof(input_file))
    /* this loop keeps reading lines from the file until it reaches the end */
    {
    /* if there are less than 3 characters, the line is ignored ('\n' + '\0' = 2) */
    if (!fgets(current_line, 1024, input_file)) break;
    if (strlen(current_line) < 3) continue;

    /* to allow comparison to the list, remove the newline */
    current_line[strlen(current_line) - 1] = 0;

    /* if the line denotes a label, find the correct file and change it */
    if (current_line[0] == '>') current_file = determine_file(value(current_line + 1));

    /*else*/
    /* otherwise, send the output to the file and replace the newline */
    /* {*/
    send_file_output(current_line, strlen(current_line), current_file);
    send_file_output("\n", 1, current_file);
    /* }*/
    }

    return 0;
}

ta0kira

kaz2100 · 07-19-2007, 08:37 AM

Quote:

Originally Posted by fs11

rupertwh, the bash script is a very straightforward solution to the problem.
However, when the data size gets very large, as in the case of large number of DNA sequences, the script is too slow for high performance computing applications...

If this is the case, locating the target record in your huge database will be a rate limiting step. What you need is NOT C script, (NOT C++ either, C++ is slower than C), you need to do something different.
Many of the scientist in various field ran into same problem and already had solution.

Happy Penguins!

rupertwh · 07-19-2007, 09:33 AM

Quote:

Originally Posted by kaz2100

If this is the case, locating the target record in your huge database will be a rate limiting step.

Actually, no. There is no need to locate a record in a huge database (see my script for an example).

I'm not sure why the bash script is so extremely slow, though. My guess is that either bash is *really* bad at string operations or the problem is that output files are closed/reopened for every record. Or both.

I was going to start learning Python this week -- I think I found a nice problem here to play around with...

kaz2100 · 07-19-2007, 11:17 AM

Quote:

Originally Posted by rupertwh

Actually, no. There is no need to locate a record in a huge database (see my script for an example).

I am not good to read bash script, however..
You sequentially scan the huge database file, if I understand your script correct. Then do something... I have seen one guy used "grep" directly, another guy used "perl" script to do same thing using similar programing concept, both choked (some two digit hours of run time, using cluster.) Anyway, I will not argue any more.

It is good for "nice problem here to play around with...", I agree.

Happy Penguins!

fs11 · 07-19-2007, 03:54 PM

Thanks for all this enthusism....
I would answer each of the queries by fellows here one by one...
@jschiwal
yes there can be other sequences with the same score as 0.5.However, the increments or the ranks are not always in 0.5.I mean there might be ranks like 1.004 or something...I hope this answers your question.If the case were that the fractional change in the ranks were the same for each time, the problem would have been much simpler, as you stated.
About C++, I would say that I try to work my way out, but I am not good at it...I have worked with Java for many years, and didnt touch C/C++, but the applications that I am working now in, requires me to have my interaction with C/C++

Quote:

Originally by kaz2100
If this is the case, locating the target record in your huge database will be a rate limiting step. What you need is NOT C script, (NOT C++ either, C++ is slower than C), you need to do something different.

what do you suggest must be done...any insight would be highly appreciable...

@ta0kira
Thankyou once again.....It wouldnt have been possible without your help...

Quote:

Originally by @rupertwh
ok, so I'm bored... ;o)
Couldn't resist and wrote a short C app to do the same. It's about 300 times faster (~250000 recs/sec).
I'm impressed. Didn't expect that much of a difference for more or less a straight (albeit selective) file copy.

Yah, that is what I originally expected when i worked with bash scripts initially, but the difference seems to be huge...probably you are right about the input/output thing of the bash scripts...

ta0kira · 07-19-2007, 05:22 PM

Quote:

Originally Posted by kaz2100

...C++ is slower than C...

Not necessarily true. std::list using validated iterators and no optimization vs. your own linked list: PROBABLY. The only thing C++ really adds is member/virtual functions, constructors, and destructors (aside from compile-time sematics such as function overloading, templates, and namespaces.) Any floating point or *, /, or % operation will nullify any performance loss caused by additional dereferencing of a vtable for a virtual function. Even the best C programmers are no match for a compiler when turning a C++ class into fast C code. If you need virtual functions, constructors, and destructors you would be wasting your time to try that in C. If you don't, C might be the way to go. The STL is a whole different story, but can be optimized to be very fast.

Creating a binary program is just the first step, which tends to be the greatest leap in performance for me. The difference between the speeds of C and C++ is trivial at most with gcc -O2 or greater when going about things the same way, and nearly all C can be put into a C++ program. The algorithm will be the limiting factor in most cases. Shell scripting is not suitable for a speed-optimizing algorithm. That much is obvious. Code that works is a start, then analysis of goal resistance vs. the code needs to take place.

Tabulating and indexing code is a lot easier to write and maintain in C++. You can modify it more easily and the concentration shifts from the many modifications C code would require to what you are really trying to do. C is no place to design a program of this sort with no definite algorithm. If you intend to redesign the algorithm, I'd recommend C++, then convert as much of it to C or asm as possible if it still isn't fast enough. If the current algorithm works for you, you needn't look any further.
ta0kira

rupertwh · 07-19-2007, 09:01 PM

oh cool, let's have a C vs C++ flamewar.

Anyways, did some more poking at the bash script and must say: bash is just really slow, no matter what.

But Python turned out to be pretty awesome. Still slower than C, of course, but way ahead of bash.

(this is my very first Python script, so Python hackers be gentle...)

Code:

#!/usr/bin/python

import math

datafname="data.txt"
rankfname="rank.txt"
outfbase="outfile_"

outmap = {}
outfiles = []
for i in (1,2,3,4,5,6,7,8,9,10) :
        outfiles.append(open(outfbase + '%02d' % i, 'w'))

rankfile = open(rankfname, 'r')
for line in rankfile.xreadlines() :
        tag = line.split(':')[0].strip()
        rank = float(line.split(':')[1].strip())

        rankno = int(math.ceil(rank * 10))
        if rankno == 0 :
                rankno = 1

        if rankno < 0 or rankno > 10 :
                print "Invalid rank: " + line
                continue

        outmap[tag] = outfiles[rankno-1]

rankfile.close()

tag=""
datafile = open(datafname, 'r')
for line in datafile.xreadlines() :
        if line[0] == '>' :
                tag = line[1:].strip()
                outfile = outmap.get(tag,"")
                if not outfile :
                        print "Unranked seq: " + tag
                        tag=""
                        continue

        if len(tag) > 0 :
                outfile.write(line)

datafile.close()
for f in outfiles :
        f.close()

So the score is currently (on my machine, obviously):
C: ~400.000 recs/s
Python: ~70.000 recs/s
Bash: ~1.000 recs/s

BTW: You can get some 15% speed increase on the C version by using a lookup table instead of a linked list.

ta0kira · 07-20-2007, 07:08 PM

Quote:

Originally Posted by rupertwh

oh cool, let's have a C vs C++ flamewar.

Let's have a let's-have-a-flamewar vs. let's-not-have-a-flamewar flamewar

Quote:

Originally Posted by rupertwh

BTW: You can get some 15% speed increase on the C version by using a lookup table instead of a linked list.

As in a contiguous array? Yes, that would be much faster. Faster still would be numerical section labels so that the comparison would be regsiter vs. register rather than string vs. string. Of course, you would pay for that up front by converting all of the labels

Generally I like to process everything using references and then extract data from the end result. That speeds things up quite a bit because you don't have to move data multiple times. What I mean by this is figure out where things go first using pointers then copy once you've figured it out. Can't really do that here, though.

Here is another way to optimize for speed, along the lines of contiguous arrays. Not intending to provoke anything, I can objectively say that arrays of structures are much easier to handle with C++ than C if only for the fact that you can increment pointers without having to do sizeof. So back to the story. What you can do is create a 257^x matrix of labels, where x will be the number of dimensions. Each dimension will represent a different character in the label, with the last being reserved for the remaining matches.

Code:

#include <string>
#include <vector>


struct tabled_value
{
	tabled_value() : value() {}

	tabled_value(const std::string lLabel, double vValue) :
	label(lLabel), value(vValue) {}

	inline bool operator == (const char *lLabel) const
	{ return !strcmp(&this->label[0], lLabel); }

	std::string label;
	double      value;
};


class multi_table3
{
public:
	void add_value(const std::string lLabel, double vValue)
	{
	unsigned int X = 256, Y = 256, Z = 256, size = lLabel.size();

	if (size > 0) X = lLabel[0];
	if (size > 1) Y = lLabel[1];
	if (size > 2) Z = lLabel[2];

	master_table[X][Y][Z].insert(master_table[X][Y][Z].begin(), tabled_value());

	if (size > 3)
	master_table[X][Y][Z][0].label = lLabel.substr(3, size - 3);

	master_table[X][Y][Z][0].value = vValue;
	}

	double retrieve_value(const char *lLabel) const
	{
	unsigned int X = 256, Y = 256, Z = 256, size = strlen(lLabel);

	const char empty[] = "";
	const char *compare = empty;

	if (size > 0) X = lLabel[0];
	if (size > 1) Y = lLabel[1];
	if (size > 2) Z = lLabel[2];
	if (size > 3) compare = lLabel + 3;

	std::vector <tabled_value> ::const_iterator begin = master_table[X][Y][Z].begin(),
	                                            end   = master_table[X][Y][Z].end();

	while (begin < end)
	if (*begin == compare) return begin->value;
	else ++begin;

	return 0.0;
	}

private:
	std::vector <tabled_value> master_table[257][257][257];
};

Using the structures above (I'll spare you the entire test) vs. a normal vector (discounting allocation time) and a rank table of 50 labels, I get about 410,200 records per-second with the 1d structure vs. 615,200 with the 3d. That's a 50% performance increase. The following caveats apply, however:

record size = 80B
labels are the numbers 1 - 10 spelled out, then repeated with 1 - 4 suffixed
allocation time is the limiting factor for smaller data sets, but can be compiler-optimized to a trivial level with huge sets (tested with 726,000 records)

The test code is long, but if you REALLY want to see it I will provide it. For the most part it's just a retrofit of the previous code I provided.
ta0kira

I can't believe I've started using small yellow circles embedded with poor representations of facial expressions to imply light-heartedness and scolding. I need a break from the internet...

ta0kira · 07-20-2007, 08:01 PM

Update: with -O1 and above I get .27s allocation time vs. 1.37s with -O0. Strangely enough, I get better improvement with 3d vs. 1d with -O0, giving me 1d @ 195,200 r/s vs. 3d @ 412,500 (111% increase in speed.)

I also tested vs. simulated contiguous arrays (effectively a C array) which gave me 283,600 r/s at -O0 (agreeably much better than 1d vector) and 376,200 at -O9 (56% improvement, 75% improvement.)

Obviously an informal test, but the approximate rates are repeatable. Not trying to start a competition, just so you know. I just like coming up with different ways to do things.
ta0kira

edit: looks like there needs to be > .5 million records involved and at least -O1 to make it worth it...

edit2: making the array of vectors 2d instead of 3d is MUCH better. ~ 30-125% improvement.

fs11 · 07-28-2007, 07:27 PM

Hi ta0kira

I wanted the code for the values b/w 0.0 to 1.0, but orignially the code was written for values between 1-10.How can i fix this.Help please

ta0kira · 07-28-2007, 08:06 PM

This will be very simple.

change the data type of the value variables in both rank_spec and parse_rank_file to double
change strtol to strtod and remove the last argument, which previously denoted the number base
make the printf calls use %f instead of %i to show the rank values
change the return type of the value function to double
change the argument type of determine_file to double and rewrite it as appropriate

Really all you are doing here is allowing for decimals where integers were only allowed previously.
ta0kira