ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
rupertwh, the bash script is a very straightforward solution to the problem.
However, when the data size gets very large, as in the case of large number of DNA sequences, the script is too slow for high performance computing applications...
the script is too slow for high performance computing applications...
I was going to answer how there wouldn't be a significant difference, the harddisk being the limiting factor.
And then I made a 'quick' test with 100000 data records. The script only managed ~800 records/sec. Much less than I expected.
So, not expecting to be equally surprised by the performance of a well written C version, I won't write about the difference not being significant...
ok, so I'm bored... ;o)
Couldn't resist and wrote a short C app to do the same. It's about 300 times faster (~250000 recs/sec).
I'm impressed. Didn't expect that much of a difference for more or less a straight (albeit selective) file copy.
I'm not an expert in genetics. ( understatement of the year! )
From your description:
Code:
for example...
>seq A
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
>seq B
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
>seq C
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
and rank file contains
seqA :0.5
seqB :1.5
seqc :2.5
Then the file that gets the seq A, must get this :
>seq A
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
How many sequences do you have in the input file? In your example, would there be other sequences with a score of 0.5, and would they be written to same file? Your example uses floating point numbers with one significant digit. Is that typical or a simplification. The reason I'm asking is because if the rank has the form [[:digit:]]\.[[:digit:]], and the fractional digit always is a zero or a five, then that is a small set of possible values which can be selected using a regular expression; or multiplying by ten, first, you could use a case statement instead of a long string of if/else floating point comparisons.
Are there always a fixed number of lines for each sequence?
You have three lines in your example.
This will extract segments and write them to separate files. You might use the program that reads the rank file and sorts them to create the sed commands (in a sed program, let's say seqsplit.sed) and then run
sed -f seqsplit.sed data_file. I made an assumption that a different sequence such as seq Q might have the same rank
seqA :0.5
seqB :1.5
seqc :2.5
...
seqq :0.5
and so ends up in the same file.
Anyway, piping the data through a sed or awk filter will be faster than a script that reads a file line by line. The difference in speed might be enough.
By the way, you said you didn't know C++, but the code sample you posted was C++.
Here was a test run using your sample:
Code:
jschiwal@hpamd64:~/tmpdir> cat datafile
>seq A
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
>seq B
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
>seq C
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
>seq Q
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
jschiwal@hpamd64:~/tmpdir> cat selseq.sed
#n
/^>seq A/,+3 w fileA
/^>seq B/,+3 w fileB
/^>seq C/,+3 w fileC
/^>seq Q/,+3 w fileA
jschiwal@hpamd64:~/tmpdir> sed -f selseq.sed datafile
jschiwal@hpamd64:~/tmpdir> cat fileA
>seq A
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
>seq Q
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
jschiwal@hpamd64:~/tmpdir> cat fileB
>seq B
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
jschiwal@hpamd64:~/tmpdir> cat fileC
>seq C
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
/* this is the function to parse the input data */static int parse_input_file(const char *fFile)
{
/* open the file provided */
FILE *input_file = fopen(fFile, "r");
if (!input_file) return -1;
/* this is to hold input data from the file */
char current_line[1024];
/* indicator of the current file */
int current_file = -1;
while (!feof(input_file))
/* this loop keeps reading lines from the file until it reaches the end */
{
/* if there are less than 3 characters, the line is ignored ('\n' + '\0' = 2) */
if (!fgets(current_line, 1024, input_file)) break;
if (strlen(current_line) < 3) continue;
/* to allow comparison to the list, remove the newline */
current_line[strlen(current_line) - 1] = 0;
/* if the line denotes a label, find the correct file and change it */
if (current_line[0] == '>') current_file = determine_file(value(current_line + 1));
/*else*//* otherwise, send the output to the file and replace the newline *//* {*/
send_file_output(current_line, strlen(current_line), current_file);
send_file_output("\n", 1, current_file);
/* }*/
}
return 0;
}
rupertwh, the bash script is a very straightforward solution to the problem.
However, when the data size gets very large, as in the case of large number of DNA sequences, the script is too slow for high performance computing applications...
If this is the case, locating the target record in your huge database will be a rate limiting step. What you need is NOT C script, (NOT C++ either, C++ is slower than C), you need to do something different.
Many of the scientist in various field ran into same problem and already had solution.
If this is the case, locating the target record in your huge database will be a rate limiting step.
Actually, no. There is no need to locate a record in a huge database (see my script for an example).
I'm not sure why the bash script is so extremely slow, though. My guess is that either bash is *really* bad at string operations or the problem is that output files are closed/reopened for every record. Or both.
I was going to start learning Python this week -- I think I found a nice problem here to play around with...
Actually, no. There is no need to locate a record in a huge database (see my script for an example).
I am not good to read bash script, however..
You sequentially scan the huge database file, if I understand your script correct. Then do something... I have seen one guy used "grep" directly, another guy used "perl" script to do same thing using similar programing concept, both choked (some two digit hours of run time, using cluster.) Anyway, I will not argue any more.
It is good for "nice problem here to play around with...", I agree.
Thanks for all this enthusism....
I would answer each of the queries by fellows here one by one...
@jschiwal
yes there can be other sequences with the same score as 0.5.However, the increments or the ranks are not always in 0.5.I mean there might be ranks like 1.004 or something...I hope this answers your question.If the case were that the fractional change in the ranks were the same for each time, the problem would have been much simpler, as you stated.
About C++, I would say that I try to work my way out, but I am not good at it...I have worked with Java for many years, and didnt touch C/C++, but the applications that I am working now in, requires me to have my interaction with C/C++
Quote:
Originally by kaz2100
If this is the case, locating the target record in your huge database will be a rate limiting step. What you need is NOT C script, (NOT C++ either, C++ is slower than C), you need to do something different.
what do you suggest must be done...any insight would be highly appreciable...
@ta0kira
Thankyou once again.....It wouldnt have been possible without your help...
Quote:
Originally by @rupertwh
ok, so I'm bored... ;o)
Couldn't resist and wrote a short C app to do the same. It's about 300 times faster (~250000 recs/sec).
I'm impressed. Didn't expect that much of a difference for more or less a straight (albeit selective) file copy.
Yah, that is what I originally expected when i worked with bash scripts initially, but the difference seems to be huge...probably you are right about the input/output thing of the bash scripts...
Not necessarily true. std::list using validated iterators and no optimization vs. your own linked list: PROBABLY. The only thing C++ really adds is member/virtual functions, constructors, and destructors (aside from compile-time sematics such as function overloading, templates, and namespaces.) Any floating point or *, /, or % operation will nullify any performance loss caused by additional dereferencing of a vtable for a virtual function. Even the best C programmers are no match for a compiler when turning a C++ class into fast C code. If you need virtual functions, constructors, and destructors you would be wasting your time to try that in C. If you don't, C might be the way to go. The STL is a whole different story, but can be optimized to be very fast.
Creating a binary program is just the first step, which tends to be the greatest leap in performance for me. The difference between the speeds of C and C++ is trivial at most with gcc -O2 or greater when going about things the same way, and nearly all C can be put into a C++ program. The algorithm will be the limiting factor in most cases. Shell scripting is not suitable for a speed-optimizing algorithm. That much is obvious. Code that works is a start, then analysis of goal resistance vs. the code needs to take place.
Tabulating and indexing code is a lot easier to write and maintain in C++. You can modify it more easily and the concentration shifts from the many modifications C code would require to what you are really trying to do. C is no place to design a program of this sort with no definite algorithm. If you intend to redesign the algorithm, I'd recommend C++, then convert as much of it to C or asm as possible if it still isn't fast enough. If the current algorithm works for you, you needn't look any further.
ta0kira
Anyways, did some more poking at the bash script and must say: bash is just really slow, no matter what.
But Python turned out to be pretty awesome. Still slower than C, of course, but way ahead of bash.
(this is my very first Python script, so Python hackers be gentle...)
Code:
#!/usr/bin/python
import math
datafname="data.txt"
rankfname="rank.txt"
outfbase="outfile_"
outmap = {}
outfiles = []
for i in (1,2,3,4,5,6,7,8,9,10) :
outfiles.append(open(outfbase + '%02d' % i, 'w'))
rankfile = open(rankfname, 'r')
for line in rankfile.xreadlines() :
tag = line.split(':')[0].strip()
rank = float(line.split(':')[1].strip())
rankno = int(math.ceil(rank * 10))
if rankno == 0 :
rankno = 1
if rankno < 0 or rankno > 10 :
print "Invalid rank: " + line
continue
outmap[tag] = outfiles[rankno-1]
rankfile.close()
tag=""
datafile = open(datafname, 'r')
for line in datafile.xreadlines() :
if line[0] == '>' :
tag = line[1:].strip()
outfile = outmap.get(tag,"")
if not outfile :
print "Unranked seq: " + tag
tag=""
continue
if len(tag) > 0 :
outfile.write(line)
datafile.close()
for f in outfiles :
f.close()
So the score is currently (on my machine, obviously):
C: ~400.000 recs/s
Python: ~70.000 recs/s
Bash: ~1.000 recs/s
BTW: You can get some 15% speed increase on the C version by using a lookup table instead of a linked list.
Let's have a let's-have-a-flamewar vs. let's-not-have-a-flamewar flamewar
Quote:
Originally Posted by rupertwh
BTW: You can get some 15% speed increase on the C version by using a lookup table instead of a linked list.
As in a contiguous array? Yes, that would be much faster. Faster still would be numerical section labels so that the comparison would be regsiter vs. register rather than string vs. string. Of course, you would pay for that up front by converting all of the labels
Generally I like to process everything using references and then extract data from the end result. That speeds things up quite a bit because you don't have to move data multiple times. What I mean by this is figure out where things go first using pointers then copy once you've figured it out. Can't really do that here, though.
Here is another way to optimize for speed, along the lines of contiguous arrays. Not intending to provoke anything, I can objectively say that arrays of structures are much easier to handle with C++ than C if only for the fact that you can increment pointers without having to do sizeof. So back to the story. What you can do is create a 257^x matrix of labels, where x will be the number of dimensions. Each dimension will represent a different character in the label, with the last being reserved for the remaining matches.
Code:
#include <string>
#include <vector>
struct tabled_value
{
tabled_value() : value() {}
tabled_value(const std::string lLabel, double vValue) :
label(lLabel), value(vValue) {}
inline bool operator == (const char *lLabel) const
{ return !strcmp(&this->label[0], lLabel); }
std::string label;
double value;
};
class multi_table3
{
public:
void add_value(const std::string lLabel, double vValue)
{
unsigned int X = 256, Y = 256, Z = 256, size = lLabel.size();
if (size > 0) X = lLabel[0];
if (size > 1) Y = lLabel[1];
if (size > 2) Z = lLabel[2];
master_table[X][Y][Z].insert(master_table[X][Y][Z].begin(), tabled_value());
if (size > 3)
master_table[X][Y][Z][0].label = lLabel.substr(3, size - 3);
master_table[X][Y][Z][0].value = vValue;
}
double retrieve_value(const char *lLabel) const
{
unsigned int X = 256, Y = 256, Z = 256, size = strlen(lLabel);
const char empty[] = "";
const char *compare = empty;
if (size > 0) X = lLabel[0];
if (size > 1) Y = lLabel[1];
if (size > 2) Z = lLabel[2];
if (size > 3) compare = lLabel + 3;
std::vector <tabled_value> ::const_iterator begin = master_table[X][Y][Z].begin(),
end = master_table[X][Y][Z].end();
while (begin < end)
if (*begin == compare) return begin->value;
else ++begin;
return 0.0;
}
private:
std::vector <tabled_value> master_table[257][257][257];
};
Using the structures above (I'll spare you the entire test) vs. a normal vector (discounting allocation time) and a rank table of 50 labels, I get about 410,200 records per-second with the 1d structure vs. 615,200 with the 3d. That's a 50% performance increase. The following caveats apply, however:
record size = 80B
labels are the numbers 1 - 10 spelled out, then repeated with 1 - 4 suffixed
allocation time is the limiting factor for smaller data sets, but can be compiler-optimized to a trivial level with huge sets (tested with 726,000 records)
The test code is long, but if you REALLY want to see it I will provide it. For the most part it's just a retrofit of the previous code I provided.
ta0kira
I can't believe I've started using small yellow circles embedded with poor representations of facial expressions to imply light-heartedness and scolding. I need a break from the internet...
Update: with -O1 and above I get .27s allocation time vs. 1.37s with -O0. Strangely enough, I get better improvement with 3d vs. 1d with -O0, giving me 1d @ 195,200 r/s vs. 3d @ 412,500 (111% increase in speed.)
I also tested vs. simulated contiguous arrays (effectively a C array) which gave me 283,600 r/s at -O0 (agreeably much better than 1d vector) and 376,200 at -O9 (56% improvement, 75% improvement.)
Obviously an informal test, but the approximate rates are repeatable. Not trying to start a competition, just so you know. I just like coming up with different ways to do things.
ta0kira
edit: looks like there needs to be > .5 million records involved and at least -O1 to make it worth it...
edit2: making the array of vectors 2d instead of 3d is MUCH better. ~ 30-125% improvement.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.