Counting number of characters in a file (but not as simple as wc)
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Counting number of characters in a file (but not as simple as wc)
Im working with sequence files in fasta format. For the non-geneticist, its basically the standard file format for DNA sequence that follows this form:
Maybe you used the wrong type of quote (perhaps you used a backtick, `, instead of an apostrophe, '. Here's version to put in a file to make a program you can run, with comments to explain it:
Code:
#!/usr/bin/perl
# the previous line says this file is a script to be interpreted
# by the program /usr/bin/perl. You might need to change the path
# if perl is installed somewhere else on your system
while(<>) # read each line of the input and execute the following code for each line
{
if (!/^>/) { # if line does not start with a '>' character ...
chomp; # remove new line character from $_ (the curent input line)
$n += length($_); # add the length of the line to the variable $n
}
}
print "$n\n"; # output the character count which is in the variable $n
# followed by a new line character, \n.
To use this, save it into a file and put the file somewhere in your PATH. e.g. /usr/local/bin. Maybe call the file "fasta_count". Then use chmod to make the file executable:
Code:
chmod 755 /usr/local/bin/fasta_count
As long as /usr/local/bin is in your PATH, you can then run it on files like this:
Code:
fasta_count input_file
Last edited by matthewg42; 09-18-2008 at 06:54 AM.
Im working with sequence files in fasta format. For the non-geneticist, its basically the standard file format for DNA sequence that follows this form:
Matthew -- thanks for the great perl advice. For the most part I am solely programming in perl, but have had some difficulty in writing quick one liners.
Going back over your code (I haven't figured out how to do a code block yet )
while(<>) # as a first line, how come you did not have to open a filepath? such as open(FASTA,"<fasta.fa") ; @line=<FASTA> ; close(FASTA)
does perl automatically take the first argument as the input for the while, and if so, could you do this, while(<ARGV[0]) since the file name is the argument?
if (!/^>/) #Im still getting used to perls shorthand but I am assuming this is the same as having a for loop iterate though all my @line and then using:
if($line[$i] != m/^>/) # I do like your way much better, I have been real bad about not reading stuff in on the fly but storing it first
$n += length($_); # length() just tells you the number of characters correct?
Yves: initially that is exactly what I was looking for but I'm glad to have added all these other hacks to my arsenal. I've never seen the tail command so I need to read up on that in the man pages
tr -d '\r\n' means delete carriage returns (\n) and what is \r? Do they have to follow each other like that or does that mean just delete all \r and \n occurances?
I need to look up what the flag -L means in wc
Ghostdog: I just ordered a quick reference guide for sed,awk. It seems to be the way to go for streaming data manipulation
If I understand this correctly, !/^>/ means don't take lines that start with >, but {total+=lengh}END{print "Total chars:"total}' has me stumped. Is that just awk syntax?
If I understand this correctly, !/^>/ means don't take lines that start with >, but {total+=lengh}END{print "Total chars:"total}' has me stumped. Is that just awk syntax?
you can try it out
Code:
awk '{print length}' file
once you run this, you know what it means. And yes, its awk syntax.
I must be missing something little, and actually your code as I understand is the most compatible with what I am doing as I just remembered that there are multple sequences in a file such that you would have may occurances of:
In Perl, <FILE> in the scalar context will read a line from the open file, FILE. Subsequent calls will read the next line of the file.
If you don't provide anything between the < and >, Perl does something a little weird, but super-useful.
If there is anything in @ARGV, each item will be treated an an input file name. Perl will automatically open them and read each line of the file in order. If @ARGV is empty, Perl will read standard input.
Thus, this Perl program does the same thing as the utility "cat"
Code:
#!/usr/bin/perl
while(<>)
{
print;
}
* * *
This code:
Code:
if ( ! /^>/ ) { ... }
will execute "..." if the current input line (which is held in the variable "$_") does not match the regular expression "^>". "^" means the start of the line, and ">" is just the character. This is a short hand for:
Code:
if ( ! $_ =~ /^>/ ) { ... }
For many operations and functions, "$_" is the default argument, which will be used if no argument is explicitly specified. The same applies to the chomp function, used in the example.
* * *
length() does indeed return the number of characters in a string.
* * *
For documentation on any perl build in function, you can do this command in a terminal:
Code:
perldoc -f function_name
You can also look in the perlfunc manual page for a big list of all the built in functions with their documentation.
The regular expression operators (like the matching operator used above) are documented in the perlop manual page.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.