Counting number of characters in a file (but not as simple as wc)

bioinformatics_guy · 09-18-2008, 05:42 AM

Im working with sequence files in fasta format. For the non-geneticist, its basically the standard file format for DNA sequence that follows this form:

>Sometext_numbers_whatever
ACCAGCGAGCGAGCGAGCAGC
AGCGATCGATCGTAGCTAGCTGACTCG
ACTAGCTAGTCAGTGCTAGTCGATCGAGCAG
TCAGTACGTACGTAGCTAGCTGACTCATG
CGTAGCTAGCTAGCTAGCTGAACGTACG

What I would like to do is, count all characters (excluding carriage returns and the first line which will always start with a >)

Is there a one liner that will do this?

matthewg42 · 09-18-2008, 05:53 AM

You could use a little perl program on the command line:

Code:

perl -e 'END { print "$n\n"; } while(<>) { if (!/^>/) { chomp; $n+=length; } }' input_file

bioinformatics_guy · 09-18-2008, 06:38 AM

Perl is definitely right up my alley. I tried that piece of code and was returned this error:

/: Event not found.

Would you mind breaking down that code?

matthewg42 · 09-18-2008, 06:52 AM

Maybe you used the wrong type of quote (perhaps you used a backtick, `, instead of an apostrophe, '. Here's version to put in a file to make a program you can run, with comments to explain it:

Code:

#!/usr/bin/perl
# the previous line says this file is a script to be interpreted 
# by the program /usr/bin/perl.  You might need to change the path
# if perl is installed somewhere else on your system

while(<>)  # read each line of the input and execute the following code for each line
{
    if (!/^>/) {          # if line does not start with a '>' character ... 
        chomp;            # remove new line character from $_ (the curent input line)
        $n += length($_); # add the length of the line to the variable $n
    }
} 

print "$n\n";  # output the character count which is in the variable $n
               # followed by a new line character, \n.

To use this, save it into a file and put the file somewhere in your PATH. e.g. /usr/local/bin. Maybe call the file "fasta_count". Then use chmod to make the file executable:

Code:

chmod 755 /usr/local/bin/fasta_count

As long as /usr/local/bin is in your PATH, you can then run it on files like this:

Code:

fasta_count input_file

theYinYeti · 09-18-2008, 07:08 AM

tail -n +2 YOUR_FILE | tr -d '\r\n' | wc -L

Yves.

ghostdog74 · 09-18-2008, 08:14 AM

Quote:

Originally Posted by bioinformatics_guy

Im working with sequence files in fasta format. For the non-geneticist, its basically the standard file format for DNA sequence that follows this form:

>Sometext_numbers_whatever
ACCAGCGAGCGAGCGAGCAGC
AGCGATCGATCGTAGCTAGCTGACTCG
ACTAGCTAGTCAGTGCTAGTCGATCGAGCAG
TCAGTACGTACGTAGCTAGCTGACTCATG
CGTAGCTAGCTAGCTAGCTGAACGTACG

What I would like to do is, count all characters (excluding carriage returns and the first line which will always start with a >)

Is there a one liner that will do this?

you can try Perl's simpler cousin

Code:

awk '!/^>/{ total+=length}END{ print "Total chars: "total}' file

bioinformatics_guy · 09-18-2008, 09:10 AM

Matthew -- thanks for the great perl advice. For the most part I am solely programming in perl, but have had some difficulty in writing quick one liners.

Going back over your code (I haven't figured out how to do a code block yet

)

while(<>) # as a first line, how come you did not have to open a filepath? such as open(FASTA,"<fasta.fa") ; @line=<FASTA> ; close(FASTA)
does perl automatically take the first argument as the input for the while, and if so, could you do this, while(<ARGV[0]) since the file name is the argument?

if (!/^>/) #Im still getting used to perls shorthand but I am assuming this is the same as having a for loop iterate though all my @line and then using:

if($line[$i] != m/^>/) # I do like your way much better, I have been real bad about not reading stuff in on the fly but storing it first

$n += length($_); # length() just tells you the number of characters correct?

Yves: initially that is exactly what I was looking for but I'm glad to have added all these other hacks to my arsenal. I've never seen the tail command so I need to read up on that in the man pages

tr -d '\r\n' means delete carriage returns (\n) and what is \r? Do they have to follow each other like that or does that mean just delete all \r and \n occurances?

I need to look up what the flag -L means in wc

Ghostdog: I just ordered a quick reference guide for sed,awk. It seems to be the way to go for streaming data manipulation

awk '!/^>/{ total+=length}END{ print "Total chars: "total}' file

If I understand this correctly, !/^>/ means don't take lines that start with >, but {total+=lengh}END{print "Total chars:"total}' has me stumped. Is that just awk syntax?

Thank you all for your help!

ghostdog74 · 09-18-2008, 09:48 AM

Quote:

Originally Posted by bioinformatics_guy

Ghostdog: I just ordered a quick reference guide for sed,awk. It seems to be the way to go for streaming data manipulation

awk '!/^>/{ total+=length}END{ print "Total chars: "total}' file

If I understand this correctly, !/^>/ means don't take lines that start with >, but {total+=lengh}END{print "Total chars:"total}' has me stumped. Is that just awk syntax?

you can try it out

Code:

awk '{print length}' file

once you run this, you know what it means. And yes, its awk syntax.

bioinformatics_guy · 09-18-2008, 09:53 AM

Ghostdog,

I tried running your command and received the following error:

awk '^>/{ total+=length}END{ print "Total chars: "total}' test.fa
awk: ^>/{ total+=length}END{ print "Total chars: "total}
awk: ^ syntax error
awk: ^>/{ total+=length}END{ print "Total chars: "total}
awk: ^ unterminated regexp

Am I missing something? I first had to delete the ` and replace with ' as my system seems to do something screwy when I cut and paste it

jschiwal · 09-18-2008, 10:02 AM

You could combine simple tools:
sed 1d testfile | tr -d '\n\t \r' | wc -c

Most distro's will add ~/bin/ to your path if it exists. That would be a better place to place your user scripts instead of /usr/local/bin/.

ghostdog74 · 09-18-2008, 10:05 AM

Quote:

Originally Posted by bioinformatics_guy

Ghostdog,

Am I missing something? I first had to delete the ` and replace with ' as my system seems to do something screwy when I cut and paste it

check the syntax again. its definitely not what i give you

bioinformatics_guy · 09-18-2008, 10:23 AM

I must be missing something little, and actually your code as I understand is the most compatible with what I am doing as I just remembered that there are multple sequences in a file such that you would have may occurances of:

>This_Structure
AJAKJKFKAKFKJ
AKJKLFAKJFK
AKLJFLKALFJ
>This_Structure
AKJLFLAKALFAJ
AKLFKALFLKALKFJ
FJKLAJFLAKFLKJ
etc...

bioinformatics_guy · 09-18-2008, 10:28 AM

And I did notice that the ! was omitted, my mistake.

Its weird, if I copy directly from the forum

awk '!/^>/{ total+=length}END{ print "Total chars: "total}' test.fa

I get the error, "/: Event not found."

Then if I cycle back a step to the last command, it gives me this:

awk '^>/{ total+=length}END{ print "Total chars: "total}' test.fa

Omitting the !. If I go back and put the ! in,

it says "Bad ! arg selector" or

awk '!>/{ total+=length}END{ print "Total chars: "total}' test.fa
awk: !>/{ total+=length}END{ print "Total chars: "total}
awk: ^ syntax error
awk: !>/{ total+=length}END{ print "Total chars: "total}
awk: ^ unterminated regexp

Am I missing something?

matthewg42 · 09-18-2008, 03:08 PM

In Perl, <FILE> in the scalar context will read a line from the open file, FILE. Subsequent calls will read the next line of the file.

If you don't provide anything between the < and >, Perl does something a little weird, but super-useful.

If there is anything in @ARGV, each item will be treated an an input file name. Perl will automatically open them and read each line of the file in order. If @ARGV is empty, Perl will read standard input.

Thus, this Perl program does the same thing as the utility "cat"

Code:

#!/usr/bin/perl

while(<>)
{
    print;
}

* * *

This code:

Code:

if ( ! /^>/ ) { ... }

will execute "..." if the current input line (which is held in the variable "$_") does not match the regular expression "^>". "^" means the start of the line, and ">" is just the character. This is a short hand for:

Code:

if ( ! $_ =~ /^>/ ) { ... }

For many operations and functions, "$_" is the default argument, which will be used if no argument is explicitly specified. The same applies to the chomp function, used in the example.

* * *

length() does indeed return the number of characters in a string.

* * *

For documentation on any perl build in function, you can do this command in a terminal:

Code:

perldoc -f function_name

You can also look in the perlfunc manual page for a big list of all the built in functions with their documentation.

The regular expression operators (like the matching operator used above) are documented in the perlop manual page.

chrism01 · 09-18-2008, 08:15 PM

You may find this useful if you don't already have it bookmarked: http://perldoc.perl.org/