LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 09-18-2008, 06:42 AM   #1
bioinformatics_guy
Member
 
Registered: Aug 2008
Posts: 54

Rep: Reputation: 15
Counting number of characters in a file (but not as simple as wc)


Im working with sequence files in fasta format. For the non-geneticist, its basically the standard file format for DNA sequence that follows this form:

>Sometext_numbers_whatever
ACCAGCGAGCGAGCGAGCAGC
AGCGATCGATCGTAGCTAGCTGACTCG
ACTAGCTAGTCAGTGCTAGTCGATCGAGCAG
TCAGTACGTACGTAGCTAGCTGACTCATG
CGTAGCTAGCTAGCTAGCTGAACGTACG

What I would like to do is, count all characters (excluding carriage returns and the first line which will always start with a >)

Is there a one liner that will do this?
 
Old 09-18-2008, 06:53 AM   #2
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 63
You could use a little perl program on the command line:
Code:
perl -e 'END { print "$n\n"; } while(<>) { if (!/^>/) { chomp; $n+=length; } }' input_file
 
Old 09-18-2008, 07:38 AM   #3
bioinformatics_guy
Member
 
Registered: Aug 2008
Posts: 54

Original Poster
Rep: Reputation: 15
Perl is definitely right up my alley. I tried that piece of code and was returned this error:

/: Event not found.

Would you mind breaking down that code?
 
Old 09-18-2008, 07:52 AM   #4
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 63
Maybe you used the wrong type of quote (perhaps you used a backtick, `, instead of an apostrophe, '. Here's version to put in a file to make a program you can run, with comments to explain it:
Code:
#!/usr/bin/perl
# the previous line says this file is a script to be interpreted 
# by the program /usr/bin/perl.  You might need to change the path
# if perl is installed somewhere else on your system

while(<>)  # read each line of the input and execute the following code for each line
{
    if (!/^>/) {          # if line does not start with a '>' character ... 
        chomp;            # remove new line character from $_ (the curent input line)
        $n += length($_); # add the length of the line to the variable $n
    }
} 

print "$n\n";  # output the character count which is in the variable $n
               # followed by a new line character, \n.
To use this, save it into a file and put the file somewhere in your PATH. e.g. /usr/local/bin. Maybe call the file "fasta_count". Then use chmod to make the file executable:
Code:
chmod 755 /usr/local/bin/fasta_count
As long as /usr/local/bin is in your PATH, you can then run it on files like this:
Code:
fasta_count input_file

Last edited by matthewg42; 09-18-2008 at 07:54 AM.
 
Old 09-18-2008, 08:08 AM   #5
theYinYeti
Senior Member
 
Registered: Jul 2004
Location: France
Distribution: Arch Linux
Posts: 1,897

Rep: Reputation: 61
tail -n +2 YOUR_FILE | tr -d '\r\n' | wc -L

Yves.
 
Old 09-18-2008, 09:14 AM   #6
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by bioinformatics_guy View Post
Im working with sequence files in fasta format. For the non-geneticist, its basically the standard file format for DNA sequence that follows this form:

>Sometext_numbers_whatever
ACCAGCGAGCGAGCGAGCAGC
AGCGATCGATCGTAGCTAGCTGACTCG
ACTAGCTAGTCAGTGCTAGTCGATCGAGCAG
TCAGTACGTACGTAGCTAGCTGACTCATG
CGTAGCTAGCTAGCTAGCTGAACGTACG

What I would like to do is, count all characters (excluding carriage returns and the first line which will always start with a >)

Is there a one liner that will do this?
you can try Perl's simpler cousin
Code:
awk '!/^>/{ total+=length}END{ print "Total chars: "total}' file
 
Old 09-18-2008, 10:10 AM   #7
bioinformatics_guy
Member
 
Registered: Aug 2008
Posts: 54

Original Poster
Rep: Reputation: 15
Matthew -- thanks for the great perl advice. For the most part I am solely programming in perl, but have had some difficulty in writing quick one liners.

Going back over your code (I haven't figured out how to do a code block yet )


while(<>) # as a first line, how come you did not have to open a filepath? such as open(FASTA,"<fasta.fa") ; @line=<FASTA> ; close(FASTA)
does perl automatically take the first argument as the input for the while, and if so, could you do this, while(<ARGV[0]) since the file name is the argument?


if (!/^>/) #Im still getting used to perls shorthand but I am assuming this is the same as having a for loop iterate though all my @line and then using:

if($line[$i] != m/^>/) # I do like your way much better, I have been real bad about not reading stuff in on the fly but storing it first

$n += length($_); # length() just tells you the number of characters correct?

Yves: initially that is exactly what I was looking for but I'm glad to have added all these other hacks to my arsenal. I've never seen the tail command so I need to read up on that in the man pages

tr -d '\r\n' means delete carriage returns (\n) and what is \r? Do they have to follow each other like that or does that mean just delete all \r and \n occurances?

I need to look up what the flag -L means in wc

Ghostdog: I just ordered a quick reference guide for sed,awk. It seems to be the way to go for streaming data manipulation

awk '!/^>/{ total+=length}END{ print "Total chars: "total}' file

If I understand this correctly, !/^>/ means don't take lines that start with >, but {total+=lengh}END{print "Total chars:"total}' has me stumped. Is that just awk syntax?

Thank you all for your help!
 
Old 09-18-2008, 10:48 AM   #8
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by bioinformatics_guy View Post
Ghostdog: I just ordered a quick reference guide for sed,awk. It seems to be the way to go for streaming data manipulation

awk '!/^>/{ total+=length}END{ print "Total chars: "total}' file

If I understand this correctly, !/^>/ means don't take lines that start with >, but {total+=lengh}END{print "Total chars:"total}' has me stumped. Is that just awk syntax?
you can try it out
Code:
awk '{print length}' file
once you run this, you know what it means. And yes, its awk syntax.
 
Old 09-18-2008, 10:53 AM   #9
bioinformatics_guy
Member
 
Registered: Aug 2008
Posts: 54

Original Poster
Rep: Reputation: 15
Ghostdog,

I tried running your command and received the following error:

awk '^>/{ total+=length}END{ print "Total chars: "total}' test.fa
awk: ^>/{ total+=length}END{ print "Total chars: "total}
awk: ^ syntax error
awk: ^>/{ total+=length}END{ print "Total chars: "total}
awk: ^ unterminated regexp

Am I missing something? I first had to delete the ` and replace with ' as my system seems to do something screwy when I cut and paste it
 
Old 09-18-2008, 11:02 AM   #10
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 670Reputation: 670Reputation: 670Reputation: 670Reputation: 670Reputation: 670
You could combine simple tools:
sed 1d testfile | tr -d '\n\t \r' | wc -c

Most distro's will add ~/bin/ to your path if it exists. That would be a better place to place your user scripts instead of /usr/local/bin/.

Last edited by jschiwal; 09-18-2008 at 11:06 AM.
 
Old 09-18-2008, 11:05 AM   #11
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by bioinformatics_guy View Post
Ghostdog,

Am I missing something? I first had to delete the ` and replace with ' as my system seems to do something screwy when I cut and paste it
check the syntax again. its definitely not what i give you
 
Old 09-18-2008, 11:23 AM   #12
bioinformatics_guy
Member
 
Registered: Aug 2008
Posts: 54

Original Poster
Rep: Reputation: 15
I must be missing something little, and actually your code as I understand is the most compatible with what I am doing as I just remembered that there are multple sequences in a file such that you would have may occurances of:

>This_Structure
AJAKJKFKAKFKJ
AKJKLFAKJFK
AKLJFLKALFJ
>This_Structure
AKJLFLAKALFAJ
AKLFKALFLKALKFJ
FJKLAJFLAKFLKJ
etc...
 
Old 09-18-2008, 11:28 AM   #13
bioinformatics_guy
Member
 
Registered: Aug 2008
Posts: 54

Original Poster
Rep: Reputation: 15
And I did notice that the ! was omitted, my mistake.

Its weird, if I copy directly from the forum

awk '!/^>/{ total+=length}END{ print "Total chars: "total}' test.fa

I get the error, "/: Event not found."

Then if I cycle back a step to the last command, it gives me this:

awk '^>/{ total+=length}END{ print "Total chars: "total}' test.fa

Omitting the !. If I go back and put the ! in,

it says "Bad ! arg selector" or

awk '!>/{ total+=length}END{ print "Total chars: "total}' test.fa
awk: !>/{ total+=length}END{ print "Total chars: "total}
awk: ^ syntax error
awk: !>/{ total+=length}END{ print "Total chars: "total}
awk: ^ unterminated regexp

Am I missing something?
 
Old 09-18-2008, 04:08 PM   #14
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 63
In Perl, <FILE> in the scalar context will read a line from the open file, FILE. Subsequent calls will read the next line of the file.

If you don't provide anything between the < and >, Perl does something a little weird, but super-useful.

If there is anything in @ARGV, each item will be treated an an input file name. Perl will automatically open them and read each line of the file in order. If @ARGV is empty, Perl will read standard input.

Thus, this Perl program does the same thing as the utility "cat"
Code:
#!/usr/bin/perl

while(<>)
{
    print;
}
* * *

This code:
Code:
if ( ! /^>/ ) { ... }
will execute "..." if the current input line (which is held in the variable "$_") does not match the regular expression "^>". "^" means the start of the line, and ">" is just the character. This is a short hand for:
Code:
if ( ! $_ =~ /^>/ ) { ... }
For many operations and functions, "$_" is the default argument, which will be used if no argument is explicitly specified. The same applies to the chomp function, used in the example.

* * *

length() does indeed return the number of characters in a string.

* * *

For documentation on any perl build in function, you can do this command in a terminal:
Code:
perldoc -f function_name
You can also look in the perlfunc manual page for a big list of all the built in functions with their documentation.

The regular expression operators (like the matching operator used above) are documented in the perlop manual page.
 
Old 09-18-2008, 09:15 PM   #15
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.8, Centos 5.10
Posts: 17,240

Rep: Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324
You may find this useful if you don't already have it bookmarked: http://perldoc.perl.org/
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
counting the output packet number zhoufanking Programming 4 07-06-2008 02:14 AM
Counting number of system reboots rbh123 Linux - Newbie 2 11-22-2007 04:28 AM
Counting characters in C++ ckoniecny Programming 6 09-08-2006 01:58 AM
counting number of files akin81 Linux - Newbie 6 03-25-2004 02:53 PM
counting characters Snake007uk Programming 13 05-10-2002 04:34 PM


All times are GMT -5. The time now is 10:50 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration