Perl script that skips a header, sorts the rest, then operates on the data
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
As you can see the order of the first column (below the header) is lexical, and not alphanumeric. I want to be able to skip the header, sort the data correctly, and then perform functions on the data. I figured out how to sort the data properly (an unnecessarily complicated procedure in Perl IMO), but that was on a test file with no header.
Currently I can do exactly what I want by calling my Perl script from a shell script like this:
The sed removes the header, sort -V sorts alphanumerical, the -var= value is passed into the "perl-script.pl".
However, I want to integrate this all into a Perl script. There is a lot more I plan to do and I need to know more about Perl structure and syntax, etc..
Here's the Perl script I've written so far.
Code:
#!/usr/bin/perl
use warnings;
no warnings 'uninitialized';
use Math::Round;
use Getopt::Long;
@xls = $ARGV[0];
open ( IN, "@xls" ) or die "Can't open file: $!";
$count = 1;
my $args;
my %args;
GetOptions(\%args,"var=f") or die "D'oh!";
die "Missing -var=[num]!\n" unless $args{var};
while (<IN>) {
next if /#/; # remove the header
chomp;
my @fields = split ("\t",$_);
if ($fields[8] <= $args{var}) {
my @new_score = nearest(1,$fields[6]);
my @name = $count;
print "$fields[0]\t$fields[1]\t$fields[2]\tData_@name\t@new_score\n";
$count++
}
}
close IN;
This is the other Perl script I wrote that can sort alphanumerically
Code:
#!/usr/bin/perl
use warnings;
no warnings 'uninitialized';
use Sort::Naturally;
@xls = $ARGV[0];
open (IN, "@xls") or die "Can't open this shit $!";
my @sort = map {$_->[0]}
sort {ncmp($a->[1], $b->[1])}
map {chomp;[$_,split(/\t/)]} <IN>;
print "$_\n" for @sort;
close IN;
There seems to be something I don't get on how to perform a function on an entire array (e.g. sort) and then take that entire sorted array and perform other functions on it as if it were the $ARGV[0]. Does that make sense? This is a sticking point for me on other programs I'm writing too, so any help on this should help me on that too
Anchor the regx at the start, its much faster, optionally includes white space, and probably more accurate.
After skipping the heading, you can add the line to an array: push(@list, $line);
Then sort it:
my @newlist = sort {ncmp($a->[1], $b->[1])} @list;
After sorting, you can iterate and split out the columns:
foreach my $x (@newlist)
{
my @columns = split("\t", $x);
etc..
}
Thats one way. You say you'll do a lot of stuff to the number, so there might be a faster way, but it depends on if the first column is unique. Will the first column (chr1, chr2, etc) ever have duplicates?
Thanks for the reply codeguy. Using this below I can skip the header and print out the data but it's not sorted the same way sort -V does it. I'm not saying it's wrong, though. The column I'm sorting are chromosomes and as it is now the order puts chr10 ahead of chr2. My output puts the numbers in the right order (e.g. chr10 comes after chr2) but chrX comes before them all. Linux sort -V doesn't come out like that. I think I see why, sort -V is a version sort but Sort::Naturally will put non-numeric values first http://search.cpan.org/~bingos/Sort-...t/Naturally.pm
So, I need to try the Sort::Version module to get this right (chrX should come last).
I'll figure that out and come back. I have some other questions regarding the rest of the code that wasn't working for me. Also, I commented out the push part and changed @list back to <IN> in the sort routine because for some reason it simply returned the original array, sans header.
Code:
#!/usr/bin/perl
use warnings;
no warnings 'uninitialized';
use Math::Round;
use Getopt::Long;
use Sort::Naturally;
@xls = $ARGV[0];
open ( IN, "@xls" ) or die "Can't open file: $!";
#$count = 1;
my $args;
my %args;
GetOptions(\%args,"FDR=f") or die "D'oh!";
die "Missing -FDR=[num]!\n" unless $args{FDR};
while (<IN>) {{
next if /#/; # remove the header #
chomp;
# my @list;
# push(@list, $_);
my @newlist = map {$_->[0]}
sort {ncmp($a->[1], $b->[1])}
map {chomp;[$_,split(/\t/)]} <IN>;
print "$_\n" for @newlist;
}}
close IN;
#!/usr/bin/perl
use strict;
use warnings;
use Sort::Naturally;
my @list;
my $file = shift;
print "Reading file: $file\n";
open(F, '<', $file) or die;
while (<F>)
{
next if (/^s*#/);
chomp;
my @cols = split("\t");
push(@list, \@cols);
}
close(F);
my @sorted = sort {ncmp($a->[0], $b->[0])} @list;
foreach my $row (@sorted)
{
print "[", $row->[0], "]","[", $row->[1], "]","[", $row->[2], "]\n";
}
Last edited by codeguy; 11-02-2013 at 08:51 AM.
Reason: removed "use Data::Dumper;", was for testing only
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.