Perl script that skips a header, sorts the rest, then operates on the data

captainentropy · 10-31-2013, 09:01 PM

I have these data files in this format:

Code:

##header
##to 
##be
##ignored	
chr1	numbers	more-numbers
chr10	numbers	more-numbers
chr2	numbers	more-numbers

As you can see the order of the first column (below the header) is lexical, and not alphanumeric. I want to be able to skip the header, sort the data correctly, and then perform functions on the data. I figured out how to sort the data properly (an unnecessarily complicated procedure in Perl IMO), but that was on a test file with no header.

Currently I can do exactly what I want by calling my Perl script from a shell script like this:

Code:

#!/bin/bash
sed '1,18d' "$1" |
sort -V > $X.sort 
perl-script.pl $X.sort -var=$2

The sed removes the header, sort -V sorts alphanumerical, the -var= value is passed into the "perl-script.pl".

However, I want to integrate this all into a Perl script. There is a lot more I plan to do and I need to know more about Perl structure and syntax, etc..

Here's the Perl script I've written so far.

Code:

#!/usr/bin/perl
use warnings;
no warnings 'uninitialized';
use Math::Round;
use Getopt::Long;

@xls = $ARGV[0];
open ( IN, "@xls" ) or die "Can't open file: $!";
$count = 1;
my $args;
my %args;
GetOptions(\%args,"var=f") or die "D'oh!";
die "Missing -var=[num]!\n" unless $args{var};
while (<IN>)	{
	next if /#/;		# remove the header	  
	chomp;
	my @fields = split ("\t",$_);
	if ($fields[8] <= $args{var})	{
	my @new_score = nearest(1,$fields[6]);
	my @name = $count;
	print "$fields[0]\t$fields[1]\t$fields[2]\tData_@name\t@new_score\n";
	$count++
					}
		}
close IN;

This is the other Perl script I wrote that can sort alphanumerically

Code:

#!/usr/bin/perl
use warnings;
no warnings 'uninitialized';
use Sort::Naturally;

@xls = $ARGV[0];
open (IN, "@xls") or die "Can't open this shit $!";
my @sort = map {$_->[0]}
	   sort {ncmp($a->[1], $b->[1])}
	   map  {chomp;[$_,split(/\t/)]} <IN>;
print "$_\n" for @sort;
close IN;

There seems to be something I don't get on how to perform a function on an entire array (e.g. sort) and then take that entire sorted array and perform other functions on it as if it were the $ARGV[0]. Does that make sense? This is a sticking point for me on other programs I'm writing too, so any help on this should help me on that too

codeguy · 10-31-2013, 09:45 PM

Quote:

next if /#/; # remove the header

You might consider:

PHP Code:



 next if (/^\s*#/);

Anchor the regx at the start, its much faster, optionally includes white space, and probably more accurate.

After skipping the heading, you can add the line to an array: push(@list, $line);

Then sort it:

my @newlist = sort {ncmp($a->[1], $b->[1])} @list;

After sorting, you can iterate and split out the columns:
foreach my $x (@newlist)
{
my @columns = split("\t", $x);
etc..
}

Thats one way. You say you'll do a lot of stuff to the number, so there might be a faster way, but it depends on if the first column is unique. Will the first column (chr1, chr2, etc) ever have duplicates?

-Andy

captainentropy · 11-01-2013, 09:35 PM

Thanks for the reply codeguy. Using this below I can skip the header and print out the data but it's not sorted the same way sort -V does it. I'm not saying it's wrong, though. The column I'm sorting are chromosomes and as it is now the order puts chr10 ahead of chr2. My output puts the numbers in the right order (e.g. chr10 comes after chr2) but chrX comes before them all. Linux sort -V doesn't come out like that. I think I see why, sort -V is a version sort but Sort::Naturally will put non-numeric values first http://search.cpan.org/~bingos/Sort-...t/Naturally.pm

So, I need to try the Sort::Version module to get this right (chrX should come last).

I'll figure that out and come back. I have some other questions regarding the rest of the code that wasn't working for me. Also, I commented out the push part and changed @list back to <IN> in the sort routine because for some reason it simply returned the original array, sans header.

Code:

#!/usr/bin/perl
use warnings;
no warnings 'uninitialized';
use Math::Round;
use Getopt::Long;
use Sort::Naturally;

@xls = $ARGV[0];
open ( IN, "@xls" ) or die "Can't open file: $!";
#$count = 1;
my $args;
my %args;
GetOptions(\%args,"FDR=f") or die "D'oh!";
die "Missing -FDR=[num]!\n" unless $args{FDR};
while (<IN>)	{{
	next if /#/;		# remove the header	#  
	chomp;	
#	my @list;
#	push(@list, $_);
	my @newlist = map {$_->[0]}
	   sort {ncmp($a->[1], $b->[1])}
	   map  {chomp;[$_,split(/\t/)]} <IN>;
	print "$_\n" for @newlist;	
		}}
close IN;

codeguy · 11-02-2013, 08:50 AM

Um, here, how about this:

Code:

#!/usr/bin/perl

use strict;
use warnings;
use Sort::Naturally;


my @list;
my $file = shift;
print "Reading file: $file\n";
open(F, '<', $file) or die;
while (<F>)
{
	next if (/^s*#/);
	chomp;
	my @cols = split("\t");
	push(@list, \@cols);
}
close(F);
my @sorted = sort {ncmp($a->[0], $b->[0])} @list;

foreach my $row (@sorted)
{
	print "[", $row->[0], "]","[", $row->[1], "]","[", $row->[2], "]\n";
}

bigearsbilly · 11-11-2013, 02:32 AM

my go ;-)

Code:

#!/usr/bin/perl 


sub your_func {
    print ">@_<\n";
}

chomp (my @L = grep {!/^#/} (<>));
map {your_func $_}  sort { (split " ", $a)[1] <=> (split " ", $b)[1] }  @L;