[SOLVED] A question about shell scripting: getting multiple input files for an application

zeratul111 · 09-21-2010, 01:38 AM

Hi everyone,

I'm new to linux and shell scripting and am having particular trouble figuring out how exactly to approach/script the following that I wish to do. Any help would be much, much appreciated.

I am running an application called QuantiSNP (http://groups.google.co.uk/group/qua...uantisnp-usage). The binary file is "quantisnp2", called upon in the "run_quantisnp2.sh" supplied by the authors. I am only able to run the application for single file processing (e.g. 1 input file for 1 sample; I can't run the batch file processing because I don't have the necessary BeadStudio report files, which has different formatting for the input files).

The difficulty is that I have 300 samples (300 unique sample IDs) and 3 input files for each sample for a total of 900 runs of this application. I am wondering how would I be able to automate this process as a shell script instead of basically manually changing the sample ID and respective input files every time a run completes? I bolded the single file processing shell script command line options below that need to be changed for each sample/input single file processing run. The command line option "--sampleid" is for the name given to the processed output files corresponding to the sample of interest and there are 3 input files for each sample.

/home2/jason/QuantiSNP/quantisnp/linux64/run_quantisnp2.sh /home2/jason/QuantiSNP/MCR/v79/ --config /home2/jason/QuantiSNP/quantisnp/config/params.dat --levels /home2/jason/QuantiSNP/quantisnp/config/levels-affy.dat --outdir /home2/jason/QuantiSNP/quantisnp_out/ --sampleid sample1 --gender female --input-files /home2/jason/files/sample1_input.txt

-------------

Note that each sample has 3 input files, for a total of 3 runs of "quantisnp2" for each sample.

e.g.
SAMPLEID INFILE
sample1 /home2/jason/files/sample1_input.txt
sample1 /home2/jason/files/sample1_input2.txt
sample1 /home2/jason/files/sample1_input3.txt
sample2 /home2/jason/files/sample2_input.txt
sample2 /home2/jason/files/sample2_input2.txt
sample2 /home2/jason/files/sample2_input3.txt
...etc.

-------------

Thanks again! Please feel free to let me know if anything I wrote above needs clarification.

quanta · 09-21-2010, 01:48 AM

Code:

While read line; do run_quantisnp2.sh ... -sampleid `echo $line | awk '{ print $1 }'` --input-files `echo $line | awk '{ print $2 }'`; done < SAMPLEID.INFILE

David the H. · 09-21-2010, 01:50 AM

Well, the actual iteration can be done with a simple loop. The big question is making sure you're using the right files in the loop at the right time.

Could you break it down in just a bit more detail? What is the exact sequence of files that need to be processed? Is there any variability in the filenames or locations? Do the names always correspond to each other?

Finally, please use [code][/code] tags around the contents of scripts and text files, to preserve formatting and improve readability.

grail · 09-21-2010, 01:56 AM

Assuming all the files are int he same directory (ie /home2/jason/files/):

Code:

#!/bin/bash

for file in /home2/jason/files/*
do
    id_name=${file%_*}
    /home2/jason/QuantiSNP/quantisnp/linux64/run_quantisnp2.sh /home2/jason/QuantiSNP/MCR/v79/ --config /home2/jason/QuantiSNP/quantisnp/config/params.dat --levels \
    /home2/jason/QuantiSNP/quantisnp/config/levels-affy.dat --outdir /home2/jason/QuantiSNP/quantisnp_out/ --sampleid $id_name --gender female --input-files $file
done

This of course untested so i would copy 3 of your associated sample files into a temp directory for testing.

zeratul111 · 09-21-2010, 02:05 AM

Thanks quanta, David, and grail for your replies.

Code:

/home2/jason/QuantiSNP/quantisnp/linux64/run_quantisnp2.sh /home2/jason/QuantiSNP/MCR/v79/ --config /home2/jason/QuantiSNP/quantisnp/config/params.dat --levels /home2/jason/QuantiSNP/quantisnp/config/levels-affy.dat --outdir /home2/jason/QuantiSNP/quantisnp_out/ --sampleid sample1 --gender female --input-files /home2/jason/files/sample1_input.txt

For the above, the only variability from run-to-run is the --sampleid and --input-files options. --sampleid involves simply a name of which there are 300. The --input-files has the same locations for all the text files. The input text files vary in name, but do contain the sample ID within the name.

Because the text files names do not correspond exactly to the sample IDs, would it be better if I create a text file that lists the sample IDs and their corresponding input file and work this into the shell script somehow? (I hope this isn't too confusing)

Thanks! I will look at the codes you guys provided in more detail right now.

Quote:

Originally Posted by David the H.

Well, the actual iteration can be done with a simple loop. The big question is making sure you're using the right files in the loop at the right time.

Could you break it down in just a bit more detail? What is the exact sequence of files that need to be processed? Is there any variability in the filenames or locations? Do the names always correspond to each other?

Finally, please use [code][/code] tags around the contents of scripts and text files, to preserve formatting and improve readability.

zeratul111 · 09-21-2010, 05:42 PM

Hello again,

I ran grail's code above. For some reason it is not outputting the files correctly (e.g. nothing in the folder defined in --outdir).

From the output log for a processed file:

Quote:

QuantiSNP: Single-file mode input found.
QuantiSNP: Processing file: /home2/jason/QuantiSNP/testinput/gw6.P4A10_SNP6_R2
QuantiSNP. Chr23 is the X chromosome
QuantiSNP. Reading data for chromosome: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
QuantiSNP. Using EM for parameter estimation. Chromosome: 1.
QuantiSNP. Using EM for parameter estimation. Chromosome: 21.
QuantiSNP. Using EM for parameter estimation. Chromosome: 22.
QuantiSNP. Using EM for parameter estimation. Chromosome: 23.
QuantiSNP. CNV Calling: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
QuantiSNP. Writing QC file: /home2/jason/QuantiSNP/testoutput///home2/jason/QuantiSNP/testinput/gw6.P4A10_SNP6.qc

However, when I just run a single file only (original commands), the output is the following:

Quote:

QuantiSNP: Single-file mode input found.QuantiSNP: Processing file: /home2/jason/QuantiSNP/gw6.P4A11_SNP6QuantiSNP. Chr23 is the X chromosomeQuantiSNP. Reading data for chromosome: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23QuantiSNP. Using EM for parameter estimation. Chromosome: 1.
QuantiSNP. Using EM for parameter estimation. Chromosome: 2.
QuantiSNP. Using EM for parameter estimation. Chromosome: 3.
QuantiSNP. Using EM for parameter estimation. Chromosome: 4.
QuantiSNP. Using EM for parameter estimation. Chromosome: 5.
QuantiSNP. Using EM for parameter estimation. Chromosome: 6.
QuantiSNP. Using EM for parameter estimation. Chromosome: 7.
QuantiSNP. Using EM for parameter estimation. Chromosome: 8.
QuantiSNP. Using EM for parameter estimation. Chromosome: 9.
QuantiSNP. Using EM for parameter estimation. Chromosome: 10.
QuantiSNP. Using EM for parameter estimation. Chromosome: 11.
QuantiSNP. Using EM for parameter estimation. Chromosome: 12.
QuantiSNP. Using EM for parameter estimation. Chromosome: 13.QuantiSNP. Using EM for parameter estimation. Chromosome: 14.
QuantiSNP. Using EM for parameter estimation. Chromosome: 15.
QuantiSNP. Using EM for parameter estimation. Chromosome: 16.
QuantiSNP. Using EM for parameter estimation. Chromosome: 17.
QuantiSNP. Using EM for parameter estimation. Chromosome: 18.
QuantiSNP. Using EM for parameter estimation. Chromosome: 19.
QuantiSNP. Using EM for parameter estimation. Chromosome: 20.
QuantiSNP. Using EM for parameter estimation. Chromosome: 21.QuantiSNP. Using EM for parameter estimation. Chromosome: 22.
QuantiSNP. Using EM for parameter estimation. Chromosome: 23.QuantiSNP. CNV Calling: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
QuantiSNP. Writing QC file: /home2/jason/QuantiSNP/quantisnp_out//P4A11_SNP6.qcQuantiSNP. Writing output to file: /home2/jason/QuantiSNP/quantisnp_out//P4A11_SNP6.cnvQuantiSNP. Writing genotypes to file: /home2/jason/QuantiSNP/quantisnp_out//P4A11_SNP6.gn
QuantiSNP. Done in 0.52 mins.

So, in summary, for each file processed it should write 3 output files and very fast to process. However, for the above run (using grail's code) I used 3 input files. Although from the output log it seems like all three input files were processed, the output files are not written and it took >30 minutes and still did not finish the processing (compared to <1 minute for the original single file run).

Help will be much appreciated. Thank you very much!

grail · 09-21-2010, 07:13 PM

yeah my bad there

Forgot that the path would still be in front of filename.
Give this a whirl:

Code:

#!/bin/bash

for path_file in /home2/jason/files/*
do
    file=${path_file##*/}
    id_name=${file%_*}
    echo "/home2/jason/QuantiSNP/quantisnp/linux64/run_quantisnp2.sh /home2/jason/QuantiSNP/MCR/v79/ --config /home2/jason/QuantiSNP/quantisnp/config/params.dat --levels \
    /home2/jason/QuantiSNP/quantisnp/config/levels-affy.dat --outdir /home2/jason/QuantiSNP/quantisnp_out/ --sampleid $id_name --gender female --input-files $file"
done

This will initially only echo out the command which you need to check against the one you are issuing from the command line.
If it looks correct then remove the echo and the quotes.

zeratul111 · 09-21-2010, 07:59 PM

Hi grail, thanks for the code! It works perfectly now.