Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
We'll be happy to help you; so post what YOU have written/done/tried, and tell us where you're stuck. But we aren't going to write your scripts for you; there are lots of scripting tutorials, sample scripts, and other things you can find with simple Internet searches for things like "how to parse a file in linux scripts".
Beyond that, your requested output doesn't match the input, you don't say what you want to do with the data, or how you want it output. If this is a one-off, you can probably import it into a spreadsheet program and parse it with some simple search/replace criteria.
Post these things in CODE tags, please, so they can be read easier. And again, your suggested output has things that aren't in the INPUT...where, exactly, do you expect them to come from?? And you're hard-coding the 'b37' in? And there are more input fields than output fields, and you're not telling us what the separator is...a tab? Spaces (if so, how many?)?Where is the "Id" supposed to come from? How are you dealing with the CHROM/POS/etc. that's in your input?? And is the input all on one line, or separate lines, as shown?
If you want to remove the braces, then prepend them with a backslash...your first sed statement is escaping the opening brace (\[) but not the closing brace.
To delete everything within the braces (INCLUDING the braces): **NOTE: BELOW COMMANDS ARE UNTESTED**
Code:
sed -e 's/\[[^][]*\]//g'
You can use sed to parse down the HG's:
Code:
sed 's/_HG.*//g'
...and remove everything before the first HG:
Code:
sed 's/.*HG//'
You can then use the above as examples to remove the ".0" after the numbers, leaving you with the first digits (which is seemingly what you want). After that, you can use awk to print out the rest, but you will most likely have to script some things to get the dual-line output you're after. Again, see the bash scripting tutorials.
I tried to do this:
sed 's/\[ ] //g' chr22_ann.vcf ## but this doesn't delete the []
Because the field you supplied ([blank]blank) doesn't exist in the data. You have to define the data you want to match exactly - usually means regex as TBone showed. Note those commands he gave really are untested. You have to be really specific with the data - for example are all the decimals always .0 ? - and never anything else ?.
Is all that data all one record, or multiple lines ?.
Quote:
I also tried to awk '{new_var=$1"_"$2"_"$3"_"$4"_"+b37; print new_var}' chr22_ann.vcf
For this to work you'll need to loop over the remaining fields to get the entire record on your print statement - best done with printf.
@OP. What TB0ne said in #4. As is, your sample data look totally messed up to me. I'd expect them to be organized in tabular form, with chromosome number (22) coming under the header #[1]CHROM, and so on.
Also, check examples in PrediXcan Sample Data and their tutorial. To me, it looks like they can use VCF data as direct input, without any conversion. In that case, you'll probably have to specify --on_the_fly_mapping METADATA.
Quote:
Originally Posted by TB0ne
your suggested output has things that aren't in the INPUT...where, exactly, do you expect them to come from?? And you're hard-coding the 'b37' in? And there are more input fields than output fields, and you're not telling us what the separator is...a tab? Spaces (if so, how many?)?Where is the "Id" supposed to come from? How are you dealing with the CHROM/POS/etc. that's in your input?? And is the input all on one line, or separate lines, as shown?
I probably can answer some of the questions. Both the input (VCF) and the output data (text dosage format) are TSV. For the latter, the official documentation states:
Quote:
The dosage format consists of gzipped, tab separated text files without header. Each line looks like:
Code:
chromosome variant_id position allele1 allele2 MAF id1 ... idn
with:
chromosome: the chromosome identifier (i.e. 1)
variant_id: a unique string identifying the variant. It can be an rsid, it can be a string encoding properties (chr1_123_C_T_b38), or whatever.
position: position of variant in the chromosome
allele1: The non-effect allele (sometimes called "ancestral", or "ref" in Thousand Genomes)
allele2: effect allele (the one which dosage will be used to predict expression/splicing/etc)
MAF: allele frequency of allele 2 (unused at the moment)
id1 ... idn: each entry is the dosage (count/probabilities of allele 2 across the chromosomes)
[...]
But I don't think converting VCF data to the text dosage format is even necessary.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.