LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 01-14-2022, 04:33 PM   #1
rheab
LQ Newbie
 
Registered: Jan 2022
Posts: 2

Rep: Reputation: 0
edit columns in file using linux


Hello, I have an imputed dosage for chr 22 file. its content is:

#[1]CHROM [2]POS [3]REF [4]ALT [5]HG00096_HG00096 [6]HG00097_HG00097 [7]HG00099_HG00099 [8]HG00100_HG00100 [9]HG00101_HG00101
[10]HG00102_HG00102 [11]HG00103_HG00103 [12]HG00105_HG00105 [13]HG00106_HG00106 [14]HG00107_HG00107 [15]HG00108_HG00108 [16]HG0 0109_HG00109 [17]HG00110_HG00110 22 16051249 T C 0.0 1.0 1.0 0.0
0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0




I want to remove [] and remove anything coming after _ from this file as well as instead of separate 22 16051249 T and C i want to write:

22_16051249_T_C_b37

So the final file should look like:
Id HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103 HG00104 HG00105 HG00106 HG00108 HG00109 HG00110 HG00111 HG00112 HG00114 HG00115 HG00116
22_16051249_T_C_b37 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 2
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
0 0 0
Can anyone please help me out with this.
 
Old 01-14-2022, 04:48 PM   #2
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,279

Rep: Reputation: 7897Reputation: 7897Reputation: 7897Reputation: 7897Reputation: 7897Reputation: 7897Reputation: 7897Reputation: 7897Reputation: 7897Reputation: 7897Reputation: 7897
Quote:
Originally Posted by rheab View Post
Hello, I have an imputed dosage for chr 22 file. its content is:
Code:
#[1]CHROM       [2]POS  [3]REF  [4]ALT  [5]HG00096_HG00096      [6]HG00097_HG00097      [7]HG00099_HG00099      [8]HG00100_HG00100     [9]HG00101_HG00101
      [10]HG00102_HG00102     [11]HG00103_HG00103     [12]HG00105_HG00105     [13]HG00106_HG00106     [14]HG00107_HG00107    [15]HG00108_HG00108     [16]HG0 0109_HG00109     [17]HG00110_HG00110   22      16051249        T       C       0.0     1.0     1.0     0.0   
0.0     1.0     0.0     0.0     1.0     0.0     0.0     1.0     0.0     0.0     0.0
     0.0     0.0     0.0     0.0     1.0     0.0     1.0     0.0     0.0     0.0     0.0     0.0     0.0     1.0     0.0     0.0     0.0     0.0     0.0
I want to remove [] and remove anything coming after _ from this file as well as instead of separate 22 16051249 T and C i want to write:
Code:
22_16051249_T_C_b37
So the final file should look like:
Code:
Id      HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103 HG00104 HG00105 HG00106 HG00108 HG00109 HG00110 HG00111 HG00112 HG00114 HG00115 HG00116 
22_16051249_T_C_b37     0       1       1       0       0       0       0       1       0       0       0       0       0       0       0       0       2
       0       0       0       0       0       0       0       0       0       0       0       0       1       0       0       0       0       0       0
       0       0       0
Can anyone please help me out with this.
We'll be happy to help you; so post what YOU have written/done/tried, and tell us where you're stuck. But we aren't going to write your scripts for you; there are lots of scripting tutorials, sample scripts, and other things you can find with simple Internet searches for things like "how to parse a file in linux scripts".

Beyond that, your requested output doesn't match the input, you don't say what you want to do with the data, or how you want it output. If this is a one-off, you can probably import it into a spreadsheet program and parse it with some simple search/replace criteria.
 
Old 01-14-2022, 05:13 PM   #3
rheab
LQ Newbie
 
Registered: Jan 2022
Posts: 2

Original Poster
Rep: Reputation: 0
I tried to do this:
sed 's/\[ ] //g' chr22_ann.vcf ## but this doesn't delete the []

I also tried to awk '{new_var=$1"_"$2"_"$3"_"$4"_"+b37; print new_var}' chr22_ann.vcf


Basically this is the input file content:
#[1]CHROM [2]POS [3]REF [4]ALT [5]HG00096_HG00096 [6]HG00097_HG00097 [7]HG00099_HG00099
22 16051249 T C 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0

The output file should be:
Id HG00096 HG00097 HG00099
22_16051249_T_C_b37 0 1 1 0 0 1 0 1
 
Old 01-14-2022, 06:26 PM   #4
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,279

Rep: Reputation: 7897Reputation: 7897Reputation: 7897Reputation: 7897Reputation: 7897Reputation: 7897Reputation: 7897Reputation: 7897Reputation: 7897Reputation: 7897Reputation: 7897
Quote:
Originally Posted by rheab View Post
I tried to do this:
sed 's/\[ ] //g' chr22_ann.vcf ## but this doesn't delete the []

I also tried to awk '{new_var=$1"_"$2"_"$3"_"$4"_"+b37; print new_var}' chr22_ann.vcf

Basically this is the input file content:
Code:
#[1]CHROM       [2]POS  [3]REF  [4]ALT  [5]HG00096_HG00096      [6]HG00097_HG00097      [7]HG00099_HG00099 
22      16051249        T       C       0.0     1.0     1.0     0.0     0.0     1.0     0.0     0.0     1.0
The output file should be:
Code:
Id      HG00096 HG00097 HG00099
22_16051249_T_C_b37     0       1       1       0       0       1       0       1
Post these things in CODE tags, please, so they can be read easier. And again, your suggested output has things that aren't in the INPUT...where, exactly, do you expect them to come from?? And you're hard-coding the 'b37' in? And there are more input fields than output fields, and you're not telling us what the separator is...a tab? Spaces (if so, how many?)?Where is the "Id" supposed to come from? How are you dealing with the CHROM/POS/etc. that's in your input?? And is the input all on one line, or separate lines, as shown?

If you want to remove the braces, then prepend them with a backslash...your first sed statement is escaping the opening brace (\[) but not the closing brace.

To delete everything within the braces (INCLUDING the braces): **NOTE: BELOW COMMANDS ARE UNTESTED**
Code:
sed -e 's/\[[^][]*\]//g'
You can use sed to parse down the HG's:
Code:
sed 's/_HG.*//g'
...and remove everything before the first HG:
Code:
sed 's/.*HG//'
You can then use the above as examples to remove the ".0" after the numbers, leaving you with the first digits (which is seemingly what you want). After that, you can use awk to print out the rest, but you will most likely have to script some things to get the dual-line output you're after. Again, see the bash scripting tutorials.
 
1 members found this post helpful.
Old 01-14-2022, 09:41 PM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,008

Rep: Reputation: 4099Reputation: 4099Reputation: 4099Reputation: 4099Reputation: 4099Reputation: 4099Reputation: 4099Reputation: 4099Reputation: 4099Reputation: 4099Reputation: 4099
Quote:
Originally Posted by rheab View Post
I tried to do this:
sed 's/\[ ] //g' chr22_ann.vcf ## but this doesn't delete the []
Because the field you supplied ([blank]blank) doesn't exist in the data. You have to define the data you want to match exactly - usually means regex as TBone showed. Note those commands he gave really are untested. You have to be really specific with the data - for example are all the decimals always .0 ? - and never anything else ?.
Is all that data all one record, or multiple lines ?.
Quote:
I also tried to awk '{new_var=$1"_"$2"_"$3"_"$4"_"+b37; print new_var}' chr22_ann.vcf
For this to work you'll need to loop over the remaining fields to get the entire record on your print statement - best done with printf.
 
1 members found this post helpful.
Old 01-15-2022, 09:36 AM   #6
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,667

Rep: Reputation: Disabled
Cross-posted to Bioinformatics SE.

@OP. What TB0ne said in #4. As is, your sample data look totally messed up to me. I'd expect them to be organized in tabular form, with chromosome number (22) coming under the header #[1]CHROM, and so on.

Conisdering your other question to Bioinformatics SE, have a look at DosageConvertor. Documentation for Michigan Imputation Server mentions some conversion tools as well.

Also, check examples in PrediXcan Sample Data and their tutorial. To me, it looks like they can use VCF data as direct input, without any conversion. In that case, you'll probably have to specify --on_the_fly_mapping METADATA.

Quote:
Originally Posted by TB0ne View Post
your suggested output has things that aren't in the INPUT...where, exactly, do you expect them to come from?? And you're hard-coding the 'b37' in? And there are more input fields than output fields, and you're not telling us what the separator is...a tab? Spaces (if so, how many?)?Where is the "Id" supposed to come from? How are you dealing with the CHROM/POS/etc. that's in your input?? And is the input all on one line, or separate lines, as shown?
I probably can answer some of the questions. Both the input (VCF) and the output data (text dosage format) are TSV. For the latter, the official documentation states:
Quote:
The dosage format consists of gzipped, tab separated text files without header. Each line looks like:
Code:
chromosome variant_id position allele1 allele2 MAF id1 ... idn
with:
  • chromosome: the chromosome identifier (i.e. 1)
  • variant_id: a unique string identifying the variant. It can be an rsid, it can be a string encoding properties (chr1_123_C_T_b38), or whatever.
  • position: position of variant in the chromosome
  • allele1: The non-effect allele (sometimes called "ancestral", or "ref" in Thousand Genomes)
  • allele2: effect allele (the one which dosage will be used to predict expression/splicing/etc)
  • MAF: allele frequency of allele 2 (unused at the moment)
  • id1 ... idn: each entry is the dosage (count/probabilities of allele 2 across the chromosomes)
[...]
But I don't think converting VCF data to the text dosage format is even necessary.

Last edited by shruggy; 01-15-2022 at 11:59 AM.
 
1 members found this post helpful.
  


Reply

Tags
csv


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Converting a file with Rows and Columns to just Columns mphillips67 Linux - Newbie 14 03-05-2014 11:31 AM
[SOLVED] bash suggestions to convert horizontal columns to vertical columns Guyverix Programming 14 01-24-2013 12:03 PM
SQL statements howto -- 3 columns input but 2 columns output fhleung Programming 3 11-29-2012 11:45 AM
Map 1 CSV's columns to matching columns in another CSV 2legit2quit Programming 7 10-27-2011 09:53 AM
[SOLVED] AWK: add columns while keep format for other columns cristalp Programming 3 10-13-2011 07:14 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 03:04 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration