Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place! |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
|
05-18-2012, 08:06 AM
|
#1
|
LQ Newbie
Registered: May 2012
Posts: 24
Rep:
|
printing columns that has specific delimiter
Hi all,
My file looks like this.
consensus_3 . GT:PL:DP:SP:GQ 0/0:0,255,106:119:0:99
consensus_6 . GT:PL:DP:SP:GQ 0/0:0,51,39:114:0:50
What I want to do is print 1st and 3rd column with fields that has : as a delimiter. For example I want the output to look like
GT DP 0/0 119
GT DP 0/0 114
What I did is to try loop over the columns and print 1st and 3rd column for the fields that has : as a delimiter.
The following is the code I tried for that.
awk '{for(x=1;x<NF;x++);split($x,a,":");print a[1],a[3]}'
But it is not working as I expected to be. Instead it prints just the 1st and 3rd column from the last field.
0/0 119
0/0 114.
How to print out the specified columns from all the fields with : as a delimiter and not just one field.
Any thoughts would be appreciated.
Thanks
Last edited by jv61; 05-18-2012 at 08:09 AM.
|
|
|
05-18-2012, 08:24 AM
|
#2
|
Senior Member
Registered: Oct 2003
Location: Northeastern Michigan, where Carhartt is a Designer Label
Distribution: Slackware 32- & 64-bit Stable
Posts: 3,541
|
You need to use the FS (field separator).
Code:
BEGIN {
FS=":"
}
{
printf ("%s %s\n", $1, $2);
}
Hope this helps some.
|
|
|
05-18-2012, 11:11 AM
|
#3
|
LQ Newbie
Registered: May 2012
Posts: 24
Original Poster
Rep:
|
Many thanks for the reply. I tried it with field separator, but it still prints out only the first encountered coulmn which has : as delimiter.
For example this is my file:
Chrom Sample1 Sample2 Sample3
1 AD:DP:GL:CG 1/1:119:23 0/1:110:22
2 AD:DP:GL:GC 0/1:120:24 1/1:100:80
I would like to print something like this
Chrom Sample1 Sample2
1 AD GL 1/1 23
2 AD GL 0/1 24
When I try this command
BEGIN {
FS=":"
}
{
printf ("%s %s\n", $1, $3);
}
it prints only from Sample1 column like given below.
AD GL
AD GL
But I would like to print it from all the sample columns.
Any thoughts of how to do that, thank you.
|
|
|
05-19-2012, 08:09 AM
|
#4
|
Senior Member
Registered: Oct 2003
Location: Northeastern Michigan, where Carhartt is a Designer Label
Distribution: Slackware 32- & 64-bit Stable
Posts: 3,541
|
Here's a hint. You've got a file with two different delimiters in it, spaces and colons:
Code:
Chrom Sample1 Sample2 Sample3
1 AD:DP:GL:CG 1/1:119:23 0/1:110:22
2 AD:DP:GL:GC 0/1:120:24 1/1:100:80
Run it though sed, changing the spaces to colons (requiring the use of FS) or changing the colons to spaces (the default delimiter in AWK):
Code:
sed 's/:/ /g' chrom
Chrom Sample1 Sample2 Sample3
1 AD DP GL CG 1/1 119 23 0/1 110 22
2 AD DP GL GC 0/1 120 24 1/1 100 80
or
Code:
sed 's/ /:/g' chrom
Chrom:Sample1:Sample2:Sample3
1:AD:DP:GL:CG:1/1:119:23:0/1:110:22
2:AD:DP:GL:GC:0/1:120:24:1/1:100:80
Thus you'll have something that AWK can deal with easily.
Hope this helps some.
Last edited by tronayne; 05-19-2012 at 08:10 AM.
|
|
|
05-19-2012, 07:40 PM
|
#5
|
LQ Guru
Registered: May 2005
Location: boston, usa
Distribution: fedora-35
Posts: 5,326
|
another quick-and-dirty hack would be to use two awks:
Code:
[schneidz@hyper ~]$ awk -F : '{print $1 " " $3 " " $5 " " $7}' jv61.lst | awk '{print $3 " " $4 " " $6 " " $7}'
GT DP 0/0 119
GT DP 0/0 114
|
|
|
05-20-2012, 03:48 PM
|
#6
|
Bash Guru
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852
|
***Please use [code][/code] tags around your code and data, to preserve formatting and to improve readability. Please do not use quote tags, colors, or other fancy formatting.***
You need a check to ensure that you are only printing fields that have delimiters. You also need to use printf to keep all the output on a single line. This makes the final line formatting a bit trickier however.
Here's the version I came up with, formatted as a stand-alone script:
Code:
#!/usr/bin/awk -f
{
for ( x=1 ; x<=NF ; x++ )
{
if ( $x ~ /[:]/ )
{
split( $x , a , ":" )
printf( "%s %s" , a[1] , a[3] )
if ( x != NF )
{
printf( " " )
}
}
}
print ""
}
I don't doubt there are cleaner ways to go about it, though.
Edit: Here's a revised version as mentioned in my follow-up post. To target an arbitrary number of fields, just replace NF with the maximum number you want (3, in this case). You can also print the first line by testing the NR variable. Be sure to set the number of " %s" entries in printf to match the field count.
I'm thinking that the if ":" test may be superfluous in this case, and perhaps even detrimental, if you're going to limit the fields printed by count. But in the absence of further clarification I left it in.
Finally, I took the liberty of changing the format of the output to use tabs instead of spaces, assuming you want something human-readable. Just go back and replace all the " \t"s with spaces if you don't desire that behavior.
Code:
#!/usr/bin/awk -f
{
if ( NR == 1 )
{
printf( "%s\t%s\t%s\n" ,$1,$2,$3 )
next
}
printf( "%s\t", $1 )
for ( x=1 ; x<=3 ; x++ )
{
if ( $x ~ /[:]/ )
{
split( $x , a , ":" )
printf( "%s %s" , a[1] , a[3] )
if ( x != 3 )
{
printf( "\t" )
}
}
}
print ""
}
Last edited by David the H.; 05-22-2012 at 10:31 AM.
Reason: 1) minor rewording 2) forgot the -f on the shebang 3) as posted
|
|
1 members found this post helpful.
|
05-20-2012, 05:54 PM
|
#7
|
Senior Member
Registered: Oct 2003
Location: Northeastern Michigan, where Carhartt is a Designer Label
Distribution: Slackware 32- & 64-bit Stable
Posts: 3,541
|
If your input file, chrom looks like this:
Code:
Chrom Sample1 Sample2 Sample3
1 AD:DP:GL:CG 1/1:119:23 0/1:110:22
2 AD:DP:GL:GC 0/1:120:24 1/1:100:80
and your AWK program, chrom.awk, looks like this:
Code:
BEGIN {
# print the first line
printf ("%s %s %s\n", $1, $2, $3);
}
{
printf ("%s %s %s %s %s\n", $1, $2, $4, $6, $8);
}
then
Code:
sed 's/:/ /g' chrom | awk -f chrom.awk
Chrom Sample1 Sample3
1 AD GL 1/1 23
2 AD GL 0/1 24
You could use tabs to space things a little better, but what that heck.
That about what you wanted?
Hope this helps some.
|
|
1 members found this post helpful.
|
05-22-2012, 10:06 AM
|
#8
|
Bash Guru
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852
|
I must say, all of the supplied suggestions so far that use sed or a second awk command for pre-processing assume that all input is exactly like the example the OP posted. But we can't take it for granted, at this point, that every line has exactly the same number of fields and sub-fields. Maybe they do, but the OP hasn't yet said so.
If you read the OP's actual stated requirements, you'll realize that just trying to replicate the example input>>output may not be enough to satisfy it under all conditions. My solution is so far the only one that correctly applies the criteria as described in the OP; print only sub-fields 1 and 3 of fields that contain colon delimiters), and can handle lines of arbitrary length.
(Although I did miss the part from his second post about printing the header and only a subset of fields. I'm going to go back and edit it to add a modified version after this post).
If the OP will come back and further clarify the structure of the input and the desired output, however, it may be possible that one of the above solutions, or a similar simpler command, would be satisfactory. Indeed, if the data structure is absolutely fixed, then you can simply use space OR colon as the delimiter and print only exactly the fields you want:
Code:
awk -F '[ :]' 'NR==1 { print $1,$2,$3 } ; NR>1 { print $2,$4,$6,$8 }'
|
|
1 members found this post helpful.
|
05-22-2012, 11:15 AM
|
#9
|
LQ Newbie
Registered: May 2012
Posts: 24
Original Poster
Rep:
|
Hi all,
Many thanks for all your replies & helpful hints. Sorry for my delayed response.
The following code from David answers close enough to my question.
Code:
#!/usr/bin/awk -f
{
for ( x=1 ; x<=NF ; x++ )
{
if ( $x ~ /[:]/ )
{
split( $x , a , ":" )
printf( "%s %s" , a[1] , a[3] )
if ( x != NF )
{
printf( " " )
}
}
}
print ""
}
The problem I have here is, the above prints the data from each column in a separate line. For example the above code prints the results like this.
Code:
GT DP
0/0 119
0/0 92
0/0 109
0/1 22
GT DP
0/0 114
0/0 101
0/0 56
0/0 13
GT DP
1/1 99
1/1 73
0/1 101
0/0 12
But I would like to print like this
Code:
GT DP 0/0 119 0/0 92 0/0 109 0/1 22
GT DP 0/0 114 0/0 101 0/0 56 0/0 13
GT DP 1/1 99 1/1 73 0/1 101 0/0 12
David, in answer to your question about my input file type, my input file type has 96 Sample columns that have : as a delimiter and others don't. The following is a portion from my input file type. I have printed just two sample columns but I have 96 sample cloumns with similar type of data from Sample 1 to Sample 96.
Code:
#CHROM POS ID FORMAT Sample1 Sample2
consensus_3 67 . GT:PL:DP:SP:GQ 0/0:0,255,106:119:0:99 0/0:0,255,96:92:0:99
consensus_6 48 . GT:PL:DP:SP:GQ 0/0:0,51,39:114:0:50 0/0:0,83,45:101:0:58
consensus_48 20 . GT:PL:DP:SP:GQ 1/1:98,255,0:99:0:99 1/1:90,220,0:73:0:98
consensus_93 48 . GT:PL:DP:SP:GQ 1/1:84,205,0:82:0:87 0/1:45,0,53:63:0:48
What I want to do is to print the 1st and 3rd column from the fields that has : as delimiter and prints the rest of the columns unchanged. My output should look something like this
Code:
#CHROM POS ID FORMAT Sample1 Sample2
consensus_3 67 . GT DP 0/0 119 0/0 92
consensus_6 48 . GT DP 0/0 114 0/0 101
consensus_6 48 . GT DP 1/1 99 1/1 73
consensus_6 48 . GT DP 1/1 82 0/1 63
Thanks
|
|
|
05-22-2012, 12:58 PM
|
#10
|
LQ Newbie
Registered: May 2012
Posts: 24
Original Poster
Rep:
|
Solved
That's awesome, thanks for the revised version David. It solved my problem. This forum & post has been really a learning experience for me. Thank you all again for sharing your ideas & help. Much appreciated :)
Code:
#!/usr/bin/awk -f
{
if ( NR == 1 )
{
printf( "%s\n", $0 )
next
}
printf( "%s %s %s ", $1,$2,$3 )
for ( x=1 ; x<=NF ; x++ )
{
if ( $x ~ /[:]/ )
{
split( $x , a , ":" )
printf( "%s %s " , a[1] , a[3] )
if ( x != NF )
{
printf( " " )
}
}
}
print " "
}
|
|
|
05-22-2012, 02:27 PM
|
#11
|
Bash Guru
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852
|
Quote:
Originally Posted by jv61
The problem I have here is, the above prints the data from each column in a separate line. For example the above code prints the results like this.
|
That's quite odd. There's no place in the original script where a newline should be inserted except after the whole line has been processed. Unless you modified it to insert one yourself somehow.
But now that I understand your exact requirements, we can simplify things quite a bit.
Code:
#!/usr/bin/awk -f
BEGIN{ OFS=" " }
{
for ( x=1 ; x<=NF ; x++ )
{
if ( $x ~ /[:]/ )
{
split( $x , a , ":" )
$x=a[1] OFS a[3]
}
else $x=$x
}
print
}
Just scan every field on the line, and if it contains a colon, split it and replace it with the modified version. Then print the line.
The addition of the BEGIN block allows you to set whatever output separator you wish between fields. The "else $x=$x" is also there so that otherwise unmodified lines such as the first one also print as separate fields according to OFS, rather than as an unmodified unit. I'm not really sure why that's necessary, to tell the truth, but according to my testing it is.
|
|
|
05-22-2012, 05:52 PM
|
#12
|
LQ Newbie
Registered: May 2012
Posts: 24
Original Poster
Rep:
|
Quote:
Originally Posted by David the H.
That's quite odd. There's no place in the original script where a newline should be inserted except after the whole line has been processed. Unless you modified it to insert one yourself somehow.
|
Yes, I modified that one. The simplified version of yours is smarter way to do.
Thanks
Last edited by jv61; 05-22-2012 at 05:53 PM.
|
|
|
All times are GMT -5. The time now is 03:37 AM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|