Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place. |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
 |
|
09-24-2013, 10:33 AM
|
#1
|
LQ Newbie
Registered: May 2013
Posts: 17
Rep: 
|
Modify I tabular file
Hi,
I am working with a tabular file and I need to make some changes.
So In my tab file I need to collapse all the raw that have the same value for column x but also the same value for column y. For example:
So in the attachment I highlight the raw that I need to collapse to count as 1.
I was thinking of using awk or sed but I have no idea on how to do it...
Can anyone help??
F.
|
|
|
09-24-2013, 10:46 AM
|
#2
|
Senior Member
Registered: Oct 2008
Distribution: Debian sid
Posts: 2,683
|
what do you mean "collapse"
no column x or y
same as what?
Which highlight? the yellow?
What you probably want is a "pivot table"
what spreadsheet software are you using?
|
|
|
09-24-2013, 10:47 AM
|
#3
|
Senior Member
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,895
|
Help us to help you. Provide a sample input file (10-15 lines will do). Not a screen capture, but a real file posted here, bracketed with code tags. Construct a sample output file which corresponds to your sample input and post both samples here. With "Before and After" examples we can better understand your needs and also judge if our proposed solution fills those needs.
Daniel B. Martin
|
|
|
09-24-2013, 11:05 AM
|
#4
|
LQ Newbie
Registered: May 2013
Posts: 17
Original Poster
Rep: 
|
Thank you for the suggestions...
and sorry for the confusion and the image( I had some trouble with the tab.. )
Superkingdom E value q_star query_end h_start h_end positives identical
Bacteria 1.00E-21 2 78 290 366 93.50% 93.50%
Bacteria 1.00E-09 1 49 1377 1425 93.90% 93.90%
Bacteria 1.00E-09 1 49 1377 1425 93.90% 93.90%
Bacteria 1.00E-11 13 64 3857 3908 94.20% 94.20%
Bacteria 1.00E-11 13 64 3857 3908 94.20% 94.20%
Bacteria 1.00E-11 11 62 3857 3908 94.20% 94.20%
Bacteria 3.00E-11 1 51 3858 3908 94.10% 94.10%
Bacteria 2.00E-17 8 69 4821 4882 95.20% 95.20%
Bacteria 2.00E-17 18 79 4821 4882 95.20% 95.20%
Bacteria 2.00E-17 20 81 4821 4882 95.20% 95.20%
Bacteria 2.00E-17 18 79 4821 4882 95.20% 95.20%
Bacteria 2.00E-17 37 98 4821 4882 95.20% 95.20%
Bacteria 8.00E-17 1 61 4821 4881 95.10% 95.10%
Bacteria 3.00E-16 1 60 4821 4880 95.00% 95.00%
Bacteria 1.00E-09 2 45 5516 5559 95.50% 95.50%
Bacteria 2.00E-08 1 38 5516 5553 97.40% 97.40%
Bacteria 1.00E-09 5 49 5584 5628 95.60% 95.60%
what I want to do is get rid of the lines that have the same value in column "h_start" but they need to have the same value for the column "h_end" (in the image I highlighted yellow and pink) ...
So my goal is to remove the raw/lines that are identical for both h_start and h_end...
Is it better???
I apologize for the confusion,hope made more sense...
Last edited by flacchy; 09-24-2013 at 11:10 AM.
|
|
|
09-24-2013, 11:18 AM
|
#5
|
LQ Guru
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509
|
Code:
awk -F"\t" '!_[$5]++ && !__[$6]++' file
This preserves the first row and removes the other ones with identical h_start and h_end pairs. Is this what you're looking for? What about the other values in the rows if they are different from the first (preserved) row?
|
|
1 members found this post helpful.
|
09-24-2013, 11:19 AM
|
#6
|
LQ Newbie
Registered: May 2013
Posts: 17
Original Poster
Rep: 
|
so ...Firerat...
"wthat do you mean "collapse"": I want to remove them and keep only the first
"same as what?" : if they have the same value for the column hit start but also for the column hit end
"Which highlight? the yellow?" : I did highlight in yellow and pink only because they were consecutive but I want to keep only the first raw for yellow and pink (you can see these reads have the same value for column hit start but also for the column hit end)
"What you probably want is a "pivot table"": I wold like to do it on linux because the file is huge and a pivot table for the hit start and hit end will be too big
"what spreadsheet software are you using?" I was thinking about modifying the file using something like awk and then open with any program ...
F.
|
|
|
09-24-2013, 11:22 AM
|
#7
|
LQ Newbie
Registered: May 2013
Posts: 17
Original Poster
Rep: 
|
Colucix:
seems what I want !! I only need to test it!
The other values now are not important because I already filter for them ( so these are the good hits I need) now I am getting more strict
so these are the parameters?
'!_[$5]++ && !__[$6]++'
where [$5} is the column ?
|
|
|
09-24-2013, 11:28 AM
|
#8
|
LQ Guru
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509
|
Well, I tested my own code more carefully and it has some problems, in the sense that if the value in the fifth ($5) or sixth ($6) column already appeared, even if the other number and the resulting pair is different, the row is not printed out. Here is a straightforward solution, using two indexes arrays:
Code:
awk -F"\t" '!_[$5,$6]++' file
This simply uses the concept of true and false in awk. The exclamation point is the negation and _ is simply the name of an array.
Last edited by colucix; 09-24-2013 at 11:30 AM.
|
|
1 members found this post helpful.
|
09-24-2013, 11:31 AM
|
#9
|
LQ Newbie
Registered: May 2013
Posts: 17
Original Poster
Rep: 
|
Quote:
Originally Posted by colucix
Code:
awk -F"\t" '!_[$5]++ && !__[$6]++' file
This preserves the first row and removes the other ones with identical h_start and h_end pairs. Is this what you're looking for? What about the other values in the rows if they are different from the first (preserved) row?
|
it did work!!! Thank you so so much
|
|
|
09-24-2013, 11:40 AM
|
#10
|
LQ Guru
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509
|
Here is what I get from your sample file:
Code:
$ awk -F"\t" '!_[$5,$6]++' file
Superkingdom E value q_star query_end h_start h_end positives identical
Bacteria 1.00E-21 2 78 290 366 93.50% 93.50%
Bacteria 1.00E-09 1 49 1377 1425 93.90% 93.90%
Bacteria 1.00E-11 13 64 3857 3908 94.20% 94.20%
Bacteria 3.00E-11 1 51 3858 3908 94.10% 94.10%
Bacteria 2.00E-17 8 69 4821 4882 95.20% 95.20%
Bacteria 8.00E-17 1 61 4821 4881 95.10% 95.10%
Bacteria 3.00E-16 1 60 4821 4880 95.00% 95.00%
Bacteria 1.00E-09 2 45 5516 5559 95.50% 95.50%
Bacteria 2.00E-08 1 38 5516 5553 97.40% 97.40%
Bacteria 1.00E-09 5 49 5584 5628 95.60% 95.60%
$ awk -F"\t" '!_[$5]++ && !__[$6]++' file
Superkingdom E value q_star query_end h_start h_end positives identical
Bacteria 1.00E-21 2 78 290 366 93.50% 93.50%
Bacteria 1.00E-09 1 49 1377 1425 93.90% 93.90%
Bacteria 1.00E-11 13 64 3857 3908 94.20% 94.20%
Bacteria 2.00E-17 8 69 4821 4882 95.20% 95.20%
Bacteria 1.00E-09 2 45 5516 5559 95.50% 95.50%
Bacteria 1.00E-09 5 49 5584 5628 95.60% 95.60%
The lines highlighted in blue are removed from the output of the first suggested awk command, but the pair h_start/h_end are different from the pair of the line immediately above. I'm still confused about your exact requirement.
|
|
|
09-24-2013, 12:00 PM
|
#11
|
Senior Member
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,895
|
With this InFile ...
Code:
Bacteria 1.00E-21 2 78 290 366 93.50% 93.50%
Bacteria 1.00E-09 1 49 1377 1425 93.90% 93.90%
Bacteria 1.00E-09 1 49 1377 1425 93.90% 93.90%
Bacteria 1.00E-11 13 64 3857 3908 94.20% 94.20%
Bacteria 1.00E-11 13 64 3857 3908 94.20% 94.20%
Bacteria 1.00E-11 11 62 3857 3908 94.20% 94.20%
Bacteria 3.00E-11 1 51 3858 3908 94.10% 94.10%
Bacteria 2.00E-17 8 69 4821 4882 95.20% 95.20%
Bacteria 2.00E-17 18 79 4821 4882 95.20% 95.20%
Bacteria 2.00E-17 20 81 4821 4882 95.20% 95.20%
Bacteria 2.00E-17 18 79 4821 4882 95.20% 95.20%
Bacteria 2.00E-17 37 98 4821 4882 95.20% 95.20%
Bacteria 8.00E-17 1 61 4821 4881 95.10% 95.10%
Bacteria 3.00E-16 1 60 4821 4880 95.00% 95.00%
Bacteria 1.00E-09 2 45 5516 5559 95.50% 95.50%
Bacteria 2.00E-08 1 38 5516 5553 97.40% 97.40%
Bacteria 1.00E-09 5 49 5584 5628 95.60% 95.60%
... this awk ...
Code:
awk 'p!=p=$5,$6' $InFile >$OutFile
... produced this OutFile ...
Code:
Bacteria 1.00E-21 2 78 290 366 93.50% 93.50%
Bacteria 1.00E-09 1 49 1377 1425 93.90% 93.90%
Bacteria 1.00E-11 13 64 3857 3908 94.20% 94.20%
Bacteria 3.00E-11 1 51 3858 3908 94.10% 94.10%
Bacteria 2.00E-17 8 69 4821 4882 95.20% 95.20%
Bacteria 1.00E-09 2 45 5516 5559 95.50% 95.50%
Bacteria 1.00E-09 5 49 5584 5628 95.60% 95.60%
Daniel B. Martin
|
|
|
09-24-2013, 12:27 PM
|
#12
|
Senior Member
Registered: Oct 2008
Distribution: Debian sid
Posts: 2,683
|
the posted awks do what you ask
but I would question what you asked...
consider these lines
Code:
Bacteria 2.00E-17 37 98 4821 4882 95.20% 95.20%
Bacteria 8.00E-17 1 61 4821 4881 95.10% 95.10%
they appear to be unique data sets, yet they are both removed.
You have to give very good reasons for deleting data
but at the end of the day, I don't know where the data came from, or what it will be used for.. so I guess not my concern
|
|
|
09-24-2013, 12:32 PM
|
#13
|
LQ Newbie
Registered: May 2013
Posts: 17
Original Poster
Rep: 
|
The first one work better
because I want to remove when they have the same start and same end but sometimes it is shift by couples of numbers.
I could still use the second command and then look manually when they are in close range ...
thank you for both
|
|
|
09-24-2013, 12:36 PM
|
#14
|
LQ Newbie
Registered: May 2013
Posts: 17
Original Poster
Rep: 
|
Quote:
Originally Posted by Firerat
the posted awks do what you ask
but I would question what you asked...
consider these lines
Code:
Bacteria 2.00E-17 37 98 4821 4882 95.20% 95.20%
Bacteria 8.00E-17 1 61 4821 4881 95.10% 95.10%
they appear to be unique data sets, yet they are both removed.
You have to give very good reasons for deleting data
but at the end of the day, I don't know where the data came from, or what it will be used for.. so I guess not my concern
|
They will represent teh same hit so I have to count them as one not as two, as is the same gene ...
|
|
|
09-25-2013, 04:40 AM
|
#15
|
LQ Newbie
Registered: May 2013
Posts: 17
Original Poster
Rep: 
|
What if I want to add another field like:
Superkingdom Species E value q_star query_end h_start h_end positives identical
Bacteria HU 1.00E-21 2 78 290 366 93.50% 93.50%
Bacteria HU 1.00E-09 1 49 1377 1425 93.90% 93.90%
Bacteria HU 1.00E-11 13 64 3857 3908 94.20% 94.20%
Bacteria Ver 3.00E-11 1 51 3858 3908 94.10% 94.10%
Bacteria Ver 2.00E-17 8 69 4821 4882 95.20% 95.20%
Bacteria Ver 8.00E-17 1 61 4821 4881 95.10% 95.10%
Bacteria Ver 3.00E-16 1 60 4821 4880 95.00% 95.00%
Bacteria HIL 1.00E-09 2 45 5516 5559 95.50% 95.50%
Bacteria HIL 2.00E-08 1 38 5516 5553 97.40% 97.40%
Bacteria HIL 1.00E-09 5 49 5584 5628 95.60% 95.60%
How can I modify the formula
Code:
awk -F"\t" '!_[$5]++ && !__[$6]++' file
to remove only if they have also the same value for column species? should it be like?
Code:
awk -F"\t" '!_[$2]++ && !__[$6]++ && !__[$7]++' file
|
|
|
All times are GMT -5. The time now is 08:12 AM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|