Modify I tabular file

flacchy · 09-24-2013, 10:33 AM

Hi,

I am working with a tabular file and I need to make some changes.
So In my tab file I need to collapse all the raw that have the same value for column x but also the same value for column y. For example:
So in the attachment I highlight the raw that I need to collapse to count as 1.

I was thinking of using awk or sed but I have no idea on how to do it...

Can anyone help??

F.

Firerat · 09-24-2013, 10:46 AM

what do you mean "collapse"
no column x or y
same as what?
Which highlight? the yellow?

What you probably want is a "pivot table"
what spreadsheet software are you using?

danielbmartin · 09-24-2013, 10:47 AM

Help us to help you. Provide a sample input file (10-15 lines will do). Not a screen capture, but a real file posted here, bracketed with code tags. Construct a sample output file which corresponds to your sample input and post both samples here. With "Before and After" examples we can better understand your needs and also judge if our proposed solution fills those needs.

Daniel B. Martin

flacchy · 09-24-2013, 11:05 AM

Thank you for the suggestions...

and sorry for the confusion and the image( I had some trouble with the tab.. )

Superkingdom E value q_star query_end h_start h_end positives identical
Bacteria 1.00E-21 2 78 290 366 93.50% 93.50%
Bacteria 1.00E-09 1 49 1377 1425 93.90% 93.90%
Bacteria 1.00E-09 1 49 1377 1425 93.90% 93.90%
Bacteria 1.00E-11 13 64 3857 3908 94.20% 94.20%
Bacteria 1.00E-11 13 64 3857 3908 94.20% 94.20%
Bacteria 1.00E-11 11 62 3857 3908 94.20% 94.20%
Bacteria 3.00E-11 1 51 3858 3908 94.10% 94.10%
Bacteria 2.00E-17 8 69 4821 4882 95.20% 95.20%
Bacteria 2.00E-17 18 79 4821 4882 95.20% 95.20%
Bacteria 2.00E-17 20 81 4821 4882 95.20% 95.20%
Bacteria 2.00E-17 18 79 4821 4882 95.20% 95.20%
Bacteria 2.00E-17 37 98 4821 4882 95.20% 95.20%
Bacteria 8.00E-17 1 61 4821 4881 95.10% 95.10%
Bacteria 3.00E-16 1 60 4821 4880 95.00% 95.00%
Bacteria 1.00E-09 2 45 5516 5559 95.50% 95.50%
Bacteria 2.00E-08 1 38 5516 5553 97.40% 97.40%
Bacteria 1.00E-09 5 49 5584 5628 95.60% 95.60%

what I want to do is get rid of the lines that have the same value in column "h_start" but they need to have the same value for the column "h_end" (in the image I highlighted yellow and pink) ...
So my goal is to remove the raw/lines that are identical for both h_start and h_end...

Is it better???

I apologize for the confusion,hope made more sense...

colucix · 09-24-2013, 11:18 AM

Code:

awk -F"\t" '!_[$5]++ && !__[$6]++' file

This preserves the first row and removes the other ones with identical h_start and h_end pairs. Is this what you're looking for? What about the other values in the rows if they are different from the first (preserved) row?

flacchy · 09-24-2013, 11:19 AM

so ...Firerat...

"wthat do you mean "collapse"": I want to remove them and keep only the first

"same as what?" : if they have the same value for the column hit start but also for the column hit end

"Which highlight? the yellow?" : I did highlight in yellow and pink only because they were consecutive but I want to keep only the first raw for yellow and pink (you can see these reads have the same value for column hit start but also for the column hit end)

"What you probably want is a "pivot table"": I wold like to do it on linux because the file is huge and a pivot table for the hit start and hit end will be too big

"what spreadsheet software are you using?" I was thinking about modifying the file using something like awk and then open with any program ...

F.

flacchy · 09-24-2013, 11:22 AM

Colucix:

seems what I want !! I only need to test it!

The other values now are not important because I already filter for them ( so these are the good hits I need) now I am getting more strict

so these are the parameters?
'!_[$5]++ && !__[$6]++'

where [$5} is the column ?

colucix · 09-24-2013, 11:28 AM

Well, I tested my own code more carefully and it has some problems, in the sense that if the value in the fifth ($5) or sixth ($6) column already appeared, even if the other number and the resulting pair is different, the row is not printed out. Here is a straightforward solution, using two indexes arrays:

Code:

awk -F"\t" '!_[$5,$6]++' file

This simply uses the concept of true and false in awk. The exclamation point is the negation and _ is simply the name of an array.

flacchy · 09-24-2013, 11:31 AM

Quote:

Originally Posted by colucix

Code:

awk -F"\t" '!_[$5]++ && !__[$6]++' file

This preserves the first row and removes the other ones with identical h_start and h_end pairs. Is this what you're looking for? What about the other values in the rows if they are different from the first (preserved) row?

it did work!!! Thank you so so much

colucix · 09-24-2013, 11:40 AM

Here is what I get from your sample file:

Code:

$ awk -F"\t" '!_[$5,$6]++' file
Superkingdom    E value  q_star query_end h_start h_end positives identical
Bacteria        1.00E-21        2       78      290     366     93.50%  93.50%
Bacteria        1.00E-09        1       49      1377    1425    93.90%  93.90%
Bacteria        1.00E-11        13      64      3857    3908    94.20%  94.20%
Bacteria        3.00E-11        1       51      3858    3908    94.10%  94.10%
Bacteria        2.00E-17        8       69      4821    4882    95.20%  95.20%
Bacteria        8.00E-17        1       61      4821    4881    95.10%  95.10%
Bacteria        3.00E-16        1       60      4821    4880    95.00%  95.00%
Bacteria        1.00E-09        2       45      5516    5559    95.50%  95.50%
Bacteria        2.00E-08        1       38      5516    5553    97.40%  97.40%
Bacteria        1.00E-09        5       49      5584    5628    95.60%  95.60%
$ awk -F"\t" '!_[$5]++ && !__[$6]++' file
Superkingdom    E value  q_star query_end h_start h_end positives identical
Bacteria        1.00E-21        2       78      290     366     93.50%  93.50%
Bacteria        1.00E-09        1       49      1377    1425    93.90%  93.90%
Bacteria        1.00E-11        13      64      3857    3908    94.20%  94.20%
Bacteria        2.00E-17        8       69      4821    4882    95.20%  95.20%
Bacteria        1.00E-09        2       45      5516    5559    95.50%  95.50%
Bacteria        1.00E-09        5       49      5584    5628    95.60%  95.60%

The lines highlighted in blue are removed from the output of the first suggested awk command, but the pair h_start/h_end are different from the pair of the line immediately above. I'm still confused about your exact requirement.

danielbmartin · 09-24-2013, 12:00 PM

With this InFile ...

Code:

Bacteria	1.00E-21	2	78	290	366	93.50%	93.50%
Bacteria	1.00E-09	1	49	1377	1425	93.90%	93.90%
Bacteria	1.00E-09	1	49	1377	1425	93.90%	93.90%
Bacteria	1.00E-11	13	64	3857	3908	94.20%	94.20%
Bacteria	1.00E-11	13	64	3857	3908	94.20%	94.20%
Bacteria	1.00E-11	11	62	3857	3908	94.20%	94.20%
Bacteria	3.00E-11	1	51	3858	3908	94.10%	94.10%
Bacteria	2.00E-17	8	69	4821	4882	95.20%	95.20%
Bacteria	2.00E-17	18	79	4821	4882	95.20%	95.20%
Bacteria	2.00E-17	20	81	4821	4882	95.20%	95.20%
Bacteria	2.00E-17	18	79	4821	4882	95.20%	95.20%
Bacteria	2.00E-17	37	98	4821	4882	95.20%	95.20%
Bacteria	8.00E-17	1	61	4821	4881	95.10%	95.10%
Bacteria	3.00E-16	1	60	4821	4880	95.00%	95.00%
Bacteria	1.00E-09	2	45	5516	5559	95.50%	95.50%
Bacteria	2.00E-08	1	38	5516	5553	97.40%	97.40%
Bacteria	1.00E-09	5	49	5584	5628	95.60%	95.60%

... this awk ...

Code:

awk 'p!=p=$5,$6' $InFile >$OutFile

... produced this OutFile ...

Code:

Bacteria	1.00E-21	2	78	290	366	93.50%	93.50%
Bacteria	1.00E-09	1	49	1377	1425	93.90%	93.90%
Bacteria	1.00E-11	13	64	3857	3908	94.20%	94.20%
Bacteria	3.00E-11	1	51	3858	3908	94.10%	94.10%
Bacteria	2.00E-17	8	69	4821	4882	95.20%	95.20%
Bacteria	1.00E-09	2	45	5516	5559	95.50%	95.50%
Bacteria	1.00E-09	5	49	5584	5628	95.60%	95.60%

Daniel B. Martin

Firerat · 09-24-2013, 12:27 PM

the posted awks do what you ask

but I would question what you asked...

consider these lines

Code:

Bacteria	2.00E-17	37	98	4821	4882	95.20%	95.20%
Bacteria	8.00E-17	1	61	4821	4881	95.10%	95.10%

they appear to be unique data sets, yet they are both removed.

You have to give very good reasons for deleting data

but at the end of the day, I don't know where the data came from, or what it will be used for.. so I guess not my concern

flacchy · 09-24-2013, 12:32 PM

The first one work better
because I want to remove when they have the same start and same end but sometimes it is shift by couples of numbers.

I could still use the second command and then look manually when they are in close range ...

thank you for both

flacchy · 09-24-2013, 12:36 PM

Quote:

Originally Posted by Firerat

the posted awks do what you ask

but I would question what you asked...

consider these lines

Code:

Bacteria	2.00E-17	37	98	4821	4882	95.20%	95.20%
Bacteria	8.00E-17	1	61	4821	4881	95.10%	95.10%

they appear to be unique data sets, yet they are both removed.

You have to give very good reasons for deleting data

but at the end of the day, I don't know where the data came from, or what it will be used for.. so I guess not my concern

They will represent teh same hit so I have to count them as one not as two, as is the same gene ...

flacchy · 09-25-2013, 04:40 AM

What if I want to add another field like:

Superkingdom Species E value q_star query_end h_start h_end positives identical
Bacteria HU 1.00E-21 2 78 290 366 93.50% 93.50%
Bacteria HU 1.00E-09 1 49 1377 1425 93.90% 93.90%
Bacteria HU 1.00E-11 13 64 3857 3908 94.20% 94.20%
Bacteria Ver 3.00E-11 1 51 3858 3908 94.10% 94.10%
Bacteria Ver 2.00E-17 8 69 4821 4882 95.20% 95.20%
Bacteria Ver 8.00E-17 1 61 4821 4881 95.10% 95.10%
Bacteria Ver 3.00E-16 1 60 4821 4880 95.00% 95.00%
Bacteria HIL 1.00E-09 2 45 5516 5559 95.50% 95.50%
Bacteria HIL 2.00E-08 1 38 5516 5553 97.40% 97.40%
Bacteria HIL 1.00E-09 5 49 5584 5628 95.60% 95.60%

How can I modify the formula

Code:

awk -F"\t" '!_[$5]++ && !__[$6]++' file

to remove only if they have also the same value for column species? should it be like?

Code:

awk -F"\t" '!_[$2]++ && !__[$6]++ && !__[$7]++' file