LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)
-   -   Modify I tabular file (https://www.linuxquestions.org/questions/linux-general-1/modify-i-tabular-file-4175478359/)

flacchy 09-24-2013 10:33 AM

Modify I tabular file
 
1 Attachment(s)
Hi,

I am working with a tabular file and I need to make some changes.
So In my tab file I need to collapse all the raw that have the same value for column x but also the same value for column y. For example:
So in the attachment I highlight the raw that I need to collapse to count as 1.

I was thinking of using awk or sed but I have no idea on how to do it...

Can anyone help??

F.

Firerat 09-24-2013 10:46 AM

what do you mean "collapse"
no column x or y
same as what?
Which highlight? the yellow?



What you probably want is a "pivot table"
what spreadsheet software are you using?

danielbmartin 09-24-2013 10:47 AM

Help us to help you. Provide a sample input file (10-15 lines will do). Not a screen capture, but a real file posted here, bracketed with code tags. Construct a sample output file which corresponds to your sample input and post both samples here. With "Before and After" examples we can better understand your needs and also judge if our proposed solution fills those needs.

Daniel B. Martin

flacchy 09-24-2013 11:05 AM

1 Attachment(s)
Thank you for the suggestions...

and sorry for the confusion and the image( I had some trouble with the tab.. )

Superkingdom E value q_star query_end h_start h_end positives identical
Bacteria 1.00E-21 2 78 290 366 93.50% 93.50%
Bacteria 1.00E-09 1 49 1377 1425 93.90% 93.90%
Bacteria 1.00E-09 1 49 1377 1425 93.90% 93.90%
Bacteria 1.00E-11 13 64 3857 3908 94.20% 94.20%
Bacteria 1.00E-11 13 64 3857 3908 94.20% 94.20%
Bacteria 1.00E-11 11 62 3857 3908 94.20% 94.20%
Bacteria 3.00E-11 1 51 3858 3908 94.10% 94.10%
Bacteria 2.00E-17 8 69 4821 4882 95.20% 95.20%
Bacteria 2.00E-17 18 79 4821 4882 95.20% 95.20%
Bacteria 2.00E-17 20 81 4821 4882 95.20% 95.20%
Bacteria 2.00E-17 18 79 4821 4882 95.20% 95.20%
Bacteria 2.00E-17 37 98 4821 4882 95.20% 95.20%
Bacteria 8.00E-17 1 61 4821 4881 95.10% 95.10%
Bacteria 3.00E-16 1 60 4821 4880 95.00% 95.00%
Bacteria 1.00E-09 2 45 5516 5559 95.50% 95.50%
Bacteria 2.00E-08 1 38 5516 5553 97.40% 97.40%
Bacteria 1.00E-09 5 49 5584 5628 95.60% 95.60%


what I want to do is get rid of the lines that have the same value in column "h_start" but they need to have the same value for the column "h_end" (in the image I highlighted yellow and pink) ...
So my goal is to remove the raw/lines that are identical for both h_start and h_end...

Is it better???

I apologize for the confusion,hope made more sense...

colucix 09-24-2013 11:18 AM

Code:

awk -F"\t" '!_[$5]++ && !__[$6]++' file
This preserves the first row and removes the other ones with identical h_start and h_end pairs. Is this what you're looking for? What about the other values in the rows if they are different from the first (preserved) row?

flacchy 09-24-2013 11:19 AM

so ...Firerat...

"wthat do you mean "collapse"": I want to remove them and keep only the first

"same as what?" : if they have the same value for the column hit start but also for the column hit end

"Which highlight? the yellow?" : I did highlight in yellow and pink only because they were consecutive but I want to keep only the first raw for yellow and pink (you can see these reads have the same value for column hit start but also for the column hit end)

"What you probably want is a "pivot table"": I wold like to do it on linux because the file is huge and a pivot table for the hit start and hit end will be too big

"what spreadsheet software are you using?" I was thinking about modifying the file using something like awk and then open with any program ...


F.

flacchy 09-24-2013 11:22 AM

Colucix:

seems what I want !! I only need to test it!

The other values now are not important because I already filter for them ( so these are the good hits I need) now I am getting more strict

so these are the parameters?
'!_[$5]++ && !__[$6]++'

where [$5} is the column ?

colucix 09-24-2013 11:28 AM

Well, I tested my own code more carefully and it has some problems, in the sense that if the value in the fifth ($5) or sixth ($6) column already appeared, even if the other number and the resulting pair is different, the row is not printed out. Here is a straightforward solution, using two indexes arrays:
Code:

awk -F"\t" '!_[$5,$6]++' file
This simply uses the concept of true and false in awk. The exclamation point is the negation and _ is simply the name of an array.

flacchy 09-24-2013 11:31 AM

Quote:

Originally Posted by colucix (Post 5033948)
Code:

awk -F"\t" '!_[$5]++ && !__[$6]++' file
This preserves the first row and removes the other ones with identical h_start and h_end pairs. Is this what you're looking for? What about the other values in the rows if they are different from the first (preserved) row?


it did work!!! Thank you so so much

colucix 09-24-2013 11:40 AM

Here is what I get from your sample file:
Code:

$ awk -F"\t" '!_[$5,$6]++' file
Superkingdom    E value  q_star query_end h_start h_end positives identical
Bacteria        1.00E-21        2      78      290    366    93.50%  93.50%
Bacteria        1.00E-09        1      49      1377    1425    93.90%  93.90%
Bacteria        1.00E-11        13      64      3857    3908    94.20%  94.20%
Bacteria        3.00E-11        1      51      3858    3908    94.10%  94.10%
Bacteria        2.00E-17        8      69      4821    4882    95.20%  95.20%
Bacteria        8.00E-17        1      61      4821    4881    95.10%  95.10%
Bacteria        3.00E-16        1      60      4821    4880    95.00%  95.00%

Bacteria        1.00E-09        2      45      5516    5559    95.50%  95.50%
Bacteria        2.00E-08        1      38      5516    5553    97.40%  97.40%
Bacteria        1.00E-09        5      49      5584    5628    95.60%  95.60%
$ awk -F"\t" '!_[$5]++ && !__[$6]++' file
Superkingdom    E value  q_star query_end h_start h_end positives identical
Bacteria        1.00E-21        2      78      290    366    93.50%  93.50%
Bacteria        1.00E-09        1      49      1377    1425    93.90%  93.90%
Bacteria        1.00E-11        13      64      3857    3908    94.20%  94.20%
Bacteria        2.00E-17        8      69      4821    4882    95.20%  95.20%
Bacteria        1.00E-09        2      45      5516    5559    95.50%  95.50%
Bacteria        1.00E-09        5      49      5584    5628    95.60%  95.60%

The lines highlighted in blue are removed from the output of the first suggested awk command, but the pair h_start/h_end are different from the pair of the line immediately above. I'm still confused about your exact requirement.

danielbmartin 09-24-2013 12:00 PM

With this InFile ...
Code:

Bacteria        1.00E-21        2        78        290        366        93.50%        93.50%
Bacteria        1.00E-09        1        49        1377        1425        93.90%        93.90%
Bacteria        1.00E-09        1        49        1377        1425        93.90%        93.90%
Bacteria        1.00E-11        13        64        3857        3908        94.20%        94.20%
Bacteria        1.00E-11        13        64        3857        3908        94.20%        94.20%
Bacteria        1.00E-11        11        62        3857        3908        94.20%        94.20%
Bacteria        3.00E-11        1        51        3858        3908        94.10%        94.10%
Bacteria        2.00E-17        8        69        4821        4882        95.20%        95.20%
Bacteria        2.00E-17        18        79        4821        4882        95.20%        95.20%
Bacteria        2.00E-17        20        81        4821        4882        95.20%        95.20%
Bacteria        2.00E-17        18        79        4821        4882        95.20%        95.20%
Bacteria        2.00E-17        37        98        4821        4882        95.20%        95.20%
Bacteria        8.00E-17        1        61        4821        4881        95.10%        95.10%
Bacteria        3.00E-16        1        60        4821        4880        95.00%        95.00%
Bacteria        1.00E-09        2        45        5516        5559        95.50%        95.50%
Bacteria        2.00E-08        1        38        5516        5553        97.40%        97.40%
Bacteria        1.00E-09        5        49        5584        5628        95.60%        95.60%

... this awk ...
Code:

awk 'p!=p=$5,$6' $InFile >$OutFile
... produced this OutFile ...
Code:

Bacteria        1.00E-21        2        78        290        366        93.50%        93.50%
Bacteria        1.00E-09        1        49        1377        1425        93.90%        93.90%
Bacteria        1.00E-11        13        64        3857        3908        94.20%        94.20%
Bacteria        3.00E-11        1        51        3858        3908        94.10%        94.10%
Bacteria        2.00E-17        8        69        4821        4882        95.20%        95.20%
Bacteria        1.00E-09        2        45        5516        5559        95.50%        95.50%
Bacteria        1.00E-09        5        49        5584        5628        95.60%        95.60%

Daniel B. Martin

Firerat 09-24-2013 12:27 PM

the posted awks do what you ask

but I would question what you asked...

consider these lines
Code:

Bacteria        2.00E-17        37        98        4821        4882        95.20%        95.20%
Bacteria        8.00E-17        1        61        4821        4881        95.10%        95.10%

they appear to be unique data sets, yet they are both removed.

You have to give very good reasons for deleting data

but at the end of the day, I don't know where the data came from, or what it will be used for.. so I guess not my concern

flacchy 09-24-2013 12:32 PM

The first one work better
because I want to remove when they have the same start and same end but sometimes it is shift by couples of numbers.

I could still use the second command and then look manually when they are in close range ...

thank you for both

flacchy 09-24-2013 12:36 PM

Quote:

Originally Posted by Firerat (Post 5033979)
the posted awks do what you ask

but I would question what you asked...

consider these lines
Code:

Bacteria        2.00E-17        37        98        4821        4882        95.20%        95.20%
Bacteria        8.00E-17        1        61        4821        4881        95.10%        95.10%

they appear to be unique data sets, yet they are both removed.

You have to give very good reasons for deleting data

but at the end of the day, I don't know where the data came from, or what it will be used for.. so I guess not my concern

They will represent teh same hit so I have to count them as one not as two, as is the same gene ...

flacchy 09-25-2013 04:40 AM

What if I want to add another field like:

Superkingdom Species E value q_star query_end h_start h_end positives identical
Bacteria HU 1.00E-21 2 78 290 366 93.50% 93.50%
Bacteria HU 1.00E-09 1 49 1377 1425 93.90% 93.90%
Bacteria HU 1.00E-11 13 64 3857 3908 94.20% 94.20%
Bacteria Ver 3.00E-11 1 51 3858 3908 94.10% 94.10%
Bacteria Ver 2.00E-17 8 69 4821 4882 95.20% 95.20%
Bacteria Ver 8.00E-17 1 61 4821 4881 95.10% 95.10%
Bacteria Ver 3.00E-16 1 60 4821 4880 95.00% 95.00%
Bacteria HIL 1.00E-09 2 45 5516 5559 95.50% 95.50%
Bacteria HIL 2.00E-08 1 38 5516 5553 97.40% 97.40%
Bacteria HIL 1.00E-09 5 49 5584 5628 95.60% 95.60%

How can I modify the formula
Code:

awk -F"\t" '!_[$5]++ && !__[$6]++' file
to remove only if they have also the same value for column species? should it be like?
Code:

awk -F"\t" '!_[$2]++ && !__[$6]++ && !__[$7]++' file


All times are GMT -5. The time now is 07:04 PM.