LinuxQuestions.org
View the Most Wanted LQ Wiki articles.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices

Reply
 
Search this Thread
Old 09-24-2013, 11:33 AM   #1
flacchy
LQ Newbie
 
Registered: May 2013
Posts: 17

Rep: Reputation: Disabled
Modify I tabular file


Hi,

I am working with a tabular file and I need to make some changes.
So In my tab file I need to collapse all the raw that have the same value for column x but also the same value for column y. For example:
So in the attachment I highlight the raw that I need to collapse to count as 1.

I was thinking of using awk or sed but I have no idea on how to do it...

Can anyone help??

F.
Attached Images
File Type: jpg example.jpg (215.6 KB, 17 views)
 
Old 09-24-2013, 11:46 AM   #2
Firerat
Senior Member
 
Registered: Oct 2008
Distribution: Debian Jessie / sid
Posts: 1,471

Rep: Reputation: 444Reputation: 444Reputation: 444Reputation: 444Reputation: 444
what do you mean "collapse"
no column x or y
same as what?
Which highlight? the yellow?



What you probably want is a "pivot table"
what spreadsheet software are you using?
 
Old 09-24-2013, 11:47 AM   #3
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,164

Rep: Reputation: 305Reputation: 305Reputation: 305Reputation: 305
Help us to help you. Provide a sample input file (10-15 lines will do). Not a screen capture, but a real file posted here, bracketed with code tags. Construct a sample output file which corresponds to your sample input and post both samples here. With "Before and After" examples we can better understand your needs and also judge if our proposed solution fills those needs.

Daniel B. Martin
 
Old 09-24-2013, 12:05 PM   #4
flacchy
LQ Newbie
 
Registered: May 2013
Posts: 17

Original Poster
Rep: Reputation: Disabled
Thank you for the suggestions...

and sorry for the confusion and the image( I had some trouble with the tab.. )

Superkingdom E value q_star query_end h_start h_end positives identical
Bacteria 1.00E-21 2 78 290 366 93.50% 93.50%
Bacteria 1.00E-09 1 49 1377 1425 93.90% 93.90%
Bacteria 1.00E-09 1 49 1377 1425 93.90% 93.90%
Bacteria 1.00E-11 13 64 3857 3908 94.20% 94.20%
Bacteria 1.00E-11 13 64 3857 3908 94.20% 94.20%
Bacteria 1.00E-11 11 62 3857 3908 94.20% 94.20%
Bacteria 3.00E-11 1 51 3858 3908 94.10% 94.10%
Bacteria 2.00E-17 8 69 4821 4882 95.20% 95.20%
Bacteria 2.00E-17 18 79 4821 4882 95.20% 95.20%
Bacteria 2.00E-17 20 81 4821 4882 95.20% 95.20%
Bacteria 2.00E-17 18 79 4821 4882 95.20% 95.20%
Bacteria 2.00E-17 37 98 4821 4882 95.20% 95.20%
Bacteria 8.00E-17 1 61 4821 4881 95.10% 95.10%
Bacteria 3.00E-16 1 60 4821 4880 95.00% 95.00%
Bacteria 1.00E-09 2 45 5516 5559 95.50% 95.50%
Bacteria 2.00E-08 1 38 5516 5553 97.40% 97.40%
Bacteria 1.00E-09 5 49 5584 5628 95.60% 95.60%


what I want to do is get rid of the lines that have the same value in column "h_start" but they need to have the same value for the column "h_end" (in the image I highlighted yellow and pink) ...
So my goal is to remove the raw/lines that are identical for both h_start and h_end...

Is it better???

I apologize for the confusion,hope made more sense...
Attached Files
File Type: txt try.txt (902 Bytes, 11 views)

Last edited by flacchy; 09-24-2013 at 12:10 PM.
 
Old 09-24-2013, 12:18 PM   #5
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
Code:
awk -F"\t" '!_[$5]++ && !__[$6]++' file
This preserves the first row and removes the other ones with identical h_start and h_end pairs. Is this what you're looking for? What about the other values in the rows if they are different from the first (preserved) row?
 
1 members found this post helpful.
Old 09-24-2013, 12:19 PM   #6
flacchy
LQ Newbie
 
Registered: May 2013
Posts: 17

Original Poster
Rep: Reputation: Disabled
so ...Firerat...

"wthat do you mean "collapse"": I want to remove them and keep only the first

"same as what?" : if they have the same value for the column hit start but also for the column hit end

"Which highlight? the yellow?" : I did highlight in yellow and pink only because they were consecutive but I want to keep only the first raw for yellow and pink (you can see these reads have the same value for column hit start but also for the column hit end)

"What you probably want is a "pivot table"": I wold like to do it on linux because the file is huge and a pivot table for the hit start and hit end will be too big

"what spreadsheet software are you using?" I was thinking about modifying the file using something like awk and then open with any program ...


F.
 
Old 09-24-2013, 12:22 PM   #7
flacchy
LQ Newbie
 
Registered: May 2013
Posts: 17

Original Poster
Rep: Reputation: Disabled
Colucix:

seems what I want !! I only need to test it!

The other values now are not important because I already filter for them ( so these are the good hits I need) now I am getting more strict

so these are the parameters?
'!_[$5]++ && !__[$6]++'

where [$5} is the column ?
 
Old 09-24-2013, 12:28 PM   #8
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
Well, I tested my own code more carefully and it has some problems, in the sense that if the value in the fifth ($5) or sixth ($6) column already appeared, even if the other number and the resulting pair is different, the row is not printed out. Here is a straightforward solution, using two indexes arrays:
Code:
awk -F"\t" '!_[$5,$6]++' file
This simply uses the concept of true and false in awk. The exclamation point is the negation and _ is simply the name of an array.

Last edited by colucix; 09-24-2013 at 12:30 PM.
 
1 members found this post helpful.
Old 09-24-2013, 12:31 PM   #9
flacchy
LQ Newbie
 
Registered: May 2013
Posts: 17

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by colucix View Post
Code:
awk -F"\t" '!_[$5]++ && !__[$6]++' file
This preserves the first row and removes the other ones with identical h_start and h_end pairs. Is this what you're looking for? What about the other values in the rows if they are different from the first (preserved) row?

it did work!!! Thank you so so much
 
Old 09-24-2013, 12:40 PM   #10
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
Here is what I get from your sample file:
Code:
$ awk -F"\t" '!_[$5,$6]++' file
Superkingdom    E value  q_star query_end h_start h_end positives identical
Bacteria        1.00E-21        2       78      290     366     93.50%  93.50%
Bacteria        1.00E-09        1       49      1377    1425    93.90%  93.90%
Bacteria        1.00E-11        13      64      3857    3908    94.20%  94.20%
Bacteria        3.00E-11        1       51      3858    3908    94.10%  94.10%
Bacteria        2.00E-17        8       69      4821    4882    95.20%  95.20%
Bacteria        8.00E-17        1       61      4821    4881    95.10%  95.10%
Bacteria        3.00E-16        1       60      4821    4880    95.00%  95.00%
Bacteria        1.00E-09        2       45      5516    5559    95.50%  95.50%
Bacteria        2.00E-08        1       38      5516    5553    97.40%  97.40%
Bacteria        1.00E-09        5       49      5584    5628    95.60%  95.60%
$ awk -F"\t" '!_[$5]++ && !__[$6]++' file
Superkingdom    E value  q_star query_end h_start h_end positives identical
Bacteria        1.00E-21        2       78      290     366     93.50%  93.50%
Bacteria        1.00E-09        1       49      1377    1425    93.90%  93.90%
Bacteria        1.00E-11        13      64      3857    3908    94.20%  94.20%
Bacteria        2.00E-17        8       69      4821    4882    95.20%  95.20%
Bacteria        1.00E-09        2       45      5516    5559    95.50%  95.50%
Bacteria        1.00E-09        5       49      5584    5628    95.60%  95.60%
The lines highlighted in blue are removed from the output of the first suggested awk command, but the pair h_start/h_end are different from the pair of the line immediately above. I'm still confused about your exact requirement.
 
Old 09-24-2013, 01:00 PM   #11
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 1,164

Rep: Reputation: 305Reputation: 305Reputation: 305Reputation: 305
With this InFile ...
Code:
Bacteria	1.00E-21	2	78	290	366	93.50%	93.50%
Bacteria	1.00E-09	1	49	1377	1425	93.90%	93.90%
Bacteria	1.00E-09	1	49	1377	1425	93.90%	93.90%
Bacteria	1.00E-11	13	64	3857	3908	94.20%	94.20%
Bacteria	1.00E-11	13	64	3857	3908	94.20%	94.20%
Bacteria	1.00E-11	11	62	3857	3908	94.20%	94.20%
Bacteria	3.00E-11	1	51	3858	3908	94.10%	94.10%
Bacteria	2.00E-17	8	69	4821	4882	95.20%	95.20%
Bacteria	2.00E-17	18	79	4821	4882	95.20%	95.20%
Bacteria	2.00E-17	20	81	4821	4882	95.20%	95.20%
Bacteria	2.00E-17	18	79	4821	4882	95.20%	95.20%
Bacteria	2.00E-17	37	98	4821	4882	95.20%	95.20%
Bacteria	8.00E-17	1	61	4821	4881	95.10%	95.10%
Bacteria	3.00E-16	1	60	4821	4880	95.00%	95.00%
Bacteria	1.00E-09	2	45	5516	5559	95.50%	95.50%
Bacteria	2.00E-08	1	38	5516	5553	97.40%	97.40%
Bacteria	1.00E-09	5	49	5584	5628	95.60%	95.60%
... this awk ...
Code:
awk 'p!=p=$5,$6' $InFile >$OutFile
... produced this OutFile ...
Code:
Bacteria	1.00E-21	2	78	290	366	93.50%	93.50%
Bacteria	1.00E-09	1	49	1377	1425	93.90%	93.90%
Bacteria	1.00E-11	13	64	3857	3908	94.20%	94.20%
Bacteria	3.00E-11	1	51	3858	3908	94.10%	94.10%
Bacteria	2.00E-17	8	69	4821	4882	95.20%	95.20%
Bacteria	1.00E-09	2	45	5516	5559	95.50%	95.50%
Bacteria	1.00E-09	5	49	5584	5628	95.60%	95.60%
Daniel B. Martin
 
Old 09-24-2013, 01:27 PM   #12
Firerat
Senior Member
 
Registered: Oct 2008
Distribution: Debian Jessie / sid
Posts: 1,471

Rep: Reputation: 444Reputation: 444Reputation: 444Reputation: 444Reputation: 444
the posted awks do what you ask

but I would question what you asked...

consider these lines
Code:
Bacteria	2.00E-17	37	98	4821	4882	95.20%	95.20%
Bacteria	8.00E-17	1	61	4821	4881	95.10%	95.10%
they appear to be unique data sets, yet they are both removed.

You have to give very good reasons for deleting data

but at the end of the day, I don't know where the data came from, or what it will be used for.. so I guess not my concern
 
Old 09-24-2013, 01:32 PM   #13
flacchy
LQ Newbie
 
Registered: May 2013
Posts: 17

Original Poster
Rep: Reputation: Disabled
The first one work better
because I want to remove when they have the same start and same end but sometimes it is shift by couples of numbers.

I could still use the second command and then look manually when they are in close range ...

thank you for both
 
Old 09-24-2013, 01:36 PM   #14
flacchy
LQ Newbie
 
Registered: May 2013
Posts: 17

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by Firerat View Post
the posted awks do what you ask

but I would question what you asked...

consider these lines
Code:
Bacteria	2.00E-17	37	98	4821	4882	95.20%	95.20%
Bacteria	8.00E-17	1	61	4821	4881	95.10%	95.10%
they appear to be unique data sets, yet they are both removed.

You have to give very good reasons for deleting data

but at the end of the day, I don't know where the data came from, or what it will be used for.. so I guess not my concern
They will represent teh same hit so I have to count them as one not as two, as is the same gene ...
 
Old 09-25-2013, 05:40 AM   #15
flacchy
LQ Newbie
 
Registered: May 2013
Posts: 17

Original Poster
Rep: Reputation: Disabled
What if I want to add another field like:

Superkingdom Species E value q_star query_end h_start h_end positives identical
Bacteria HU 1.00E-21 2 78 290 366 93.50% 93.50%
Bacteria HU 1.00E-09 1 49 1377 1425 93.90% 93.90%
Bacteria HU 1.00E-11 13 64 3857 3908 94.20% 94.20%
Bacteria Ver 3.00E-11 1 51 3858 3908 94.10% 94.10%
Bacteria Ver 2.00E-17 8 69 4821 4882 95.20% 95.20%
Bacteria Ver 8.00E-17 1 61 4821 4881 95.10% 95.10%
Bacteria Ver 3.00E-16 1 60 4821 4880 95.00% 95.00%
Bacteria HIL 1.00E-09 2 45 5516 5559 95.50% 95.50%
Bacteria HIL 2.00E-08 1 38 5516 5553 97.40% 97.40%
Bacteria HIL 1.00E-09 5 49 5584 5628 95.60% 95.60%

How can I modify the formula
Code:
awk -F"\t" '!_[$5]++ && !__[$6]++' file
to remove only if they have also the same value for column species? should it be like?
Code:
awk -F"\t" '!_[$2]++ && !__[$6]++ && !__[$7]++' file
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Output in Tabular Format mahendra singh Programming 3 08-16-2013 12:14 PM
Modify script to do mass (file) translation instead of 1 file at a time. k3lt01 Programming 18 10-31-2012 02:36 PM
How to modify a field in few lines in a file and save the new file - in Perl rounak94 Programming 1 10-02-2008 08:43 PM
tabular to xml and back to tabular ebsbel Programming 3 03-14-2007 07:43 AM
modify file access & modify timestamps i2itstud Linux - General 1 05-20-2003 04:34 AM


All times are GMT -5. The time now is 05:02 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration