Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I am working with a tabular file and I need to make some changes.
So In my tab file I need to collapse all the raw that have the same value for column x but also the same value for column y. For example:
So in the attachment I highlight the raw that I need to collapse to count as 1.
I was thinking of using awk or sed but I have no idea on how to do it...
Help us to help you. Provide a sample input file (10-15 lines will do). Not a screen capture, but a real file posted here, bracketed with code tags. Construct a sample output file which corresponds to your sample input and post both samples here. With "Before and After" examples we can better understand your needs and also judge if our proposed solution fills those needs.
what I want to do is get rid of the lines that have the same value in column "h_start" but they need to have the same value for the column "h_end" (in the image I highlighted yellow and pink) ...
So my goal is to remove the raw/lines that are identical for both h_start and h_end...
Is it better???
I apologize for the confusion,hope made more sense...
This preserves the first row and removes the other ones with identical h_start and h_end pairs. Is this what you're looking for? What about the other values in the rows if they are different from the first (preserved) row?
"wthat do you mean "collapse"": I want to remove them and keep only the first
"same as what?" : if they have the same value for the column hit start but also for the column hit end
"Which highlight? the yellow?" : I did highlight in yellow and pink only because they were consecutive but I want to keep only the first raw for yellow and pink (you can see these reads have the same value for column hit start but also for the column hit end)
"What you probably want is a "pivot table"": I wold like to do it on linux because the file is huge and a pivot table for the hit start and hit end will be too big
"what spreadsheet software are you using?" I was thinking about modifying the file using something like awk and then open with any program ...
Well, I tested my own code more carefully and it has some problems, in the sense that if the value in the fifth ($5) or sixth ($6) column already appeared, even if the other number and the resulting pair is different, the row is not printed out. Here is a straightforward solution, using two indexes arrays:
Code:
awk -F"\t" '!_[$5,$6]++' file
This simply uses the concept of true and false in awk. The exclamation point is the negation and _ is simply the name of an array.
This preserves the first row and removes the other ones with identical h_start and h_end pairs. Is this what you're looking for? What about the other values in the rows if they are different from the first (preserved) row?
The lines highlighted in blue are removed from the output of the first suggested awk command, but the pair h_start/h_end are different from the pair of the line immediately above. I'm still confused about your exact requirement.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.