bash script to count number of lines with a specific property7
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
bash script to count number of lines with a specific property7
Hello folks,
I would like to parse an input file in which there are two columns per each row. We want to see how many lines are duplicated where we define duplicate to be having the same second field and different first field. For instance if the input file looks like the following:
79874 13131
79873 12309
79820 13131
79873 12309
The output should be 1. Because essentially only line 1 and line 3 are duplicate of each other. So, we have 1 duplicate entry. Note that as both fields of line 2 and 4 are the same, they are not duplicate based on the above definition.
Now, I know that it is trivial to write a python script or so to calculate this duplicate number for a given input file but I'm curious to see if it is possible to write such a bash script using available linux tools like awk, sed, uniq, and so. Any ideas linux freaks?
Well, the only straightforward way that I can immediately think of is sorting the file based on the second column (i.e., uniq -f 1) and then doing a python or bash script like this:
dup = 0
last_seq = -1
last_id = -1
for line in open(file, 'r').readlines():
line_split = line.split()
cur_id = line_split[0]
cur_seq = line_split[1]
if cur_seq == last_seq and cur_id != last_id:
dup += 1
last_id = cur_id
last_seq = cur_seq
But like I said I'm looking for a cute tricky script which would only use available linux tools like sed and awk. There could be a one line solution for such a thing! Please feel free to share if one of your many ideas feels in this category.
I agree awk/sed will be a better solution. It is possible with sort/uniq, but it will be slow on big files. But it's usually better to optimize later. Reading your description I think it can be written like this:
Thanks for your note. I'm looking for something along with what you are suggesting. However I'm not sure how you're checking the constraint that the first column of two duplicate entries are different. Can you explain this a little more? Are you sure this code will do this?
The first uniq filters out all duplicate rows. Then the first column is removed. We sort again, then uniq -d, which means only the duplicate rows are being outputted. The first uniq filters out all rows with both numbers being the same. Isn't that what you wanted?
Anyway, my point was that it's often better to play with commands like that. Test it with real data, and if it works like it should, you can optimize.
The first uniq filters out all duplicate rows. Then the first column is removed. We sort again, then uniq -d, which means only the duplicate rows are being outputted. The first uniq filters out all rows with both numbers being the same. Isn't that what you wanted?
Anyway, my point was that it's often better to play with commands like that. Test it with real data, and if it works like it should, you can optimize.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.