LinuxQuestions.org
Go Job Hunting at the LQ Job Marketplace
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices



Reply
 
Search this Thread
Old 08-11-2010, 02:52 AM   #1
hhamid
LQ Newbie
 
Registered: Sep 2005
Location: California
Distribution: Ubuntu
Posts: 17

Rep: Reputation: 0
bash script to count number of lines with a specific property7


Hello folks,

I would like to parse an input file in which there are two columns per each row. We want to see how many lines are duplicated where we define duplicate to be having the same second field and different first field. For instance if the input file looks like the following:

79874 13131
79873 12309
79820 13131
79873 12309

The output should be 1. Because essentially only line 1 and line 3 are duplicate of each other. So, we have 1 duplicate entry. Note that as both fields of line 2 and 4 are the same, they are not duplicate based on the above definition.

Now, I know that it is trivial to write a python script or so to calculate this duplicate number for a given input file but I'm curious to see if it is possible to write such a bash script using available linux tools like awk, sed, uniq, and so. Any ideas linux freaks?

Thanks
--Hamid
 
Old 08-11-2010, 03:11 AM   #2
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,698

Rep: Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988
Plenty of ideas, but maybe you could show what you have tried and / or what is not working?
 
0 members found this post helpful.
Old 08-11-2010, 03:33 AM   #3
hhamid
LQ Newbie
 
Registered: Sep 2005
Location: California
Distribution: Ubuntu
Posts: 17

Original Poster
Rep: Reputation: 0
Well, the only straightforward way that I can immediately think of is sorting the file based on the second column (i.e., uniq -f 1) and then doing a python or bash script like this:

dup = 0
last_seq = -1
last_id = -1
for line in open(file, 'r').readlines():
line_split = line.split()
cur_id = line_split[0]
cur_seq = line_split[1]
if cur_seq == last_seq and cur_id != last_id:
dup += 1
last_id = cur_id
last_seq = cur_seq

But like I said I'm looking for a cute tricky script which would only use available linux tools like sed and awk. There could be a one line solution for such a thing! Please feel free to share if one of your many ideas feels in this category.

Thanks
--Hamid
 
Old 08-11-2010, 05:31 AM   #4
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,698

Rep: Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988
Well I see where your going and I am guessing you have not used sed or awk as you have not found any good sites to help you, sooo:

awk - http://www.gnu.org/manual/gawk/html_node/index.html

sed - http://www.grymoire.com/Unix/Sed.html

My personal suggestion would be for you to look at the awk site first as this application deals particularly well with delimetered information.
 
0 members found this post helpful.
Old 08-11-2010, 07:01 AM   #5
Guttorm
Senior Member
 
Registered: Dec 2003
Location: Trondheim, Norway
Distribution: Debian and Ubuntu
Posts: 1,159

Rep: Reputation: 258Reputation: 258Reputation: 258
Hi

I agree awk/sed will be a better solution. It is possible with sort/uniq, but it will be slow on big files. But it's usually better to optimize later. Reading your description I think it can be written like this:

cat datafile.txt | sort | uniq | cut -d" " -f2 | sort | uniq -d | wc -l
 
1 members found this post helpful.
Old 08-12-2010, 12:48 AM   #6
hhamid
LQ Newbie
 
Registered: Sep 2005
Location: California
Distribution: Ubuntu
Posts: 17

Original Poster
Rep: Reputation: 0
Hello Guttorm,

Thanks for your note. I'm looking for something along with what you are suggesting. However I'm not sure how you're checking the constraint that the first column of two duplicate entries are different. Can you explain this a little more? Are you sure this code will do this?

Thanks again

Last edited by hhamid; 08-12-2010 at 12:51 AM.
 
Old 08-12-2010, 05:10 AM   #7
Guttorm
Senior Member
 
Registered: Dec 2003
Location: Trondheim, Norway
Distribution: Debian and Ubuntu
Posts: 1,159

Rep: Reputation: 258Reputation: 258Reputation: 258
Hi

The first uniq filters out all duplicate rows. Then the first column is removed. We sort again, then uniq -d, which means only the duplicate rows are being outputted. The first uniq filters out all rows with both numbers being the same. Isn't that what you wanted?

Anyway, my point was that it's often better to play with commands like that. Test it with real data, and if it works like it should, you can optimize.
 
1 members found this post helpful.
Old 08-12-2010, 11:13 AM   #8
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,698

Rep: Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988
Code:
awk '!_[$0]++{sum++}END{print (NR-sum)}' file
It seems you just want the answer
 
1 members found this post helpful.
Old 08-12-2010, 04:14 PM   #9
hhamid
LQ Newbie
 
Registered: Sep 2005
Location: California
Distribution: Ubuntu
Posts: 17

Original Poster
Rep: Reputation: 0
Thanks Guttorm! Brilliant!

Quote:
Originally Posted by Guttorm View Post
Hi

The first uniq filters out all duplicate rows. Then the first column is removed. We sort again, then uniq -d, which means only the duplicate rows are being outputted. The first uniq filters out all rows with both numbers being the same. Isn't that what you wanted?

Anyway, my point was that it's often better to play with commands like that. Test it with real data, and if it works like it should, you can optimize.
 
Old 08-12-2010, 04:44 PM   #10
hhamid
LQ Newbie
 
Registered: Sep 2005
Location: California
Distribution: Ubuntu
Posts: 17

Original Poster
Rep: Reputation: 0
Thanks. Although your code doesn't work but I got the idea. Something like this should do the job:

BEGIN {
last_id=-1
last_seq=-1
dup = 0
}
{
if($1 != last_id && $2 == last_seq) {
dup++
}
last_id=$1
last_seq=$2
}
END {
print dup
}

The fact that awk and sed iterate all lines of a file make them really useful in such a scenario.

Quote:
Originally Posted by grail View Post
Code:
awk '!_[$0]++{sum++}END{print (NR-sum)}' file
It seems you just want the answer

Last edited by hhamid; 08-12-2010 at 04:45 PM.
 
Old 08-13-2010, 02:35 AM   #11
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,698

Rep: Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988Reputation: 1988
Quote:
Although your code doesn't work but I got the idea.
Would you like to explain further what did not work? It seemed fine on your test data.
 
  


Reply

Tags
awk, bash, duplicate, lines, script, sed


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Editing a specific group of lines by line number? SirTristan Linux - Newbie 5 09-29-2009 08:30 PM
commands for bash script that handles files of varying number of lines BBFeltham Linux - Newbie 1 07-26-2008 11:18 AM
Bash script question (sending output to specific lines) ptcl Linux - Software 9 11-21-2006 04:53 AM
count total number of lines in several files xushi Programming 5 11-12-2005 05:42 PM
cat: output specific number of lines mikeshn Linux - Software 3 12-31-2003 01:15 PM


All times are GMT -5. The time now is 03:22 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration