LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 12-11-2011, 10:35 AM   #1
steve51184
Member
 
Registered: Dec 2006
Posts: 381

Rep: Reputation: 30
remove similar (not identical) lines of text


hi guys and girls i have a list of songs and a few are duplicated and i want to remove them

here's an example:

Quote:
3 Doors Down When I'm Gone
4 Non Blondes What's Up
4 Non Blondes What's Up?
a-ha Take On Me
Aerosmith Cryin
Aerosmith Cryin'
as you can see the '4 Non Blondes' has 2 entries and so does 'Aerosmith' but they are not exact duplicate lines

i'd normally use something like the below to display all the duplicate lines but as the lines aren't 100% the same it doesn't detect any duplicate lines

Code:
sort songs.txt | uniq -d
my question of course is are there any ways to find 'similar lines' and not 'duplicate lines'

thanks

Last edited by steve51184; 12-11-2011 at 10:39 AM.
 
Old 12-11-2011, 11:21 AM   #2
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
How do you want to define 'similar'?

For example, this awk snippet will only consider letters case-insensitively, and ignore all other characters. For lines that match, it will only output the first one.
Code:
awk '{ t = tolower($0) ;
       gsub(/[^a-z]+/, "", t) ;
       if (!(t in seen)) print $0 ;
       seen[t] = NR
     }' input-file > output-file
For your example input, it will output
Code:
3 Doors Down When I'm Gone
4 Non Blondes What's Up
a-ha Take On Me
Aerosmith Cryin
Note how important the input order is. If you reverse-sort the input, i.e. run
Code:
sort -rbd input-file | awk '
 { t = tolower($0) ;
   gsub(/[^a-z]+/, "", t) ;
   if (!(t in seen)) print $0 ;
   seen[t] = NR
 }' | sort -bd > output-file
the output will be
Code:
3 Doors Down When I'm Gone
4 Non Blondes What's Up?
Aerosmith Cryin' 
a-ha Take On Me
Finally, you can change the gsub(/pattern/,replacement,t) to edit the comparison version of each line however you want. If you wish, you can also add gsub(/pattern/,replacement) lines to edit the output lines.
 
Old 12-11-2011, 11:43 AM   #3
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
There are "fuzzy search" facilities in at least MySQL and PHP according to the results of this netsearch. so one possibility would be to load the lines into a MySQL database and then use something (bash comes to mind, assuming your song list is in hundreds rather than thousands of lines) to loop through each line of the list and do an SQL fuzzy match on it. Familiarity with MySQL would help -- or a taste for learning adventures

With any sort of fuzzy matching it would be dangerous to change data automatically. For example "The Best of Hot Cam and the Four Valve Head Vol I" would fuzzily match Vol II but you know they are different! The best idea might be to generate a list of potential changes and then edit them manually after applying human judgement.
 
Old 12-11-2011, 05:17 PM   #4
steve51184
Member
 
Registered: Dec 2006
Posts: 381

Original Poster
Rep: Reputation: 30
i used 'Nominal Animals' method and it works quite nicely

thank you both very much for the help
 
Old 04-24-2012, 07:51 AM   #5
jorisrafael
LQ Newbie
 
Registered: Apr 2012
Posts: 1

Rep: Reputation: Disabled
How do I implement this on a mac. I am very new to all of this so I appreciate your help very much. Please make it as simple as possible.
Thank you very much
Joris
 
Old 04-24-2012, 11:45 PM   #6
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
Quote:
Originally Posted by jorisrafael View Post
How do I implement this on a mac. I am very new to all of this so I appreciate your help very much. Please make it as simple as possible.
Thank you very much
Joris
1st of all: welcome to LQ!

2nd: you shouldn't hi-jack current or reanimate dead threads if your question
isn't exactly the same as in the original (it isn't).

3rd: Mac's have terminals, they have awk, so use it in exactly the same way as
in Linux.



Cheers,
Tink
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bash script to find and remove similar lines from multiple files linuxquestion1 Programming 9 07-13-2011 01:45 AM
finding and remove block of identical strings cocostaec Linux - Newbie 2 05-20-2011 06:24 AM
How to remove lines and parts of lines from python strings? golmschenk Programming 3 11-26-2009 11:29 PM
awk: remove similar lines from logfile peos Programming 7 06-19-2006 07:13 AM
remove identical lines in a file benjithegreat98 Linux - General 4 04-24-2004 06:12 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 11:26 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration