LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   remove similar (not identical) lines of text (https://www.linuxquestions.org/questions/linux-software-2/remove-similar-not-identical-lines-of-text-918180/)

steve51184 12-11-2011 10:35 AM

remove similar (not identical) lines of text
 
hi guys and girls i have a list of songs and a few are duplicated and i want to remove them

here's an example:

Quote:

3 Doors Down When I'm Gone
4 Non Blondes What's Up
4 Non Blondes What's Up?
a-ha Take On Me
Aerosmith Cryin
Aerosmith Cryin'
as you can see the '4 Non Blondes' has 2 entries and so does 'Aerosmith' but they are not exact duplicate lines

i'd normally use something like the below to display all the duplicate lines but as the lines aren't 100% the same it doesn't detect any duplicate lines

Code:

sort songs.txt | uniq -d
my question of course is are there any ways to find 'similar lines' and not 'duplicate lines'

thanks

Nominal Animal 12-11-2011 11:21 AM

How do you want to define 'similar'?

For example, this awk snippet will only consider letters case-insensitively, and ignore all other characters. For lines that match, it will only output the first one.
Code:

awk '{ t = tolower($0) ;
      gsub(/[^a-z]+/, "", t) ;
      if (!(t in seen)) print $0 ;
      seen[t] = NR
    }' input-file > output-file

For your example input, it will output
Code:

3 Doors Down When I'm Gone
4 Non Blondes What's Up
a-ha Take On Me
Aerosmith Cryin

Note how important the input order is. If you reverse-sort the input, i.e. run
Code:

sort -rbd input-file | awk '
 { t = tolower($0) ;
  gsub(/[^a-z]+/, "", t) ;
  if (!(t in seen)) print $0 ;
  seen[t] = NR
 }' | sort -bd > output-file

the output will be
Code:

3 Doors Down When I'm Gone
4 Non Blondes What's Up?
Aerosmith Cryin'
a-ha Take On Me

Finally, you can change the gsub(/pattern/,replacement,t) to edit the comparison version of each line however you want. If you wish, you can also add gsub(/pattern/,replacement) lines to edit the output lines.

catkin 12-11-2011 11:43 AM

There are "fuzzy search" facilities in at least MySQL and PHP according to the results of this netsearch. so one possibility would be to load the lines into a MySQL database and then use something (bash comes to mind, assuming your song list is in hundreds rather than thousands of lines) to loop through each line of the list and do an SQL fuzzy match on it. Familiarity with MySQL would help -- or a taste for learning adventures :D

With any sort of fuzzy matching it would be dangerous to change data automatically. For example "The Best of Hot Cam and the Four Valve Head Vol I" would fuzzily match Vol II but you know they are different! The best idea might be to generate a list of potential changes and then edit them manually after applying human judgement.

steve51184 12-11-2011 05:17 PM

i used 'Nominal Animals' method and it works quite nicely :)

thank you both very much for the help

jorisrafael 04-24-2012 07:51 AM

How do I implement this on a mac. I am very new to all of this so I appreciate your help very much. Please make it as simple as possible. ;)
Thank you very much
Joris

Tinkster 04-24-2012 11:45 PM

Quote:

Originally Posted by jorisrafael (Post 4661644)
How do I implement this on a mac. I am very new to all of this so I appreciate your help very much. Please make it as simple as possible. ;)
Thank you very much
Joris

1st of all: welcome to LQ!

2nd: you shouldn't hi-jack current or reanimate dead threads if your question
isn't exactly the same as in the original (it isn't).

3rd: Mac's have terminals, they have awk, so use it in exactly the same way as
in Linux.



Cheers,
Tink


All times are GMT -5. The time now is 04:44 PM.