LinuxQuestions.org - [SOLVED] remove similar (not identical) lines of text

- Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)

- - remove similar (not identical) lines of text (https://www.linuxquestions.org/questions/linux-software-2/remove-similar-not-identical-lines-of-text-918180/)

remove similar (not identical) lines of text

hi guys and girls i have a list of songs and a few are duplicated and i want to remove them

here's an example:

Quote:

3 Doors Down When I'm Gone
4 Non Blondes What's Up
4 Non Blondes What's Up?
a-ha Take On Me
Aerosmith Cryin
Aerosmith Cryin'

as you can see the '4 Non Blondes' has 2 entries and so does 'Aerosmith' but they are not exact duplicate lines

i'd normally use something like the below to display all the duplicate lines but as the lines aren't 100% the same it doesn't detect any duplicate lines

Code:

sort songs.txt | uniq -d

my question of course is are there any ways to find 'similar lines' and not 'duplicate lines'

thanks

How do you want to define 'similar'?

For example, this awk snippet will only consider letters case-insensitively, and ignore all other characters. For lines that match, it will only output the first one.

Code:

awk '{ t = tolower($0) ;

      gsub(/[^a-z]+/, "", t) ;

      if (!(t in seen)) print $0 ;

      seen[t] = NR

    }' input-file > output-file

For your example input, it will output

Code:

3 Doors Down When I'm Gone

4 Non Blondes What's Up

a-ha Take On Me

Aerosmith Cryin

Note how important the input order is. If you reverse-sort the input, i.e. run

Code:

sort -rbd input-file | awk '

 { t = tolower($0) ;

  gsub(/[^a-z]+/, "", t) ;

  if (!(t in seen)) print $0 ;

  seen[t] = NR

 }' | sort -bd > output-file

the output will be

Code:

3 Doors Down When I'm Gone

4 Non Blondes What's Up?

Aerosmith Cryin' 

a-ha Take On Me

Finally, you can change the gsub(/pattern/,replacement,t) to edit the comparison version of each line however you want. If you wish, you can also add gsub(/pattern/,replacement) lines to edit the output lines.

There are "fuzzy search" facilities in at least MySQL and PHP according to the results of this netsearch. so one possibility would be to load the lines into a MySQL database and then use something (bash comes to mind, assuming your song list is in hundreds rather than thousands of lines) to loop through each line of the list and do an SQL fuzzy match on it. Familiarity with MySQL would help -- or a taste for learning adventures :D

With any sort of fuzzy matching it would be dangerous to change data automatically. For example "The Best of Hot Cam and the Four Valve Head Vol I" would fuzzily match Vol II but you know they are different! The best idea might be to generate a list of potential changes and then edit them manually after applying human judgement.

i used 'Nominal Animals' method and it works quite nicely :)

thank you both very much for the help

How do I implement this on a mac. I am very new to all of this so I appreciate your help very much. Please make it as simple as possible. ;)
Thank you very much
Joris

Quote:

Originally Posted by jorisrafael (Post 4661644)

How do I implement this on a mac. I am very new to all of this so I appreciate your help very much. Please make it as simple as possible. ;)
Thank you very much
Joris

1st of all: welcome to LQ!

2nd: you shouldn't hi-jack current or reanimate dead threads if your question
isn't exactly the same as in the original (it isn't).

3rd: Mac's have terminals, they have awk, so use it in exactly the same way as
in Linux.

Cheers,
Tink