remove similar (not identical) lines of text
hi guys and girls i have a list of songs and a few are duplicated and i want to remove them
here's an example: Quote:
i'd normally use something like the below to display all the duplicate lines but as the lines aren't 100% the same it doesn't detect any duplicate lines Code:
sort songs.txt | uniq -d thanks |
How do you want to define 'similar'?
For example, this awk snippet will only consider letters case-insensitively, and ignore all other characters. For lines that match, it will only output the first one. Code:
awk '{ t = tolower($0) ; Code:
3 Doors Down When I'm Gone Code:
sort -rbd input-file | awk ' Code:
3 Doors Down When I'm Gone |
There are "fuzzy search" facilities in at least MySQL and PHP according to the results of this netsearch. so one possibility would be to load the lines into a MySQL database and then use something (bash comes to mind, assuming your song list is in hundreds rather than thousands of lines) to loop through each line of the list and do an SQL fuzzy match on it. Familiarity with MySQL would help -- or a taste for learning adventures :D
With any sort of fuzzy matching it would be dangerous to change data automatically. For example "The Best of Hot Cam and the Four Valve Head Vol I" would fuzzily match Vol II but you know they are different! The best idea might be to generate a list of potential changes and then edit them manually after applying human judgement. |
i used 'Nominal Animals' method and it works quite nicely :)
thank you both very much for the help |
How do I implement this on a mac. I am very new to all of this so I appreciate your help very much. Please make it as simple as possible. ;)
Thank you very much Joris |
Quote:
2nd: you shouldn't hi-jack current or reanimate dead threads if your question isn't exactly the same as in the original (it isn't). 3rd: Mac's have terminals, they have awk, so use it in exactly the same way as in Linux. Cheers, Tink |
All times are GMT -5. The time now is 04:44 PM. |