Please help compare 2 files via awk/sed

Seemoi · 06-21-2012, 09:40 AM

Using Debian testing, mawk/sed

Objective:
Match "Title", "Artist" of a song in file1: "Playlist", to
"Title", "Artist" in file2: "Music-List". Both lists are Tab delimited.

Print the result only if the Song Title and Artist match the Playlist, otherwise print "Not Found"

Playlist; <artist> <title>

ZZ Top Tush
Peter Gabriel Sledgehammer (Remix)

Music-List; <artist> <title> <bitrate> <path>

ZZ Top Tush 32000 /Music/Z/ZZ Top - Tush.mp3
Peter Gabriel 32000 /Music/P/Peter Gabriel - Sledgehammer.mp3
Cactus Jack Tush 128000 /Music/C/Cactus Jack - Tush.mp3

Other's suggestions;
1. Use an awk index to avoid special characters (-.[) etc.
2. Use an awk array for the Playlist file <artist> <title>

For example, the Song "Tush" appears 2 times with different artists,
if a line in Playlist searches for "Tush", it should also check for
the Artist "ZZ Top", to avoid accidentally printing the other "Tush" song by Cactus Jack.

Could someone please recommend a one-line to acomplish this?

Thank you much.

David the H. · 06-21-2012, 10:34 AM

Please use ***[code][/code] tags*** around your code and data, to preserve formatting and to improve readability. Please do not use quote tags, colors, or other fancy formatting.

Unless the files are formatted with a clear delimiter between the fields that's not found anywhere in the text itself, I can't see any easy or reliable way to accomplish this. It will certainly take more than a one-liner.

On the other hand, it would probably be trivial to do if the input was, say, tab-delimited.

BTW, is this a homework question? It reads like one.

Quote:

Do not post homework assignments verbatim. We're happy to assist if you have specific questions or have hit a stumbling point, however. Let us know what you've already tried and what references you have used (including class notes, books, and searches) and we'll do our best to help. Keep in mind that your instructor might also be an LQ member.

http://www.linuxquestions.org/linux/rules.html

grail · 06-21-2012, 11:13 AM

I would also be curious what output you would expect from your example? Assuming, as David has mentioned, that an appropriate delimiter is used, the only match to your criteria would be
the ZZ Top entry ... is this correct?

Seemoi · 06-21-2012, 12:20 PM

No code was entered.
Both lists are Tab delimited.
This is not homework, I am consolidating many music playlists.

Expected result? OK...

Playlist Entry: ZZ Top Tush
For each entry in Playlist...

*Search field 2 "Tush" in Music-List field 2
*If matched, next check for a match for "ZZ Top" in field 1 of Music-List
If both title and artist match (in Music-List), print the line in Music-List

Result: ZZ Top Tush 32000 /Music/Z/ZZ Top - Tush.mp3

Hope that helps.
Thank you.

grail · 06-21-2012, 01:59 PM

Quote:

No code was entered.

Actually if you read a little closer, David said both code and data do well from being placed in code tags.

Quote:

Both lists are Tab delimited.

This is not only an important detail to omit, but would also have been kept had you used code tags

Quote:

Result: ZZ Top Tush 32000 /Music/Z/ZZ Top - Tush.mp3

So just to confirm again, based on the following input data:

Code:

$ cat playlist
ZZ Top Tush
Peter Gabriel Sledgehammer (Remix)
$ cat music-list
ZZ Top Tush 32000 /Music/Z/ZZ Top - Tush.mp3
Peter Gabriel 32000 /Music/P/Peter Gabriel - Sledgehammer.mp3
Cactus Jack Tush 128000 /Music/C/Cactus Jack - Tush.mp3

Then the result is still only ZZ Top?

Assuming all the above is correct, I would say awk is your friend and that there are several examples on LQ never mind the net.
Standard format is:

1. Store all files from the first comparison file in an array
2. Check if associated items from the second file in corresponding fields are in the array

Here is a link for the gawk manual if you do not already have one:

http://www.gnu.org/software/gawk/man...ode/index.html

Nominal Animal · 06-21-2012, 04:10 PM

Code:

#!/usr/bin/mawk -f
BEGIN {
    # Each line (using any convention) is a separate record.
    # Also remove any leading and trailing whitespace on a line.
    RS = "[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*"

    # Fields are separated by a single tab character.
    FS = "[\t]"

    # For output, use linefeeds.
    ORS = "\n"

    # Input file number.
    file = 0
}

# Increase input file number before processing its first record.
(FNR == 1) {
    file++
}

# Compute an associative array key from the artist and song names (for all input files).
{
    # Start with "artist" "|" "song", converted to lower case.
    key = tolower($1 "|" $2)

    # Remove all non-alphanumeric characters from the key.    
    gsub(/[^0-9a-z|]+/, "", key)
}

# First file is the playlist we compare against. Remember the keys seen here.
(file == 1) {
    playlist[key]
}

# All other files are music lists.
(file > 1) {
    # Output the record to standard output, if this artist-song
    # was listed in the initial playlist.
    # Otherwise, output the record to standard error.
    if (key in playlist)
        printf("%s%s", $0, ORS)
    else
        printf("%s%s", $0, ORS) > "/dev/stderr"
}

The script takes two or more file names. The first names the file containing just the artist and song names. For all the other files, the script will output the record to standard output if the artist and song was named in the first file, and to standard error otherwise.

The idea is that the first two fields from each record in the first file are saved as keys in associative array playlist. Usually, there are some typos in the names, so the script converts the array key to uppercase, then removes all but numbers and letters. A pipe character is used to keep the artist and song names separate (so that "Artist" "Song Name" and "Artist Song" "Name" are distinguishable from each other).

Given first file

Code:

ZZ Top	Tush
Peter Gabriel	Sledgehammer (Remix)

and second file

Code:

ZZ Top	Tush	32000	/Music/Z/ZZ Top - Tush.mp3
Peter Gabriel	32000	/Music/P/Peter Gabriel - Sledgehammer.mp3
Cactus Jack	Tush	128000	/Music/C/Cactus Jack - Tush.mp3

the script will output to standard output

Code:

ZZ Top	Tush	32000	/Music/Z/ZZ Top - Tush.mp3

and to standard error

Code:

Peter Gabriel	32000	/Music/P/Peter Gabriel - Sledgehammer.mp3
Cactus Jack	Tush	128000	/Music/C/Cactus Jack - Tush.mp3

Note that the "Peter Gabriel" entry in the second file is missing the song name; the script sees "Peter Gabriel" as the artist, and "32000" as the song name, therefore it does not match "Sledgehammer (Remix)" in the play list.

To get better results in practice, I'd edit the key string. For example, if the playlist has annotations in parentheses or brackets you want to ignore, you could replace the key code with

Code:

# Compute an associative array key from the artist and song names (for all input files).
{
    # Start with "artist" "|" "song", converted to lower case.
    key = tolower($1 "|" $2)

    # Remove stuff in parentheses
    gsub(/ *\([^\)]*\) */, " ", key)

    # Remove stuff in brackets
    gsub(/ *\[[^\]]*\] */, " ", key)

    # Remove all non-alphanumeric characters from the key.    
    gsub(/[^0-9a-z|]+/, "", key)
}

With a bit of hacking the above -- depending on how much variance and typos there are in your playlists -- you can probably save yourself a lot of hand-editing.

Questions? Comments?

Seemoi · 06-22-2012, 12:53 PM

Nominal Animal,

Thank you for helping me without reprimand and non answers.
I very much appreciate it.

I am not a programmer so I will have to study what you did to try and get this into
one line. It's been difficult transitioning from using a filemaker database to acomplish this.

I realized the omission for the Peter Gabriel too late, thanks for catching that.
I've tried so many incarnations that didn't work so far...

cat Playlist | while read z; do tit="$z" ; awk -v title="$tit" '{FS = "\t"} $2 ~ title {print $0}' Music-List ; done

awk -F'\t' '{for(N in var){if(index($2,N)){print; next}}}' Playlist Music-List

awk -F'-' 'NR==FNR{art=$1;gsub(/[()-%$@]|^[ ]*/,"",$2);tit=$2;next}{l=split($0,a,"\t");gsub(/[()-%$@]/,"",a[2]);mas=a[1];for(i=2;i<=l;i++)mas=mas" "a[i];if(mas~tit)if(mas~art)print a[l];else print "No Match"}' Playlist Music-List

Thanks for you time in helping.

Regards,
Seemoi

Nominal Animal · 06-22-2012, 07:58 PM

You're welcome, Seemoi.

It is difficult to decipher what is asked when the data and inputs do not match the description or the intent. I basically just guessed.

While the responses here may read like they were reprimands, they really were just requests for clarification. While I may be overstepping some social bounds, I can guarantee you that both grail and David the H. only wanted to help you, but found your description frustratingly difficult to understand.

Quote:

Originally Posted by Seemoi

I am not a programmer so I will have to study what you did to try and get this into one line.

If you save the entire script from my previous post, just as written, into say file merge-playlist in your home directory (not desktop, your home directory), you can make it executable by simply running command

Code:

chmod a+x ~/merge-playlist

once; one time only. The ~/ refers to your home directory.

Then you can simply run

Code:

~/merge-playlist Playlist Music-List > New-List

or even

Code:

~/merge-playlist Playlist Music-List1 Music-List2 Music-List3 > Combined-List

to save all the Music-List entries that match the Playlist into New-List or Combined-List. All the Music-List entries that were not listed in the Playlist (and are not saved in New-List or Combined-List) will be shown on-screen.

You can even edit the script (using gedit, emacs, vim, nano, or any text editor you wish -- just don't use a word processor like Abiword, OpenOffice Writer, LibreOffice Writer, or so on). All editors I've used retain the executable flag, so you won't need to run the chmod command again; you can just save your changes to the script, and run the ~/merge-playlist command immediately.

If you seriously need the mawk command to work on a single command line, then you can use this:

Code:

mawk 'BEGIN { RS = "[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*" ; FS = "[\t]" ; ORS = "\n" ; file = 0 }
      (FNR == 1) { file++ }
      { key = tolower($1 "|" $2) ; gsub(/[^0-9a-z|]+/, "", key) }
      (file == 1) { playlist[key] }
      (file > 1) { if (key in playlist) printf("%s%s", $0, ORS) else printf("%s%s", $0, ORS) > "/dev/stderr" }' Playlist Music-List > New-List

You can either keep it as it is -- because the script part is in single quotes, it will be parsed as a single command even if it is on more than one line -- or just omit the newlines, putting it all on a single very long line. Both will work exactly the same. I basically just omitted all comments, added semicolons to separate expressions.. and that's about it.

Hope you find this useful, and cut some slack to grail and David the H.; they too were just trying to help you,

Seemoi · 06-22-2012, 11:35 PM

Once again, you are generous with your time... it's much appreciated.

Thank you so much.

I'll take a pass on a reply to your comment to cut slack... you were the only one who offered an actual contribution towards a problem.

Best regards.

Seemoi.

[Solved]

David the H. · 06-24-2012, 10:03 AM

Nominal Animal is correct. We get a lot of poorly-worded requests here, so perhaps we sometimes get a little impatient, but we really just want to clarify the requirements first so that we can give the most appropriate solutions, instead of trying to guess what you really want.

We also hesitate to simply give out complete scripting solutions, as we know you'll learn more by doing it yourself. We expect you to do as much as you can on your own, and to come back for more help whenever you get stuck. We volunteer our time here as guides and helpers, not tech support.

To get the maximum benefit in help forums like this, please read Eric S. Raymond's excellent How To Ask Questions The Smart Way when you have the time.

grail · 06-24-2012, 01:23 PM

+1 to both David and NA. I have provided too many solutions to what I 'thought' was where the question was going only to find it was nothing to do with my solution

Also, sorry if my replies sounded harsh