LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-21-2012, 09:40 AM   #1
Seemoi
LQ Newbie
 
Registered: Feb 2009
Posts: 29

Rep: Reputation: 1
Please help compare 2 files via awk/sed


Using Debian testing, mawk/sed

Objective:
Match "Title", "Artist" of a song in file1: "Playlist", to
"Title", "Artist" in file2: "Music-List". Both lists are Tab delimited.

Print the result only if the Song Title and Artist match the Playlist, otherwise print "Not Found"

Playlist; <artist> <title>
  • ZZ Top Tush
  • Peter Gabriel Sledgehammer (Remix)

Music-List; <artist> <title> <bitrate> <path>
  • ZZ Top Tush 32000 /Music/Z/ZZ Top - Tush.mp3
  • Peter Gabriel 32000 /Music/P/Peter Gabriel - Sledgehammer.mp3
  • Cactus Jack Tush 128000 /Music/C/Cactus Jack - Tush.mp3

Other's suggestions;
1. Use an awk index to avoid special characters (-.[) etc.
2. Use an awk array for the Playlist file <artist> <title>

For example, the Song "Tush" appears 2 times with different artists,
if a line in Playlist searches for "Tush", it should also check for
the Artist "ZZ Top", to avoid accidentally printing the other "Tush" song by Cactus Jack.

Could someone please recommend a one-line to acomplish this?

Thank you much.

Last edited by Seemoi; 06-21-2012 at 12:11 PM. Reason: Corrections
 
Old 06-21-2012, 10:34 AM   #2
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Please use ***[code][/code] tags*** around your code and data, to preserve formatting and to improve readability. Please do not use quote tags, colors, or other fancy formatting.


Unless the files are formatted with a clear delimiter between the fields that's not found anywhere in the text itself, I can't see any easy or reliable way to accomplish this. It will certainly take more than a one-liner.

On the other hand, it would probably be trivial to do if the input was, say, tab-delimited.

BTW, is this a homework question? It reads like one.

Quote:
Do not post homework assignments verbatim. We're happy to assist if you have specific questions or have hit a stumbling point, however. Let us know what you've already tried and what references you have used (including class notes, books, and searches) and we'll do our best to help. Keep in mind that your instructor might also be an LQ member.
http://www.linuxquestions.org/linux/rules.html

Last edited by David the H.; 06-21-2012 at 10:41 AM. Reason: expandum postum
 
Old 06-21-2012, 11:13 AM   #3
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
I would also be curious what output you would expect from your example? Assuming, as David has mentioned, that an appropriate delimiter is used, the only match to your criteria would be
the ZZ Top entry ... is this correct?
 
Old 06-21-2012, 12:20 PM   #4
Seemoi
LQ Newbie
 
Registered: Feb 2009
Posts: 29

Original Poster
Rep: Reputation: 1
No code was entered.
Both lists are Tab delimited.
This is not homework, I am consolidating many music playlists.

Expected result? OK...

Playlist Entry: ZZ Top Tush
For each entry in Playlist...

*Search field 2 "Tush" in Music-List field 2
*If matched, next check for a match for "ZZ Top" in field 1 of Music-List
If both title and artist match (in Music-List), print the line in Music-List

Result: ZZ Top Tush 32000 /Music/Z/ZZ Top - Tush.mp3

Hope that helps.
Thank you.

Last edited by Seemoi; 06-21-2012 at 12:22 PM.
 
Old 06-21-2012, 01:59 PM   #5
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Quote:
No code was entered.
Actually if you read a little closer, David said both code and data do well from being placed in code tags.
Quote:
Both lists are Tab delimited.
This is not only an important detail to omit, but would also have been kept had you used code tags
Quote:
Result: ZZ Top Tush 32000 /Music/Z/ZZ Top - Tush.mp3
So just to confirm again, based on the following input data:
Code:
$ cat playlist
ZZ Top Tush
Peter Gabriel Sledgehammer (Remix)
$ cat music-list
ZZ Top Tush 32000 /Music/Z/ZZ Top - Tush.mp3
Peter Gabriel 32000 /Music/P/Peter Gabriel - Sledgehammer.mp3
Cactus Jack Tush 128000 /Music/C/Cactus Jack - Tush.mp3
Then the result is still only ZZ Top?

Assuming all the above is correct, I would say awk is your friend and that there are several examples on LQ never mind the net.
Standard format is:

1. Store all files from the first comparison file in an array
2. Check if associated items from the second file in corresponding fields are in the array

Here is a link for the gawk manual if you do not already have one:

http://www.gnu.org/software/gawk/man...ode/index.html
 
Old 06-21-2012, 04:10 PM   #6
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Code:
#!/usr/bin/mawk -f
BEGIN {
    # Each line (using any convention) is a separate record.
    # Also remove any leading and trailing whitespace on a line.
    RS = "[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*"

    # Fields are separated by a single tab character.
    FS = "[\t]"

    # For output, use linefeeds.
    ORS = "\n"

    # Input file number.
    file = 0
}

# Increase input file number before processing its first record.
(FNR == 1) {
    file++
}

# Compute an associative array key from the artist and song names (for all input files).
{
    # Start with "artist" "|" "song", converted to lower case.
    key = tolower($1 "|" $2)

    # Remove all non-alphanumeric characters from the key.    
    gsub(/[^0-9a-z|]+/, "", key)
}

# First file is the playlist we compare against. Remember the keys seen here.
(file == 1) {
    playlist[key]
}

# All other files are music lists.
(file > 1) {
    # Output the record to standard output, if this artist-song
    # was listed in the initial playlist.
    # Otherwise, output the record to standard error.
    if (key in playlist)
        printf("%s%s", $0, ORS)
    else
        printf("%s%s", $0, ORS) > "/dev/stderr"
}
The script takes two or more file names. The first names the file containing just the artist and song names. For all the other files, the script will output the record to standard output if the artist and song was named in the first file, and to standard error otherwise.

The idea is that the first two fields from each record in the first file are saved as keys in associative array playlist. Usually, there are some typos in the names, so the script converts the array key to uppercase, then removes all but numbers and letters. A pipe character is used to keep the artist and song names separate (so that "Artist" "Song Name" and "Artist Song" "Name" are distinguishable from each other).

Given first file
Code:
ZZ Top	Tush
Peter Gabriel	Sledgehammer (Remix)
and second file
Code:
ZZ Top	Tush	32000	/Music/Z/ZZ Top - Tush.mp3
Peter Gabriel	32000	/Music/P/Peter Gabriel - Sledgehammer.mp3
Cactus Jack	Tush	128000	/Music/C/Cactus Jack - Tush.mp3
the script will output to standard output
Code:
ZZ Top	Tush	32000	/Music/Z/ZZ Top - Tush.mp3
and to standard error
Code:
Peter Gabriel	32000	/Music/P/Peter Gabriel - Sledgehammer.mp3
Cactus Jack	Tush	128000	/Music/C/Cactus Jack - Tush.mp3
Note that the "Peter Gabriel" entry in the second file is missing the song name; the script sees "Peter Gabriel" as the artist, and "32000" as the song name, therefore it does not match "Sledgehammer (Remix)" in the play list.

To get better results in practice, I'd edit the key string. For example, if the playlist has annotations in parentheses or brackets you want to ignore, you could replace the key code with
Code:
# Compute an associative array key from the artist and song names (for all input files).
{
    # Start with "artist" "|" "song", converted to lower case.
    key = tolower($1 "|" $2)

    # Remove stuff in parentheses
    gsub(/ *\([^\)]*\) */, " ", key)

    # Remove stuff in brackets
    gsub(/ *\[[^\]]*\] */, " ", key)

    # Remove all non-alphanumeric characters from the key.    
    gsub(/[^0-9a-z|]+/, "", key)
}
With a bit of hacking the above -- depending on how much variance and typos there are in your playlists -- you can probably save yourself a lot of hand-editing.

Questions? Comments?
 
Old 06-22-2012, 12:53 PM   #7
Seemoi
LQ Newbie
 
Registered: Feb 2009
Posts: 29

Original Poster
Rep: Reputation: 1
Nominal Animal,

Thank you for helping me without reprimand and non answers.
I very much appreciate it.

I am not a programmer so I will have to study what you did to try and get this into
one line. It's been difficult transitioning from using a filemaker database to acomplish this.

I realized the omission for the Peter Gabriel too late, thanks for catching that.
I've tried so many incarnations that didn't work so far...

cat Playlist | while read z; do tit="$z" ; awk -v title="$tit" '{FS = "\t"} $2 ~ title {print $0}' Music-List ; done

awk -F'\t' '{for(N in var){if(index($2,N)){print; next}}}' Playlist Music-List

awk -F'-' 'NR==FNR{art=$1;gsub(/[()-%$@]|^[ ]*/,"",$2);tit=$2;next}{l=split($0,a,"\t");gsub(/[()-%$@]/,"",a[2]);mas=a[1];for(i=2;i<=l;i++)mas=mas" "a[i];if(mas~tit)if(mas~art)print a[l];else print "No Match"}' Playlist Music-List

Thanks for you time in helping.

Regards,
Seemoi

Last edited by Seemoi; 06-22-2012 at 02:41 PM.
 
Old 06-22-2012, 07:58 PM   #8
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
You're welcome, Seemoi.

It is difficult to decipher what is asked when the data and inputs do not match the description or the intent. I basically just guessed.

While the responses here may read like they were reprimands, they really were just requests for clarification. While I may be overstepping some social bounds, I can guarantee you that both grail and David the H. only wanted to help you, but found your description frustratingly difficult to understand.

Quote:
Originally Posted by Seemoi View Post
I am not a programmer so I will have to study what you did to try and get this into one line.
If you save the entire script from my previous post, just as written, into say file merge-playlist in your home directory (not desktop, your home directory), you can make it executable by simply running command
Code:
chmod a+x ~/merge-playlist
once; one time only. The ~/ refers to your home directory.

Then you can simply run
Code:
~/merge-playlist Playlist Music-List > New-List
or even
Code:
~/merge-playlist Playlist Music-List1 Music-List2 Music-List3 > Combined-List
to save all the Music-List entries that match the Playlist into New-List or Combined-List. All the Music-List entries that were not listed in the Playlist (and are not saved in New-List or Combined-List) will be shown on-screen.

You can even edit the script (using gedit, emacs, vim, nano, or any text editor you wish -- just don't use a word processor like Abiword, OpenOffice Writer, LibreOffice Writer, or so on). All editors I've used retain the executable flag, so you won't need to run the chmod command again; you can just save your changes to the script, and run the ~/merge-playlist command immediately.

If you seriously need the mawk command to work on a single command line, then you can use this:
Code:
mawk 'BEGIN { RS = "[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*" ; FS = "[\t]" ; ORS = "\n" ; file = 0 }
      (FNR == 1) { file++ }
      { key = tolower($1 "|" $2) ; gsub(/[^0-9a-z|]+/, "", key) }
      (file == 1) { playlist[key] }
      (file > 1) { if (key in playlist) printf("%s%s", $0, ORS) else printf("%s%s", $0, ORS) > "/dev/stderr" }' Playlist Music-List > New-List
You can either keep it as it is -- because the script part is in single quotes, it will be parsed as a single command even if it is on more than one line -- or just omit the newlines, putting it all on a single very long line. Both will work exactly the same. I basically just omitted all comments, added semicolons to separate expressions.. and that's about it.

Hope you find this useful, and cut some slack to grail and David the H.; they too were just trying to help you,
 
Old 06-22-2012, 11:35 PM   #9
Seemoi
LQ Newbie
 
Registered: Feb 2009
Posts: 29

Original Poster
Rep: Reputation: 1
Once again, you are generous with your time... it's much appreciated.

Thank you so much.

I'll take a pass on a reply to your comment to cut slack... you were the only one who offered an actual contribution towards a problem.

Best regards.

Seemoi.

[Solved]
 
Old 06-24-2012, 10:03 AM   #10
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Nominal Animal is correct. We get a lot of poorly-worded requests here, so perhaps we sometimes get a little impatient, but we really just want to clarify the requirements first so that we can give the most appropriate solutions, instead of trying to guess what you really want.

We also hesitate to simply give out complete scripting solutions, as we know you'll learn more by doing it yourself. We expect you to do as much as you can on your own, and to come back for more help whenever you get stuck. We volunteer our time here as guides and helpers, not tech support.

To get the maximum benefit in help forums like this, please read Eric S. Raymond's excellent How To Ask Questions The Smart Way when you have the time.
 
Old 06-24-2012, 01:23 PM   #11
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
+1 to both David and NA. I have provided too many solutions to what I 'thought' was where the question was going only to find it was nothing to do with my solution

Also, sorry if my replies sounded harsh
 
  


Reply

Tags
array, awk, index, sed, shell



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Compare two files using sed/grep xpto09 Linux - General 4 09-23-2011 10:20 AM
compare two files using Awk!! visitnag Linux - Newbie 3 09-15-2008 12:42 PM
AWK: compare two files haydar68 Programming 6 08-02-2008 11:20 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 03:10 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration