Detecting duplicate keys in records.

carl.waldbieser · 09-08-2005, 08:01 PM

Is there a common technique for detecting duplicate records based only on key fields? I figure that this should be possible with tools like sort, uniq, cut, and join, but the fact that I can only join on a single field seems to complicate matters.

For example, I work with lots of files where the fields are tab-delimited and the records are newline delimited (they are dumps of database tables). Sometimes, when I want to bulk-load data into a table, the operation will be refused because some rows I am trying to load would cause a primary key violation with other rows already in the table. I want to be able to quickly identify what rows in the file I am trying to load are duplicates.

For example, consider I have two data files, each with three fields: color, shape, size. Let's further assume that color and shape form the primary key.
file 1 (new data)

Code:

green circle small
green square large
green triangle medium
red circle large
red triangle small
yellow square large
yellow triangle medium

file 2 (existing data)

Code:

purple circle small
purple triangle small
green circle medium
yellow rectangle large
yellow trapezoid small
red circle medium
pink circle large

If I sort both files together, I can run them through uniq -d -W 2 option. However, I am not sure whether it will show the duplicate from file 1 or file 2. What I'd like is to be able to extract the duplicates from file 1 so I can examine them, and then remove them from file 1 so the remaining contents can be inserted into the table. Is there a simple pipeline that can do this, or am I just better off writing a small program?

paulsm4 · 09-09-2005, 01:18 AM

Hi -

Here's one possible solution: that uses "sort", "uniq" ... and "diff":

Quote:

1. Extract the keys:
cut -d' ' -f1,2 f1|sort|uniq > k1
cut -d' ' -f1,2 f2|sort|uniq > k2

2. Generate the set of all keys in both files:
cat k1 k2 | sort | uniq > k_all

3. Now list the keys in k1 that *aren't* in k2:
diff k1 k_all
0a1
>
3a5,7
> pink circle
> purple circle
> purple triangle
5a10
> yellow rectangle
6a12
> yellow trapezoid

'Hope that helps .. PSM

PS:
I also hope this isn't a homework assignment or anything ;-)

archtoad6 · 09-09-2005, 11:23 AM

I believe:

Code:

cat  file2 file1  | sort -suk1,2

will do what you want, test it against:

Code:

cat  file1 file2  | sort -suk1,2

This better not be a homework assignment.

BTW, the sort man page is well worth studying.

carl.waldbieser · 09-09-2005, 06:18 PM

Well, my school days are long over.
Actually neither of the proposed solutions really accomplishes what I wanted. I guess I wasn't entirely clear.

paulsm4, I initially tried doing something very similar to what you did by creating the index files. However, the drawback I encountered is that I want to end up with the *entire* record from the first dump that was a duplicate, not just the keys. So in my example, I wanted:

Code:

green circle small
red circle large

as the result. I initially though join would help, but for some reason, it seems you can only join on a single field.

archtoad6, your example will return the unique union of both the old and new sets of data. This is ultimately what I want to be in the database in most cases. However, I often am not able to drop the existing data for one reason or another.
I think I understand the man pages for these tools pretty well-- I just think that maybe I am missing some obvious way of applying them together to accomplish what I want to do. On the other hand, it may be that want to do requires a full-blown script instead of a simple pipeline.

The closest I have been able to get is this awk script:

Code:

#!/bin/bash

#Sort the existing table dump and the new data by primary key (fields 1 and 2).
#Sort is stable so duplicates from the second file follow the originals from the first file.
#The awk script only prints out the duplicates from the second file.

sort -s -k 1,2 file2.txt file1.txt | awk '{if ( ($1,$2) in set) print $0; else set[$1,$2] = 1;}'

The basic problem I have with doing this is that it is somewhat messy-- if I have a new dump with 4 fields for the keys, I have to edit this pipeline in 3 different places (once in the sort and twice in the awk).

The reason I want the entire set of bad lines is so I can just do:

Code:

$ uniq -u bad_lines.txt new_lines.txt

in order to just pull out the lines that can be added without a problem. This is sometimes useful if a customer can't resist going in and entering data even though I told them to wait until I finished loading their tables.

Ultimately, I'd just like to do something like:

Code:

$ ./get_bad_lines -k 1,4 oldfile.txt newfile.txt > bad_lines.txt

eddiebaby1023 · 09-10-2005, 08:28 AM

Just a thought: when you're munging your input files, why not get sed (or similiar) to add the filename to each line? If you can persuade your matching technology to only match on the first two fields it'll be easy to see where the duplicates are.

AnanthaP · 09-10-2005, 10:57 AM

I would use associative arrays in awk. thus:
Assume that the key value to check resides in column 2.

{
Kount[$2]++ ;
}
END {
for(x in Kount) {
if(Kount(x) > 1) {
My error messages go here ;
}
}
}

carl.waldbieser · 09-11-2005, 09:49 PM

I cobbled together a little bash script to write out my awk script. I am somewhat new to shell scripting, so I would like feedback if I am doing things the hard way. For example, I couldn't figure out how to access the first and second keys of the range without turning them into a bash array and then pulling out the values. It seems to me that maybe there is an easier way to do that.

The script seems to work, though I haven't tested it very thoroughly. For the future, I'd like to be able to give it multiple key ranges, somewhat like the sort command.

Thanks for all the advice.

Code:

#! /bin/bash

usage ()
{
	echo "Usage: $0 [-k keys] oldfile newfile"
}

process_key_range ()
{
	local range len start end

	range=($(echo $1 | tr ',' ' '))
	len=${#range[@]}
	if [ $len -gt 2 ]; then
		echo "-k option takes the form \"start-key[,end-key]\"" >&2
		usage >&2
		exit 1
	fi
	if [ $len -eq 2 ]; then
		start=${range[0]}
		end=${range[1]}
		if [ $start -gt $end ]; then
			echo "Illegal range for -k option." >&2
			usage >&2
			exit 1
		fi
		echo ${range[@]}
		return 0
	fi
	echo "${range[0]} ${range[0]}"
}

enum_range ()
{
	local range start end count flag
	range=($1)
	start=${range[0]}
	end=${range[1]}
	count=$start
	flag=0
	while [ $count -le $end ]
	do
		if [ $flag -eq 1 ]; then
			echo -n ","
		fi
		flag=1
		echo -n "\$$count"
		count=$(($count + 1))
	done
}

#Set flags.
keys="1 1"

#Process options.
while getopts :k:v opt
do
	case $opt in
	k)		keys=$(process_key_range "$OPTARG")
		;;
	v)		echo "Version 1.0"
			exit 0
		;;
	'?')	echo "$0: invalid option -$OPTARG" >&2
			usage >&2
			exit 1
		;;
	esac
done
shift $((OPTIND - 1))

#Process arguments.
if [ $# -ne 2 ]; then
	usage
	exit 1
fi
old_file=$1
new_file=$2
awk_keys=$(enum_range "$keys")
arr_keys=($keys)
start_key=${arr_keys[0]}
end_key=${arr_keys[1]}
cmdline="sort -s -k $start_key,$end_key $old_file $new_file | awk "\'"{if ( ($awk_keys) in set) print \$0; else set[$awk_keys] = 1;}"\'
eval $cmdline

archtoad6 · 09-12-2005, 01:57 PM

I'm sure there is a simpler way.

Q: Who are the users of your script?

Just you? -- Then why all the syntax checking? After all, you wrote it.
Others? -- How sophisticated are they on the CLI? It could use more bulletproofing.

Let me try to restate background of the problem:[list=1][*]You are working on several databases.[*]These db's have different key structures.[*]You don not work directly on the db's, but with text file dumps.[*]You need to purge new data of records that would cause primary key violations.[*]You want to list the offending records.[*]You want to remove the offending records from the new data.[*]Ultimately, you want a file of "clean" (purged) new data to add to the db.[*]The files you want could be called "OLD", "NEW", "BAD", "ADD". [*]As usual, the users cannot be trusted to leave things alone during maintenance.[/list=1]
More Q's:

Anyhing to add or subtract from the above analysis?
How much of this would be solved if #9 were fixed?
Can't you lock the apropriate files during updates?

carl.waldbieser · 09-12-2005, 06:34 PM

Quote:

Originally posted by archtoad6
I sure there is a simpler way.

Q: Who are the users of your script?
Just you? -- Then why all the syntax checking? After all, you wrote it.

Others? -- How sophisticated are they on the CLI? It could use more bulletproofing.

Let me try to restate background of the problem:[list=1][*]You are working on several databases.[*]These db's have different key structures.[*]You don not work directly on the db's, but with text file dumps.[*]You need to purge new data of records that would cause primary key violations.[*]You want to list the offending records.[*]You want to remove the offending records from the new data.[*]Ultimately, you want a file of "clean" (purged) new data to add to the db.[*]The files you want could be called "OLD", "NEW", "BAD", "ADD". [*]As usual, the users cannot be trusted to leave things alone during maintenance.[/list=1]
More Q's:
Anything to add or subtract from the above analysis?

How much of this would be solved if #9 were fixed?

Can't you lock the apropriate files during updates?

I would be the primary user. I threw in some syntax checking mostly out of habit.

For the most part, I will arrange with a client ahead of time that they must be out of their system before I load their new data. However, several of my colleagues have already found themselves in situations where someone was adding records via the application software when they shouldn't have, and that can really throw a wrench in the works.

For at least part of the year (normally the summer months), I spend a lot of time migrating data from customer's old application software to our company's software. It's not really an exact science, so sometimes a conversion will have errors and I might have actually put in duplicate rows for one reason or another myself, so that is why I like to see what the bad rows are first. Sometimes seeing what is duplicated can give some insight into what went wrong.

Of course, the need for this script would probably not be as great if #9 were fixed, but I would still find it useful. Sometimes a SQL query is more appropriate, but sometimes it seems like it is easier to analyze the data dumps for problems.

It may be possible to lock users out of the database in some scenarios. In others, I am not sure it would be a good idea, as the database is shared by distinct groups, and not everyone is affected by the data being converted.

archtoad6 · 09-12-2005, 07:43 PM

Too bad locking them out is inappropriate.

Here's my take on the problem.

Lot's of comments, my habits. In fact almost too many, but maybe someone else can learn from them.
Trying have nothing that might need editing in more than one place -- similar to using variables.
Attempting to self-document. -- Ergo "USAGE" & "_usage"
Different naming style. -- I get the impression that bash programmers like uppercase variable names.
Similar variables have names of the same (short) length.
Possibly 2 or 3 unnecessary functions.
Narrower indents.
IMO, simpler commands -- i.e. ones I know or want to learn. (Like awk.)

Code:

#! /bin/bash
#0 db_purge === compare & purge text files from database dump
USEAGE="  Usage:  $0   database | [keys] oldfile newfile"       # self documenting

#@ f.a.archibald.iii    copyright (c) 2005      all right reserved

#echo -e "$0\t$# -- $@\n\n"     # debugging

# =====   Functions   =====
_usage () {
  echo "$USEAGE"
  exit
}

_no_bd () {
  echo -e "  Sorry, you have not added database \"$1\" to this script.\n"
}

_files () {
  OLD="$1"              # $OLD is the dump of the existing data
  NEW="$2"              # $NEW is the dump of the data to purge (& add)
  BAD="$NEW.bad"        # $BAD is the bad data from $NEW to be purged
  ADD="$OLD.add"        # $ADD is the good data from $NEW to be added
  TMP="unique"          # $TMP is the unique old & added data,
                        #    it predicts the dump of data after $ADD has been added
  #End of files
  FILES=`awk '$0 == "_files () {", $1 == "#End"' $0  \
        |awk -F= '$0 ~ "=" {print $1}'               `
}

_list_files () {
  #echo -e "OLD=$OLD \t NEW=$NEW \t BAD=$BAD \t ADD=$ADD"
  # This may be too cute,
  # but it avoids remembering to edit this function if  _files ()  is changed,
  for F in $FILES
  do
    echo -ne "$F=${!F}\t"
  done
  echo
}

_show_files () {
  for F in $FILES
  do
    less -~ ${!F}
  done
}

# =====   Main   =====
case $# in
  1)  case $1 in
       test)  _files "file2" "file1" ;;
          *)  _no_bd $1; _help ;;
      esac ;;
  2)    _files $1 $2 ;;
  3)    KEYS="$1"
        shift
        _files $1 $2 ;;
  *)    _usage ;;
esac    ;_list_files    # Debugging & info -- kill w/ '#' after ';'

# place ALL the unique (good) data in $TMP
cat  $OLD $NEW  | sort -suk1,2  > $TMP

# place the bad data in $BAD
echo -e "\n\nBad data:  \t\t(Also stored in $BAD)\n"
cat  $OLD $NEW  | sort -sk1,2  | diff - $TMP  | sed -n '/^[<>]/s,^[<>] ,,p'  | tee $BAD
# cat & sort put all the data on STDIN, sorted
# "diff - $TMP"  compares  STDIN (-) to  $TMP
# "sed ... " removes  diff  artifacts leaving only the bad data
# tee diplays & stores the bad data

# place the unique (good) data to add in $ADD
echo -e "\n\nGood data: \t\t(Also stored in $ADD)\n"
diff $NEW $TMP  | sed -n '/^>/s,^> ,,p'  | tee $ADD
# diff'ing the new w/ all the good gives the additions
# sed cleans the output
# tee displays & stores

# uncomment next line to always view files at finish
#echo -e "\nReady to show files" ;  read ;  _show_files

carl.waldbieser · 09-13-2005, 09:03 PM

Thanks. It will probably take me a little while to figure out everything you are doing, though. For example, the _files() function had me scratching my head for a while until I realized you were having the script read itself. That is a somewhat new twist in my experience. Also, I was not aware of the ${!variable} syntax. After some time, I realized you must be doing the equivalent of an eval, but I had trouble finding the syntax in any of my books on shell scripting. I found a small paragraph reference to it in O'Reilly's "Learning Bash". I haven't made it through the diff and sed parts yet, and I still haven't figured out what the KEYS variable is used for, but it is an interesting approach.

paulsm4 · 09-13-2005, 09:34 PM

One silly question: doesn't your DB have a "unique" constraint, so that duplicate records cannot be inserted in the first place ;-)?

carl.waldbieser · 09-14-2005, 12:04 AM

Quote:

Originally posted by paulsm4
One silly question: doesn't your DB have a "unique" constraint, so that duplicate records cannot be inserted in the first place ;-)?

Yes, but the command for bulk loading the data will fail if there are any errors (no rows will be loaded). In some cases, just being able to say, "load the data you can, ignore the rest" would be ideal. Other times, it's better to know exactly why you were getting duplicates in the first place.

chrism01 · 09-14-2005, 08:24 PM

Personally, given how fiddly this is getting, I'd write a Perl program to analyse the flat files. Given that hashes are a basic data type in Perl, checking for duplicates is trivial.
You could also get it to loop insert recs into the DB 1 at a time (it's a quick language) and log any that get rejected.
I've done similar stuff in the past, and it's the sort of thing Perl is very good at.

jim mcnamara · 09-15-2005, 05:48 AM

Is there a reason you cannot create temp tables without unique indexes?

Instead of doing this in bash, load the stuff into temp tables and let SQL do what it does best.