ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Is there a common technique for detecting duplicate records based only on key fields? I figure that this should be possible with tools like sort, uniq, cut, and join, but the fact that I can only join on a single field seems to complicate matters.
For example, I work with lots of files where the fields are tab-delimited and the records are newline delimited (they are dumps of database tables). Sometimes, when I want to bulk-load data into a table, the operation will be refused because some rows I am trying to load would cause a primary key violation with other rows already in the table. I want to be able to quickly identify what rows in the file I am trying to load are duplicates.
For example, consider I have two data files, each with three fields: color, shape, size. Let's further assume that color and shape form the primary key.
file 1 (new data)
Code:
green circle small
green square large
green triangle medium
red circle large
red triangle small
yellow square large
yellow triangle medium
file 2 (existing data)
Code:
purple circle small
purple triangle small
green circle medium
yellow rectangle large
yellow trapezoid small
red circle medium
pink circle large
If I sort both files together, I can run them through uniq -d -W 2 option. However, I am not sure whether it will show the duplicate from file 1 or file 2. What I'd like is to be able to extract the duplicates from file 1 so I can examine them, and then remove them from file 1 so the remaining contents can be inserted into the table. Is there a simple pipeline that can do this, or am I just better off writing a small program?
2. Generate the set of all keys in both files:
cat k1 k2 | sort | uniq > k_all
3. Now list the keys in k1 that *aren't* in k2:
diff k1 k_all
0a1
>
3a5,7
> pink circle
> purple circle
> purple triangle
5a10
> yellow rectangle
6a12
> yellow trapezoid
'Hope that helps .. PSM
PS:
I also hope this isn't a homework assignment or anything ;-)
Well, my school days are long over.
Actually neither of the proposed solutions really accomplishes what I wanted. I guess I wasn't entirely clear.
paulsm4, I initially tried doing something very similar to what you did by creating the index files. However, the drawback I encountered is that I want to end up with the *entire* record from the first dump that was a duplicate, not just the keys. So in my example, I wanted:
Code:
green circle small
red circle large
as the result. I initially though join would help, but for some reason, it seems you can only join on a single field.
archtoad6, your example will return the unique union of both the old and new sets of data. This is ultimately what I want to be in the database in most cases. However, I often am not able to drop the existing data for one reason or another.
I think I understand the man pages for these tools pretty well-- I just think that maybe I am missing some obvious way of applying them together to accomplish what I want to do. On the other hand, it may be that want to do requires a full-blown script instead of a simple pipeline.
The closest I have been able to get is this awk script:
Code:
#!/bin/bash
#Sort the existing table dump and the new data by primary key (fields 1 and 2).
#Sort is stable so duplicates from the second file follow the originals from the first file.
#The awk script only prints out the duplicates from the second file.
sort -s -k 1,2 file2.txt file1.txt | awk '{if ( ($1,$2) in set) print $0; else set[$1,$2] = 1;}'
The basic problem I have with doing this is that it is somewhat messy-- if I have a new dump with 4 fields for the keys, I have to edit this pipeline in 3 different places (once in the sort and twice in the awk).
The reason I want the entire set of bad lines is so I can just do:
Code:
$ uniq -u bad_lines.txt new_lines.txt
in order to just pull out the lines that can be added without a problem. This is sometimes useful if a customer can't resist going in and entering data even though I told them to wait until I finished loading their tables.
Just a thought: when you're munging your input files, why not get sed (or similiar) to add the filename to each line? If you can persuade your matching technology to only match on the first two fields it'll be easy to see where the duplicates are.
I cobbled together a little bash script to write out my awk script. I am somewhat new to shell scripting, so I would like feedback if I am doing things the hard way. For example, I couldn't figure out how to access the first and second keys of the range without turning them into a bash array and then pulling out the values. It seems to me that maybe there is an easier way to do that.
The script seems to work, though I haven't tested it very thoroughly. For the future, I'd like to be able to give it multiple key ranges, somewhat like the sort command.
Thanks for all the advice.
Code:
#! /bin/bash
usage ()
{
echo "Usage: $0 [-k keys] oldfile newfile"
}
process_key_range ()
{
local range len start end
range=($(echo $1 | tr ',' ' '))
len=${#range[@]}
if [ $len -gt 2 ]; then
echo "-k option takes the form \"start-key[,end-key]\"" >&2
usage >&2
exit 1
fi
if [ $len -eq 2 ]; then
start=${range[0]}
end=${range[1]}
if [ $start -gt $end ]; then
echo "Illegal range for -k option." >&2
usage >&2
exit 1
fi
echo ${range[@]}
return 0
fi
echo "${range[0]} ${range[0]}"
}
enum_range ()
{
local range start end count flag
range=($1)
start=${range[0]}
end=${range[1]}
count=$start
flag=0
while [ $count -le $end ]
do
if [ $flag -eq 1 ]; then
echo -n ","
fi
flag=1
echo -n "\$$count"
count=$(($count + 1))
done
}
#Set flags.
keys="1 1"
#Process options.
while getopts :k:v opt
do
case $opt in
k) keys=$(process_key_range "$OPTARG")
;;
v) echo "Version 1.0"
exit 0
;;
'?') echo "$0: invalid option -$OPTARG" >&2
usage >&2
exit 1
;;
esac
done
shift $((OPTIND - 1))
#Process arguments.
if [ $# -ne 2 ]; then
usage
exit 1
fi
old_file=$1
new_file=$2
awk_keys=$(enum_range "$keys")
arr_keys=($keys)
start_key=${arr_keys[0]}
end_key=${arr_keys[1]}
cmdline="sort -s -k $start_key,$end_key $old_file $new_file | awk "\'"{if ( ($awk_keys) in set) print \$0; else set[$awk_keys] = 1;}"\'
eval $cmdline
Just you? -- Then why all the syntax checking? After all, you wrote it.
Others? -- How sophisticated are they on the CLI? It could use more bulletproofing.
Let me try to restate background of the problem:[list=1][*]You are working on several databases.[*]These db's have different key structures.[*]You don not work directly on the db's, but with text file dumps.[*]You need to purge new data of records that would cause primary key violations.[*]You want to list the offending records.[*]You want to remove the offending records from the new data.[*]Ultimately, you want a file of "clean" (purged) new data to add to the db.[*]The files you want could be called "OLD", "NEW", "BAD", "ADD". [*]As usual, the users cannot be trusted to leave things alone during maintenance.[/list=1]
More Q's:
Anyhing to add or subtract from the above analysis?
How much of this would be solved if #9 were fixed?
Can't you lock the apropriate files during updates?
Originally posted by archtoad6 I sure there is a simpler way.
Q: Who are the users of your script?
Just you? -- Then why all the syntax checking? After all, you wrote it.
Others? -- How sophisticated are they on the CLI? It could use more bulletproofing.
Let me try to restate background of the problem:[list=1][*]You are working on several databases.[*]These db's have different key structures.[*]You don not work directly on the db's, but with text file dumps.[*]You need to purge new data of records that would cause primary key violations.[*]You want to list the offending records.[*]You want to remove the offending records from the new data.[*]Ultimately, you want a file of "clean" (purged) new data to add to the db.[*]The files you want could be called "OLD", "NEW", "BAD", "ADD". [*]As usual, the users cannot be trusted to leave things alone during maintenance.[/list=1]
More Q's:
Anything to add or subtract from the above analysis?
How much of this would be solved if #9 were fixed?
Can't you lock the apropriate files during updates?
I would be the primary user. I threw in some syntax checking mostly out of habit.
For the most part, I will arrange with a client ahead of time that they must be out of their system before I load their new data. However, several of my colleagues have already found themselves in situations where someone was adding records via the application software when they shouldn't have, and that can really throw a wrench in the works.
For at least part of the year (normally the summer months), I spend a lot of time migrating data from customer's old application software to our company's software. It's not really an exact science, so sometimes a conversion will have errors and I might have actually put in duplicate rows for one reason or another myself, so that is why I like to see what the bad rows are first. Sometimes seeing what is duplicated can give some insight into what went wrong.
Of course, the need for this script would probably not be as great if #9 were fixed, but I would still find it useful. Sometimes a SQL query is more appropriate, but sometimes it seems like it is easier to analyze the data dumps for problems.
It may be possible to lock users out of the database in some scenarios. In others, I am not sure it would be a good idea, as the database is shared by distinct groups, and not everyone is affected by the data being converted.
Lot's of comments, my habits. In fact almost too many, but maybe someone else can learn from them.
Trying have nothing that might need editing in more than one place -- similar to using variables.
Attempting to self-document. -- Ergo "USAGE" & "_usage"
Different naming style. -- I get the impression that bash programmers like uppercase variable names.
Similar variables have names of the same (short) length.
Possibly 2 or 3 unnecessary functions.
Narrower indents.
IMO, simpler commands -- i.e. ones I know or want to learn. (Like awk.)
Code:
#! /bin/bash
#0 db_purge === compare & purge text files from database dump
USEAGE=" Usage: $0 database | [keys] oldfile newfile" # self documenting
#@ f.a.archibald.iii copyright (c) 2005 all right reserved
#echo -e "$0\t$# -- $@\n\n" # debugging
# ===== Functions =====
_usage () {
echo "$USEAGE"
exit
}
_no_bd () {
echo -e " Sorry, you have not added database \"$1\" to this script.\n"
}
_files () {
OLD="$1" # $OLD is the dump of the existing data
NEW="$2" # $NEW is the dump of the data to purge (& add)
BAD="$NEW.bad" # $BAD is the bad data from $NEW to be purged
ADD="$OLD.add" # $ADD is the good data from $NEW to be added
TMP="unique" # $TMP is the unique old & added data,
# it predicts the dump of data after $ADD has been added
#End of files
FILES=`awk '$0 == "_files () {", $1 == "#End"' $0 \
|awk -F= '$0 ~ "=" {print $1}' `
}
_list_files () {
#echo -e "OLD=$OLD \t NEW=$NEW \t BAD=$BAD \t ADD=$ADD"
# This may be too cute,
# but it avoids remembering to edit this function if _files () is changed,
for F in $FILES
do
echo -ne "$F=${!F}\t"
done
echo
}
_show_files () {
for F in $FILES
do
less -~ ${!F}
done
}
# ===== Main =====
case $# in
1) case $1 in
test) _files "file2" "file1" ;;
*) _no_bd $1; _help ;;
esac ;;
2) _files $1 $2 ;;
3) KEYS="$1"
shift
_files $1 $2 ;;
*) _usage ;;
esac ;_list_files # Debugging & info -- kill w/ '#' after ';'
# place ALL the unique (good) data in $TMP
cat $OLD $NEW | sort -suk1,2 > $TMP
# place the bad data in $BAD
echo -e "\n\nBad data: \t\t(Also stored in $BAD)\n"
cat $OLD $NEW | sort -sk1,2 | diff - $TMP | sed -n '/^[<>]/s,^[<>] ,,p' | tee $BAD
# cat & sort put all the data on STDIN, sorted
# "diff - $TMP" compares STDIN (-) to $TMP
# "sed ... " removes diff artifacts leaving only the bad data
# tee diplays & stores the bad data
# place the unique (good) data to add in $ADD
echo -e "\n\nGood data: \t\t(Also stored in $ADD)\n"
diff $NEW $TMP | sed -n '/^>/s,^> ,,p' | tee $ADD
# diff'ing the new w/ all the good gives the additions
# sed cleans the output
# tee displays & stores
# uncomment next line to always view files at finish
#echo -e "\nReady to show files" ; read ; _show_files
Thanks. It will probably take me a little while to figure out everything you are doing, though. For example, the _files() function had me scratching my head for a while until I realized you were having the script read itself. That is a somewhat new twist in my experience. Also, I was not aware of the ${!variable} syntax. After some time, I realized you must be doing the equivalent of an eval, but I had trouble finding the syntax in any of my books on shell scripting. I found a small paragraph reference to it in O'Reilly's "Learning Bash". I haven't made it through the diff and sed parts yet, and I still haven't figured out what the KEYS variable is used for, but it is an interesting approach.
Originally posted by paulsm4 One silly question: doesn't your DB have a "unique" constraint, so that duplicate records cannot be inserted in the first place ;-)?
Yes, but the command for bulk loading the data will fail if there are any errors (no rows will be loaded). In some cases, just being able to say, "load the data you can, ignore the rest" would be ideal. Other times, it's better to know exactly why you were getting duplicates in the first place.
Personally, given how fiddly this is getting, I'd write a Perl program to analyse the flat files. Given that hashes are a basic data type in Perl, checking for duplicates is trivial.
You could also get it to loop insert recs into the DB 1 at a time (it's a quick language) and log any that get rejected.
I've done similar stuff in the past, and it's the sort of thing Perl is very good at.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.