ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
i would want to ignore the 1st and 3rd columns (the byte-offsets are alwaze the same).
i have two files with about 30 columns (one has 39,952 rows; the other has 40,215) and i wanna' compare which serial-numbers (last column) have different data (date, time, processor, ... fields will always be different).
i dont have gnu extentions but i do have access to a c compiler.
You really need to provide better sample data. I am having difficulties to understand what you are actually trying to achieve. I am also not sure if the serial numbers will be unique in each file. If they are not, does it matter?
Anyway, provide some representative samples and the results that you expect. Then we can test if the following might function as a usable filter:
#!/bin/bash
while read serial
do
stuff
done < serial.number
but what is it you want to do?
you might not even need that serial.number file
Code:
#!/bin/bash
while IFS=, read -a MYDATA
do
while IFS=, read -a MYDATAFILTERED
do
printf "%s\t%s\t%s\n" \
${MYDATAFILTERED[3]} \
${MYDATAFILTERED[23]} \
${MYDATAFILTERED[5]}
done < <( grep ",${MYDATA[-1]}$" s2.lst )
done < s1.lst
not very useful, but what you actually want is not very clear
${MYDATA[-1]}, the -1 is the "last index" -2 would be second to last
bash array index starts at 0, so column 30 would be ${MYDATA[29]}
compare?
what do you mean by that?
look for things that are the same or different?
Code:
#!/bin/bash
while IFS=, read -a MYDATA
do
while IFS=, read -a MYDATAFILTERED
do
if [[ ${MYDATA[5]} == ${MYDATAFILTERED[5]} ]]
then
echo "${MYDATA[-1]} column 6 matches data in s2.lst"
else
sep=","
for i in ${!MYDATAFILTERED[@]}
do
[[ $i == $(( ${#MYDATAFILTERED[@]} - 1 )) ]] && sep=$'\n'
printf "%s${sep}" ${MYDATAFILTERED[$i]}
done >> someoutputfile
fi
done < <( grep ",${MYDATA[-1]}$" s2.lst )
done < s1.lst
you can probably guess I have just made up numbers for the "columns"
if columns 6 match, it just says so
if they don't match it spits out the line using commas as field separator
that does assume the serials in s1.lst are "unique"
if not then you will duplicate work/output
I don't use or have access to AIX
it would help to know bash version
ultimately we are going to need sample data that resembles the real data you are working with as well as "rules" for the compare
thanx. you guys have given me some ideas and syntaxes to play around with.
it will take me a while to sanitize some production data. i work in healthcare informatics so there will be hipaa concerns.
the 2 reports are mostly identical (s1 has a few hundred that doesnt exist in s2; and s2 has a few dozen that doesnt exist in s1. they have more than 39,000 in common).
for the time being i'll try a more detailed description:
i have 2 reports with about 30 columns each.
they have member-id's that dont map to each other (some members have multiple plans).
i ran a db2 query with 3 columns (select(distinct) s1 member-id, serial-#, s2 member-id) and copied it to the unix server.
overnite i let it grep the member-id and stub in the serial at the end for both files.
now i want a list of rows that are not paying correctly.
the date, time, member-id, processor-# will be different -- all other feilds should match (serial is the key).
i mite try trimming out lines that are not common;and, sort by serial; then, diff|cut ?
agreed, I really dislike writing temp files to disk
I'm unsure of the relationship between these two files
or what exactly is being compared
I have thrown together this script
it removes the date field, and "counts" "unique" lines for each serial
I have had to guess the structure of the log
first field is date/time "2019-11-07 10:55" is 16 chars long, hence the $( cut -b 17- )
the last is serial ( added later or in original ? )
in-between is "data", knowing the fields of interest would make a better script!
I have assumed serial is in the log, but that memberid_serial map makes me suspect that is not true
this is where I'm confused
why add serial based on memberid ?
anyway, script follows
Code:
#!/bin/bash
File1="$1"
File2="$2"
MakeArrays1 () {
while IFS=, read -a line
do
unset line[0] # discard date time
Hash=($( md5sum <<<"${line[*]}" ))
[[ -z ${dataset1[sn${line[29]}]} ]] \
&& declare -g -A dataset1[sn${line[29]}]="${Hash}" \
|| {
[[ ${Hash} =~ ${dataset1[sn${line[29]}]} ]] \
|| declare -g -A dataset1[sn${line[29]}]+="|${Hash}"
}
done
}
# I don't like near copies of functions
#TODO figure out how to have just one function
MakeArrays2 () {
while IFS=, read -a line
do
unset line[0] # discard date time
Hash=($( md5sum <<<"${line[*]}" ))
[[ -z ${dataset2[sn${line[29]}]} ]] \
&& declare -g -A dataset2[sn${line[29]}]="${Hash}" \
|| {
[[ ${Hash} =~ ${dataset2[sn${line[29]}]} ]] \
|| declare -g -A dataset2[sn${line[29]}]+="|${Hash}"
}
done
}
MakeArrays1 <"$File1"
MakeArrays2 <"$File2"
# note the ! , it lists the index not the values
for i in ${!dataset1[@]}
do
IFS='|' read -a Count <<<${dataset1[$i]}
printf "%s has %d unique datasets ( excluding date/time )\n" $i ${#Count[@]}
done | sort -k1
for i in ${!dataset1[@]}
do
IFS=${IFS/#/|}
[[ $i =~ ${!dataset2[*]} ]] \
&& echo "$i exists in dataset2" \
|| echo "$i does not exist in dataset2"
IFS=${IFS#|}
done | sort -k1
I have not added checking lines in file1 match lines in file2
to do that, for each serial in dataset1, load its pattern into an array
iterate over that array and check against the serial's pattern in dataset2
I've not counted duplicates in the same file.
duplicates could be counted by creating a new array
e.g.
you will end up with some really long patterns
but the retained data allows for further analysis
each element ( serial ) of dataset contains two arrays, one with field separator "|" and the other ","
to be honest, when you start getting that deep it is probably time to think about getting that data into a proper database.
Please don't tell me this data was exported from a database
I'm unsure of the relationship between these two files
or what exactly is being compared
I have thrown together this script
...
Code:
#!/bin/bash
File1="$1"
File2="$2"
MakeArrays1 () {
while IFS=, read -a line
do
unset line[0] # discard date time
Hash=($( md5sum <<<"${line[*]}" ))
[[ -z ${dataset1[sn${line[29]}]} ]] \
&& declare -g -A dataset1[sn${line[29]}]="${Hash}" \
|| {
[[ ${Hash} =~ ${dataset1[sn${line[29]}]} ]] \
|| declare -g -A dataset1[sn${line[29]}]+="|${Hash}"
}
done
}
# I don't like near copies of functions
#TODO figure out how to have just one function
MakeArrays2 () {
while IFS=, read -a line
do
unset line[0] # discard date time
Hash=($( md5sum <<<"${line[*]}" ))
[[ -z ${dataset2[sn${line[29]}]} ]] \
&& declare -g -A dataset2[sn${line[29]}]="${Hash}" \
|| {
[[ ${Hash} =~ ${dataset2[sn${line[29]}]} ]] \
|| declare -g -A dataset2[sn${line[29]}]+="|${Hash}"
}
done
}
MakeArrays1 <"$File1"
MakeArrays2 <"$File2"
...
First of all, without knowing the data structure and what exactly is desired this is pretty much an exercise in futility. Also, the script does not work (as far as any script can work with the given information) as "expected".
The following sample data was used (it only has two columns, which is sufficient for demonstration purposes):
This means, that for multiple duplicates of the same serial number the script will always count 2. I assume, the code responsible for this was supposed to be a fancy way of saving the line for the declaration:
There are two problems with fancy stuff in general:
it needlessly obfuscates the code
quite often it is just a recipe for future unpleasantness
If the second 'declare' is removed then it correctly counts the duplicates in the first file, not in the second, though. This implies that the first file would have to be authoritative with regard to the duplicity of the serial number. The information supplied by OP does not support that assumption.
I also do not see any benefit in calculating a hashsum - storing the line *should* be fine.
Anyway with regard to the near duplicate function:
Code:
#!/usr/bin/bash
field=1
File1="$1"
File2="$2"
MakeArrays () {
local -n ref=$1
declare -g -A ${!ref}
while IFS=, read -a line; do
chk="${line[*]}"
[[ ${chk} =~ ${ref[sn${line[$field]}]:=${chk}} ]] \
|| ref[sn${line[$field]}]+="|${chk}"
done
}
MakeArrays dataset1 <"$File1"
MakeArrays dataset2 <"$File2"
# debug
#echo ${dataset1[@]}
#echo ${dataset2[@]}
...
PS:
OP said he is working on an AIX and does not have access to GNU extensions. There is a good chance that associative arrays are not supported on the target system.
Last edited by crts; 11-07-2019 at 09:45 AM.
Reason: Added PS
removing the date/time from the array and it returns 4
if I add those 4 lines to the input, I still get 4 unique
but commenting out the unset I get 9
why?
the + in the date/time messing with the RE
now, this is where md5sum comes in
I do not know what is in these input files
so I have no control over the patterns, unless..
Code:
MakeArrays () {
local -n ref=$1
declare -g -A ${!ref}
field=29
while IFS=, read -a line; do
#unset line[0]
#chk="${line[*]}"
chk=( $( md5sum <<<${line[*]} ) )
[[ ${chk} =~ ${ref[sn${line[$field]}]:=${chk}} ]] \
|| ref[sn${line[$field]}]+="|${chk}"
done
}
no more problems with the + in the date
it does take longer
but knowing what is expected in the input it can be dealt with.
now , why is it counting the extra 1?
again, that is the + from the date/time messing with the RE
so the first is added twice, initially when the array is empty, and then when it fails the RE
faster, cleaner but prone to special chars in the data
remember , the chk does not need to be the whole line
only the fields that are of interest
if the data is likely to have things like []{}+*$^ in it, either hash or escape
Code:
#!/usr/bin/bash
field=29
File1="$1"
File2="$2"
MakeArrays () {
local -n ref=$1
declare -g -A ${!ref}
while IFS=, read -a line; do
unset line[0]
chk="${line[*]}"
[[ ${chk} =~ ${ref[sn${line[$field]}]:=${chk}} ]] \
|| ref[sn${line[$field]}]+="|${chk}"
done
}
MakeArrays dataset1 <"${File1}"
MakeArrays dataset2 <"${File2}"
# note the ! , it lists the index not the values
for i in ${!dataset1[@]}
do
IFS='|' read -a Count <<<${dataset1[$i]}
printf "%s has %d unique datasets ( excluding date/time )\n" $i ${#Count[@]}
done | sort -k1
# this is mostly an example, not really useful
for i in ${!dataset1[@]}
do
IFS=${IFS/#/|}
[[ $i =~ ${!dataset2[*]} ]] \
&& echo "$i exists in dataset2" \
|| echo "$i does not exist in dataset2"
IFS=${IFS#|}
done | sort -k1
Yes, I know. I think that the function should expect a preconditioned set of data and not do any filtering by itself. As OP stated
Quote:
Originally Posted by schneidz
i would want to ignore the 1st and 3rd columns
So you will have to at least also filter 'line[2]', if above statement is still valid.
Furthermore,
Quote:
Originally Posted by schneidz
i ran a db2 query with 3 columns ...
This leads me to suspect that the data may be coming from a database. If the query is changed to not output the unneeded fields then the function - if it is filtering - will have to be adjusted, too. Therefore I think it is more maintainable, if the filtering (and any other required preconditioning) is done in a separate function or maybe even in a dedicated script.
The RegEx issue briefly crossed my mind. I disregarded it, however, because if it turns out that the input has "funny" characters then the characters in question might be changed to something else (or escaped, as you suggested). This should also be done in the preconditioning phase.
Of course, hashes can also take care of the problem. If you want to use hashes, however, I would recommend to use two different hash functions to minimize the risk of collisions, e.g.:
Since all the rows have the same structure some may be similiar, thus the chance of collisions is slightly higher than for arbitrary data, as is demonstrated here. Even if the chance for collision is still low, why take it?
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.