LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 11-05-2019, 03:26 PM   #1
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fedora-35
Posts: 5,313

Rep: Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918
bash: diff ignoring certain columns


[a.i.x.]:

Code:
cat s1.lst
chun-li,hello,0451,world
cat s2.lst
ryu-ken,hello,0000,world
i would want to ignore the 1st and 3rd columns (the byte-offsets are alwaze the same).

i have two files with about 30 columns (one has 39,952 rows; the other has 40,215) and i wanna' compare which serial-numbers (last column) have different data (date, time, processor, ... fields will always be different).

i dont have gnu extentions but i do have access to a c compiler.

maybe
Code:
cat serial.number | while read serial
do
 grep $serial s1.lst | cut -b 100-125,134- > s1.buff
 grep $serial s2.lst | cut -b 100-125,134- > s2.buff
 diff s1.buff s2.buff
done > report.lst
will work ?

Last edited by schneidz; 11-05-2019 at 03:37 PM.
 
Old 11-05-2019, 04:22 PM   #2
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Please provide a larger sample input file and a hand-built example of the desired output file. That will help us understand the problem statement.

My first thought is that you need only this:
Code:
grep $serial s1.lst | cut -b 100-125,134- > s1.buff
grep $serial s2.lst | cut -b 100-125,134- > s2.buff
diff s1.buff s2.buff >report.lst
Daniel B. Martin

.
 
1 members found this post helpful.
Old 11-05-2019, 04:26 PM   #3
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
You really need to provide better sample data. I am having difficulties to understand what you are actually trying to achieve. I am also not sure if the serial numbers will be unique in each file. If they are not, does it matter?
Anyway, provide some representative samples and the results that you expect. Then we can test if the following might function as a usable filter:
Code:
sort -t, -k30 s1.lst s2.lst|awk 'BEGIN{FS=",";OFS=","}{$1="";$3="";print}'| uniq
It can be further refined, based on your requirement; once we know what it exactly is.

Last edited by crts; 11-05-2019 at 04:29 PM.
 
1 members found this post helpful.
Old 11-05-2019, 07:17 PM   #4
Firerat
Senior Member
 
Registered: Oct 2008
Distribution: Debian sid
Posts: 2,683

Rep: Reputation: 783Reputation: 783Reputation: 783Reputation: 783Reputation: 783Reputation: 783Reputation: 783
I'm a little lost as to what you want

but it appears awk would work well, but overkill and most people tend to write for gawk

I assume serial.number is simply a list of serial numbers
avoid UUOC http://porkmail.org/era/unix/award.html

Code:
#!/bin/bash
while read serial
do
  stuff
done < serial.number
but what is it you want to do?

you might not even need that serial.number file

Code:
#!/bin/bash
while IFS=, read -a MYDATA
do
  while IFS=, read -a MYDATAFILTERED
  do
    printf "%s\t%s\t%s\n" \
      ${MYDATAFILTERED[3]} \
      ${MYDATAFILTERED[23]} \
      ${MYDATAFILTERED[5]}
  done < <( grep ",${MYDATA[-1]}$" s2.lst )
done < s1.lst
not very useful, but what you actually want is not very clear

${MYDATA[-1]}, the -1 is the "last index" -2 would be second to last
bash array index starts at 0, so column 30 would be ${MYDATA[29]}

compare?
what do you mean by that?
look for things that are the same or different?

Code:
#!/bin/bash
while IFS=, read -a MYDATA
do
  while IFS=, read -a MYDATAFILTERED
  do
    if [[ ${MYDATA[5]} == ${MYDATAFILTERED[5]} ]]
    then
      echo "${MYDATA[-1]} column 6 matches data in s2.lst"
    else
      sep=","
      for i in ${!MYDATAFILTERED[@]}
      do
        [[ $i == $(( ${#MYDATAFILTERED[@]} - 1 )) ]] && sep=$'\n'
        printf "%s${sep}" ${MYDATAFILTERED[$i]}
      done >> someoutputfile
    fi      
  done < <( grep ",${MYDATA[-1]}$" s2.lst )
done < s1.lst
you can probably guess I have just made up numbers for the "columns"
if columns 6 match, it just says so
if they don't match it spits out the line using commas as field separator

that does assume the serials in s1.lst are "unique"
if not then you will duplicate work/output

I don't use or have access to AIX
it would help to know bash version

ultimately we are going to need sample data that resembles the real data you are working with as well as "rules" for the compare
 
1 members found this post helpful.
Old 11-05-2019, 07:54 PM   #5
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fedora-35
Posts: 5,313

Original Poster
Rep: Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918
thanx. you guys have given me some ideas and syntaxes to play around with.
it will take me a while to sanitize some production data. i work in healthcare informatics so there will be hipaa concerns.

the 2 reports are mostly identical (s1 has a few hundred that doesnt exist in s2; and s2 has a few dozen that doesnt exist in s1. they have more than 39,000 in common).

for the time being i'll try a more detailed description:
i have 2 reports with about 30 columns each.
they have member-id's that dont map to each other (some members have multiple plans).
i ran a db2 query with 3 columns (select(distinct) s1 member-id, serial-#, s2 member-id) and copied it to the unix server.
overnite i let it grep the member-id and stub in the serial at the end for both files.

now i want a list of rows that are not paying correctly.
the date, time, member-id, processor-# will be different -- all other feilds should match (serial is the key).

i mite try trimming out lines that are not common;and, sort by serial; then, diff|cut ?

Last edited by schneidz; 11-05-2019 at 08:52 PM.
 
Old 11-05-2019, 09:03 PM   #6
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fedora-35
Posts: 5,313

Original Poster
Rep: Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918
Quote:
Originally Posted by Firerat View Post
...
avoid UUOC http://porkmail.org/era/unix/award.html
Code:
#!/bin/bash
while read serial
do
  stuff
done < serial.number
...
big up. but i prefer to read top-to-bottom, left-to-rite; rather than, jump to the bottom and scan back up.
 
Old 11-05-2019, 10:09 PM   #7
Firerat
Senior Member
 
Registered: Oct 2008
Distribution: Debian sid
Posts: 2,683

Rep: Reputation: 783Reputation: 783Reputation: 783Reputation: 783Reputation: 783Reputation: 783Reputation: 783
Quote:
Originally Posted by schneidz View Post
big up. but i prefer to read top-to-bottom, left-to-rite; rather than, jump to the bottom and scan back up.
the good old readability argument.

fair enough

please do consider applying that to your written word
left-to-right makes more sense.
 
Old 11-06-2019, 04:19 PM   #8
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fedora-35
Posts: 5,313

Original Poster
Rep: Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918
this is what i developed:
Code:
[schneidz@lq] cat s1-s2-compare.ksh
#!/bin/bash

cat memberid-serial-map.1 | while read line
do
 c=`echo $line | awk '{print $1}'`
 u=`echo $line | awk '{print $2}'`
 awk -v c=$c -v u=$u "/$c/ {print \$0u}" $1
done > s1.lst # append matching serial to end of each row

cat memberid-serial-map.2 | while read line
do
 c=`echo $line | awk '{print $1}'`
 u=`echo $line | awk '{print $2}'`
 awk -v c=$c -v u=$u "/$c/ {print \$0u}" $2
done > s2.lst # append matching serial to end of each row

cut -b 17- memberid-serial-map.1 | sort > serial.1
cut -b 17- memberid-serial-map.2 | sort > serial.2

#comm -23 serial.1 serial.2 > not-in-s2.lst # these were
#comm -13 serial.1 serial.2 > not-in-s1.lst # buggy
#comm -12 serial.1 serial.2 > serial.comm   # for some reason

cat serial.1 serial.1 serial.2 | sort | uniq -c | grep "   1" | cut -b 6- > not-in-s1.lst
cat serial.1 serial.1 serial.2 | sort | uniq -c | grep "   2" | cut -b 6- > not-in-s2.lst
cat serial.1 serial.1 serial.2 | sort | uniq -c | grep "   3" | cut -b 6- > serial.comm

cat serial.comm | while read s
do
 grep $s s1.lst | cut -b 42-72,82- > s.1
 grep $s s2.lst | cut -b 42-72,82- > s.2
 diff s.1 s.2 > /dev/null
 if [ $? -eq 1 ]
 then
  grep $s s[12].lst
 fi
done > unmatched-report.lst

cut -b 339- unmatched-report.lst | sort | uniq > s1-s2-differences
wc -l not-in-s1.lst not-in-s2.lst serial.comm s1-s2-differences
rm s.1 s.2 s1-s2-differences

tar -cf - $1 $2 unmatched-report.lst | bzip2 > s1-s2-diffs.tar.bz2
[schneidz@lq] ./s1-s2-compare.ksh mainframe.dataset.1 mainframe.dataset.2
     233 not-in-s1.lst
       5 not-in-s2.lst
   42893 serial.comm
   14178 s1-s2-differences
 
Old 11-06-2019, 05:05 PM   #9
Firerat
Senior Member
 
Registered: Oct 2008
Distribution: Debian sid
Posts: 2,683

Rep: Reputation: 783Reputation: 783Reputation: 783Reputation: 783Reputation: 783Reputation: 783Reputation: 783
backticks are awful

they are hard to see and a nightmare to nest
instead use $(...)

Code:
 c="$( echo $line | awk '{print $1}' )"
and.. echo?

Code:
 c="$( <<<$line  awk '{print $1}' )"
but you don't even need awk
Code:
while read -a line
do
  c="${line[0]}"
  u="${line[1]}"
....
..
alternative

Code:
while read c u foo1 foo2 foo......foo28
do
  echo $c
I prefer to work with arrays, it is easer to check I have the correct number of fields

fact is, you don't need all your cats echos greps cuts awks

pure bash can do most if not all of what you want ( and will be much faster )

https://mywiki.wooledge.org/BashGuide
 
2 members found this post helpful.
Old 11-07-2019, 01:25 AM   #10
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
Quote:
Originally Posted by Firerat View Post
backticks are awful
So is shuffling temporary files.
I holeheartedly agree:
Quote:
I prefer to work with arrays, it is easer to check I have the correct number of fields

fact is, you don't need all your cats echos greps cuts awks

pure bash can do most if not all of what you want ( and will be much faster )

https://mywiki.wooledge.org/BashGuide
It means to re-think, re-implement the whole thing.
A lot of work, but do that once and all your subsequent scripts will get better, too.
 
2 members found this post helpful.
Old 11-07-2019, 05:23 AM   #11
Firerat
Senior Member
 
Registered: Oct 2008
Distribution: Debian sid
Posts: 2,683

Rep: Reputation: 783Reputation: 783Reputation: 783Reputation: 783Reputation: 783Reputation: 783Reputation: 783
agreed, I really dislike writing temp files to disk

I'm unsure of the relationship between these two files
or what exactly is being compared

I have thrown together this script

it removes the date field, and "counts" "unique" lines for each serial

I have had to guess the structure of the log
first field is date/time "2019-11-07 10:55" is 16 chars long, hence the $( cut -b 17- )
the last is serial ( added later or in original ? )
in-between is "data", knowing the fields of interest would make a better script!

I have assumed serial is in the log, but that memberid_serial map makes me suspect that is not true
this is where I'm confused
why add serial based on memberid ?

anyway, script follows
Code:
#!/bin/bash
File1="$1"
File2="$2"

MakeArrays1 () {
while IFS=, read -a line
do
  unset line[0] # discard date time
  Hash=($( md5sum <<<"${line[*]}" ))

  [[ -z ${dataset1[sn${line[29]}]} ]] \
    && declare -g -A dataset1[sn${line[29]}]="${Hash}" \
    ||  {
          [[ ${Hash} =~ ${dataset1[sn${line[29]}]} ]] \
            || declare -g -A dataset1[sn${line[29]}]+="|${Hash}"
        }
done
}
# I don't like near copies of functions
#TODO figure out how to have just one function
MakeArrays2 () {
while IFS=, read -a line
do
  unset line[0] # discard date time
  Hash=($( md5sum <<<"${line[*]}" ))

  [[ -z ${dataset2[sn${line[29]}]} ]] \
    && declare -g -A dataset2[sn${line[29]}]="${Hash}" \
    ||  {
          [[ ${Hash} =~ ${dataset2[sn${line[29]}]} ]] \
            || declare -g -A dataset2[sn${line[29]}]+="|${Hash}"
        }
done
}

MakeArrays1 <"$File1"
MakeArrays2 <"$File2"

# note the ! , it lists the index not the values
for i in ${!dataset1[@]}
do
    IFS='|' read -a Count <<<${dataset1[$i]}
    printf "%s has %d unique datasets ( excluding date/time )\n" $i ${#Count[@]}
done | sort -k1

for i in ${!dataset1[@]}
do
  IFS=${IFS/#/|}
  [[ $i =~ ${!dataset2[*]} ]] \
    && echo "$i exists in dataset2" \
    || echo "$i does not exist in dataset2"
  IFS=${IFS#|}
done | sort -k1
I have not added checking lines in file1 match lines in file2
to do that, for each serial in dataset1, load its pattern into an array
iterate over that array and check against the serial's pattern in dataset2

I've not counted duplicates in the same file.

duplicates could be counted by creating a new array
e.g.
Code:
...
||  {
      [[ ${Hash} =~ ${dataset2[sn${line[29]}]} ]] \
        && dupeCount[sn${line[29]}]=$(( dupeCount[sn${line[29]}] + 1 )) \
        || declare -g -A dataset2[sn${line[29]}]+="|${Hash}"
    }
...
but since you have been using uniq I'm guessing duplicates don't "count"

if you need/want to do more analysis the Hash bit may "get in the way"
Code:
...
IFS=${IFS/#/,}
  Hash="${line[*]}"
IFS=${IFS#,}
...
you will end up with some really long patterns
but the retained data allows for further analysis
each element ( serial ) of dataset contains two arrays, one with field separator "|" and the other ","
to be honest, when you start getting that deep it is probably time to think about getting that data into a proper database.
Please don't tell me this data was exported from a database
 
Old 11-07-2019, 05:56 AM   #12
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fedora-35
Posts: 5,313

Original Poster
Rep: Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918
Quote:
Originally Posted by ondoho View Post
So is shuffling temporary files.
I holeheartedly agree:

It means to re-think, re-implement the whole thing.
A lot of work, but do that once and all your subsequent scripts will get better, too.
you guys are rite. this is a lazy attempt. i am begroaning doing the homework to improve. little by little.
 
Old 11-07-2019, 09:40 AM   #13
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
Quote:
Originally Posted by Firerat View Post
I'm unsure of the relationship between these two files
or what exactly is being compared

I have thrown together this script

...
Code:
#!/bin/bash
File1="$1"
File2="$2"

MakeArrays1 () {
while IFS=, read -a line
do
  unset line[0] # discard date time
  Hash=($( md5sum <<<"${line[*]}" ))

  [[ -z ${dataset1[sn${line[29]}]} ]] \
    && declare -g -A dataset1[sn${line[29]}]="${Hash}" \
    ||  {
          [[ ${Hash} =~ ${dataset1[sn${line[29]}]} ]] \
            || declare -g -A dataset1[sn${line[29]}]+="|${Hash}"
        }
done
}
# I don't like near copies of functions
#TODO figure out how to have just one function
MakeArrays2 () {
while IFS=, read -a line
do
  unset line[0] # discard date time
  Hash=($( md5sum <<<"${line[*]}" ))

  [[ -z ${dataset2[sn${line[29]}]} ]] \
    && declare -g -A dataset2[sn${line[29]}]="${Hash}" \
    ||  {
          [[ ${Hash} =~ ${dataset2[sn${line[29]}]} ]] \
            || declare -g -A dataset2[sn${line[29]}]+="|${Hash}"
        }
done
}

MakeArrays1 <"$File1"
MakeArrays2 <"$File2"

...
First of all, without knowing the data structure and what exactly is desired this is pretty much an exercise in futility. Also, the script does not work (as far as any script can work with the given information) as "expected".
The following sample data was used (it only has two columns, which is sufficient for demonstration purposes):
Code:
user@kronos$ cat s1.lst 
data1,1111
data2,2222
data3,3333
data1,1111
data11,1111
data111,1111
user@kronos$ cat s2.lst 
data1,1111
data3,3333
data4,4444
data11,1111
data111,1111
data1111,1111
If there are duplicate serial numbers, the second 'declare' statement in the function results in the array looking like this:
Code:
|56baf66ff8ff800696fcfd56256688e1 856f07fad133660d6e16eb93d1414546 d532f61d1a327ca53c685c6d7dbe2ab9
This means, that for multiple duplicates of the same serial number the script will always count 2. I assume, the code responsible for this was supposed to be a fancy way of saving the line for the declaration:
Code:
  [[ -z ${dataset1[sn${line[29]}]} ]] \
    && declare -g -A dataset1[sn${line[29]}]="${Hash}" \
    ||  {
          [[ ${Hash} =~ ${dataset1[sn${line[29]}]} ]] \
            || declare -g -A dataset1[sn${line[29]}]+="|${Hash}"
        }
There are two problems with fancy stuff in general:
  1. it needlessly obfuscates the code
  2. quite often it is just a recipe for future unpleasantness
If the second 'declare' is removed then it correctly counts the duplicates in the first file, not in the second, though. This implies that the first file would have to be authoritative with regard to the duplicity of the serial number. The information supplied by OP does not support that assumption.

I also do not see any benefit in calculating a hashsum - storing the line *should* be fine.

Anyway with regard to the near duplicate function:
Code:
#!/usr/bin/bash

field=1
File1="$1"
File2="$2"

MakeArrays () {
        local -n ref=$1
        declare -g -A ${!ref}

        while IFS=, read -a line; do
                chk="${line[*]}"
                [[ ${chk} =~ ${ref[sn${line[$field]}]:=${chk}} ]] \
                        || ref[sn${line[$field]}]+="|${chk}"
        done
}


MakeArrays dataset1 <"$File1"
MakeArrays dataset2 <"$File2"

# debug
#echo ${dataset1[@]}
#echo ${dataset2[@]}

...
PS:
OP said he is working on an AIX and does not have access to GNU extensions. There is a good chance that associative arrays are not supported on the target system.

Last edited by crts; 11-07-2019 at 09:45 AM. Reason: Added PS
 
1 members found this post helpful.
Old 11-07-2019, 11:46 AM   #14
Firerat
Senior Member
 
Registered: Oct 2008
Distribution: Debian sid
Posts: 2,683

Rep: Reputation: 783Reputation: 783Reputation: 783Reputation: 783Reputation: 783Reputation: 783Reputation: 783
form OP "about 30 columns"
last is serial ( but I have some doubts about that after the later posts )

I guessed the first was yyyy-mm-dd hh:mm as that is 16 char long, is always different and stripped with cut -b 17- in a later script

a very crude script to make random data, very few rules to it
Code:
#!/bin/bash

somestrings=( edxe2Bt0iB umVwOVmPB5 pXPf4jyERd Lk5yRJdz8r J3R03MQiUD t5lv9ewBIm 6rii1QdmUe NfgpDKtZgE mjmG3nB2RO 5rGS2VlrYW )

cheapstring () {
printf "%s" ${somestrings[$((RANDOM%${#somestings[@]}))]}
}
cheapTS () {
date --iso-8601=seconds -d "@$(( $(date +%s ) + ${RANDOM} ))"
}
cheapserial () {
printf "%03d" $((RANDOM%200 ))
}

for i in {0000..1000}
do
  for i in {1..30}
  do
    case $i in
      1) printf "%s," $( cheapTS )
          ;;
    30)  printf "%s\n" $( cheapserial )
          ;;
      *) printf "%s," $( cheapstring )
          ;;
    esac
  done
done
flawed, yes

Quote:
Originally Posted by Firerat View Post
<snip>
I don't use or have access to AIX
it would help to know bash version

ultimately we are going to need sample data that resembles the real data you are working with as well as "rules" for the compare
http://www.perzl.org/aix/index.php%3Fn%3DMain.Bash
chances are it is bash v4


not sure what is going on with the spaces
this is a sample of what I get
Code:
+ [[ -z 08ab0b2b9dfaa654d3b5387b86aa69f7|03293782aa51226e836ba3ceb175ca37|a9c8d717660e9887c50a54889f804a52 ]]
+ [[ 0c119a413caf6d68bcd11c7261597c9f =~ 08ab0b2b9dfaa654d3b5387b86aa69f7|03293782aa51226e836ba3ceb175ca37|a9c8d717660e9887c50a54889f804a52 ]]
+ declare -g -A 'dataset1[sn006]+=|0c119a413caf6d68bcd11c7261597c9f'

thanks for the dedupe MakeArray
Initially I had just the one, then last min. added the second ( basically at time of posting )

Code:
MakeArrays () {
local -n ref=$1
declare -g -A ${!ref}
field=29

while IFS=, read -a line; do
  unset line[0] # strips date/time
  chk="${line[*]}"
  [[ ${chk} =~ ${ref[sn${line[$field]}]:=${chk}} ]] \
     || ref[sn${line[$field]}]+="|${chk}"
done
}
there is a problem with your counts
Code:
bash -x testing.sh input1 input2 2> debug | head -n1 ; grep ,000$ input1
sn000 has 5 unique datasets ( excluding date/time )
2019-11-06T21:14:46+00:00,6rii1QdmUe,umVwOVmPB5,6rii1QdmUe,edxe2Bt0iB,NfgpDKtZgE,Lk5yRJdz8r,NfgpDKtZgE,mjmG3nB2RO,Lk5yRJdz8r,edxe2Bt0iB,J3R03MQiUD,Lk5yRJdz8r,
6rii1QdmUe,edxe2Bt0iB,J3R03MQiUD,J3R03MQiUD,t5lv9ewBIm,5rGS2VlrYW,NfgpDKtZgE,pXPf4jyERd,NfgpDKtZgE,5rGS2VlrYW,5rGS2VlrYW,umVwOVmPB5,edxe2Bt0iB,edxe2Bt0iB,6rii
1QdmUe,6rii1QdmUe,000
2019-11-06T21:38:36+00:00,mjmG3nB2RO,J3R03MQiUD,J3R03MQiUD,edxe2Bt0iB,5rGS2VlrYW,Lk5yRJdz8r,Lk5yRJdz8r,pXPf4jyERd,6rii1QdmUe,mjmG3nB2RO,5rGS2VlrYW,edxe2Bt0iB,
umVwOVmPB5,Lk5yRJdz8r,Lk5yRJdz8r,umVwOVmPB5,NfgpDKtZgE,NfgpDKtZgE,J3R03MQiUD,NfgpDKtZgE,J3R03MQiUD,Lk5yRJdz8r,mjmG3nB2RO,NfgpDKtZgE,umVwOVmPB5,5rGS2VlrYW,edxe
2Bt0iB,t5lv9ewBIm,000
2019-11-06T21:17:51+00:00,pXPf4jyERd,umVwOVmPB5,mjmG3nB2RO,edxe2Bt0iB,edxe2Bt0iB,umVwOVmPB5,6rii1QdmUe,mjmG3nB2RO,NfgpDKtZgE,edxe2Bt0iB,t5lv9ewBIm,5rGS2VlrYW,
edxe2Bt0iB,t5lv9ewBIm,mjmG3nB2RO,t5lv9ewBIm,Lk5yRJdz8r,t5lv9ewBIm,6rii1QdmUe,5rGS2VlrYW,Lk5yRJdz8r,t5lv9ewBIm,edxe2Bt0iB,t5lv9ewBIm,pXPf4jyERd,J3R03MQiUD,umVw
OVmPB5,Lk5yRJdz8r,000
2019-11-07T00:54:33+00:00,edxe2Bt0iB,J3R03MQiUD,J3R03MQiUD,6rii1QdmUe,NfgpDKtZgE,umVwOVmPB5,mjmG3nB2RO,edxe2Bt0iB,pXPf4jyERd,6rii1QdmUe,mjmG3nB2RO,5rGS2VlrYW,
5rGS2VlrYW,NfgpDKtZgE,edxe2Bt0iB,6rii1QdmUe,J3R03MQiUD,pXPf4jyERd,6rii1QdmUe,mjmG3nB2RO,6rii1QdmUe,umVwOVmPB5,5rGS2VlrYW,5rGS2VlrYW,pXPf4jyERd,5rGS2VlrYW,Nfgp
DKtZgE,5rGS2VlrYW,000
it is counting 5, but I only have 4 in the input

removing the date/time from the array and it returns 4

if I add those 4 lines to the input, I still get 4 unique
but commenting out the unset I get 9

why?
the + in the date/time messing with the RE

now, this is where md5sum comes in
I do not know what is in these input files
so I have no control over the patterns, unless..
Code:
MakeArrays () {
local -n ref=$1
declare -g -A ${!ref}
field=29

while IFS=, read -a line; do
  #unset line[0]
  #chk="${line[*]}"
  chk=( $( md5sum <<<${line[*]} ) )
  [[ ${chk} =~ ${ref[sn${line[$field]}]:=${chk}} ]] \
     || ref[sn${line[$field]}]+="|${chk}"
done
}
no more problems with the + in the date
it does take longer
but knowing what is expected in the input it can be dealt with.

now , why is it counting the extra 1?
again, that is the + from the date/time messing with the RE
so the first is added twice, initially when the array is empty, and then when it fails the RE


faster, cleaner but prone to special chars in the data
remember , the chk does not need to be the whole line
only the fields that are of interest
if the data is likely to have things like []{}+*$^ in it, either hash or escape
Code:
#!/usr/bin/bash

field=29
File1="$1"
File2="$2"

MakeArrays () {
local -n ref=$1
declare -g -A ${!ref}

while IFS=, read -a line; do
  unset line[0]
  chk="${line[*]}"
  [[ ${chk} =~ ${ref[sn${line[$field]}]:=${chk}} ]] \
     || ref[sn${line[$field]}]+="|${chk}"
done
}

MakeArrays dataset1 <"${File1}"
MakeArrays dataset2 <"${File2}"

# note the ! , it lists the index not the values
for i in ${!dataset1[@]}
do
    IFS='|' read -a Count <<<${dataset1[$i]}
    printf "%s has %d unique datasets ( excluding date/time )\n" $i ${#Count[@]}
done | sort -k1

# this is mostly an example, not really useful
for i in ${!dataset1[@]}
do
  IFS=${IFS/#/|}
  [[ $i =~ ${!dataset2[*]} ]] \
    && echo "$i exists in dataset2" \
    || echo "$i does not exist in dataset2"
  IFS=${IFS#|}
done | sort -k1
 
Old 11-08-2019, 05:12 PM   #15
crts
Senior Member
 
Registered: Jan 2010
Posts: 2,020

Rep: Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757Reputation: 757
Quote:
Originally Posted by Firerat View Post
there is a problem with your counts
Yes, I know. I think that the function should expect a preconditioned set of data and not do any filtering by itself. As OP stated
Quote:
Originally Posted by schneidz View Post
i would want to ignore the 1st and 3rd columns
So you will have to at least also filter 'line[2]', if above statement is still valid.
Furthermore,
Quote:
Originally Posted by schneidz View Post
i ran a db2 query with 3 columns ...
This leads me to suspect that the data may be coming from a database. If the query is changed to not output the unneeded fields then the function - if it is filtering - will have to be adjusted, too. Therefore I think it is more maintainable, if the filtering (and any other required preconditioning) is done in a separate function or maybe even in a dedicated script.

The RegEx issue briefly crossed my mind. I disregarded it, however, because if it turns out that the input has "funny" characters then the characters in question might be changed to something else (or escaped, as you suggested). This should also be done in the preconditioning phase.

Of course, hashes can also take care of the problem. If you want to use hashes, however, I would recommend to use two different hash functions to minimize the risk of collisions, e.g.:
Code:
chk="$( md5sum <<<${line[*]} )$( sha512sum <<<${line[*]} )"
chk="${chk// /}"
Since all the rows have the same structure some may be similiar, thus the chance of collisions is slightly higher than for arbitrary data, as is demonstrated here. Even if the chance for collision is still low, why take it?
 
1 members found this post helpful.
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] bash suggestions to convert horizontal columns to vertical columns Guyverix Programming 14 01-24-2013 11:03 AM
Comparing Directories (diff..?) with its File's Sizes, Ignoring Filenames? mrm5102 Linux - Newbie 2 04-23-2012 08:42 AM
way of ignoring out of order lines in diff? Geneset Linux - Software 2 06-04-2009 10:01 AM
Selecting certain parts of a list of columns in BASH mikejreading Linux - Newbie 6 05-07-2009 04:48 AM
diff / patch ignoring changes to particular lines Kikazaru Linux - General 2 03-09-2009 10:57 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 05:32 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration