[SOLVED] bash: diff ignoring certain columns

schneidz · 11-05-2019, 03:26 PM

[a.i.x.]:

Code:

cat s1.lst
chun-li,hello,0451,world
cat s2.lst
ryu-ken,hello,0000,world

i would want to ignore the 1st and 3rd columns (the byte-offsets are alwaze the same).

i have two files with about 30 columns (one has 39,952 rows; the other has 40,215) and i wanna' compare which serial-numbers (last column) have different data (date, time, processor, ... fields will always be different).

i dont have gnu extentions but i do have access to a c compiler.

maybe

Code:

cat serial.number | while read serial
do
 grep $serial s1.lst | cut -b 100-125,134- > s1.buff
 grep $serial s2.lst | cut -b 100-125,134- > s2.buff
 diff s1.buff s2.buff
done > report.lst

will work ?

danielbmartin · 11-05-2019, 04:22 PM

Please provide a larger sample input file and a hand-built example of the desired output file. That will help us understand the problem statement.

My first thought is that you need only this:

Code:

grep $serial s1.lst | cut -b 100-125,134- > s1.buff
grep $serial s2.lst | cut -b 100-125,134- > s2.buff
diff s1.buff s2.buff >report.lst

Daniel B. Martin

.

crts · 11-05-2019, 04:26 PM

You really need to provide better sample data. I am having difficulties to understand what you are actually trying to achieve. I am also not sure if the serial numbers will be unique in each file. If they are not, does it matter?
Anyway, provide some representative samples and the results that you expect. Then we can test if the following might function as a usable filter:

Code:

sort -t, -k30 s1.lst s2.lst|awk 'BEGIN{FS=",";OFS=","}{$1="";$3="";print}'| uniq

It can be further refined, based on your requirement; once we know what it exactly is.

Firerat · 11-05-2019, 07:17 PM

I'm a little lost as to what you want

but it appears awk would work well, but overkill and most people tend to write for gawk

I assume serial.number is simply a list of serial numbers
avoid UUOC http://porkmail.org/era/unix/award.html

Code:

#!/bin/bash
while read serial
do
  stuff
done < serial.number

but what is it you want to do?

you might not even need that serial.number file

Code:

#!/bin/bash
while IFS=, read -a MYDATA
do
  while IFS=, read -a MYDATAFILTERED
  do
    printf "%s\t%s\t%s\n" \
      ${MYDATAFILTERED[3]} \
      ${MYDATAFILTERED[23]} \
      ${MYDATAFILTERED[5]}
  done < <( grep ",${MYDATA[-1]}$" s2.lst )
done < s1.lst

not very useful, but what you actually want is not very clear

${MYDATA[-1]}, the -1 is the "last index" -2 would be second to last
bash array index starts at 0, so column 30 would be ${MYDATA[29]}

compare?
what do you mean by that?
look for things that are the same or different?

Code:

#!/bin/bash
while IFS=, read -a MYDATA
do
  while IFS=, read -a MYDATAFILTERED
  do
    if [[ ${MYDATA[5]} == ${MYDATAFILTERED[5]} ]]
    then
      echo "${MYDATA[-1]} column 6 matches data in s2.lst"
    else
      sep=","
      for i in ${!MYDATAFILTERED[@]}
      do
        [[ $i == $(( ${#MYDATAFILTERED[@]} - 1 )) ]] && sep=$'\n'
        printf "%s${sep}" ${MYDATAFILTERED[$i]}
      done >> someoutputfile
    fi      
  done < <( grep ",${MYDATA[-1]}$" s2.lst )
done < s1.lst

you can probably guess I have just made up numbers for the "columns"
if columns 6 match, it just says so
if they don't match it spits out the line using commas as field separator

that does assume the serials in s1.lst are "unique"
if not then you will duplicate work/output

I don't use or have access to AIX
it would help to know bash version

ultimately we are going to need sample data that resembles the real data you are working with as well as "rules" for the compare

schneidz · 11-05-2019, 07:54 PM

thanx. you guys have given me some ideas and syntaxes to play around with.
it will take me a while to sanitize some production data. i work in healthcare informatics so there will be hipaa concerns.

the 2 reports are mostly identical (s1 has a few hundred that doesnt exist in s2; and s2 has a few dozen that doesnt exist in s1. they have more than 39,000 in common).

for the time being i'll try a more detailed description:
i have 2 reports with about 30 columns each.
they have member-id's that dont map to each other (some members have multiple plans).
i ran a db2 query with 3 columns (select(distinct) s1 member-id, serial-#, s2 member-id) and copied it to the unix server.
overnite i let it grep the member-id and stub in the serial at the end for both files.

now i want a list of rows that are not paying correctly.
the date, time, member-id, processor-# will be different -- all other feilds should match (serial is the key).

i mite try trimming out lines that are not common;and, sort by serial; then, diff|cut ?

schneidz · 11-05-2019, 09:03 PM

Quote:

Originally Posted by Firerat

...
avoid UUOC http://porkmail.org/era/unix/award.html

Code:

#!/bin/bash
while read serial
do
  stuff
done < serial.number

...

big up. but i prefer to read top-to-bottom, left-to-rite; rather than, jump to the bottom and scan back up.

Firerat · 11-05-2019, 10:09 PM

Quote:

Originally Posted by schneidz

big up. but i prefer to read top-to-bottom, left-to-rite; rather than, jump to the bottom and scan back up.

the good old readability argument.

fair enough

please do consider applying that to your written word
left-to-right makes more sense.

schneidz · 11-06-2019, 04:19 PM

this is what i developed:

Code:

[schneidz@lq] cat s1-s2-compare.ksh
#!/bin/bash

cat memberid-serial-map.1 | while read line
do
 c=`echo $line | awk '{print $1}'`
 u=`echo $line | awk '{print $2}'`
 awk -v c=$c -v u=$u "/$c/ {print \$0u}" $1
done > s1.lst # append matching serial to end of each row

cat memberid-serial-map.2 | while read line
do
 c=`echo $line | awk '{print $1}'`
 u=`echo $line | awk '{print $2}'`
 awk -v c=$c -v u=$u "/$c/ {print \$0u}" $2
done > s2.lst # append matching serial to end of each row

cut -b 17- memberid-serial-map.1 | sort > serial.1
cut -b 17- memberid-serial-map.2 | sort > serial.2

#comm -23 serial.1 serial.2 > not-in-s2.lst # these were
#comm -13 serial.1 serial.2 > not-in-s1.lst # buggy
#comm -12 serial.1 serial.2 > serial.comm   # for some reason

cat serial.1 serial.1 serial.2 | sort | uniq -c | grep "   1" | cut -b 6- > not-in-s1.lst
cat serial.1 serial.1 serial.2 | sort | uniq -c | grep "   2" | cut -b 6- > not-in-s2.lst
cat serial.1 serial.1 serial.2 | sort | uniq -c | grep "   3" | cut -b 6- > serial.comm

cat serial.comm | while read s
do
 grep $s s1.lst | cut -b 42-72,82- > s.1
 grep $s s2.lst | cut -b 42-72,82- > s.2
 diff s.1 s.2 > /dev/null
 if [ $? -eq 1 ]
 then
  grep $s s[12].lst
 fi
done > unmatched-report.lst

cut -b 339- unmatched-report.lst | sort | uniq > s1-s2-differences
wc -l not-in-s1.lst not-in-s2.lst serial.comm s1-s2-differences
rm s.1 s.2 s1-s2-differences

tar -cf - $1 $2 unmatched-report.lst | bzip2 > s1-s2-diffs.tar.bz2
[schneidz@lq] ./s1-s2-compare.ksh mainframe.dataset.1 mainframe.dataset.2
     233 not-in-s1.lst
       5 not-in-s2.lst
   42893 serial.comm
   14178 s1-s2-differences

Firerat · 11-06-2019, 05:05 PM

backticks are awful

they are hard to see and a nightmare to nest
instead use $(...)

Code:

 c="$( echo $line | awk '{print $1}' )"

and.. echo?

Code:

 c="$( <<<$line  awk '{print $1}' )"

but you don't even need awk

Code:

while read -a line
do
  c="${line[0]}"
  u="${line[1]}"
....
..

alternative

Code:

while read c u foo1 foo2 foo......foo28
do
  echo $c

I prefer to work with arrays, it is easer to check I have the correct number of fields

fact is, you don't need all your cats echos greps cuts awks

pure bash can do most if not all of what you want ( and will be much faster )

https://mywiki.wooledge.org/BashGuide

ondoho · 11-07-2019, 01:25 AM

Quote:

Originally Posted by Firerat

backticks are awful

So is shuffling temporary files.
I holeheartedly agree:

Quote:

I prefer to work with arrays, it is easer to check I have the correct number of fields

fact is, you don't need all your cats echos greps cuts awks

pure bash can do most if not all of what you want ( and will be much faster )

https://mywiki.wooledge.org/BashGuide

It means to re-think, re-implement the whole thing.
A lot of work, but do that once and all your subsequent scripts will get better, too.

Firerat · 11-07-2019, 05:23 AM

agreed, I really dislike writing temp files to disk

I'm unsure of the relationship between these two files
or what exactly is being compared

I have thrown together this script

it removes the date field, and "counts" "unique" lines for each serial

I have had to guess the structure of the log
first field is date/time "2019-11-07 10:55" is 16 chars long, hence the $( cut -b 17- )
the last is serial ( added later or in original ? )
in-between is "data", knowing the fields of interest would make a better script!

I have assumed serial is in the log, but that memberid_serial map makes me suspect that is not true
this is where I'm confused
why add serial based on memberid ?

anyway, script follows

Code:

#!/bin/bash
File1="$1"
File2="$2"

MakeArrays1 () {
while IFS=, read -a line
do
  unset line[0] # discard date time
  Hash=($( md5sum <<<"${line[*]}" ))

  [[ -z ${dataset1[sn${line[29]}]} ]] \
    && declare -g -A dataset1[sn${line[29]}]="${Hash}" \
    ||  {
          [[ ${Hash} =~ ${dataset1[sn${line[29]}]} ]] \
            || declare -g -A dataset1[sn${line[29]}]+="|${Hash}"
        }
done
}
# I don't like near copies of functions
#TODO figure out how to have just one function
MakeArrays2 () {
while IFS=, read -a line
do
  unset line[0] # discard date time
  Hash=($( md5sum <<<"${line[*]}" ))

  [[ -z ${dataset2[sn${line[29]}]} ]] \
    && declare -g -A dataset2[sn${line[29]}]="${Hash}" \
    ||  {
          [[ ${Hash} =~ ${dataset2[sn${line[29]}]} ]] \
            || declare -g -A dataset2[sn${line[29]}]+="|${Hash}"
        }
done
}

MakeArrays1 <"$File1"
MakeArrays2 <"$File2"

# note the ! , it lists the index not the values
for i in ${!dataset1[@]}
do
    IFS='|' read -a Count <<<${dataset1[$i]}
    printf "%s has %d unique datasets ( excluding date/time )\n" $i ${#Count[@]}
done | sort -k1

for i in ${!dataset1[@]}
do
  IFS=${IFS/#/|}
  [[ $i =~ ${!dataset2[*]} ]] \
    && echo "$i exists in dataset2" \
    || echo "$i does not exist in dataset2"
  IFS=${IFS#|}
done | sort -k1

I have not added checking lines in file1 match lines in file2
to do that, for each serial in dataset1, load its pattern into an array
iterate over that array and check against the serial's pattern in dataset2

I've not counted duplicates in the same file.

duplicates could be counted by creating a new array
e.g.

Code:

...
||  {
      [[ ${Hash} =~ ${dataset2[sn${line[29]}]} ]] \
        && dupeCount[sn${line[29]}]=$(( dupeCount[sn${line[29]}] + 1 )) \
        || declare -g -A dataset2[sn${line[29]}]+="|${Hash}"
    }
...

but since you have been using uniq I'm guessing duplicates don't "count"

if you need/want to do more analysis the Hash bit may "get in the way"

Code:

...
IFS=${IFS/#/,}
  Hash="${line[*]}"
IFS=${IFS#,}
...

you will end up with some really long patterns
but the retained data allows for further analysis
each element ( serial ) of dataset contains two arrays, one with field separator "|" and the other ","
to be honest, when you start getting that deep it is probably time to think about getting that data into a proper database.
Please don't tell me this data was exported from a database

schneidz · 11-07-2019, 05:56 AM

Quote:

Originally Posted by ondoho

So is shuffling temporary files.
I holeheartedly agree:

It means to re-think, re-implement the whole thing.
A lot of work, but do that once and all your subsequent scripts will get better, too.

you guys are rite. this is a lazy attempt. i am begroaning doing the homework to improve. little by little.

crts · 11-07-2019, 09:40 AM

Quote:

Originally Posted by Firerat

I'm unsure of the relationship between these two files
or what exactly is being compared

I have thrown together this script

...

Code:

#!/bin/bash
File1="$1"
File2="$2"

MakeArrays1 () {
while IFS=, read -a line
do
  unset line[0] # discard date time
  Hash=($( md5sum <<<"${line[*]}" ))

  [[ -z ${dataset1[sn${line[29]}]} ]] \
    && declare -g -A dataset1[sn${line[29]}]="${Hash}" \
    ||  {
          [[ ${Hash} =~ ${dataset1[sn${line[29]}]} ]] \
            || declare -g -A dataset1[sn${line[29]}]+="|${Hash}"
        }
done
}
# I don't like near copies of functions
#TODO figure out how to have just one function
MakeArrays2 () {
while IFS=, read -a line
do
  unset line[0] # discard date time
  Hash=($( md5sum <<<"${line[*]}" ))

  [[ -z ${dataset2[sn${line[29]}]} ]] \
    && declare -g -A dataset2[sn${line[29]}]="${Hash}" \
    ||  {
          [[ ${Hash} =~ ${dataset2[sn${line[29]}]} ]] \
            || declare -g -A dataset2[sn${line[29]}]+="|${Hash}"
        }
done
}

MakeArrays1 <"$File1"
MakeArrays2 <"$File2"

...

First of all, without knowing the data structure and what exactly is desired this is pretty much an exercise in futility. Also, the script does not work (as far as any script can work with the given information) as "expected".
The following sample data was used (it only has two columns, which is sufficient for demonstration purposes):

Code:

user@kronos$ cat s1.lst 
data1,1111
data2,2222
data3,3333
data1,1111
data11,1111
data111,1111
user@kronos$ cat s2.lst 
data1,1111
data3,3333
data4,4444
data11,1111
data111,1111
data1111,1111

If there are duplicate serial numbers, the second 'declare' statement in the function results in the array looking like this:

Code:

|56baf66ff8ff800696fcfd56256688e1 856f07fad133660d6e16eb93d1414546 d532f61d1a327ca53c685c6d7dbe2ab9

This means, that for multiple duplicates of the same serial number the script will always count 2. I assume, the code responsible for this was supposed to be a fancy way of saving the line for the declaration:

Code:

  [[ -z ${dataset1[sn${line[29]}]} ]] \
    && declare -g -A dataset1[sn${line[29]}]="${Hash}" \
    ||  {
          [[ ${Hash} =~ ${dataset1[sn${line[29]}]} ]] \
            || declare -g -A dataset1[sn${line[29]}]+="|${Hash}"
        }

There are two problems with fancy stuff in general:

it needlessly obfuscates the code
quite often it is just a recipe for future unpleasantness

If the second 'declare' is removed then it correctly counts the duplicates in the first file, not in the second, though. This implies that the first file would have to be authoritative with regard to the duplicity of the serial number. The information supplied by OP does not support that assumption.

I also do not see any benefit in calculating a hashsum - storing the line *should* be fine.

Anyway with regard to the near duplicate function:

Code:

#!/usr/bin/bash

field=1
File1="$1"
File2="$2"

MakeArrays () {
        local -n ref=$1
        declare -g -A ${!ref}

        while IFS=, read -a line; do
                chk="${line[*]}"
                [[ ${chk} =~ ${ref[sn${line[$field]}]:=${chk}} ]] \
                        || ref[sn${line[$field]}]+="|${chk}"
        done
}


MakeArrays dataset1 <"$File1"
MakeArrays dataset2 <"$File2"

# debug
#echo ${dataset1[@]}
#echo ${dataset2[@]}

...

PS:
OP said he is working on an AIX and does not have access to GNU extensions. There is a good chance that associative arrays are not supported on the target system.

Firerat · 11-07-2019, 11:46 AM

form OP "about 30 columns"
last is serial ( but I have some doubts about that after the later posts )

I guessed the first was yyyy-mm-dd hh:mm as that is 16 char long, is always different and stripped with cut -b 17- in a later script

a very crude script to make random data, very few rules to it

Code:

#!/bin/bash

somestrings=( edxe2Bt0iB umVwOVmPB5 pXPf4jyERd Lk5yRJdz8r J3R03MQiUD t5lv9ewBIm 6rii1QdmUe NfgpDKtZgE mjmG3nB2RO 5rGS2VlrYW )

cheapstring () {
printf "%s" ${somestrings[$((RANDOM%${#somestings[@]}))]}
}
cheapTS () {
date --iso-8601=seconds -d "@$(( $(date +%s ) + ${RANDOM} ))"
}
cheapserial () {
printf "%03d" $((RANDOM%200 ))
}

for i in {0000..1000}
do
  for i in {1..30}
  do
    case $i in
      1) printf "%s," $( cheapTS )
          ;;
    30)  printf "%s\n" $( cheapserial )
          ;;
      *) printf "%s," $( cheapstring )
          ;;
    esac
  done
done

flawed, yes

Quote:

Originally Posted by Firerat

<snip>
I don't use or have access to AIX
it would help to know bash version

ultimately we are going to need sample data that resembles the real data you are working with as well as "rules" for the compare

http://www.perzl.org/aix/index.php%3Fn%3DMain.Bash
chances are it is bash v4

not sure what is going on with the spaces
this is a sample of what I get

Code:

+ [[ -z 08ab0b2b9dfaa654d3b5387b86aa69f7|03293782aa51226e836ba3ceb175ca37|a9c8d717660e9887c50a54889f804a52 ]]
+ [[ 0c119a413caf6d68bcd11c7261597c9f =~ 08ab0b2b9dfaa654d3b5387b86aa69f7|03293782aa51226e836ba3ceb175ca37|a9c8d717660e9887c50a54889f804a52 ]]
+ declare -g -A 'dataset1[sn006]+=|0c119a413caf6d68bcd11c7261597c9f'

thanks for the dedupe MakeArray
Initially I had just the one, then last min. added the second ( basically at time of posting )

Code:

MakeArrays () {
local -n ref=$1
declare -g -A ${!ref}
field=29

while IFS=, read -a line; do
  unset line[0] # strips date/time
  chk="${line[*]}"
  [[ ${chk} =~ ${ref[sn${line[$field]}]:=${chk}} ]] \
     || ref[sn${line[$field]}]+="|${chk}"
done
}

there is a problem with your counts

Code:

bash -x testing.sh input1 input2 2> debug | head -n1 ; grep ,000$ input1
sn000 has 5 unique datasets ( excluding date/time )
2019-11-06T21:14:46+00:00,6rii1QdmUe,umVwOVmPB5,6rii1QdmUe,edxe2Bt0iB,NfgpDKtZgE,Lk5yRJdz8r,NfgpDKtZgE,mjmG3nB2RO,Lk5yRJdz8r,edxe2Bt0iB,J3R03MQiUD,Lk5yRJdz8r,
6rii1QdmUe,edxe2Bt0iB,J3R03MQiUD,J3R03MQiUD,t5lv9ewBIm,5rGS2VlrYW,NfgpDKtZgE,pXPf4jyERd,NfgpDKtZgE,5rGS2VlrYW,5rGS2VlrYW,umVwOVmPB5,edxe2Bt0iB,edxe2Bt0iB,6rii
1QdmUe,6rii1QdmUe,000
2019-11-06T21:38:36+00:00,mjmG3nB2RO,J3R03MQiUD,J3R03MQiUD,edxe2Bt0iB,5rGS2VlrYW,Lk5yRJdz8r,Lk5yRJdz8r,pXPf4jyERd,6rii1QdmUe,mjmG3nB2RO,5rGS2VlrYW,edxe2Bt0iB,
umVwOVmPB5,Lk5yRJdz8r,Lk5yRJdz8r,umVwOVmPB5,NfgpDKtZgE,NfgpDKtZgE,J3R03MQiUD,NfgpDKtZgE,J3R03MQiUD,Lk5yRJdz8r,mjmG3nB2RO,NfgpDKtZgE,umVwOVmPB5,5rGS2VlrYW,edxe
2Bt0iB,t5lv9ewBIm,000
2019-11-06T21:17:51+00:00,pXPf4jyERd,umVwOVmPB5,mjmG3nB2RO,edxe2Bt0iB,edxe2Bt0iB,umVwOVmPB5,6rii1QdmUe,mjmG3nB2RO,NfgpDKtZgE,edxe2Bt0iB,t5lv9ewBIm,5rGS2VlrYW,
edxe2Bt0iB,t5lv9ewBIm,mjmG3nB2RO,t5lv9ewBIm,Lk5yRJdz8r,t5lv9ewBIm,6rii1QdmUe,5rGS2VlrYW,Lk5yRJdz8r,t5lv9ewBIm,edxe2Bt0iB,t5lv9ewBIm,pXPf4jyERd,J3R03MQiUD,umVw
OVmPB5,Lk5yRJdz8r,000
2019-11-07T00:54:33+00:00,edxe2Bt0iB,J3R03MQiUD,J3R03MQiUD,6rii1QdmUe,NfgpDKtZgE,umVwOVmPB5,mjmG3nB2RO,edxe2Bt0iB,pXPf4jyERd,6rii1QdmUe,mjmG3nB2RO,5rGS2VlrYW,
5rGS2VlrYW,NfgpDKtZgE,edxe2Bt0iB,6rii1QdmUe,J3R03MQiUD,pXPf4jyERd,6rii1QdmUe,mjmG3nB2RO,6rii1QdmUe,umVwOVmPB5,5rGS2VlrYW,5rGS2VlrYW,pXPf4jyERd,5rGS2VlrYW,Nfgp
DKtZgE,5rGS2VlrYW,000

it is counting 5, but I only have 4 in the input

removing the date/time from the array and it returns 4

if I add those 4 lines to the input, I still get 4 unique
but commenting out the unset I get 9

why?
the + in the date/time messing with the RE

now, this is where md5sum comes in
I do not know what is in these input files
so I have no control over the patterns, unless..

Code:

MakeArrays () {
local -n ref=$1
declare -g -A ${!ref}
field=29

while IFS=, read -a line; do
  #unset line[0]
  #chk="${line[*]}"
  chk=( $( md5sum <<<${line[*]} ) )
  [[ ${chk} =~ ${ref[sn${line[$field]}]:=${chk}} ]] \
     || ref[sn${line[$field]}]+="|${chk}"
done
}

no more problems with the + in the date
it does take longer
but knowing what is expected in the input it can be dealt with.

now , why is it counting the extra 1?
again, that is the + from the date/time messing with the RE
so the first is added twice, initially when the array is empty, and then when it fails the RE

faster, cleaner but prone to special chars in the data
remember , the chk does not need to be the whole line
only the fields that are of interest
if the data is likely to have things like []{}+*$^ in it, either hash or escape

Code:

#!/usr/bin/bash

field=29
File1="$1"
File2="$2"

MakeArrays () {
local -n ref=$1
declare -g -A ${!ref}

while IFS=, read -a line; do
  unset line[0]
  chk="${line[*]}"
  [[ ${chk} =~ ${ref[sn${line[$field]}]:=${chk}} ]] \
     || ref[sn${line[$field]}]+="|${chk}"
done
}

MakeArrays dataset1 <"${File1}"
MakeArrays dataset2 <"${File2}"

# note the ! , it lists the index not the values
for i in ${!dataset1[@]}
do
    IFS='|' read -a Count <<<${dataset1[$i]}
    printf "%s has %d unique datasets ( excluding date/time )\n" $i ${#Count[@]}
done | sort -k1

# this is mostly an example, not really useful
for i in ${!dataset1[@]}
do
  IFS=${IFS/#/|}
  [[ $i =~ ${!dataset2[*]} ]] \
    && echo "$i exists in dataset2" \
    || echo "$i does not exist in dataset2"
  IFS=${IFS#|}
done | sort -k1

crts · 11-08-2019, 05:12 PM

Quote:

Originally Posted by Firerat

there is a problem with your counts

Yes, I know. I think that the function should expect a preconditioned set of data and not do any filtering by itself. As OP stated

Quote:

Originally Posted by schneidz

i would want to ignore the 1st and 3rd columns

So you will have to at least also filter 'line[2]', if above statement is still valid.
Furthermore,

Quote:

Originally Posted by schneidz

i ran a db2 query with 3 columns ...

This leads me to suspect that the data may be coming from a database. If the query is changed to not output the unneeded fields then the function - if it is filtering - will have to be adjusted, too. Therefore I think it is more maintainable, if the filtering (and any other required preconditioning) is done in a separate function or maybe even in a dedicated script.

The RegEx issue briefly crossed my mind. I disregarded it, however, because if it turns out that the input has "funny" characters then the characters in question might be changed to something else (or escaped, as you suggested). This should also be done in the preconditioning phase.

Of course, hashes can also take care of the problem. If you want to use hashes, however, I would recommend to use two different hash functions to minimize the risk of collisions, e.g.:

Code:

chk="$( md5sum <<<${line[*]} )$( sha512sum <<<${line[*]} )"
chk="${chk// /}"

Since all the rows have the same structure some may be similiar, thus the chance of collisions is slightly higher than for arbitrary data, as is demonstrated here. Even if the chance for collision is still low, why take it?