[SOLVED] uniq ignoring non [[:alnum:]] chars...

masavini · 11-23-2018, 07:41 AM

hi,
is there a bash way to make a list of strings unique ignoring non [[:alnum:]] chars, and returning the longest strings?

i mean something like this:

Code:

$ cat list.txt
1string
1str-ing
2s-tring
2str.+ing
$ elastic_uniq list.txt
1str-ing
2str.+ing

in this example the strings '1string' and '2-string' are purged because they're equivalent, respectively, to '1str-ing' and '2str.+ing' and they're shorter than their "synonyms".

thanks!

l0f4r0 · 11-23-2018, 08:19 AM

I don't know how to return the longest pattern but here is a solution for the other aspects of your problematic:

Code:

sed 's/[^[:alnum:]]//g' list.txt | sort -u

EDIT: maybe it's not totally what you're looking for because this command can sometimes return data that was not initially in your file (depending on the substitutions)...

lougavulin · 11-23-2018, 09:13 AM

That will keep the logest of all after removing synonyms :

Code:

cat file | tr -dc '[[:alnum:]]\n' | sort -u | awk '{ L=length($0); if ( L > M ) { M=L; C=$0;} } END{ print C; }'

masavini · 11-23-2018, 09:43 AM

thanks for your help, here is how i used your hints:

Code:

$ cat loose_uniq_test.sh
function loose_uniq_test () {

  echo "1string
1str-ing
2s-tring
2str.+ing" > /tmp/list.txt

  declare -a needles=(
    $(
      sed 's/[^[:alnum:]]//g' /tmp/list.txt \
        | sort -u
    )
  )

  declare needle
  for needle in "${needles[@]}"; do
    
    needle="$(sed 's/\([[:alnum:]]\)/\1[^[:alnum:]]*/Ig' <<< "${needle}")"

    grep -i "${needle}" /tmp/list.txt \
      | awk 'length > max_length { max_length = length; longest_line = $0 } END { print longest_line }'

  done

  return 0
}
$ . loose_uniq_test.sh
$ loose_uniq_test
1str-ing
2str.+ing

is it possible to combine the 'grep ... | awk ...' command in a single awk command?

l0f4r0 · 11-23-2018, 10:12 AM

Here is another suggestion:

Code:

#!/bin/bash
set -o nounset

lineCounter=1

while IFS= read -r line
do
	originalWord[${lineCounter}]="${line}"
	modifiedWord[${lineCounter}]="${line//[^[:alnum:]]/}"
	(( lineCounter++ ))
done <list.txt

for (( i=1;i<lineCounter-1;i++ ))
do
	for (( j=1;j<lineCounter-1;j++ ))
	do
		[[ "${modifiedWord[$i]}" == "${modifiedWord[$j]}" ]] && (( ${#originalWord[$i]} < ${#originalWord[$j]} )) && originalWord[$i]=""
	done
done

for word in "${originalWord[@]}"
do
	if [[ "${word}" != "" ]]; then echo "${word}";fi;
done

...but I think I complicated things a little bit compared to lougavulin

l0f4r0 · 11-23-2018, 10:21 AM

Quote:

Originally Posted by masavini

is it possible to combine the 'grep ... | awk ...' command in a single awk command?

Yes.
grep "pattern" file | awk '{...}' --> awk '/pattern/{...}' file

So try to replace:

Code:

grep -i "${needle}" /tmp/list.txt | awk 'length > max_length { max_length = length; longest_line = $0 } END { print longest_line }'

with

Code:

awk -v myNeedle="$needle" 'IGNORECASE = 1;/myNeedle/{ if (length > max_length) { max_length = length; longest_line = $0 } END { print longest_line }}' /tmp/list.txt

Does it work?

masavini · 11-23-2018, 11:45 AM

Quote:

Originally Posted by l0f4r0

Yes.

Code:

awk -v myNeedle="$needle" 'IGNORECASE = 1;/myNeedle/{ if (length > max_length) { max_length = length; longest_line = $0 } END { print longest_line }}' /tmp/list.txt

Does it work?

unfortunately, no...

Code:

awk: cmd. line:1: IGNORECASE = 1;/myNeedle/{ if (length > max_length) { max_length = length; longest_line = $0 } END { print longest_line }}
awk: cmd. line:1:                                                                                                ^ syntax error

pan64 · 11-23-2018, 11:56 AM

here you can find some tips how to solve that ignorecase issue:
https://stackoverflow.com/questions/...orecase-in-awk

l0f4r0 · 11-23-2018, 01:17 PM

Ok considering what has been said previously, try this:

Code:

awk -v myNeedle="$needle" 'tolower($0) ~ tolower(myNeedle) && length > max_length { max_length = length; longest_line = $0 } END { print longest_line }' /tmp/list.txt

masavini · 11-24-2018, 04:01 AM

Quote:

Originally Posted by l0f4r0

Ok considering what has been said previously, try this:

Code:

awk -v myNeedle="$needle" 'tolower($0) ~ tolower(myNeedle) && length > max_length { max_length = length; longest_line = $0 } END { print longest_line }' /tmp/list.txt

perfect solution, thanks!

this is the final function:

Code:

function loose_uniq_test () {

  echo "1string
1Str-ing
2s-tring
2str.+ing" > /tmp/list.txt

  declare -a needles=(
    $(
      sed 's/[^[:alnum:]]//g;
        s/\([[:alnum:]]\)/\1[^[:alnum:]]*/Ig' \
        /tmp/list.txt \
        | sort -u --ignore-case
    )
  )

  declare needle
  for needle in "${needles[@]}"; do

    awk \
      -v myNeedle="$needle" \
      'tolower($0) ~ myNeedle && length > len_max { len_max = length; longest_line = $0 } END { print longest_line }' \
      /tmp/list.txt

  done

  return 0
}

l0f4r0 · 11-24-2018, 12:34 PM

^ Beware your script doesn't seem to work with capital words if there is no duplicate in lowercase.

Are

Code:

sort -u --ignore-case

and
flag "I" in your first sed
really useful because I'm not sure...

Your needles array contain unecessary data that reduce the performances.
If I were you, I would keep only lowercase patterns inside:

Code:

declare -a needles=(
	$(
sed 's/[^[:alnum:]]//g;
s/\([[:alnum:]]\)/\1[^[:alnum:]]*/g;
s/[[:upper:]]/[[:lower:]]/g' \
list.txt \
|sort -u
)
)

masavini · 11-24-2018, 03:06 PM

Quote:

Originally Posted by l0f4r0

^ Beware your script doesn't seem to work with capital words if there is no duplicate in lowercase.

thank you, great hint! i didn't notice that...

pan64 · 11-25-2018, 04:39 AM

you can convert case in sed using \U or \L. But I don't know if that helps to solve it. https://stackoverflow.com/questions/...e-to-lowercase

l0f4r0 · 11-26-2018, 08:37 AM

^ Interesting pan64! Didn't know about /L and /U.
Actually my suggestion in #11 about the needles array should already solve the problem.