LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 11-23-2018, 07:41 AM   #1
masavini
Member
 
Registered: Jun 2008
Posts: 285

Rep: Reputation: 6
uniq ignoring non [[:alnum:]] chars...


hi,
is there a bash way to make a list of strings unique ignoring non [[:alnum:]] chars, and returning the longest strings?

i mean something like this:
Code:
$ cat list.txt
1string
1str-ing
2s-tring
2str.+ing
$ elastic_uniq list.txt
1str-ing
2str.+ing
in this example the strings '1string' and '2-string' are purged because they're equivalent, respectively, to '1str-ing' and '2str.+ing' and they're shorter than their "synonyms".

thanks!

Last edited by masavini; 11-23-2018 at 07:44 AM.
 
Old 11-23-2018, 08:19 AM   #2
l0f4r0
Member
 
Registered: Jul 2018
Location: Paris
Distribution: Debian
Posts: 900

Rep: Reputation: 290Reputation: 290Reputation: 290
I don't know how to return the longest pattern but here is a solution for the other aspects of your problematic:
Code:
sed 's/[^[:alnum:]]//g' list.txt | sort -u
EDIT: maybe it's not totally what you're looking for because this command can sometimes return data that was not initially in your file (depending on the substitutions)...

Last edited by l0f4r0; 11-23-2018 at 08:23 AM.
 
Old 11-23-2018, 09:13 AM   #3
lougavulin
Member
 
Registered: Jul 2018
Distribution: Slackware,x86_64,current
Posts: 279

Rep: Reputation: 100Reputation: 100
That will keep the logest of all after removing synonyms :
Code:
cat file | tr -dc '[[:alnum:]]\n' | sort -u | awk '{ L=length($0); if ( L > M ) { M=L; C=$0;} } END{ print C; }'
 
Old 11-23-2018, 09:43 AM   #4
masavini
Member
 
Registered: Jun 2008
Posts: 285

Original Poster
Rep: Reputation: 6
thanks for your help, here is how i used your hints:

Code:
$ cat loose_uniq_test.sh
function loose_uniq_test () {

  echo "1string
1str-ing
2s-tring
2str.+ing" > /tmp/list.txt

  declare -a needles=(
    $(
      sed 's/[^[:alnum:]]//g' /tmp/list.txt \
        | sort -u
    )
  )

  declare needle
  for needle in "${needles[@]}"; do
    
    needle="$(sed 's/\([[:alnum:]]\)/\1[^[:alnum:]]*/Ig' <<< "${needle}")"

    grep -i "${needle}" /tmp/list.txt \
      | awk 'length > max_length { max_length = length; longest_line = $0 } END { print longest_line }'

  done

  return 0
}
$ . loose_uniq_test.sh
$ loose_uniq_test
1str-ing
2str.+ing
is it possible to combine the 'grep ... | awk ...' command in a single awk command?
 
Old 11-23-2018, 10:12 AM   #5
l0f4r0
Member
 
Registered: Jul 2018
Location: Paris
Distribution: Debian
Posts: 900

Rep: Reputation: 290Reputation: 290Reputation: 290
Here is another suggestion:
Code:
#!/bin/bash
set -o nounset

lineCounter=1

while IFS= read -r line
do
	originalWord[${lineCounter}]="${line}"
	modifiedWord[${lineCounter}]="${line//[^[:alnum:]]/}"
	(( lineCounter++ ))
done <list.txt

for (( i=1;i<lineCounter-1;i++ ))
do
	for (( j=1;j<lineCounter-1;j++ ))
	do
		[[ "${modifiedWord[$i]}" == "${modifiedWord[$j]}" ]] && (( ${#originalWord[$i]} < ${#originalWord[$j]} )) && originalWord[$i]=""
	done
done

for word in "${originalWord[@]}"
do
	if [[ "${word}" != "" ]]; then echo "${word}";fi;
done
...but I think I complicated things a little bit compared to lougavulin
 
Old 11-23-2018, 10:21 AM   #6
l0f4r0
Member
 
Registered: Jul 2018
Location: Paris
Distribution: Debian
Posts: 900

Rep: Reputation: 290Reputation: 290Reputation: 290
Quote:
Originally Posted by masavini View Post
is it possible to combine the 'grep ... | awk ...' command in a single awk command?
Yes.
grep "pattern" file | awk '{...}' --> awk '/pattern/{...}' file

So try to replace:
Code:
grep -i "${needle}" /tmp/list.txt | awk 'length > max_length { max_length = length; longest_line = $0 } END { print longest_line }'
with
Code:
awk -v myNeedle="$needle" 'IGNORECASE = 1;/myNeedle/{ if (length > max_length) { max_length = length; longest_line = $0 } END { print longest_line }}' /tmp/list.txt
Does it work?
 
Old 11-23-2018, 11:45 AM   #7
masavini
Member
 
Registered: Jun 2008
Posts: 285

Original Poster
Rep: Reputation: 6
Quote:
Originally Posted by l0f4r0 View Post
Yes.
Code:
awk -v myNeedle="$needle" 'IGNORECASE = 1;/myNeedle/{ if (length > max_length) { max_length = length; longest_line = $0 } END { print longest_line }}' /tmp/list.txt
Does it work?
unfortunately, no...
Code:
awk: cmd. line:1: IGNORECASE = 1;/myNeedle/{ if (length > max_length) { max_length = length; longest_line = $0 } END { print longest_line }}
awk: cmd. line:1:                                                                                                ^ syntax error
 
Old 11-23-2018, 11:56 AM   #8
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,892

Rep: Reputation: 7317Reputation: 7317Reputation: 7317Reputation: 7317Reputation: 7317Reputation: 7317Reputation: 7317Reputation: 7317Reputation: 7317Reputation: 7317Reputation: 7317
here you can find some tips how to solve that ignorecase issue:
https://stackoverflow.com/questions/...orecase-in-awk
 
Old 11-23-2018, 01:17 PM   #9
l0f4r0
Member
 
Registered: Jul 2018
Location: Paris
Distribution: Debian
Posts: 900

Rep: Reputation: 290Reputation: 290Reputation: 290
Ok considering what has been said previously, try this:
Code:
awk -v myNeedle="$needle" 'tolower($0) ~ tolower(myNeedle) && length > max_length { max_length = length; longest_line = $0 } END { print longest_line }' /tmp/list.txt

Last edited by l0f4r0; 11-23-2018 at 01:27 PM.
 
Old 11-24-2018, 04:01 AM   #10
masavini
Member
 
Registered: Jun 2008
Posts: 285

Original Poster
Rep: Reputation: 6
Quote:
Originally Posted by l0f4r0 View Post
Ok considering what has been said previously, try this:
Code:
awk -v myNeedle="$needle" 'tolower($0) ~ tolower(myNeedle) && length > max_length { max_length = length; longest_line = $0 } END { print longest_line }' /tmp/list.txt
perfect solution, thanks!

this is the final function:
Code:
function loose_uniq_test () {

  echo "1string
1Str-ing
2s-tring
2str.+ing" > /tmp/list.txt

  declare -a needles=(
    $(
      sed 's/[^[:alnum:]]//g;
        s/\([[:alnum:]]\)/\1[^[:alnum:]]*/Ig' \
        /tmp/list.txt \
        | sort -u --ignore-case
    )
  )

  declare needle
  for needle in "${needles[@]}"; do

    awk \
      -v myNeedle="$needle" \
      'tolower($0) ~ myNeedle && length > len_max { len_max = length; longest_line = $0 } END { print longest_line }' \
      /tmp/list.txt

  done

  return 0
}
 
Old 11-24-2018, 12:34 PM   #11
l0f4r0
Member
 
Registered: Jul 2018
Location: Paris
Distribution: Debian
Posts: 900

Rep: Reputation: 290Reputation: 290Reputation: 290
^ Beware your script doesn't seem to work with capital words if there is no duplicate in lowercase.

Are
Code:
sort -u --ignore-case
and
flag "I" in your first sed
really useful because I'm not sure...

Your needles array contain unecessary data that reduce the performances.
If I were you, I would keep only lowercase patterns inside:
Code:
declare -a needles=(
	$(
sed 's/[^[:alnum:]]//g;
s/\([[:alnum:]]\)/\1[^[:alnum:]]*/g;
s/[[:upper:]]/[[:lower:]]/g' \
list.txt \
|sort -u
)
)

Last edited by l0f4r0; 11-24-2018 at 12:40 PM.
 
Old 11-24-2018, 03:06 PM   #12
masavini
Member
 
Registered: Jun 2008
Posts: 285

Original Poster
Rep: Reputation: 6
Quote:
Originally Posted by l0f4r0 View Post
^ Beware your script doesn't seem to work with capital words if there is no duplicate in lowercase.
thank you, great hint! i didn't notice that...
 
Old 11-25-2018, 04:39 AM   #13
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,892

Rep: Reputation: 7317Reputation: 7317Reputation: 7317Reputation: 7317Reputation: 7317Reputation: 7317Reputation: 7317Reputation: 7317Reputation: 7317Reputation: 7317Reputation: 7317
you can convert case in sed using \U or \L. But I don't know if that helps to solve it. https://stackoverflow.com/questions/...e-to-lowercase
 
1 members found this post helpful.
Old 11-26-2018, 08:37 AM   #14
l0f4r0
Member
 
Registered: Jul 2018
Location: Paris
Distribution: Debian
Posts: 900

Rep: Reputation: 290Reputation: 290Reputation: 290
^ Interesting pan64! Didn't know about /L and /U.
Actually my suggestion in #11 about the needles array should already solve the problem.
 
  


Reply

Tags
bash scripting



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Need bash script to remove spaces and non alpha chars from folders/ files ne0shell Programming 6 06-22-2012 11:10 AM
[SOLVED] sed: replace regexp w/ variable #s of chars with the same # of (diff.) chars? kmkocot Linux - Newbie 6 11-18-2011 05:36 AM
Vim: command to show all non-printable chars. stf92 Linux - Newbie 2 12-06-2010 03:44 AM
Bash Scripting POSIX Class [[:alnum:]] giving wrong output livetoday Linux - Newbie 3 01-21-2008 11:56 PM
ignoring the "non-portable whitespace encountered at line " warning Jake13 Linux - Software 3 08-18-2004 12:34 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 03:30 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration