[SOLVED] Regular Expression Problem

rm_-rf_windows · 03-07-2012, 11:11 AM

Hi,

I'm trying to convert a text file dictionary into an .sql file to convert it into a database table. The table will contain the word, the part of speech and if it is or not a lemma (dictionary version of word). There are several parts of speech, and, to simplify, let's say there are 4: a, d, n, v for adjective, adverb, noun and verb respectively. For a given word I might have (one entry per line in text file):

Code:

word a a d v v d v n
word2 v v v v v
word3 v v d v n n v d n

which I'd like to convert into:

PHP Code:



word adnv
word2 v
word3 dnv

i.e., in alphabetical order and getting rid of repetitions.

I've been using sed and learning a lot, but have realized that it's not that easy because repeated letters can be separated by other different letters, identical letters are not necessarily grouped together. I'm therefore stuck! There are 250,000 words.

Any takers?

Thanks.

sundialsvcs · 03-07-2012, 12:14 PM

Well, if I were doing this in Perl, I would use split() to separate the string by spaces, perhaps after using a regular expression to reduce any doubled spaces to singles.

Then, I would shift the zeroth entry (the word) from the resulting list, and I would initialize an empty hashref.

Next, I would again shift the remaining entries one-at-a-time until I ran out of entries (undef), and for each string, assign a hash-entry with a value of "1" thereby eliminating dupes. Now, once I've done that, join(sort keys) to create the string.

I can name that tune:

Code:

#!/usr/bin/env perl
use strict; 
use warnings; 
my $wd="word a a d v v d v n"; 
$wd =~ s/\s\s/ /g;                # COMPRESS MULTIPLE BLANKS
my @list=split(" ", $wd);         # SPLIT BY BLANKS
my $w=shift @list;                # GLOM FIRST WORD
my $hash={};                      # SET UP HASH TO RECEIVE KEYS
while (my $w2=shift @list) 
  { 
     $$hash{$w2}=1;    # VALUE DOESN'T MATTER. WE ONLY WANT THE KEYS.
  } 
print $w . " " . join("", sort keys %$hash) . "\n";

Expanding this to read from a file and to print output to another file (or STDOUT) is a trivial exercise for the reader.

whizje · 03-07-2012, 02:42 PM

An example in bash.

Code:

bash-4.1# temp='word a a d v v  d v n' ; echo ${temp%% *}' ' | tr -d '\n' ; echo $temp | grep -o [[^:alpha:]]*.* | \
grep -o [[:alpha:]] | sort | uniq | tr -d '\n';echo   
word adnv

temp='word a a d v v d v n' save the string in temp
echo ${temp%% *} write first word of the string (word)
' ' write space behind first word
| tr -d '\n' remove go to next line
echo $temp | feed the string to grep
grep -o [[^:alpha:]]*.* | \ option -o print only part of the string that matches
[[^:alpha:]]* dismiss first word
.* | feed letters to next grep
grep -o [[:alpha:]] | put linefeed between letters for sort
sort | sort the letters
uniq | remove double letters
tr -d '\n' remove (\n) go to next lines
;echo add 1 go to next line

romagnolo · 03-07-2012, 08:39 PM

Hi,
I have exactly what you are looking for:

Code:

#!/bin/bash
# Database swiffer

db_dirt=$1
db_clean=$2

: >"$db_clean"

while read l; do
    [ "$l" ] || continue
    word=$(echo "$l" | grep -Eo '^ *[[:graph:]]+')
    attribute=${l#"$word"}

    echo "Working on: $word"

    best=$attribute
    while :; do
        cleaner=$(echo "$best" | sed -nr 's/(.*)([[:graph:]]) (.*)\2/\1\2\3/gp')
        [ "$cleaner" ] && best=$cleaner || break
    done

    best="$(echo $best | sed -r '{s/ //g
s/([[:graph:]])/\1\n/g
}'  | sort | xargs echo -n | sed -r 's/ //g')"

    echo "$word" "$best" >>"$db_clean"

done <"$db_dirt"

Example:
Save the program in a file named swiffer.sh, then:

Code:

./swiffer.sh spurious_database clean_database

it will work all your database and place the cleaned one in clean_database.

For further assistance you can contact me by the e-mail on user menu.

rm_-rf_windows · 03-08-2012, 11:25 AM

Thank you all for your response.

I've chosen Whizje's solution because it was very short. I eventually got it working after encountered some bugs in [[:alpha:]], which was hit and miss. I changed it to [a-zA-Z], which should be the equivalent, but for some reason got better results.

Thanks Whisje, for paving a small part of my rough road... which I hope leads to the stars!

rm

rm_-rf_windows · 03-08-2012, 03:03 PM

Hi again all,

I thought I found the solution, but have been going nuts trying to figure this out for hours. Sometimes it works and sometimes it doesn't, and I can't figure out why!

Code:

$ unset temp
$ temp="hdflkjkj l i j i e f e"
$ echo $temp | grep -o [[^:alpha:]]*.* | grep -o [[:alpha:]] | sort | uniq | tr -d '\n';echo
defhijkl 
#this didn't work, took into account "hdflkjkj"

$ temp="word l i f j e i f"
$ echo $temp | grep -o [[^:alpha:]]*.* | grep -o [[:alpha:]] | sort | uniq | tr -d '\n';echo
efijl 
#worked! didn't take into account "word" 

$ temp="word l i f j e i d"
$ echo $temp | grep -o [[^:alpha:]]*.* | grep -o [[:alpha:]] | sort | uniq | tr -d '\n';echo
defijl
#worked! despite the common letter in both "word" and "i f j e i d"!

WHY??!!!!

Ughhhhhhhhh!!!

whizje · 03-08-2012, 03:34 PM

change

Code:

echo $temp to echo ${temp%% *}' '

you forgot this line which prints the first word

Code:

 echo ${temp%% *}' ' | tr -d '\n' ;

whizje · 03-08-2012, 03:37 PM

But it is more easy when you convert it to a script.

whizje · 03-08-2012, 04:04 PM

Strange when word starts with a letter lower then in the letters it isn't suppressed.

whizje · 03-08-2012, 04:08 PM

I'll work on a version with sed which should work better for this sort of problems.

PTrenholme · 03-08-2012, 04:12 PM

There are other tools beside sed and perl for solving this type of problem. Here's a gawk solution:

Code:

cat prep.dat # Test data
word a a d v v d v n
word2 v v v v v
word3 v v d v n n v d n

$ cat prep.gawk # The program (If you copy it, verify the quotes: they are sometimes unicode quotes.)
#!/usr/bin/gawk -f
{
  word=$1
  for (i=2;i<=NF;++i) {
    ++class[$i]
  }
  printf("%s",word)
  n=asorti(class,sorted)
  for (i=1;i<=n;++i) {
    printf(" %s", sorted[i])
  }
  printf("\n")
  delete class
  delete sorted
}

$ chmod +x prep.gawk # Make the program executable

$ ./prep.gawk prep.dat # Run it using the test data as input
word a d n v
word2 v
word3 d n v

<edit>
I don't know if you'd need it, but printing class[sorted[i]] with sorted[i] would give you the number of times the word was assigned to each class.

Also, if you indexed your table by using word,class as a unique index and selected using "ordered by," the sorting would be internal to the data base, and the duplicates would be automatically removed when you load the data into the table. So preparing the data this way is, probably, unnecessary.
</edit>

rm_-rf_windows · 03-08-2012, 04:15 PM

Nope, still doesn't work, not on my end.

But I like the conciseness of your syntax... A pity I can't get it working!

whizje · 03-08-2012, 06:25 PM

Pff fixed replaced the faulty word remove with a bash string operation.

Code:

bash-4.1$ temp='hdflkjkj l i j i e f e' ; echo ${temp%% *}' ' | tr -d '\n' ;echo ${temp#* } | grep -o [[:alpha:]] | sort | uniq | tr -d '\n';echo 
hdflkjkj efijl
bash-4.1$ temp='word a a d v v  d v n' ; echo ${temp%% *}' ' | tr -d '\n' ;echo ${temp#* } | grep -o [[:alpha:]] | sort | uniq | tr -d '\n';echo
word adnv

whizje · 03-08-2012, 07:05 PM

script version

Code:

#!/bin/bash
temp=$*
echo ${temp%% *}' ' | tr -d '\n' 
echo ${temp#* } | grep -o [[:alpha:]] | sort | uniq | tr -d '\n'
echo

Remember to make the script executable.

Code:

chmod +x wlsort

example

Code:

bash-4.1$ wlsort word a a d v v  d v nmc i o p
word acdimnopv
bash-4.1$ wlsort eindelijk t e z z a b c y r y
eindelijk abcertyz

romagnolo · 03-08-2012, 10:27 PM

Albert, I wrote that script only for you. It does exactly that.
Use it.