LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 03-07-2012, 11:11 AM   #1
rm_-rf_windows
Member
 
Registered: Jun 2007
Location: Europe
Distribution: Ubuntu
Posts: 292

Rep: Reputation: 27
Regular Expression Problem


Hi,

I'm trying to convert a text file dictionary into an .sql file to convert it into a database table. The table will contain the word, the part of speech and if it is or not a lemma (dictionary version of word). There are several parts of speech, and, to simplify, let's say there are 4: a, d, n, v for adjective, adverb, noun and verb respectively. For a given word I might have (one entry per line in text file):

Code:
word a a d v v d v n
word2 v v v v v
word3 v v d v n n v d n
which I'd like to convert into:
PHP Code:
word adnv
word2 v
word3 dnv 
i.e., in alphabetical order and getting rid of repetitions.

I've been using sed and learning a lot, but have realized that it's not that easy because repeated letters can be separated by other different letters, identical letters are not necessarily grouped together. I'm therefore stuck! There are 250,000 words.

Any takers?

Thanks.
 
Old 03-07-2012, 12:14 PM   #2
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,659
Blog Entries: 4

Rep: Reputation: 3940Reputation: 3940Reputation: 3940Reputation: 3940Reputation: 3940Reputation: 3940Reputation: 3940Reputation: 3940Reputation: 3940Reputation: 3940Reputation: 3940
Well, if I were doing this in Perl, I would use split() to separate the string by spaces, perhaps after using a regular expression to reduce any doubled spaces to singles.

Then, I would shift the zeroth entry (the word) from the resulting list, and I would initialize an empty hashref.

Next, I would again shift the remaining entries one-at-a-time until I ran out of entries (undef), and for each string, assign a hash-entry with a value of "1" thereby eliminating dupes. Now, once I've done that, join(sort keys) to create the string.

I can name that tune:
Code:
#!/usr/bin/env perl
use strict; 
use warnings; 
my $wd="word a a d v v d v n"; 
$wd =~ s/\s\s/ /g;                # COMPRESS MULTIPLE BLANKS
my @list=split(" ", $wd);         # SPLIT BY BLANKS
my $w=shift @list;                # GLOM FIRST WORD
my $hash={};                      # SET UP HASH TO RECEIVE KEYS
while (my $w2=shift @list) 
  { 
     $$hash{$w2}=1;    # VALUE DOESN'T MATTER. WE ONLY WANT THE KEYS.
  } 
print $w . " " . join("", sort keys %$hash) . "\n";
Expanding this to read from a file and to print output to another file (or STDOUT) is a trivial exercise for the reader.
 
Old 03-07-2012, 02:42 PM   #3
whizje
Member
 
Registered: Sep 2008
Location: The Netherlands
Distribution: Slackware64 current
Posts: 594

Rep: Reputation: 141Reputation: 141
An example in bash.
Code:
bash-4.1# temp='word a a d v v  d v n' ; echo ${temp%% *}' ' | tr -d '\n' ; echo $temp | grep -o [[^:alpha:]]*.* | \
grep -o [[:alpha:]] | sort | uniq | tr -d '\n';echo   
word adnv

temp='word a a d v v d v n' save the string in temp
echo ${temp%% *} write first word of the string (word)
' ' write space behind first word
| tr -d '\n' remove go to next line
echo $temp | feed the string to grep
grep -o [[^:alpha:]]*.* | \ option -o print only part of the string that matches
[[^:alpha:]]* dismiss first word
.* | feed letters to next grep
grep -o [[:alpha:]] | put linefeed between letters for sort
sort | sort the letters
uniq | remove double letters
tr -d '\n' remove (\n) go to next lines
;echo add 1 go to next line
 
Old 03-07-2012, 08:39 PM   #4
romagnolo
Member
 
Registered: Jul 2009
Location: Montaletto
Distribution: Debian GNU/Linux
Posts: 107

Rep: Reputation: 5
Hi,
I have exactly what you are looking for:

Code:
#!/bin/bash
# Database swiffer

db_dirt=$1
db_clean=$2

: >"$db_clean"

while read l; do
    [ "$l" ] || continue
    word=$(echo "$l" | grep -Eo '^ *[[:graph:]]+')
    attribute=${l#"$word"}

    echo "Working on: $word"

    best=$attribute
    while :; do
        cleaner=$(echo "$best" | sed -nr 's/(.*)([[:graph:]]) (.*)\2/\1\2\3/gp')
        [ "$cleaner" ] && best=$cleaner || break
    done

    best="$(echo $best | sed -r '{s/ //g
s/([[:graph:]])/\1\n/g
}'  | sort | xargs echo -n | sed -r 's/ //g')"

    echo "$word" "$best" >>"$db_clean"

done <"$db_dirt"
Example:
Save the program in a file named swiffer.sh, then:

Code:
./swiffer.sh spurious_database clean_database
it will work all your database and place the cleaned one in clean_database.

For further assistance you can contact me by the e-mail on user menu.

Last edited by romagnolo; 03-07-2012 at 08:50 PM.
 
Old 03-08-2012, 11:25 AM   #5
rm_-rf_windows
Member
 
Registered: Jun 2007
Location: Europe
Distribution: Ubuntu
Posts: 292

Original Poster
Rep: Reputation: 27
Thank you all for your response.

I've chosen Whizje's solution because it was very short. I eventually got it working after encountered some bugs in [[:alpha:]], which was hit and miss. I changed it to [a-zA-Z], which should be the equivalent, but for some reason got better results.

Thanks Whisje, for paving a small part of my rough road... which I hope leads to the stars!

rm
 
1 members found this post helpful.
Old 03-08-2012, 03:03 PM   #6
rm_-rf_windows
Member
 
Registered: Jun 2007
Location: Europe
Distribution: Ubuntu
Posts: 292

Original Poster
Rep: Reputation: 27
Hi again all,

I thought I found the solution, but have been going nuts trying to figure this out for hours. Sometimes it works and sometimes it doesn't, and I can't figure out why!

Code:
$ unset temp
$ temp="hdflkjkj l i j i e f e"
$ echo $temp | grep -o [[^:alpha:]]*.* | grep -o [[:alpha:]] | sort | uniq | tr -d '\n';echo
defhijkl 
#this didn't work, took into account "hdflkjkj"

$ temp="word l i f j e i f"
$ echo $temp | grep -o [[^:alpha:]]*.* | grep -o [[:alpha:]] | sort | uniq | tr -d '\n';echo
efijl 
#worked! didn't take into account "word" 

$ temp="word l i f j e i d"
$ echo $temp | grep -o [[^:alpha:]]*.* | grep -o [[:alpha:]] | sort | uniq | tr -d '\n';echo
defijl
#worked! despite the common letter in both "word" and "i f j e i d"!
WHY??!!!!

Ughhhhhhhhh!!!
 
Old 03-08-2012, 03:34 PM   #7
whizje
Member
 
Registered: Sep 2008
Location: The Netherlands
Distribution: Slackware64 current
Posts: 594

Rep: Reputation: 141Reputation: 141
change
Code:
echo $temp to echo ${temp%% *}' '
you forgot this line which prints the first word
Code:
 echo ${temp%% *}' ' | tr -d '\n' ;

Last edited by whizje; 03-08-2012 at 03:40 PM. Reason: clarification
 
Old 03-08-2012, 03:37 PM   #8
whizje
Member
 
Registered: Sep 2008
Location: The Netherlands
Distribution: Slackware64 current
Posts: 594

Rep: Reputation: 141Reputation: 141
But it is more easy when you convert it to a script.
 
Old 03-08-2012, 04:04 PM   #9
whizje
Member
 
Registered: Sep 2008
Location: The Netherlands
Distribution: Slackware64 current
Posts: 594

Rep: Reputation: 141Reputation: 141
Strange when word starts with a letter lower then in the letters it isn't suppressed.
 
Old 03-08-2012, 04:08 PM   #10
whizje
Member
 
Registered: Sep 2008
Location: The Netherlands
Distribution: Slackware64 current
Posts: 594

Rep: Reputation: 141Reputation: 141
I'll work on a version with sed which should work better for this sort of problems.
 
Old 03-08-2012, 04:12 PM   #11
PTrenholme
Senior Member
 
Registered: Dec 2004
Location: Olympia, WA, USA
Distribution: Fedora, (K)Ubuntu
Posts: 4,187

Rep: Reputation: 354Reputation: 354Reputation: 354Reputation: 354
There are other tools beside sed and perl for solving this type of problem. Here's a gawk solution:
Code:
cat prep.dat # Test data
word a a d v v d v n
word2 v v v v v
word3 v v d v n n v d n

$ cat prep.gawk # The program (If you copy it, verify the quotes: they are sometimes unicode quotes.)
#!/usr/bin/gawk -f
{
  word=$1
  for (i=2;i<=NF;++i) {
    ++class[$i]
  }
  printf("%s",word)
  n=asorti(class,sorted)
  for (i=1;i<=n;++i) {
    printf(" %s", sorted[i])
  }
  printf("\n")
  delete class
  delete sorted
}

$ chmod +x prep.gawk # Make the program executable

$ ./prep.gawk prep.dat # Run it using the test data as input
word a d n v
word2 v
word3 d n v
<edit>
I don't know if you'd need it, but printing class[sorted[i]] with sorted[i] would give you the number of times the word was assigned to each class.

Also, if you indexed your table by using word,class as a unique index and selected using "ordered by," the sorting would be internal to the data base, and the duplicates would be automatically removed when you load the data into the table. So preparing the data this way is, probably, unnecessary.
</edit>

Last edited by PTrenholme; 03-08-2012 at 04:22 PM.
 
Old 03-08-2012, 04:15 PM   #12
rm_-rf_windows
Member
 
Registered: Jun 2007
Location: Europe
Distribution: Ubuntu
Posts: 292

Original Poster
Rep: Reputation: 27
Nope, still doesn't work, not on my end.

But I like the conciseness of your syntax... A pity I can't get it working!
 
Old 03-08-2012, 06:25 PM   #13
whizje
Member
 
Registered: Sep 2008
Location: The Netherlands
Distribution: Slackware64 current
Posts: 594

Rep: Reputation: 141Reputation: 141
Pff fixed replaced the faulty word remove with a bash string operation.
Code:
bash-4.1$ temp='hdflkjkj l i j i e f e' ; echo ${temp%% *}' ' | tr -d '\n' ;echo ${temp#* } | grep -o [[:alpha:]] | sort | uniq | tr -d '\n';echo 
hdflkjkj efijl
bash-4.1$ temp='word a a d v v  d v n' ; echo ${temp%% *}' ' | tr -d '\n' ;echo ${temp#* } | grep -o [[:alpha:]] | sort | uniq | tr -d '\n';echo
word adnv
 
Old 03-08-2012, 07:05 PM   #14
whizje
Member
 
Registered: Sep 2008
Location: The Netherlands
Distribution: Slackware64 current
Posts: 594

Rep: Reputation: 141Reputation: 141
script version
Code:
#!/bin/bash
temp=$*
echo ${temp%% *}' ' | tr -d '\n' 
echo ${temp#* } | grep -o [[:alpha:]] | sort | uniq | tr -d '\n'
echo
Remember to make the script executable.
Code:
chmod +x wlsort
example
Code:
bash-4.1$ wlsort word a a d v v  d v nmc i o p
word acdimnopv
bash-4.1$ wlsort eindelijk t e z z a b c y r y
eindelijk abcertyz
 
Old 03-08-2012, 10:27 PM   #15
romagnolo
Member
 
Registered: Jul 2009
Location: Montaletto
Distribution: Debian GNU/Linux
Posts: 107

Rep: Reputation: 5
Albert, I wrote that script only for you. It does exactly that.
Use it.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] problem with regular expression ashok.g Programming 2 12-30-2009 06:05 AM
regular expression problem sancho1980 Programming 10 06-08-2009 07:26 AM
perl regular expression problem true_atlantis Programming 4 05-27-2009 06:35 AM
Regular expression problem raghu123 Programming 11 10-12-2008 07:17 AM
having problem in writing regular expression in tcl mohtasham1983 Programming 1 10-29-2006 01:29 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 08:13 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration