Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I'm trying to convert a text file dictionary into an .sql file to convert it into a database table. The table will contain the word, the part of speech and if it is or not a lemma (dictionary version of word). There are several parts of speech, and, to simplify, let's say there are 4: a, d, n, v for adjective, adverb, noun and verb respectively. For a given word I might have (one entry per line in text file):
Code:
word a a d v v d v n
word2 v v v v v
word3 v v d v n n v d n
which I'd like to convert into:
PHP Code:
word adnv word2 v word3 dnv
i.e., in alphabetical order and getting rid of repetitions.
I've been using sed and learning a lot, but have realized that it's not that easy because repeated letters can be separated by other different letters, identical letters are not necessarily grouped together. I'm therefore stuck! There are 250,000 words.
Well, if I were doing this in Perl, I would use split() to separate the string by spaces, perhaps after using a regular expression to reduce any doubled spaces to singles.
Then, I would shift the zeroth entry (the word) from the resulting list, and I would initialize an empty hashref.
Next, I would again shift the remaining entries one-at-a-time until I ran out of entries (undef), and for each string, assign a hash-entry with a value of "1" thereby eliminating dupes. Now, once I've done that, join(sort keys) to create the string.
I can name that tune:
Code:
#!/usr/bin/env perl
use strict;
use warnings;
my $wd="word a a d v v d v n";
$wd =~ s/\s\s/ /g; # COMPRESS MULTIPLE BLANKS
my @list=split(" ", $wd); # SPLIT BY BLANKS
my $w=shift @list; # GLOM FIRST WORD
my $hash={}; # SET UP HASH TO RECEIVE KEYS
while (my $w2=shift @list)
{
$$hash{$w2}=1; # VALUE DOESN'T MATTER. WE ONLY WANT THE KEYS.
}
print $w . " " . join("", sort keys %$hash) . "\n";
Expanding this to read from a file and to print output to another file (or STDOUT) is a trivial exercise for the reader.
bash-4.1# temp='word a a d v v d v n' ; echo ${temp%% *}' ' | tr -d '\n' ; echo $temp | grep -o [[^:alpha:]]*.* | \
grep -o [[:alpha:]] | sort | uniq | tr -d '\n';echo
word adnv
temp='word a a d v v d v n' save the string in temp
echo ${temp%% *} write first word of the string (word)
' ' write space behind first word
| tr -d '\n' remove go to next line
echo $temp | feed the string to grep
grep -o [[^:alpha:]]*.* | \ option -o print only part of the string that matches
[[^:alpha:]]* dismiss first word
.* | feed letters to next grep
grep -o [[:alpha:]] | put linefeed between letters for sort
sort | sort the letters
uniq | remove double letters
tr -d '\n' remove (\n) go to next lines
;echo add 1 go to next line
I've chosen Whizje's solution because it was very short. I eventually got it working after encountered some bugs in [[:alpha:]], which was hit and miss. I changed it to [a-zA-Z], which should be the equivalent, but for some reason got better results.
Thanks Whisje, for paving a small part of my rough road... which I hope leads to the stars!
I thought I found the solution, but have been going nuts trying to figure this out for hours. Sometimes it works and sometimes it doesn't, and I can't figure out why!
Code:
$ unset temp
$ temp="hdflkjkj l i j i e f e"
$ echo $temp | grep -o [[^:alpha:]]*.* | grep -o [[:alpha:]] | sort | uniq | tr -d '\n';echo
defhijkl
#this didn't work, took into account "hdflkjkj"
$ temp="word l i f j e i f"
$ echo $temp | grep -o [[^:alpha:]]*.* | grep -o [[:alpha:]] | sort | uniq | tr -d '\n';echo
efijl
#worked! didn't take into account "word"
$ temp="word l i f j e i d"
$ echo $temp | grep -o [[^:alpha:]]*.* | grep -o [[:alpha:]] | sort | uniq | tr -d '\n';echo
defijl
#worked! despite the common letter in both "word" and "i f j e i d"!
There are other tools beside sed and perl for solving this type of problem. Here's a gawk solution:
Code:
cat prep.dat # Test data
word a a d v v d v n
word2 v v v v v
word3 v v d v n n v d n
$ cat prep.gawk # The program (If you copy it, verify the quotes: they are sometimes unicode quotes.)
#!/usr/bin/gawk -f
{
word=$1
for (i=2;i<=NF;++i) {
++class[$i]
}
printf("%s",word)
n=asorti(class,sorted)
for (i=1;i<=n;++i) {
printf(" %s", sorted[i])
}
printf("\n")
delete class
delete sorted
}
$ chmod +x prep.gawk # Make the program executable
$ ./prep.gawk prep.dat # Run it using the test data as input
word a d n v
word2 v
word3 d n v
<edit>
I don't know if you'd need it, but printing class[sorted[i]] with sorted[i] would give you the number of times the word was assigned to each class.
Also, if you indexed your table by using word,class as a unique index and selected using "ordered by," the sorting would be internal to the data base, and the duplicates would be automatically removed when you load the data into the table. So preparing the data this way is, probably, unnecessary.
</edit>
Last edited by PTrenholme; 03-08-2012 at 04:22 PM.
Pff fixed replaced the faulty word remove with a bash string operation.
Code:
bash-4.1$ temp='hdflkjkj l i j i e f e' ; echo ${temp%% *}' ' | tr -d '\n' ;echo ${temp#* } | grep -o [[:alpha:]] | sort | uniq | tr -d '\n';echo
hdflkjkj efijl
bash-4.1$ temp='word a a d v v d v n' ; echo ${temp%% *}' ' | tr -d '\n' ;echo ${temp#* } | grep -o [[:alpha:]] | sort | uniq | tr -d '\n';echo
word adnv
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.