LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 03-09-2012, 12:09 PM   #16
whizje
Member
 
Registered: Sep 2008
Location: The Netherlands
Distribution: Slackware64 current
Posts: 594

Rep: Reputation: 141Reputation: 141

You can replace sort | uniq with sort -u and grep -o [[:alpha:]] can be replaced with tr ' ' '\n'
Code:
#!/bin/bash
temp=$*
echo ${temp%% *}' ' | tr -d '\n' 
echo ${temp#* } | tr ' ' '\n' | sort -u | tr -d '\n'
echo
 
Old 03-09-2012, 06:24 PM   #17
rm_-rf_windows
Member
 
Registered: Jun 2007
Location: Europe
Distribution: Ubuntu
Posts: 292

Original Poster
Rep: Reputation: 27
Hey guys,

I tried Romagnolo's script. I'm not advanced enough to understand it completely, maybe I'll study it some tomorrow.

I've finally wrote a script that works, more or less (this part was only a small part of the script), however the script takes forever to execute, I mean 2 hours on my netbook. I think the problem is the loop (while):
Code:
touch temp.txt
echo "" > temp.txt
while read line; do
    word=`echo -e ${line%% *} | tr -d '\n'`
    pos_info=`echo ${line#* } | grep -o [[:alpha:]] | sort | uniq | tr -d '\n'`
    echo -e "$word    $pos_info" >> temp.txt 
done < $2
mv temp.txt $2
Writing into a text file seems to take a long, long time. Is there a better way of doing this? Via a variable? Some sort of pipe or something? Perhaps Romagnolo's script is would be faster. I don't know. But I'd have to adapt it to my script.

I'm really enjoying learning how to use all of these command line tools. Many thanks for all of your response.

sundialsvcs, PTrenholme, sorry, I haven't tried yours out yet. The problem is that I'm not all that advanced yet, but I am going to study your scripts.

Romagnolo, chi è "Albert"? Io? Perché "Albert"!

rm
 
Old 03-09-2012, 07:02 PM   #18
rm_-rf_windows
Member
 
Registered: Jun 2007
Location: Europe
Distribution: Ubuntu
Posts: 292

Original Poster
Rep: Reputation: 27
Hi all,

Whizje, I have another question. The words are going to be slightly different, my script doesn't work completely when trying to populate my database.

The format of words will be:
Code:
"word" n v n n v
"multiword expression" n v v n
"another different multiword expression" a d d a
etc.
In other words, the word (or "entry", because there will be multiword expressions) part will be between double quotes and could contain spaces, the part of speech part (v v v n d d a) remains the same.

I can't figure out how to get the entire line except what's in quotes. I've learned "grep -Ev ...", but that would exclude an entire line containing what I don't want.

I'll get through this script eventually. Meanwhile, I'm learning a lot and enjoying it.

Thanks again.
 
Old 03-10-2012, 06:33 AM   #19
whizje
Member
 
Registered: Sep 2008
Location: The Netherlands
Distribution: Slackware64 current
Posts: 594

Rep: Reputation: 141Reputation: 141
A few questions are the input lines coming from a file or from a variable and over how many input lines are we talking.
and you don't need to use a temp file you can use a variable for it
Code:
str3="$word $pos_info"
and are the input lines as long as in the examples you gave.

Last edited by whizje; 03-10-2012 at 07:00 AM.
 
Old 03-10-2012, 12:32 PM   #20
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by rm_-rf_windows View Post
The format of words will be:
Code:
"word" n v n n v
"multiword expression" n v v n
"another different multiword expression" a d d a
etc.
In other words, the word (or "entry", because there will be multiword expressions) part will be between double quotes and could contain spaces, the part of speech part (v v v n d d a) remains the same.
Consider this approach.

Input file:
Code:
"Frank" a a d v v d v n
"Frank Lloyd" v v v v v
"Frank Lloyd Wright" v v d v n n v d n
Code:
Code:
# sed to put quoted string in a temporary file
sed 's/\(.*" \).*/\1/' $InFile > $Work01

# cut to remove quoted string
# awk to sort horizontally
# tr to squeeze out duplicates
# paste to restore quoted string to each line
cut -d'"' -f3- $InFile \
|awk '{split($0,a); asort(a); for(i=1;i<NF;i++){printf("%s",a[i])} print ""}' \
|tr -s [:alpha:] \
|paste -d '' $Work01 - > $OutFile
Output file:
Code:
"Frank" adnv
"Frank Lloyd" v
"Frank Lloyd Wright" dnv
Daniel B. Martin

Last edited by danielbmartin; 03-10-2012 at 02:04 PM. Reason: Improve code example
 
Old 03-10-2012, 02:13 PM   #21
rm_-rf_windows
Member
 
Registered: Jun 2007
Location: Europe
Distribution: Ubuntu
Posts: 292

Original Poster
Rep: Reputation: 27
Thank you Daniel for your solution. It doesn't work on my end, I've altered the code to feed an infile and to execute from a script:
Code:
# sed to put quoted string in a temporary file
#Work01=""
sed 's/\(.*" \).*/\1/' $1 > $Work01

# cut to remove quoted string
# awk to sort horizontally
# tr to squeeze out duplicates
# paste to restore quoted string to each line
cut -d'"' -f3- $1 \
|awk '{split($0,a); asort(a); for(i=1;i<NF;i++){printf("%s",a[i])} print ""}' \
|tr -s [:alpha:] \
|paste -d '' $Work01 - > output.txt
This is the error I get:
Quote:
$ ./dannyscode.sh inputFile.txt
./dannyscode.sh: line 3: $Work01: ambiguous redirect
$
I noticed you altered the code once or twice. I tried other versions you had put too but got a similar message.

Thanks,

rm
 
Old 03-10-2012, 03:43 PM   #22
PTrenholme
Senior Member
 
Registered: Dec 2004
Location: Olympia, WA, USA
Distribution: Fedora, (K)Ubuntu
Posts: 4,187

Rep: Reputation: 354Reputation: 354Reputation: 354Reputation: 354
Well, a small modification to my above gawk script does it:
Code:
#!/usr/bin/gawk -f
{
  word=$1
  start_class_list=2
  while ((word ~ /^"/) && (word !~ /"$/)) {
      word = word " " $start_class_list
      ++start_class_list
  }
  for (i=start_class_list; i<=NF; ++i) {
    ++class[$i]
  }
  printf("%s ",word)
  n=asorti(class,sorted)
  for (i=1;i<=n;++i) {
    printf("%s", sorted[i])
  }
  printf("\n")
  delete class
  delete sorted
}
Producing this with your quoted examples appended to your first ones:
Code:
$ ./prep.gawk prep.dat
word adnv
word2 v
word3 dnv
"word" nv
"multiword expression" nv
"another different multiword expression" ad
<edit>
At the expense of readability, that can be done as a (long) "one-line" program:
Code:
$ gawk '{w=$1;s=2;while((w~/^"/)&&(w!~/"$/)){w=w" "$s;++s;};for(i=s;i<=NF;++i)++c[$i];printf("%s ",w);n=asorti(c,d);for(i=1;i<=n;++i)printf("%s",d[i]);printf("\n");delete c}' prep.dat
word adnv
word2 v
word3 dnv
"word" nv
"multiword expression" nv
"another different multiword expression" ad
</edit>

Last edited by PTrenholme; 03-10-2012 at 04:16 PM. Reason: Forgot to clear the array in the one-line example
 
Old 03-10-2012, 04:21 PM   #23
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by rm_-rf_windows View Post
I noticed you altered the code once or twice.
I did, but only to "polish" the code and add explanatory comments. Every posted version was tested and worked.
Quote:
This is the error I get:
$Work01: ambiguous redirect
$Work01 identifies a work file on my machine. In my script, above the code, I have this:
Code:
# File Identifications  
InFile='/home/daniel/Desktop/Voters/dbm258inp.txt'
OutFile='/home/daniel/Desktop/Voters/dbm258out.txt'
Work01='/home/daniel/Desktop/Voters/dbm258w01.txt'
You would "plug in" your own fully-qualified file identifiers for input, output, and work files.

Daniel B. Martin
 
Old 03-11-2012, 11:53 AM   #24
whizje
Member
 
Registered: Sep 2008
Location: The Netherlands
Distribution: Slackware64 current
Posts: 594

Rep: Reputation: 141Reputation: 141
A new version for the quoted words
Code:
#!/bin/bash
filename="$@"
declare -a arr
total=""
str3=""
while read line
 do
    IFS='"'            # set word splitting from space to "
    arr=($line)        # copy var line to array arr 
    IFS=' '            # reset IFS to space
#                        arr[1] contains now the quoted words
#                         and arr[2]  contains the letters

# \"  put quotes in str3 they where lost when we copied $line to the array array
# ${arr[1]} copy the quoted words to str3
# \" put quotes in str3 so the words are quoted again
# $(.......) command substitution execute command and use resulting string
# echo -e 
# ${arr[2]//' '/'\n'} replace space with return so sort can use the input
# |sort -u   sort the letters and delete duplicates
# |tr -d '\n'  delete the new lines
# $'\n' add 1 newline
    str3="\"${arr[1]}\"$(echo -e ${arr[2]//' '/'\n'}|sort -u|tr -d '\n')"$'\n'
    total="$total$str3"   # add result to total
done < $filename
echo -e $total         # print total
example
Code:
bash-4.1$ cat names.txt
"word" n v n n v
"multiword expression" n v v n
"another different multiword expression" a d d a
bash-4.1$ wlsort names.txt
"word" nv
"multiword expression" nv
"another different multiword expression" ad
12000 lines on a phenom X4 3.4 GHz took
Code:
real    1m13.496s
user    0m51.104s
sys     0m7.264s

Last edited by whizje; 03-11-2012 at 02:13 PM. Reason: extra info
 
Old 03-11-2012, 02:42 PM   #25
whizje
Member
 
Registered: Sep 2008
Location: The Netherlands
Distribution: Slackware64 current
Posts: 594

Rep: Reputation: 141Reputation: 141
The gawk version from PTrenholme is approximately 600 faster.

Last edited by whizje; 03-11-2012 at 02:44 PM. Reason: error
 
Old 03-17-2012, 05:22 PM   #26
rm_-rf_windows
Member
 
Registered: Jun 2007
Location: Europe
Distribution: Ubuntu
Posts: 292

Original Poster
Rep: Reputation: 27
Many thanks for all of your answers!

This thread is more than solved!
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] problem with regular expression ashok.g Programming 2 12-30-2009 06:05 AM
regular expression problem sancho1980 Programming 10 06-08-2009 07:26 AM
perl regular expression problem true_atlantis Programming 4 05-27-2009 06:35 AM
Regular expression problem raghu123 Programming 11 10-12-2008 07:17 AM
having problem in writing regular expression in tcl mohtasham1983 Programming 1 10-29-2006 01:29 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 09:17 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration