LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 10-05-2011, 12:17 PM   #1
ericvolpone
LQ Newbie
 
Registered: Oct 2011
Posts: 7

Rep: Reputation: Disabled
Bash - Spell Checker


Hey guys, this is a homework assignment for my systems programming class and I'm having a lot of difficulty planning how I'm going to do any of the four shell programs that are asked of me.

I started with a spell checker, which is supposed to take an input file of text, compare it to /usr/share/dict/words, and output the file (ideally) as the original input file with misspelled words underlined.

Here is a link to the actual homework assignment, if anybody has questions:

http://www.cse.psu.edu/~christma/cgi-bin/311/hw2.html

I'm even just having trouble parsing the input and then comparing it to the words file.

What I have right now is.. (I hope code tags are code and /code)

Code:
#!/bin/bash

touch tempText.txt

cat $1

sed 's/[\.,?]//g' $1 | tr " " "\n" | sed 's/ //' | diff - /usr/share/dict/words | grep '<' | sed 's/< //' | grep '.' >> tempText.txt

cat tempText.txt

rm tempText.txt
It creates a temporary text file, then does the following (I would hope)
-removes punctuation marks ., and ?
-changes spaces into new lines, splitting each word in the input to its
own line,
-removes any leftover spaces (might not be necessary)
-compares the parsed input line by line with the words list
-grabs all lines of < (such that a word appeared in the input and not in
the dict list
-removes the "< " from the line
-grabs any non-empty lines and throws them into the text file.
-Then I display the text file, however most of my words are "misspelled"

Here is my sample input/output.

Input: "Happy almonds never really tell you, they just show you."
Output: Happy
you
just
show

Any idea why this is happening?
 
Old 10-05-2011, 12:49 PM   #2
Snark1994
Senior Member
 
Registered: Sep 2010
Location: Wales, UK
Distribution: Arch
Posts: 1,632
Blog Entries: 3

Rep: Reputation: 346Reputation: 346Reputation: 346Reputation: 346
You can run it with "bash -x " which will provide extra debugging information.

I would have approached it slightly differently:
Code:
-Remove punctuation marks
-Change spaces into new lines, splitting each word in the input to its
own line
-Remove any leftover spaces

-Sort the input words (using 'sort')
-Remove duplicates (using 'uniq')
-Compare with dictionary file

-Grab all lines of < (such that a word appeared in the input and not in
the dict list)
-Remove the "< " from the line
-Grab any non-empty lines and throw them into the text file.
I don't know if the fact the words are in a different order in your input file might be messing it up. I need to go for food now (so I haven't been able to work through it properly) but I'll be able to have a better look later on

EDIT: Right, had a proper look at it. I don't have your dictionary, but from the playing I've done with a sample dictionary, the approach I suggested works (ie. yours, but with the extra steps of sorting and removing duplicates from the input file) but you also need to add the extra step of converting all the upper case letters into lower case letters. Of course, this stops your spellchecker correcting "london" into "London"; if you need this functionality, then you're going to need to have some extra code to detect the start of sentences. So - your code works as posted, but you need to add
Code:
... | tr '[A-Z]' '[a-z]' | sort | uniq ...
at an appropriate place. Finally, you've got an extra "cat $1" line in there which prints out the original text file: this isn't in your sample output, so you might want to remove it

Hope this helps,

Last edited by Snark1994; 10-05-2011 at 01:56 PM.
 
1 members found this post helpful.
Old 10-05-2011, 04:08 PM   #3
ericvolpone
LQ Newbie
 
Registered: Oct 2011
Posts: 7

Original Poster
Rep: Reputation: Disabled
Cool

Cool man yeah I asked my professor and he said that the error I was probably getting was either from unsorted input or multiple tokens, so the sort | uniq pipe fixed up that problem.

Now I'm thinking about how I'm going to run through the initial input and locate the misspelled words, then underlining them.

I know that basic txt files do not support underlining, so I was curious as to how he expected that to happen (or if there was a linux utility that would do that for me either using print or echo).

Anyway, I was planning on just looping through the misspelled list and grep-ing the input for each word, or possibly stream editing the input for every word in the misspelled list and replacing it with an underlined version of that word. However, not too sure how to do that.. is there a way to write to and display an rtf file or something else that supports underlining?

EDIT:

I'm actually thinking about doing something like this for the rest of the program.

Code:
BU=`tput smul`
EU=`tput rmul`

for word in 'cat misspelled.txt'; do
	echo | sed "s/$word/${BU}$word${EU}/" $1
done
However, I don't know if I'm using the for variables or the sed command correctly.
When I run this, it just echoes the basic input without any underling.
However, when I replace "s/$word/${BU}$word${EU}/" with "s/a/${BU}a${EU}/", it replaces the first letter a in every line with an underlined a.

I know this is probably the route to go, I just can't really figure out how to do it

Last edited by ericvolpone; 10-05-2011 at 04:36 PM.
 
Old 10-05-2011, 04:51 PM   #4
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,239

Rep: Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405
Instead of diffing against the wordlist checkout grep's -f option (also -F for efficiency).

Quote:
I know that basic txt files do not support underlining, so I was curious as to how he expected that to happen (or if there was a linux utility that would do that for me either using print or echo).
According to the assignment spec you linked to, you should do underlining by outputting "-"s in the following line:
Code:
EXTRA CREDIT: Instead of merely listing the words that are in error, underline where they appear in the text. E.g.

2:forth on this continent a new nation, concieved in liberty,
                                        ---------            
5:but that the government of the poeple, by the poeple, and for the poeple
                                 ------         ------              ------
 
Old 10-05-2011, 05:05 PM   #5
ericvolpone
LQ Newbie
 
Registered: Oct 2011
Posts: 7

Original Poster
Rep: Reputation: Disabled
Since I have diff's parsing working at the moment, I think I'm going to just stick with that for the time being until I finish the rest of this up.

An explanation of how I could use grep -f as opposed to diff would be appreciated though!

Quote:
Originally Posted by ntubski View Post
According to the assignment spec you linked to, you should do underlining by outputting "-"s in the following line:
While this is one way of doing it, I can't help but think that the way I was approaching the problem before is a more logical method (and if possible, much easier as well).

Like I said, I'm very new to shell programming and don't really understand the utilities as well as I could.

I don't even think I understand how the for loop works...

When I run this loop:

Code:
for word in 'cat misspelled.txt'; do
	echo $word
done
I'd expect it to output every misspelled word, but instead it just outputs "cat misspelled.txt", which is the "list" in my for loop.

Sigh
 
Old 10-05-2011, 05:30 PM   #6
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,239

Rep: Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405
Code:
for word in `cat misspelled.txt`; do
	echo $word
done
You used the wrong quotes. That's one of the reasons why it's considered a better idea to use $() instead of ``.
 
Old 10-06-2011, 05:00 PM   #7
ericvolpone
LQ Newbie
 
Registered: Oct 2011
Posts: 7

Original Poster
Rep: Reputation: Disabled
Haha awesome, yeah I figured that out after a bit thanks

Finally, one last bit that I haven't gotten to work would be the option to import multiple word lists to add to the dictionary.

What I have at the moment is...

Code:
cat /usr/share/dict/words > wordsList.txt

# Accept extra word files and put them into the word list
while getopts "f:" wordfiles; do
	cat $OPTARG >> wordsList.txt
	shift $(($OPTIND - 1))
done

# Create a second word list to use for the sorted word list.
# Had to do this because, whenever I tried to do the command
# sort wordsList.txt > wordsList.txt, wordsList.txt became empty and every
# word was misspelled.  Creating an extra wordsList seemed to fix it.
sort wordsList.txt | uniq > wordsList2.txt
However, this only allows me to run this program like this:

bash spellchk.sh -f wordsList1.txt input.txt

However, when I try to use multiple -f tags or do something like :

bash spellchk.sh { -f wordsList1.txt wordsList2.txt } input.txt

It doesn't accept the multiple word files.

any clue how to fix this?
 
Old 10-07-2011, 08:30 AM   #8
Snark1994
Senior Member
 
Registered: Sep 2010
Location: Wales, UK
Distribution: Arch
Posts: 1,632
Blog Entries: 3

Rep: Reputation: 346Reputation: 346Reputation: 346Reputation: 346
I think you just want

Code:
while getopts "f:" wordfiles; do
	cat $OPTARG >> wordsList.txt
done
and then run the script with

Code:
bash spellchk.bash -f wordsList1.txt -f wordsList2.txt input.txt
I'm not sure the best way to access the non-option variable is (ie. input.txt in that example). If you can assume that it's the last argument passed then you can add

Code:
shift $(($OPTIND - 1))
echo $1
after the loop (you had the 'shift' line inside the loop) but this won't catch things like:

Code:
bash spellchk.bash input.txt -f wordsList1.txt -f wordsList2.txt
Hope this helps,
 
Old 10-07-2011, 10:57 AM   #9
ericvolpone
LQ Newbie
 
Registered: Oct 2011
Posts: 7

Original Poster
Rep: Reputation: Disabled
Awesome guys, thanks for all the help - this forum is teaching me tons haha. Being new to a language sucks but this kind of programming is pretty interesting.

I got the last two programs of the assignment working the way they are expected, but am having difficulty thinking of how to tackle the first two.

Keep in mind that this is for homework, so I'm not expecting any answers given to me. Just thoughts on how to approach the problem - I like doing things on my own for the most part and only like help when I'm stuck with something.

For the first program, I'm supposed to be able to call the bash file:

Code:
bash deconstruct.sh input.c
This will take a c file that compiles in c99 and convert it into c89 structure.

The thing that I will be worried about changing is structure initialization.

For instance, if I have the structs:

Code:
struct node {
   int val;
};

typedef struct {
   int numer;
   int denom;
} Rational;
And the c file instantiates them using enumerations:

Code:
struct node head;
Rational half, *newf = malloc(sizeof(Rational));

head = (struct node){ 5, NULL };
half = (Rational){ 1, 2 };
*newf = (Rational){ 2, 3 };
Which isn't accepted in c89 (I guess), I'm supposed to write a bash script that finds these enumerated instantiations and converts them into text compilable by c89, by creating a constructor function for the struct and calling them, like this:

Code:
void init_node( struct node *nd, int v)
{
   nd->val = v;
}

void init_Rational( Rational *ret, int numer, int denom ) 
{
   ret->numer = numer;
   ret->denom = denom;
}

init_node( &head,  5 );
init_Rational( &half,  1, 2  );
init_Rational( newf,  2, 3  );
This seems pretty difficult to me.

There's a few hints on the assignment, saying to traverse the input once and get all structure definitions on one line, then making 2 extra copies of them.

On the second traversal, the first copy restores the structure declaration; the second becomes the function heading; and the third becomes the function body.

I kind've understand how he wants me to do this, but I'm having trouble getting started.

I think that it'd be best to maybe grep the line numbers that have "struct xxxxx" and "typedef struct" in them, and use sed's holdspace to accumulate the whole declaration (until I hit a }). I don't have to worry about nested structs, unions or enumerations, so curly brackets aren't going to be too difficult to handle.

Anyway, I have this much started

Code:
cat $txtfile | grep -n 'struct [A-z]*'
where txtfile is my input file.

I'm just trying to search the file for struct name.

Also, is there a way using regular expressions to account for any amount of spaces? I tried [ ]*, but don't think it worked the way I wanted it to.

thanks a lot for any help! and sorry for the huge post.
 
Old 10-07-2011, 11:39 AM   #10
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,239

Rep: Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405
Code:
# Create a second word list to use for the sorted word list.
# Had to do this because, whenever I tried to do the command
# sort wordsList.txt > wordsList.txt, wordsList.txt became empty and every
# word was misspelled.  Creating an extra wordsList seemed to fix it.
sort wordsList.txt | uniq > wordsList2.txt
Yeah, you should never have 2 programs writing and reading from the same file concurrently. You can actually make the wordlist without using an extra temporary though:

Code:
{
    cat /usr/share/dict/words
    while getopts "f:" wordfiles; do
        cat $OPTARG
    done
} | sort | uniq > wordList.txt
Quote:
Originally Posted by ericvolpone View Post
However, when I try to use multiple -f tags or do something like :

bash spellchk.sh { -f wordsList1.txt wordsList2.txt } input.txt
Just out of curiosity, where did you come up with that syntax for multiple -f options?


Code:
cat $txtfile | grep -n 'struct [A-z]*'
[A-z] has undefined behaviour, you should use [A-Za-z], or rather [_A-Za-z][_A-Za-z0-9]* since a C identifier can have digits and underscores. Also that's a Useless Use of Cat:

Code:
grep -n 'struct [_A-Za-z][_A-Za-z0-9]*' < "$txtfile"
grep -n 'struct [_A-Za-z][_A-Za-z0-9]*' "$txtfile"

Quote:
Also, is there a way using regular expressions to account for any amount of spaces? I tried [ ]*, but don't think it worked the way I wanted it to.
[ ]* would be 0 or more whitespaces, you may mean '[ ]\+' which is 1 or more whitespaces, and you probably also want to match tabs and such: [ \t] or even [[:space:]].
 
Old 10-09-2011, 03:32 PM   #11
ericvolpone
LQ Newbie
 
Registered: Oct 2011
Posts: 7

Original Poster
Rep: Reputation: Disabled
Cool beans guys this has been a great help.

Now I'm just wondering about one thing to finish up the last parts of the project.

What I want to do is search a text file for a struct definition in C, either by the use of struct structname {blah blah } or typedef struct {jioeqjfoipeje} structname;

What I want to do with this is take the whole declaration and put it on one line.

This is the code that I came up with, which I'm not too sure is going to work.

Code:
cat $txtfile | sed -e ':begin' \
-e '/^ *struct *[A-Za-z0-9]*[ {]*/N' \
-e '/}/!b begin'
What this (to my eyes) is supposed to do is display the input file, create a loop that finds a line with a struct definition (in the struct structname {} format), and keep adding the next line to the original, then unless they find a }, repeat.

This will return the structure definition with embedded new lines, in which I can tr "\n" " " to get rid of the newlines and display it all on one line.

However, I'm not sure if I understand the loop in the sed idea too well.

Any ideas what I'm doing wrong?
 
Old 10-10-2011, 11:07 AM   #12
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,239

Rep: Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405
Quote:
Originally Posted by ericvolpone View Post
This is the code that I came up with, which I'm not too sure is going to work.
This kind of sounds like you haven't even tried the code yet...
Code:
cat $txtfile | sed -e ':begin' \
-e '/^ *struct *[A-Za-z0-9]*[ {]*/N' \
-e '/}/!b begin'
Remember that struct names can contain underscores, and I think you meant [ ]*{ instead of [ {]*. Also, that's another Useless Use of Cat.

Quote:
What this (to my eyes) is supposed to do is display the input file, create a loop that finds a line with a struct definition (in the struct structname {} format), and keep adding the next line to the original, then unless they find a }, repeat.
What this actually means is:
  1. if the line contains a struct definition add to the pattern space and get the next line. Go to step 2.
  2. if the line doesn't contain a }, go back to step 1
So if you have a line that is neither a struct definition nor contains a }, it's an infinite loop.

Quote:
However, I'm not sure if I understand the loop in the sed idea too well.
It can be somewhat confusing, you're probably better off with awk. From the GNU sed info manual:
Quote:
In most cases, use of these [branch/label] commands indicates that you are probably better off programming in something like awk or Perl. But occasionally one is committed to sticking with sed, and these commands can enable one to write quite convoluted scripts.
 
Old 10-10-2011, 03:42 PM   #13
ericvolpone
LQ Newbie
 
Registered: Oct 2011
Posts: 7

Original Poster
Rep: Reputation: Disabled
Quote:
It can be somewhat confusing, you're probably better off with awk. From the GNU sed info manual:
Unfortunately, I'm required to code this project in bourne or bourne again or some similar shell command.

My professor said that this task can be done with the use of sed's holdspace, however. Perhaps I could sed the initial line, add it to the holdspace, continue to the next line and until it has a }, continue. Then finally output the hold space with tr "\n" " "

However, I'm not sure how this would fix the problem of the infinite loop.

I'm pretty unfamiliar with the awk utility, You think it'd be better at accomplishing this task? How so?
 
Old 10-10-2011, 04:15 PM   #14
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,239

Rep: Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405Reputation: 1405
Quote:
Originally Posted by ericvolpone View Post
However, I'm not sure how this would fix the problem of the infinite loop.
To fix the infinite loop you should add a rule to do something on lines between the { and }.

Quote:
I'm pretty unfamiliar with the awk utility, You think it'd be better at accomplishing this task? How so?
sed essentially has only 2 variables: the hold space and the pattern space. For control flow the only option is if-then-goto. All the commands have single letter names which makes sed programs even more difficult to read.

In awk you can have as many variables as you like, and give them meaningful names. For control flow you can use the standard structured programming constructs (if-else, while, for). The builtin functions have short but still readable names.

I think any sed program much beyond s/old/new would be better done in awk.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
spell checker ronlau9 Linux - Distributions 5 04-01-2008 03:49 PM
Spell Checker Kahless LQ Suggestions & Feedback 4 07-16-2007 12:35 AM
Spell Checker royleith Linux - Desktop 2 06-15-2007 08:30 AM
FF 2.0 spell checker ??? lleb Linux - Software 11 02-03-2007 10:45 AM
Your Spell-Checker Robert G. Hays LQ Suggestions & Feedback 4 03-23-2005 02:31 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:25 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration