LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-04-2015, 08:03 PM   #1
rbees
Member
 
Registered: Mar 2004
Location: northern michigan usa
Distribution: Debian Squeeze, Whezzy, Jessie
Posts: 891

Rep: Reputation: 45
grep search returns non existent words


Ladies & Gents,

Thanks again for the great insight and guidance the users of this site provide.

I am trying to search through 970 htm files for old / middle english word so that they can be updated to modern english. Answereth vs Answers, or changest vs change, and so on. At this point I have some 800 words but grep is returning words that do not exist in the data files.

The problem is that the grep command I am using is returning noexistant words out of the files. Said command is
Code:
:~$ find ~/path/to/*.htm -exec grep -oh "\w*est" * {} \+ > ~/work/path/wordList.txt
Then I get rid of the duplicate lines with
Code:
:~$ awk 'a !~ $0; {a=$0}' ~/work/path/wordList.txt > ~/work/path/TrimedWordList.txt
Here is a snippet of the results with the nonexistent words in it.
Code:
bbadest
bbringest
bcamest
bcriest
bdeliverest
bdest
bdoest
beatest
becamest
begettest
beholdest
belongest
best
bgavest
bgoest
bknowest
blackest
blessest
bliest
bliftest
blovest
bmayest
boastest
borest
boverthrowest
bplantest
breakest
bringest
broughtest
bsavest
bseest
bsellest
bsendest
bshouldest
btillest
buildest
buyest
bwouldest
byest
candlest
If I do as search for the first word with
Code:
:~$ grep -rnw '/path/to/files/' -e "bbadest"
:~$
it returns nothing. Notice that the word seams to have an extra b perpended to it, but if I search for the word without the b I still get nothing. It is the same for the others that look odd.

This only seams to happen to words that start with a b.

So I have concluded that there is something not quite right about the initial search string somehow. Any clues?

Thanks
 
Old 06-04-2015, 08:26 PM   #2
metaschima
Senior Member
 
Registered: Dec 2013
Distribution: Slackware
Posts: 1,982

Rep: Reputation: 491Reputation: 491Reputation: 491Reputation: 491Reputation: 491
You are not using the regex correctly. The '*' matches any number of the previous character. If the previous character is 'b' it will match any number of 'b's. What you probably want is '.*', which will match any number of any character. You may also find '\<' and '\>' useful for matching the beginning and end of a word.
 
Old 06-04-2015, 08:46 PM   #3
rbees
Member
 
Registered: Mar 2004
Location: northern michigan usa
Distribution: Debian Squeeze, Whezzy, Jessie
Posts: 891

Original Poster
Rep: Reputation: 45
Thanks metaschima

I added a . to the command in what I thought was the right place but I am still getting the same results.

Code:
$ find ~/path/to/*.htm -exec egrep -oh "\w.*est" * {} \+ > ~/path/to/work/wordList.txt
 
Old 06-04-2015, 09:13 PM   #4
linosaurusroot
Member
 
Registered: Oct 2012
Distribution: OpenSuSE,RHEL,Fedora,OpenBSD
Posts: 982
Blog Entries: 2

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by rbees View Post
Ladies & Gents,
get rid of the duplicate lines with
Why not uniq ?
 
Old 06-05-2015, 01:14 PM   #5
fatmac
Senior Member
 
Registered: Sep 2011
Location: Upper Hale, Surrey/Hants Border, UK
Distribution: AntiX
Posts: 2,345

Rep: Reputation: Disabled
You may find sed is a better tool for this job, publishers seem to use it to change words in very long texts.
 
Old 06-05-2015, 03:23 PM   #6
rbees
Member
 
Registered: Mar 2004
Location: northern michigan usa
Distribution: Debian Squeeze, Whezzy, Jessie
Posts: 891

Original Poster
Rep: Reputation: 45
Thanks linosaurusroot

I think I like uniq better than awk for getting rid of the duplicate lines.

Thanks fatmac,

I have a sed command all ready to do the changes, what I am having issues with is getting the database for the sed command built. When grep is returning non existent words from the data files, all 970 of them, and one is not sure just how it should be changed, it makes the task more difficult than it needs to be.

Thanks again
 
Old 06-06-2015, 12:46 PM   #7
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,608

Rep: Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935
May I ask you to explain the find command you are using? ie. what do you think each part is doing??
 
Old 06-06-2015, 09:48 PM   #8
rbees
Member
 
Registered: Mar 2004
Location: northern michigan usa
Distribution: Debian Squeeze, Whezzy, Jessie
Posts: 891

Original Poster
Rep: Reputation: 45
Thanks grail

Code:
:~$ find ~/path/to/*.htm -exec grep -oh "\w*est" * {} \+ > ~/work/path/wordList.txt
find with the path to the source files

-exec tell find to execute the grep command on said files

the -oh switches tell grep to print only the matched (non-empty) parts of a matching line, one one each line and to skip the path/filename details

"\w*est" tells grep to find words ending in est

* tells grep all said words

No guess on {} or \+ but will try to search it out when I get some time. But I assume that it is some kind of inclusive wrapper around the output.

> redirects the output

I have intentions of omitting the -h to see if I can figure out just what word it is returning the bad data for, but I have not had a chance yet.

Thanks
 
Old 06-06-2015, 10:12 PM   #9
joec@home
Member
 
Registered: Sep 2009
Location: Galveston Tx
Posts: 291

Rep: Reputation: 70
Just personal preference, I'm not to particular to using find in scripts as these, much rather use loops. Not as fancy but easier to see what you are doing, and easier to see what you might be doing wrong when it breaks. A rough example, you might have to tweak it a little. Also good practice to sort before uniq, and even pipe to sort two or three times when you hit output strings over 100. "sort | sort | uniq"

Build your loop:
PATH="~/path/to/*.htm"; for i in $(ls $PATH ) do echo $i; done

Add your additional variables and test:
PATH="~/path/to/*.htm"; SEARCH="est"; for i in $(ls $PATH ) do echo $i $PATH $SEARCH; done

Add the commands to do the work:
PATH="~/path/to/*.htm"; SEARCH="est"; for i in $(ls $PATH ) do grep -i \w.*$SEARCH $PATH$i ;done

Add additional filters and verify your work:
PATH="~/path/to/*.htm"; SEARCH="est"; for i in $(ls $PATH ) do grep -i \w.*$SEARCH $PATH$i ;done |sort |uniq

Output to file
PATH="~/path/to/*.htm"; SEARCH="est"; for i in $(ls $PATH ) do grep -i \w.*$SEARCH $PATH$i ;done |sort |uniq > path/to/file
 
Old 06-07-2015, 04:54 AM   #10
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,608

Rep: Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935
Quote:
* tells grep all said words
Not sure the above is doing what you expect. If we remove find from the equation for the moment, what would you expect from the following:
Code:
grep -oh "\w*est" * ~/path/to/actual_file_name_here.htm
In the above, what would expect to happen to the asterisk (*) sitting on its own?

I am not trying to be difficult, but I am trying to lead you to the answer so you learn from it

I would add that this is not necessarily the solution to your problem, but it is introducing unwanted side effects all the same.

Last edited by grail; 06-07-2015 at 04:56 AM.
 
Old 06-07-2015, 07:08 AM   #11
rbees
Member
 
Registered: Mar 2004
Location: northern michigan usa
Distribution: Debian Squeeze, Whezzy, Jessie
Posts: 891

Original Poster
Rep: Reputation: 45
Thanks grail

Quote:
I am not trying to be difficult, but I am trying to lead you to the answer so you learn from it
I truly appreciate your tutoring.

According to my grep pocket reference (O'Reilly) the * causes gerp to match any number of characters. Which in this case I think should match any word that has the est in it more than once.
 
Old 06-07-2015, 10:19 AM   #12
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,608

Rep: Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935Reputation: 2935
As the asterisk is unprotected by any quotes it will be dealt with by the shell prior to grep seeing it. So it will actually be expanded to be all the file and directory names in the current directory.
This in turn means that your regex is not being run simply on the html files found but also on all the files and directories (dirs will cause an error ... I think) in the current directory.

Like I said before, not sure that will fix the issue, but try removing it and you should get some different results ... assuming of course you are not running it from the directory where all the html files are anyway
 
Old 06-07-2015, 12:28 PM   #13
rbees
Member
 
Registered: Mar 2004
Location: northern michigan usa
Distribution: Debian Squeeze, Whezzy, Jessie
Posts: 891

Original Poster
Rep: Reputation: 45
Well it seams that the most recent round of updates has broken gerp

Per the man
Quote:
-o, --only-matching
Print only the matched (non-empty) parts of a matching line, with each
such part on a separate output line.
But when I run the command above I am getting the whole line anyway. I have played with the switches, both long and short versions, and I get the same output.

Code:
:~/bin/shabbat/data/JPS$ grep -o "\w.*est" et0103.htm
B>3</B> but of the fruit of the tree which is in the middle of the garden, God has said: You will not eat of it, neither will you touch it, lest
P><B>22</B> And HaShem God said: 'Behold, the man is become as one of us, to know good and evil; and now, lest
B><A NAME="Mail">Got a quest

Last edited by rbees; 06-07-2015 at 12:47 PM. Reason: more info
 
Old 06-07-2015, 12:34 PM   #14
rbees
Member
 
Registered: Mar 2004
Location: northern michigan usa
Distribution: Debian Squeeze, Whezzy, Jessie
Posts: 891

Original Poster
Rep: Reputation: 45
I do think that I may know why I am getting the results I am. I think the bad words are beginning of line words that are continued from the previous line and so are missing some of their letters. Still I question why they only seam to exist when starting with a "b".
 
Old 06-07-2015, 01:36 PM   #15
rknichols
Senior Member
 
Registered: Aug 2009
Distribution: CentOS
Posts: 3,938

Rep: Reputation: 1717Reputation: 1717Reputation: 1717Reputation: 1717Reputation: 1717Reputation: 1717Reputation: 1717Reputation: 1717Reputation: 1717Reputation: 1717Reputation: 1717
Quote:
Originally Posted by rbees View Post
Well it seams that the most recent round of updates has broken gerp

Per the man But when I run the command above I am getting the whole line anyway. I have played with the switches, both long and short versions, and I get the same output.

Code:
:~/bin/shabbat/data/JPS$ grep -o "\w.*est" et0103.htm
B>3</B> but of the fruit of the tree which is in the middle of the garden, God has said: You will not eat of it, neither will you touch it, lest
P><B>22</B> And HaShem God said: 'Behold, the man is become as one of us, to know good and evil; and now, lest
B><A NAME="Mail">Got a quest
That ".*" is not doing what you want. As written, the match will begin with the first alphanumeric character ("\w" is a synonym for "[[:alnum:]]") on the line (probably the beginning of the line) and then include any number of any character and end with the last instance of the string "est". Try this:
Code:
"\w\+est\w*"
That's one or more alphanumeric characters, followed by the string "est", and then including however many alphanumeric characters that immediately follow.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] how to use grep to search for words/sentences starting with hyphen? aswani Programming 12 09-01-2012 07:44 AM
Can I use grep to find two words near each other? walterbyrd Linux - Software 4 12-09-2011 11:01 PM
how to grep 2 words at the same time ufmale Linux - Newbie 2 09-09-2010 04:25 PM
[SOLVED] LQ Search: Can there be a way to search for tiny words? GrapefruiTgirl LQ Suggestions & Feedback 6 02-02-2010 05:58 PM
Search and Replace: Asian Words to English Words ieeestd802 Linux - Software 0 10-27-2004 07:48 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 05:24 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration