LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   grep search returns non existent words (https://www.linuxquestions.org/questions/programming-9/grep-search-returns-non-existent-words-4175544533/)

rbees 06-04-2015 08:03 PM

grep search returns non existent words
 
Ladies & Gents,

Thanks again for the great insight and guidance the users of this site provide.

I am trying to search through 970 htm files for old / middle english word so that they can be updated to modern english. Answereth vs Answers, or changest vs change, and so on. At this point I have some 800 words but grep is returning words that do not exist in the data files.

The problem is that the grep command I am using is returning noexistant words out of the files. Said command is
Code:

:~$ find ~/path/to/*.htm -exec grep -oh "\w*est" * {} \+ > ~/work/path/wordList.txt
Then I get rid of the duplicate lines with
Code:

:~$ awk 'a !~ $0; {a=$0}' ~/work/path/wordList.txt > ~/work/path/TrimedWordList.txt
Here is a snippet of the results with the nonexistent words in it.
Code:

bbadest
bbringest
bcamest
bcriest
bdeliverest
bdest
bdoest
beatest
becamest
begettest
beholdest
belongest
best
bgavest
bgoest
bknowest
blackest
blessest
bliest
bliftest
blovest
bmayest
boastest
borest
boverthrowest
bplantest
breakest
bringest
broughtest
bsavest
bseest
bsellest
bsendest
bshouldest
btillest
buildest
buyest
bwouldest
byest
candlest

If I do as search for the first word with
Code:

:~$ grep -rnw '/path/to/files/' -e "bbadest"
:~$

it returns nothing. Notice that the word seams to have an extra b perpended to it, but if I search for the word without the b I still get nothing. It is the same for the others that look odd.

This only seams to happen to words that start with a b.

So I have concluded that there is something not quite right about the initial search string somehow. Any clues?

Thanks

metaschima 06-04-2015 08:26 PM

You are not using the regex correctly. The '*' matches any number of the previous character. If the previous character is 'b' it will match any number of 'b's. What you probably want is '.*', which will match any number of any character. You may also find '\<' and '\>' useful for matching the beginning and end of a word.

rbees 06-04-2015 08:46 PM

Thanks metaschima

I added a . to the command in what I thought was the right place but I am still getting the same results.

Code:

$ find ~/path/to/*.htm -exec egrep -oh "\w.*est" * {} \+ > ~/path/to/work/wordList.txt

linosaurusroot 06-04-2015 09:13 PM

Quote:

Originally Posted by rbees (Post 5372351)
Ladies & Gents,
get rid of the duplicate lines with

Why not uniq ?

fatmac 06-05-2015 01:14 PM

You may find sed is a better tool for this job, publishers seem to use it to change words in very long texts.

rbees 06-05-2015 03:23 PM

Thanks linosaurusroot

I think I like uniq better than awk for getting rid of the duplicate lines.

Thanks fatmac,

I have a sed command all ready to do the changes, what I am having issues with is getting the database for the sed command built. When grep is returning non existent words from the data files, all 970 of them, and one is not sure just how it should be changed, it makes the task more difficult than it needs to be.

Thanks again

grail 06-06-2015 12:46 PM

May I ask you to explain the find command you are using? ie. what do you think each part is doing??

rbees 06-06-2015 09:48 PM

Thanks grail

Code:

:~$ find ~/path/to/*.htm -exec grep -oh "\w*est" * {} \+ > ~/work/path/wordList.txt
find with the path to the source files

-exec tell find to execute the grep command on said files

the -oh switches tell grep to print only the matched (non-empty) parts of a matching line, one one each line and to skip the path/filename details

"\w*est" tells grep to find words ending in est

* tells grep all said words

No guess on {} or \+ but will try to search it out when I get some time. But I assume that it is some kind of inclusive wrapper around the output.

> redirects the output

I have intentions of omitting the -h to see if I can figure out just what word it is returning the bad data for, but I have not had a chance yet.

Thanks

joec@home 06-06-2015 10:12 PM

Just personal preference, I'm not to particular to using find in scripts as these, much rather use loops. Not as fancy but easier to see what you are doing, and easier to see what you might be doing wrong when it breaks. A rough example, you might have to tweak it a little. Also good practice to sort before uniq, and even pipe to sort two or three times when you hit output strings over 100. "sort | sort | uniq"

Build your loop:
PATH="~/path/to/*.htm"; for i in $(ls $PATH ) do echo $i; done

Add your additional variables and test:
PATH="~/path/to/*.htm"; SEARCH="est"; for i in $(ls $PATH ) do echo $i $PATH $SEARCH; done

Add the commands to do the work:
PATH="~/path/to/*.htm"; SEARCH="est"; for i in $(ls $PATH ) do grep -i \w.*$SEARCH $PATH$i ;done

Add additional filters and verify your work:
PATH="~/path/to/*.htm"; SEARCH="est"; for i in $(ls $PATH ) do grep -i \w.*$SEARCH $PATH$i ;done |sort |uniq

Output to file
PATH="~/path/to/*.htm"; SEARCH="est"; for i in $(ls $PATH ) do grep -i \w.*$SEARCH $PATH$i ;done |sort |uniq > path/to/file

grail 06-07-2015 04:54 AM

Quote:

* tells grep all said words
Not sure the above is doing what you expect. If we remove find from the equation for the moment, what would you expect from the following:
Code:

grep -oh "\w*est" * ~/path/to/actual_file_name_here.htm
In the above, what would expect to happen to the asterisk (*) sitting on its own?

I am not trying to be difficult, but I am trying to lead you to the answer so you learn from it :)

I would add that this is not necessarily the solution to your problem, but it is introducing unwanted side effects all the same.

rbees 06-07-2015 07:08 AM

Thanks grail

Quote:

I am not trying to be difficult, but I am trying to lead you to the answer so you learn from it
I truly appreciate your tutoring.

According to my grep pocket reference (O'Reilly) the * causes gerp to match any number of characters. Which in this case I think should match any word that has the est in it more than once.

grail 06-07-2015 10:19 AM

As the asterisk is unprotected by any quotes it will be dealt with by the shell prior to grep seeing it. So it will actually be expanded to be all the file and directory names in the current directory.
This in turn means that your regex is not being run simply on the html files found but also on all the files and directories (dirs will cause an error ... I think) in the current directory.

Like I said before, not sure that will fix the issue, but try removing it and you should get some different results ... assuming of course you are not running it from the directory where all the html files are anyway :)

rbees 06-07-2015 12:28 PM

Well it seams that the most recent round of updates has broken gerp

Per the man
Quote:

-o, --only-matching
Print only the matched (non-empty) parts of a matching line, with each
such part on a separate output line.
But when I run the command above I am getting the whole line anyway. I have played with the switches, both long and short versions, and I get the same output.

Code:

:~/bin/shabbat/data/JPS$ grep -o "\w.*est" et0103.htm
B>3</B> but of the fruit of the tree which is in the middle of the garden, God has said: You will not eat of it, neither will you touch it, lest
P><B>22</B> And HaShem God said: 'Behold, the man is become as one of us, to know good and evil; and now, lest
B><A NAME="Mail">Got a quest


rbees 06-07-2015 12:34 PM

I do think that I may know why I am getting the results I am. I think the bad words are beginning of line words that are continued from the previous line and so are missing some of their letters. Still I question why they only seam to exist when starting with a "b".

rknichols 06-07-2015 01:36 PM

Quote:

Originally Posted by rbees (Post 5373579)
Well it seams that the most recent round of updates has broken gerp

Per the man But when I run the command above I am getting the whole line anyway. I have played with the switches, both long and short versions, and I get the same output.

Code:

:~/bin/shabbat/data/JPS$ grep -o "\w.*est" et0103.htm
B>3</B> but of the fruit of the tree which is in the middle of the garden, God has said: You will not eat of it, neither will you touch it, lest
P><B>22</B> And HaShem God said: 'Behold, the man is become as one of us, to know good and evil; and now, lest
B><A NAME="Mail">Got a quest


That ".*" is not doing what you want. As written, the match will begin with the first alphanumeric character ("\w" is a synonym for "[[:alnum:]]") on the line (probably the beginning of the line) and then include any number of any character and end with the last instance of the string "est". Try this:
Code:

"\w\+est\w*"
That's one or more alphanumeric characters, followed by the string "est", and then including however many alphanumeric characters that immediately follow.


All times are GMT -5. The time now is 04:25 AM.