grep search returns non existent words
Ladies & Gents,
Thanks again for the great insight and guidance the users of this site provide. I am trying to search through 970 htm files for old / middle english word so that they can be updated to modern english. Answereth vs Answers, or changest vs change, and so on. At this point I have some 800 words but grep is returning words that do not exist in the data files. The problem is that the grep command I am using is returning noexistant words out of the files. Said command is Code:
:~$ find ~/path/to/*.htm -exec grep -oh "\w*est" * {} \+ > ~/work/path/wordList.txt Code:
:~$ awk 'a !~ $0; {a=$0}' ~/work/path/wordList.txt > ~/work/path/TrimedWordList.txt Code:
bbadest Code:
:~$ grep -rnw '/path/to/files/' -e "bbadest" This only seams to happen to words that start with a b. So I have concluded that there is something not quite right about the initial search string somehow. Any clues? Thanks |
You are not using the regex correctly. The '*' matches any number of the previous character. If the previous character is 'b' it will match any number of 'b's. What you probably want is '.*', which will match any number of any character. You may also find '\<' and '\>' useful for matching the beginning and end of a word.
|
Thanks metaschima
I added a . to the command in what I thought was the right place but I am still getting the same results. Code:
$ find ~/path/to/*.htm -exec egrep -oh "\w.*est" * {} \+ > ~/path/to/work/wordList.txt |
Quote:
|
You may find sed is a better tool for this job, publishers seem to use it to change words in very long texts.
|
Thanks linosaurusroot
I think I like uniq better than awk for getting rid of the duplicate lines. Thanks fatmac, I have a sed command all ready to do the changes, what I am having issues with is getting the database for the sed command built. When grep is returning non existent words from the data files, all 970 of them, and one is not sure just how it should be changed, it makes the task more difficult than it needs to be. Thanks again |
May I ask you to explain the find command you are using? ie. what do you think each part is doing??
|
Thanks grail
Code:
:~$ find ~/path/to/*.htm -exec grep -oh "\w*est" * {} \+ > ~/work/path/wordList.txt -exec tell find to execute the grep command on said files the -oh switches tell grep to print only the matched (non-empty) parts of a matching line, one one each line and to skip the path/filename details "\w*est" tells grep to find words ending in est * tells grep all said words No guess on {} or \+ but will try to search it out when I get some time. But I assume that it is some kind of inclusive wrapper around the output. > redirects the output I have intentions of omitting the -h to see if I can figure out just what word it is returning the bad data for, but I have not had a chance yet. Thanks |
Just personal preference, I'm not to particular to using find in scripts as these, much rather use loops. Not as fancy but easier to see what you are doing, and easier to see what you might be doing wrong when it breaks. A rough example, you might have to tweak it a little. Also good practice to sort before uniq, and even pipe to sort two or three times when you hit output strings over 100. "sort | sort | uniq"
Build your loop: PATH="~/path/to/*.htm"; for i in $(ls $PATH ) do echo $i; done Add your additional variables and test: PATH="~/path/to/*.htm"; SEARCH="est"; for i in $(ls $PATH ) do echo $i $PATH $SEARCH; done Add the commands to do the work: PATH="~/path/to/*.htm"; SEARCH="est"; for i in $(ls $PATH ) do grep -i \w.*$SEARCH $PATH$i ;done Add additional filters and verify your work: PATH="~/path/to/*.htm"; SEARCH="est"; for i in $(ls $PATH ) do grep -i \w.*$SEARCH $PATH$i ;done |sort |uniq Output to file PATH="~/path/to/*.htm"; SEARCH="est"; for i in $(ls $PATH ) do grep -i \w.*$SEARCH $PATH$i ;done |sort |uniq > path/to/file |
Quote:
Code:
grep -oh "\w*est" * ~/path/to/actual_file_name_here.htm I am not trying to be difficult, but I am trying to lead you to the answer so you learn from it :) I would add that this is not necessarily the solution to your problem, but it is introducing unwanted side effects all the same. |
Thanks grail
Quote:
According to my grep pocket reference (O'Reilly) the * causes gerp to match any number of characters. Which in this case I think should match any word that has the est in it more than once. |
As the asterisk is unprotected by any quotes it will be dealt with by the shell prior to grep seeing it. So it will actually be expanded to be all the file and directory names in the current directory.
This in turn means that your regex is not being run simply on the html files found but also on all the files and directories (dirs will cause an error ... I think) in the current directory. Like I said before, not sure that will fix the issue, but try removing it and you should get some different results ... assuming of course you are not running it from the directory where all the html files are anyway :) |
Well it seams that the most recent round of updates has broken gerp
Per the man Quote:
Code:
:~/bin/shabbat/data/JPS$ grep -o "\w.*est" et0103.htm |
I do think that I may know why I am getting the results I am. I think the bad words are beginning of line words that are continued from the previous line and so are missing some of their letters. Still I question why they only seam to exist when starting with a "b".
|
Quote:
Code:
"\w\+est\w*" |
All times are GMT -5. The time now is 04:25 AM. |