LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-16-2015, 06:59 AM   #1
rbees
Member
 
Registered: Mar 2004
Location: northern michigan usa
Distribution: Debian Squeeze, Whezzy, Jessie
Posts: 905

Rep: Reputation: 46
grep with two arrays


Ladies & Gents,

As always thank you for helping me get things figured out so they work as I intend them to.

I am trying to update the language in some text files. I have a short script that will do so but it takes it fifteen minuets or so to complete on my dell laptop with an i5 processor, which is not supper speedy by todays standards but not a slouch either.

I believe I can speed that up quite a bit if I can generate a specific list of changes that need to be made to each file instead of a bulk data base that is thrown at each file. There are 970 files and some 800+ antiquated words that need to be changed. Some of the files are only a hundred or so words long, so clearly checking for all 800+ words is a waste of cpu cycles and time consuming for no gain on those short files.

The problem I am having is finding a way to input each element from the array that contains the word-list to grep one at a time. I have tried every thing I know and google has not yielded any help that I have been able to translate into what I want to accomplish.

The data word-list is in a one word (or phrase) per line text file. Which I then pull into an array for looping through. (loop not working correctly)

The file list also gets pulled into an array and appears to be working correctly.

I am sure it is something simple, I have just not managed to figure out what it is.

Code:
#!/bin/bash
set -x

WORKDIR="$HOME/bin/scripting/BibleTextUpdate"
DPATH="$HOME/bin/shabbat/data/JPS"

# Read list of words to search for into array
declare -a filecontent
filecontent=$(cat $WORKDIR/MyFindWordList.txt)

# Read list of files into array
files_in_dirs=($DPATH/et*htm)

for f in "${files_in_dirs[@]}" ;do
  oldIFS="$IFS"	# have commented out
  IFS=$'\n'	# have commented out
  echo $f
  for e in "${filecontent}" ;do  # have had [$] [@] and[*] here too
    grep -n $e $f > "$f.txt"
  done
  IFS="$oldIFS"	# have commented out
done
Ideally I want the output file to look something like this as I don't need the context, only where to look in the file to see the correct changes that need to be made.

Code:
12:word
24:other word
48:another word or phrase
As the script stands it is generating an empty file and grepping for the whole word-list at one time, which will never match. This script need not be supper efficient but the script that will be built from the data returned by this script will need to be.

Thanks
 
Old 06-16-2015, 07:04 AM   #2
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 11,515

Rep: Reputation: 3461Reputation: 3461Reputation: 3461Reputation: 3461Reputation: 3461Reputation: 3461Reputation: 3461Reputation: 3461Reputation: 3461Reputation: 3461Reputation: 3461
the "usual" double-loop problem, with an external command inside (which is an additional loop, and contains a fork too).
I would reimplement it in a single perl script, that will definitely run faster.
 
Old 06-16-2015, 07:11 AM   #3
rtmistler
Moderator
 
Registered: Mar 2011
Location: MA, USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 7,214
Blog Entries: 12

Rep: Reputation: 2656Reputation: 2656Reputation: 2656Reputation: 2656Reputation: 2656Reputation: 2656Reputation: 2656Reputation: 2656Reputation: 2656Reputation: 2656Reputation: 2656
I don't understand why you're just using grep, I'd use sed to search and replace and then re-direct the output to a parallel directory to ensure that my original content was still retained but the changes got completed.

I'm assuming you know the syntax of a sed search and replace, but a minor example:
Code:
$ cat 1.txt
This is my file
$ sed -e s/This/That/g 1.txt
That is my file
sed doesn't change the file, but redirects it to stdout, however you can redirect that into another file name "> 2.txt"

Granted you would be iteratively performing sed many times on the same file, but right now you're iteratively performing grep in the same manner, and not changing the files. Therefore the way you have it, you'd generate your change file list and then how would you be changing the files? Manually?
 
Old 06-16-2015, 07:41 AM   #4
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,387

Rep: Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553
Code:
  for e in "${filecontent}" ;do  # have had [$] [@] and[*] here too
    grep -n $e $f > "$f.txt"
  done
done
Not entirely following what you're doing, but the above loop should be approximately equivalent to
Code:
grep -Fn -f "$WORKDIR/MyFindWordList.txt" "$f" > "$f.txt"
 
1 members found this post helpful.
Old 06-16-2015, 10:25 AM   #5
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,695

Rep: Reputation: 561Reputation: 561Reputation: 561Reputation: 561Reputation: 561Reputation: 561
Quote:
Originally Posted by rbees View Post
... I believe I can speed that up quite a bit if I can generate a specific list of changes that need to be made to each file instead of a bulk data base that is thrown at each file. ...
I prefer to respond to LQ posts with running tested code. In this case the sheer volume of data is an obstacle so I'll outline an approach which you might try.

1) For each file which requires substitutions, construct a temporary file called a lexicon. That is a file in which each line contains a single word which appears somewhere in the input file. If the input contains multiple instances of a word the lexicon would have it only once.

2) For each file which requires substitutions, bounce the newly constructed lexicon against your large list of possible substitutions it to create a subset substitution file (another temporary file). This subset contains only those substitutions which will be needed.

3) Use sed or grep or awk (whichever suits your fancy) with the second temporary file to make the substitutions.

4) Blow away both temporary files; they have served their purpose.

Will this proposed code be faster than what you already have? It depends on the content of the files you are processing.

Daniel B. Martin
 
Old 06-16-2015, 01:00 PM   #6
rbees
Member
 
Registered: Mar 2004
Location: northern michigan usa
Distribution: Debian Squeeze, Whezzy, Jessie
Posts: 905

Original Poster
Rep: Reputation: 46
Thanks all,

pan64
Quote:
I would reimplement it in a single perl script, that will definitely run faster.
That would all be fine except that I have almost zero knowledge of perl.


rtmistler
Quote:
I don't understand why you're just using grep, I'd use sed
I have a sed script
Quote:
I have a short script that will do so but it takes it fifteen minuets or so to complete
I am not trying to do a find/replace with this code, only to generate a more efficient data structure for the final script that actually makes the changes.

ntubski, that section of code is suppose to take one of the word/phrases ($e) and grep one of the files ($f) and write the line number out to a text file, then repeat the process with the next word/pharase.

danielbmartin, Your list is pretty much what I am trying to do. This step happens to be item 1 on your list. I have the majority of words that will eventually need to be changed in the data file
Code:
filecontent=$(cat $WORKDIR/MyFindWordList.txt)
Reading 780 files, some of which are quite large looking for 800+ words is only asking for year long process to generate the individual data files. Some how I think a little bit of coding will speed the process up quite a lot.

If I can just figure out how to pass grep the word list one line at a time for each file and write the output to a text file I'll be a lot closer than I am now.

Code:
wget  -O "$STOREDIR/JPS/JPS.zip" "http://www.mechon-mamre.org/htmlzips/et002.zip"
will get you a copy of the files to be changed but NOT the word list.

Thanks
 
Old 06-16-2015, 01:51 PM   #7
rtmistler
Moderator
 
Registered: Mar 2011
Location: MA, USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 7,214
Blog Entries: 12

Rep: Reputation: 2656Reputation: 2656Reputation: 2656Reputation: 2656Reputation: 2656Reputation: 2656Reputation: 2656Reputation: 2656Reputation: 2656Reputation: 2656Reputation: 2656
The issue is that this isn't like a modified shell sort where once you process part of it, you're done, because you have multiple and different search criteria.

Another thing to consider is any search algorithm, be that handwritten or an existing utility can never avoid reading the entire file because otherwise how would it guarantee that it would not miss any strings?

Therefore besides limiting the search to single pass per term to ensure that you're not being wasteful with your search, the only other thing to do here is to ensure that your system is not doing extraneous tasks. I'd boot to singer user mode and eliminate any side processing in order to utilize as much CPU time as possible for this.
 
Old 06-16-2015, 08:00 PM   #8
rbees
Member
 
Registered: Mar 2004
Location: northern michigan usa
Distribution: Debian Squeeze, Whezzy, Jessie
Posts: 905

Original Poster
Rep: Reputation: 46
ntubski

That line did the trick. So the new script is
Code:
#!/bin/bash
set -x

# Script to pull the antiquated words out of a text file and generate a list 
# for updating the language to more modern english.

# get the words
# find ~/bin/shabbat/data/JPS/*.htm -exec grep -oh "\w*eth\b" * {} \+ > ~/bin/scripting/BibleTextUpdate/wordList.txt

# sort the list
# sort ~/bin/scripting/BibleTextUpdate/wordList.txt > SortedWordList.txt

# delete duplicate lines
# uniq ~/bin/scripting/BibleTextUpdate/SortedWordList.txt ~/bin/scripting/BibleTextUpdate/UniqWordList.txt

WORKDIR="$HOME/bin/scripting/BibleTextUpdate"
DPATH="$HOME/bin/shabbat/data/JPS"


# Read list of files into array
files_in_dirs=($DPATH/et*htm)

for f in "${files_in_dirs[@]}" ;do
  echo $f
  grep -Fn -f "$WORKDIR/MyFindWordList.txt" "$f" > "$f.txt"
done
I guess I was justing trying to make it to complicated. On the plus side I do understand why the line works. Now it is generating the individual files with some relevant data in them which will help in getting the final versions done.

It is however returning for some files the last line which only contains html tags and what-nots, there is no reason that line should be returned as there are no antiquated words in it. Also it does not return it for every file, only part of them. I can deal with that though.

Thanks again
 
Old 06-16-2015, 09:54 PM   #9
rbees
Member
 
Registered: Mar 2004
Location: northern michigan usa
Distribution: Debian Squeeze, Whezzy, Jessie
Posts: 905

Original Poster
Rep: Reputation: 46
update on the output.

I have another thread about getting corrupted data from grep and I am still being afflicted by that problem. http://www.linuxquestions.org/questi...ds-4175544533/ I thought it was solved by a system update and reboot but it seams not. So after removing the corrupted data I am not getting the last line returned.
 
Old 06-17-2015, 01:31 AM   #10
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 11,515

Rep: Reputation: 3461Reputation: 3461Reputation: 3461Reputation: 3461Reputation: 3461Reputation: 3461Reputation: 3461Reputation: 3461Reputation: 3461Reputation: 3461Reputation: 3461
check if wordlist contains illegal lines (for example emtpy line at the end) and/or illegal chars (od -xc)
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Something not working quite right in arrays/parsing arrays.. sysmicuser Linux - Newbie 14 04-29-2013 05:37 AM
Creating an alias in ksh that uses grep and includes 'grep -v grep' doug248 Linux - Newbie 2 08-05-2012 03:07 PM
Trying to understand pipes - Can't pipe output from tail -f to grep then grep again lostjohnny Linux - Newbie 15 03-12-2009 11:31 PM
Arrays of Structures to Arrays of Classes knobby67 Programming 1 01-01-2008 02:39 PM
Question about outputing arrays with pointers, then just arrays... RHLinuxGUY Programming 1 04-12-2006 06:40 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:09 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration