LinuxQuestions.org
Go Job Hunting at the LQ Job Marketplace
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices

Reply
 
Search this Thread
Old 06-01-2009, 08:08 AM   #1
jitendriya.dash
LQ Newbie
 
Registered: May 2009
Posts: 7

Rep: Reputation: 0
Exclamation Performing fast search with a bash script


Hi,

Here is a tough requirement , to be served by bash script.

I want to perform 3,00,000 * 10,000 searches.

i.e. I have 10,000 doc files and 3,00,000 html files in the file-system. I want to check, which of the doc files are referred in any html files. (ex- <a href="abc.doc">abc</a>) Finally, I want to remove all the doc files, which are not referenced from any of the html files.

Approach -1 :-
Initially I have tried with nested loops, outer loop on list of html files, and inner loop on list of doc files. Then, inside the inner loop, I was checking (with fgrep command) whether one file is present in one html.
# html_list :- list of all html files
# doc_file_list :- list of all doc files
# tmp_doc_file_list :- list of temp doc files
while read l_line_outer
do
while read l_line_inner
do
fgrep <file> <html>
return_code=$?
if [ $return_code -ne 0 ]
then
printf "%s\t%s\n" $l_alias_name_file $l_alias_path_file >> tmp_doc_file_list
fi
done < doc_file_list
mv tmp_doc_file_list doc_file_list
done < html_list

This approach was giving correct output, but it was taking a long time to perform this huge no. of searches.

Approach -2 :-
Then, we switched to a different logic, by launching many threads in parallel.

1. Outer loop on "doc_file_list" and inner loop on html_list.
2. under a single process, (inside the inner loop ) i was searching (fgrep) existence of one file into 30 html files at once.
3. I was launching 10 such processes in parallel (by using & at the end.)

The sample code is as follows.
........
.........
while read l_line_outer
do
.......
< Logic to jump the loop pointer in 10 lines basis, i.e. first loop it will start from first position file and from next loop it will start from 11th position.>
.......
while read l_line_inner
do
< Logic to jump the loop pointer in 30 lines basis, i.e. first loop it will start from first position file and from next loop it will start from 31th position. >
........
# Loop to launch multiple fgrep in parallel
for ((i=1; i<=10; i++))
do
( fgrep -s -m 1 <file{i}> <html1> <html2> <html3> ... <html30> > /dev/null ; echo $? >> thread_status{i} ) &
done
....
done < html_list
.....
<Logic to prepare the doc_file_list for the next loop and handle the multiple threads>
.....
done < doc_file_list
......
.....

However, This approach is also not working.
a ) I am getting the correct output, on small no of files/folders.
b ) While performing 300,000 * 10,000 searches, my shell script is getting dead-locked some-where, and the execution is getting halted.
c ) Even if, I am managing the dead-locking (thread management) to some exetent, it will take a long time to finish such a huge search.


Is there any alternative approach , for making this search faster, so that this search can be finished atleast in 2-3 days ?

Please help.

Thanks and Regards,

Jitendriya Dash.
 
Old 06-01-2009, 02:30 PM   #2
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
It's highly inefficient to search all those html files for EVERY SINGLE doc. Instead, make an index of all docs that you find in the html files and search that file instead. I say this will improve efficiency a lot.

So, something like:

Code:
while read html
do
  grep \.doc $html > docs_found_in_html
done < html_list

while read doc
do
  if grep -m 1 $doc docs_found_in_html
  then
    echo $doc > found_docs
  fi
done < doc_file_list

sort doc_file_list > sorted_doc_file_list
sort found_docs > sorted_found_docs

comm -13 sorted_found_docs sorted_doc_file_list > missing_docs

Last edited by H_TeXMeX_H; 06-03-2009 at 08:25 AM.
 
Old 06-03-2009, 04:03 AM   #3
jitendriya.dash
LQ Newbie
 
Registered: May 2009
Posts: 7

Original Poster
Rep: Reputation: 0
Thanks a lot.

Thanks a lot. This idea really worked like a miracle. Just simple and effective. Now the entire task of searching can be completed within a single day even.

Thank you Very much.
 
Old 06-03-2009, 08:24 AM   #4
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
Really ? so it worked. Very cool

If you used some of what I posted, technically you could speed it up even more if you use the '-m 1' option for the second grep, I forgot about that.
 
Old 06-03-2009, 08:47 AM   #5
theYinYeti
Senior Member
 
Registered: Jul 2004
Location: France
Distribution: Arch Linux
Posts: 1,897

Rep: Reputation: 61
Were I to do that, I would do that way:
— sed all HTML files to extract the doc files, normalize all file names, and sort -u,
— list all doc files, normalized the same way, and sort -u
— diff both,
— lines with a “<” are files referenced in HTML that do not exist; lines with a “>” are existing files that do not appear in the HTML.
— alternately, a uniq -d would give those that are in both HTML and filesystem.

Something like that:
Code:
diff <(sed '...' *.html | sort -u) <(find ... -printf ...)
or
{ sed '...' *.html; find ... -printf ...; } | uniq -d
In my experience, managing data in flow is much faster than iteration.

Yves.
 
Old 06-03-2009, 09:56 AM   #6
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
That's going to be a more difficult solution, I doubt you could isolate just the name of the .doc unless they begin with something specific.
 
Old 06-03-2009, 10:52 AM   #7
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,695
Blog Entries: 5

Rep: Reputation: 240Reputation: 240Reputation: 240
if you have Python
Code:
import re
doc=[]
path=os.path.join("/home","path1")
pat=re.compile(".*<a\s+href=\"(.*\.doc)\">.*",re.M|re.DOTALL)
for r,d,f in os.walk(path):
    for files in f:
        if files[-4:] == ".doc":
            doc.append(files)         
for r,d,f in os.walk(path):
    for files in f:
        f=0
        if files[-5:] == ".html":            
            for line in open( os.path.join(r,files) ):
                if pat.search(line):
                    found=pat.findall(line)[0]                    
                    print "found ",found
                    if found in doc:
                        doc.remove(found)
                        f=1                
                    break
                if f: break               
print "These are docs not referenced: "
for d in doc:
    print d
 
Old 06-04-2009, 12:54 AM   #8
jitendriya.dash
LQ Newbie
 
Registered: May 2009
Posts: 7

Original Poster
Rep: Reputation: 0
The applied logic that worked...

Hi,

This is the logic I applied.

# l_html_list: file with all html paths

# l_conf_list_extension : contains all extensions that needs to be
# searched. ex- .doc , .xls etc.

# l_tmp_parse_path : file containing 2 columns, first column is
# file-names and 2nd column is file-path.

while read l_line_html
do
fgrep -f $l_conf_list_extension "$l_line_html" >> $l_tmp_html_content
done < $l_html_list


while read l_line_docs
do
l_alias_name_docs=`echo "${l_line_docs}" | $AWK_COMMAND -F '\t' '{ print $1 }'`
l_alias_path_docs=`echo "${l_line_docs}" | $AWK_COMMAND -F '\t' '{ print $2 }'`

#-- Grep all the file-names into the temp file with all necessary html contents.
fgrep -s -m 1 "$l_alias_name_docs" "$l_tmp_html_content" >> /dev/null
l_return_code=$?
if [ $l_return_code -ne 0 ]
then
printf "%s\t%s\n" "$l_alias_name_docs" "$l_alias_path_docs" >> $l_tmp_parse_final
fi
done < $l_tmp_parse_path

Anyway, thanks to you all again, because this task completes successfully, for searching of 10,000 files into 300,000 html files, in 2 hours only. Previously it was taking a long long time. Now the performance has been drastically improved.

Thanks and Regards,

Jitendriya Dash.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Search and Replace in bash script daemoneye Other *NIX 1 04-08-2009 01:24 AM
xgl + compiz + x1650 not performing as fast jinchuriki Linux - Software 1 06-09-2007 03:32 AM
Bash script to search for specific pattern? jimyg10 Programming 22 05-18-2007 01:02 PM
Bash script to search through directories. mcdrr Programming 5 05-11-2007 05:41 PM
search function (bash script) LYK Programming 2 05-27-2004 10:51 AM


All times are GMT -5. The time now is 07:16 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration