LinuxQuestions.org
Latest LQ Deal: Complete CCNA, CCNP & Red Hat Certification Training Bundle
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 12-14-2017, 12:42 PM   #1
nadiawicket
LQ Newbie
 
Registered: Dec 2017
Posts: 5

Rep: Reputation: Disabled
Fastest way to search a 500 thousand part array in BASH?


Really need help trying to find the fastest way to search a 500000 part unidimensional array Array in the DATA file.

I need all of the lines in the ArrayDataFile searched for at the DATA file. I generate the array with declare array and then proceed to readarray a DATA file in my Documents folder with a for loop. Is this the best code for this job? The arrayelements in the array are searched for at the DATA file, however its a 500000 part list of items. The following code is an exact replica of the DATA file contents until line number 40 or something. The real file to be used is 600000 lines long. I need to optimized this search so as to be able to search the DATA file as fast as possible on my outdated hardware:
DATA file is
Code:
1321646154654654654654654546513213321378787542119
The ArrayDataFile is
Code:
    12
    36
    3321
    31
    51
    52
    21
    13
    32     ##### .. . .. . on and on until 500000 items
BASH Script code that I have been able to put together to accomplish this:
Code:
#!/bin/bash
declare -a Array
readarray Array < ArrayDataFile   
for each in "${Array[@]}"
do
LC_ALL=C fgrep -o "$each" '/home/USER/Documents/DATA' | wc -l >> GrepResultsOutputFile
done
In order to quickly search a 500,000 part unidimensional array taken from the lines of the ArrayDataFile. what is the absolute best way to optimize code for speed in the search? I need to display the output in a per line basis in the GrepResultsOutputFile. Does not have to be the same code, any method that is the fastest, be it SED, AWK, GREP or any other method?

BASH the best way at all? Heard its slow.

Last edited by nadiawicket; 12-14-2017 at 02:57 PM.
 
Old 12-14-2017, 01:26 PM   #2
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,637

Rep: Reputation: 521Reputation: 521Reputation: 521Reputation: 521Reputation: 521Reputation: 521
Quote:
Originally Posted by nadiawicket View Post
Really need help trying to find the fastest way to search a 500000 part unidimensional array Array in the DATA file.
If the array can be sorted, a binary search is a fast search technique.

https://en.wikipedia.org/wiki/Binary_search_algorithm

Daniel B. Martin

Last edited by danielbmartin; 12-14-2017 at 01:30 PM. Reason: Add Wikipedia reference.
 
Old 12-14-2017, 01:37 PM   #3
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=14, FreeBSD_10{.0|.1|.2}
Posts: 4,664
Blog Entries: 6

Rep: Reputation: 2524Reputation: 2524Reputation: 2524Reputation: 2524Reputation: 2524Reputation: 2524Reputation: 2524Reputation: 2524Reputation: 2524Reputation: 2524Reputation: 2524
If I understand your question and example correctly, you are looking for the number of occurrances of each array element in the DATA file, not simply whether it is in the DATA.

I think before I would spend much time identifying the fastest way, I would try some simple methods with existing functions and look for a fast enough solution among them.

The first that comes to mind is a simple grep|sort|uniq...

Code:
grep -wf ArrayDataFile DATA |sort -n |uniq -c
...which will produce the count per matched array element itself, without additional bash array iteration.

Last edited by astrogeek; 12-14-2017 at 02:12 PM. Reason: Added "number of..." note
 
Old 12-14-2017, 02:02 PM   #4
MadeInGermany
Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 695

Rep: Reputation: 319Reputation: 319Reputation: 319Reputation: 319
Of course "grep -c is faster than grep | wc -l",
and you can bundle the output to one stream, instead of open-append-write-close each time
Code:
LC_ALL=C
for each in "${Array[@]}"
do
  fgrep -c "$each" '/home/USER/Documents/DATA'
done > GrepResultsOutputFile
How does your DATA file look like? It would help to only search the in a certain area of the line, ideally at the beginning or end.
 
Old 12-14-2017, 02:14 PM   #5
MadeInGermany
Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 695

Rep: Reputation: 319Reputation: 319Reputation: 319Reputation: 319
I see now, you gave a sample of the DATA file. Then please give a sample of the ArrayDataFile.
If there were a 1 in the ArrayDataFile, should it match all the lines
Code:
1321
61
651
321
?
 
Old 12-14-2017, 02:40 PM   #6
nadiawicket
LQ Newbie
 
Registered: Dec 2017
Posts: 5

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by astrogeek View Post
If I understand your question and example correctly, you are looking for the number of occurrances of each array element in the DATA file, not simply whether it is in the DATA.

I think before I would spend much time identifying the fastest way, I would try some simple methods with existing functions and look for a fast enough solution among them.

The first that comes to mind is a simple grep|sort|uniq...

Code:
grep -wf ArrayDataFile DATA |sort -n |uniq -c
...which will produce the count per matched array element itself, without additional bash array iteration.
Running grep -wf ArrayDataFile DATA |sort -n with the DATA file being
Code:
98787879878465321321646513218513213123216478322354798798432543216846513184351684313512131213211311321217982114645465465421313215431324651321651321653546513546515465454654654644546546545465465464321654659876544846546132654324013216513543184654
and the ArrayDataFile being
Code:
78
22
8
7
So as to have the times that every value of every line in ArrayDataFile happened inside the DATA file, which is just a string of many numbers,
Code:
grep -wf ArrayDataFile DATA |sort -n |uniq -c
did not produce the desired output. In fact. it does nothing.
 
Old 12-14-2017, 02:51 PM   #7
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=14, FreeBSD_10{.0|.1|.2}
Posts: 4,664
Blog Entries: 6

Rep: Reputation: 2524Reputation: 2524Reputation: 2524Reputation: 2524Reputation: 2524Reputation: 2524Reputation: 2524Reputation: 2524Reputation: 2524Reputation: 2524Reputation: 2524
Quote:
Originally Posted by nadiawicket View Post
Running grep -wf ArrayDataFile DATA |sort -n with the DATA file being
Code:
98787879878465321321646513218513213123216478322354798798432543216846513184351684313512131213211311321217982114645465465421313215431324651321651321653546513546515465454654654644546546545465465464321654659876544846546132654324013216513543184654
and the ArrayDataFile being
Code:
78
22
8
7
No it wouldn't produce anything with a single long string.

But your example DATA file as originally stated was one element per line, not a single long string. Which is it?

Quote:
DATA file is
Code:
1321
64
61
546
54
654
654
654
654
54
651
321
3
321
37
87
87
54
211
9       #####. . . . . . . . on to 600000
In your example array you have both values of 54 and 654, how would you expect those to be distinguished? Should both match, or the longest, or the first found?

Perhaps you should provide a more complete description of the problem, and give some examples of input and expected results. It would be very difficult to provide help based on the information provided so far.

Last edited by astrogeek; 12-14-2017 at 03:00 PM.
 
Old 12-14-2017, 02:58 PM   #8
nadiawicket
LQ Newbie
 
Registered: Dec 2017
Posts: 5

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by astrogeek View Post
If I understand your question and example correctly, you are looking for the number of occurrances of each array element in the DATA file, not simply whether it is in the DATA.

I think before I would spend much time identifying the fastest way, I would try some simple methods with existing functions and look for a fast enough solution among them.

The first that comes to mind is a simple grep|sort|uniq...

Code:
grep -wf ArrayDataFile DATA |sort -n |uniq -c
...which will produce the count per matched array element itself, without additional bash array iteration.
Sorry, I was not clear in stating that my DATA file just contained a large string of data and that the ArrayDataFile is the one that contains 500 000 or so items.
 
Old 12-14-2017, 03:01 PM   #9
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 9,078
Blog Entries: 4

Rep: Reputation: 3164Reputation: 3164Reputation: 3164Reputation: 3164Reputation: 3164Reputation: 3164Reputation: 3164Reputation: 3164Reputation: 3164Reputation: 3164Reputation: 3164
This, to me, is an excellent use-case for SQLite, a very sophisticated database-engine that is entirely based on disk files, not servers.

(And for using a real programming language to do the work, using #!shebang to leverage the language of your choice.)

Putting 500,000 records (or, many times that much ...) into an SQLite table is (yawn ...) "no big deal," and now you can query the information to get the answers that you need, exploiting "indexes" to let you get these answers efficiently. You will probably find that queries can completely replace a lot of the "procedural logic" that you might now be contemplating.

Furthermore, since many spreadsheets and other tools can readily use SQLite databases as a data-source, "who knows?" ... You very well might be able to eliminate the use of "a custom computer program" altogether! (It happens. A lot.)
 
Old 12-14-2017, 03:53 PM   #10
MadeInGermany
Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 695

Rep: Reputation: 319Reputation: 319Reputation: 319Reputation: 319
Are you saying the DATA file is one line with 500 000 characters?
 
Old 12-14-2017, 10:16 PM   #11
michaelk
Moderator
 
Registered: Aug 2002
Posts: 16,836

Rep: Reputation: 2080Reputation: 2080Reputation: 2080Reputation: 2080Reputation: 2080Reputation: 2080Reputation: 2080Reputation: 2080Reputation: 2080Reputation: 2080Reputation: 2080
Quote:
my DATA file just contained a large string of data and that the ArrayDataFile is the one that contains 500 000 or so items.
You have a list of 500,000 patterns that you want to find in a single string that you call your data file which is not the same thing as searching an array if I understand the question.

grep is efficient and you should be able to search all at once like

grep -o -f ArrayDataFile DATA file

but it isn't producing the same results as an individual search so either I must be doing something wrong.

https://stackoverflow.com/questions/...ep-run-so-fast
https://en.wikipedia.org/wiki/Boyer%...arch_algorithm
 
Old 12-15-2017, 12:38 AM   #12
MadeInGermany
Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 695

Rep: Reputation: 319Reputation: 319Reputation: 319Reputation: 319
If the DATA file is much smaller then I would put IT in memory and process the big file. The following assumes the DATA file is one line.
Code:
#!/bin/sh
IFS="" read -r data < /home/USER/Documents/DATA
while read -r key
do
  case $data in
  (*$key*) echo 1;;
  (*) echo 0;;
  esac
done < ArrayDataFile > GrepResultsOutputFile
 
Old 12-15-2017, 02:02 PM   #13
MadeInGermany
Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 695

Rep: Reputation: 319Reputation: 319Reputation: 319Reputation: 319
Looking at your grep -o in post#1, I think you want display how often each key is found in DATA.
This is not good in shell, but in awk the gsub() function counts the matches.
Code:
awk 'NR==FNR {data=$0; next} {printf "\"%s\" found %d times\n",$1,gsub($1,$1,data)}' DATA ArrayDataFile
 
Old 12-15-2017, 02:40 PM   #14
starrysky1
LQ Newbie
 
Registered: Dec 2017
Posts: 17

Rep: Reputation: Disabled
Thanks all for the answers.

Is AWK faster than SQLite for this job? I just need a very fast word count (taken from the arraydatafile) applied to the string.
 
Old 12-15-2017, 03:18 PM   #15
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=14, FreeBSD_10{.0|.1|.2}
Posts: 4,664
Blog Entries: 6

Rep: Reputation: 2524Reputation: 2524Reputation: 2524Reputation: 2524Reputation: 2524Reputation: 2524Reputation: 2524Reputation: 2524Reputation: 2524Reputation: 2524Reputation: 2524
Quote:
Originally Posted by starrysky1 View Post
Thanks all for the answers.

Is AWK faster than SQLite for this job? I just need a very fast word count (taken from the arraydatafile) applied to the string.
It is not possible for anyone to answer that question as you still have not provided a very useful description of the task, even with the changes made to the original question.

Specific things we would need to know are how overlapping array items should be handled (i.e. 54, 654), is there a pattern to the array items (i.e. does 500,000 mean the integers from 0 to 500,000, is order significant, etc.?), how to handle repetitions, patterns that can match more than one way (How would 33 match the sequence 333333, or 131 match 1313131).

As asked previously, please provide a more complete problem description with a few examples of the expected results especially of cases where array items overlap, of which there must be very many based on the sample provided so far.

With regard to SQLite, I doubt this would be a good use case for it as the main benefit would come from use of indexes, and you cannot index the string to be searched in a way useful to this search. In fact, generating such an index itself would require an implementation of the search you appear to be trying to perform.

Please review the Site FAQ for guidance in formulating a more complete question, and especially follow the links at bottom of that page.

Finally, it is not acceptable to create multiple login accounts here at LQ. Please decide which member name you would like to use and request that the other(s) be closed. Using multiple member names is confusing to others and may be considered to be deceitful by some members.

Last edited by astrogeek; 12-15-2017 at 04:18 PM. Reason: Typo, added repetitions case
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Fastest way to search a 500 thousand part array in BASH? nadiawicket Programming 1 12-14-2017 01:09 PM
Linux is now running 486 of the world's 500 fastest computers jeremy Linux - News 3 10-24-2015 12:49 AM
BASH-Adding array element: Naming issue using array[${#array[*]}]=5 calvarado777 Programming 8 07-26-2013 09:48 PM
LXer: Of the 500 Fastest Supercomputers, 455 Run on Linux LXer Syndicated Linux News 2 06-06-2010 08:32 AM
100 thousand, OR 500 Thousand first? MasterC General 23 10-05-2003 12:05 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 04:50 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration