LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Fastest way to search a 500 thousand part array in BASH? (https://www.linuxquestions.org/questions/programming-9/fastest-way-to-search-a-500-thousand-part-array-in-bash-4175619614/)

nadiawicket 12-14-2017 12:42 PM

Fastest way to search a 500 thousand part array in BASH?
 
Really need help trying to find the fastest way to search a 500000 part unidimensional array Array in the DATA file.

I need all of the lines in the ArrayDataFile searched for at the DATA file. I generate the array with declare array and then proceed to readarray a DATA file in my Documents folder with a for loop. Is this the best code for this job? The arrayelements in the array are searched for at the DATA file, however its a 500000 part list of items. The following code is an exact replica of the DATA file contents until line number 40 or something. The real file to be used is 600000 lines long. I need to optimized this search so as to be able to search the DATA file as fast as possible on my outdated hardware:
DATA file is
Code:

1321646154654654654654654546513213321378787542119
The ArrayDataFile is
Code:

    12
    36
    3321
    31
    51
    52
    21
    13
    32    ##### .. . .. . on and on until 500000 items

BASH Script code that I have been able to put together to accomplish this:
Code:

#!/bin/bash
declare -a Array
readarray Array < ArrayDataFile 
for each in "${Array[@]}"
do
LC_ALL=C fgrep -o "$each" '/home/USER/Documents/DATA' | wc -l >> GrepResultsOutputFile
done

In order to quickly search a 500,000 part unidimensional array taken from the lines of the ArrayDataFile. what is the absolute best way to optimize code for speed in the search? I need to display the output in a per line basis in the GrepResultsOutputFile. Does not have to be the same code, any method that is the fastest, be it SED, AWK, GREP or any other method?

BASH the best way at all? Heard its slow.

danielbmartin 12-14-2017 01:26 PM

Quote:

Originally Posted by nadiawicket (Post 5793214)
Really need help trying to find the fastest way to search a 500000 part unidimensional array Array in the DATA file.

If the array can be sorted, a binary search is a fast search technique.

https://en.wikipedia.org/wiki/Binary_search_algorithm

Daniel B. Martin

astrogeek 12-14-2017 01:37 PM

If I understand your question and example correctly, you are looking for the number of occurrances of each array element in the DATA file, not simply whether it is in the DATA.

I think before I would spend much time identifying the fastest way, I would try some simple methods with existing functions and look for a fast enough solution among them.

The first that comes to mind is a simple grep|sort|uniq...

Code:

grep -wf ArrayDataFile DATA |sort -n |uniq -c
...which will produce the count per matched array element itself, without additional bash array iteration.

MadeInGermany 12-14-2017 02:02 PM

Of course "grep -c is faster than grep | wc -l",
and you can bundle the output to one stream, instead of open-append-write-close each time
Code:

LC_ALL=C
for each in "${Array[@]}"
do
  fgrep -c "$each" '/home/USER/Documents/DATA'
done > GrepResultsOutputFile

How does your DATA file look like? It would help to only search the in a certain area of the line, ideally at the beginning or end.

MadeInGermany 12-14-2017 02:14 PM

I see now, you gave a sample of the DATA file. Then please give a sample of the ArrayDataFile.
If there were a 1 in the ArrayDataFile, should it match all the lines
Code:

1321
61
651
321

?

nadiawicket 12-14-2017 02:40 PM

Quote:

Originally Posted by astrogeek (Post 5793241)
If I understand your question and example correctly, you are looking for the number of occurrances of each array element in the DATA file, not simply whether it is in the DATA.

I think before I would spend much time identifying the fastest way, I would try some simple methods with existing functions and look for a fast enough solution among them.

The first that comes to mind is a simple grep|sort|uniq...

Code:

grep -wf ArrayDataFile DATA |sort -n |uniq -c
...which will produce the count per matched array element itself, without additional bash array iteration.

Running grep -wf ArrayDataFile DATA |sort -n with the DATA file being
Code:

98787879878465321321646513218513213123216478322354798798432543216846513184351684313512131213211311321217982114645465465421313215431324651321651321653546513546515465454654654644546546545465465464321654659876544846546132654324013216513543184654
and the ArrayDataFile being
Code:

78
22
8
7

So as to have the times that every value of every line in ArrayDataFile happened inside the DATA file, which is just a string of many numbers,
Code:

grep -wf ArrayDataFile DATA |sort -n |uniq -c
did not produce the desired output. In fact. it does nothing.

astrogeek 12-14-2017 02:51 PM

Quote:

Originally Posted by nadiawicket (Post 5793273)
Running grep -wf ArrayDataFile DATA |sort -n with the DATA file being
Code:

98787879878465321321646513218513213123216478322354798798432543216846513184351684313512131213211311321217982114645465465421313215431324651321651321653546513546515465454654654644546546545465465464321654659876544846546132654324013216513543184654
and the ArrayDataFile being
Code:

78
22
8
7


No it wouldn't produce anything with a single long string.

But your example DATA file as originally stated was one element per line, not a single long string. Which is it?

Quote:

DATA file is
Code:

1321
64
61
546
54
654
654
654
654
54
651
321
3
321
37
87
87
54
211
9      #####. . . . . . . . on to 600000


In your example array you have both values of 54 and 654, how would you expect those to be distinguished? Should both match, or the longest, or the first found?

Perhaps you should provide a more complete description of the problem, and give some examples of input and expected results. It would be very difficult to provide help based on the information provided so far.

nadiawicket 12-14-2017 02:58 PM

Quote:

Originally Posted by astrogeek (Post 5793241)
If I understand your question and example correctly, you are looking for the number of occurrances of each array element in the DATA file, not simply whether it is in the DATA.

I think before I would spend much time identifying the fastest way, I would try some simple methods with existing functions and look for a fast enough solution among them.

The first that comes to mind is a simple grep|sort|uniq...

Code:

grep -wf ArrayDataFile DATA |sort -n |uniq -c
...which will produce the count per matched array element itself, without additional bash array iteration.

Sorry, I was not clear in stating that my DATA file just contained a large string of data and that the ArrayDataFile is the one that contains 500 000 or so items.

sundialsvcs 12-14-2017 03:01 PM

This, to me, is an excellent use-case for SQLite, a very sophisticated database-engine that is entirely based on disk files, not servers.

(And :tisk: for using a real programming language to do the work, using #!shebang to leverage the language of your choice.)

Putting 500,000 records (or, many times that much ...) into an SQLite table is (yawn ...) "no big deal," and now you can query the information to get the answers that you need, exploiting "indexes" to let you get these answers efficiently. You will probably find that queries can completely replace a lot of the "procedural logic" that you might now be contemplating.

Furthermore, since many spreadsheets and other tools can readily use SQLite databases as a data-source, "who knows?" ... You very well might be able to eliminate the use of "a custom computer program" altogether! :eek: (It happens. A lot.)

MadeInGermany 12-14-2017 03:53 PM

Are you saying the DATA file is one line with 500 000 characters?

michaelk 12-14-2017 10:16 PM

Quote:

my DATA file just contained a large string of data and that the ArrayDataFile is the one that contains 500 000 or so items.
You have a list of 500,000 patterns that you want to find in a single string that you call your data file which is not the same thing as searching an array if I understand the question.

grep is efficient and you should be able to search all at once like

grep -o -f ArrayDataFile DATA file

but it isn't producing the same results as an individual search so either I must be doing something wrong.

https://stackoverflow.com/questions/...ep-run-so-fast
https://en.wikipedia.org/wiki/Boyer%...arch_algorithm

MadeInGermany 12-15-2017 12:38 AM

If the DATA file is much smaller then I would put IT in memory and process the big file. The following assumes the DATA file is one line.
Code:

#!/bin/sh
IFS="" read -r data < /home/USER/Documents/DATA
while read -r key
do
  case $data in
  (*$key*) echo 1;;
  (*) echo 0;;
  esac
done < ArrayDataFile > GrepResultsOutputFile


MadeInGermany 12-15-2017 02:02 PM

Looking at your grep -o in post#1, I think you want display how often each key is found in DATA.
This is not good in shell, but in awk the gsub() function counts the matches.
Code:

awk 'NR==FNR {data=$0; next} {printf "\"%s\" found %d times\n",$1,gsub($1,$1,data)}' DATA ArrayDataFile

starrysky1 12-15-2017 02:40 PM

Thanks all for the answers.

Is AWK faster than SQLite for this job? I just need a very fast word count (taken from the arraydatafile) applied to the string.

astrogeek 12-15-2017 03:18 PM

Quote:

Originally Posted by starrysky1 (Post 5793650)
Thanks all for the answers.

Is AWK faster than SQLite for this job? I just need a very fast word count (taken from the arraydatafile) applied to the string.

It is not possible for anyone to answer that question as you still have not provided a very useful description of the task, even with the changes made to the original question.

Specific things we would need to know are how overlapping array items should be handled (i.e. 54, 654), is there a pattern to the array items (i.e. does 500,000 mean the integers from 0 to 500,000, is order significant, etc.?), how to handle repetitions, patterns that can match more than one way (How would 33 match the sequence 333333, or 131 match 1313131).

As asked previously, please provide a more complete problem description with a few examples of the expected results especially of cases where array items overlap, of which there must be very many based on the sample provided so far.

With regard to SQLite, I doubt this would be a good use case for it as the main benefit would come from use of indexes, and you cannot index the string to be searched in a way useful to this search. In fact, generating such an index itself would require an implementation of the search you appear to be trying to perform.

Please review the Site FAQ for guidance in formulating a more complete question, and especially follow the links at bottom of that page.

Finally, it is not acceptable to create multiple login accounts here at LQ. Please decide which member name you would like to use and request that the other(s) be closed. Using multiple member names is confusing to others and may be considered to be deceitful by some members.


All times are GMT -5. The time now is 01:55 PM.