Fastest way to search a 500 thousand part array in BASH?
Really need help trying to find the fastest way to search a 500000 part unidimensional array Array in the DATA file.
I need all of the lines in the ArrayDataFile searched for at the DATA file. I generate the array with declare array and then proceed to readarray a DATA file in my Documents folder with a for loop. Is this the best code for this job? The arrayelements in the array are searched for at the DATA file, however its a 500000 part list of items. The following code is an exact replica of the DATA file contents until line number 40 or something. The real file to be used is 600000 lines long. I need to optimized this search so as to be able to search the DATA file as fast as possible on my outdated hardware: DATA file is Code:
1321646154654654654654654546513213321378787542119 Code:
12 Code:
#!/bin/bash BASH the best way at all? Heard its slow. |
Quote:
https://en.wikipedia.org/wiki/Binary_search_algorithm Daniel B. Martin |
If I understand your question and example correctly, you are looking for the number of occurrances of each array element in the DATA file, not simply whether it is in the DATA.
I think before I would spend much time identifying the fastest way, I would try some simple methods with existing functions and look for a fast enough solution among them. The first that comes to mind is a simple grep|sort|uniq... Code:
grep -wf ArrayDataFile DATA |sort -n |uniq -c |
Of course "grep -c is faster than grep | wc -l",
and you can bundle the output to one stream, instead of open-append-write-close each time Code:
LC_ALL=C |
I see now, you gave a sample of the DATA file. Then please give a sample of the ArrayDataFile.
If there were a 1 in the ArrayDataFile, should it match all the lines Code:
1321 |
Quote:
Code:
98787879878465321321646513218513213123216478322354798798432543216846513184351684313512131213211311321217982114645465465421313215431324651321651321653546513546515465454654654644546546545465465464321654659876544846546132654324013216513543184654 Code:
78 Code:
grep -wf ArrayDataFile DATA |sort -n |uniq -c |
Quote:
But your example DATA file as originally stated was one element per line, not a single long string. Which is it? Quote:
Perhaps you should provide a more complete description of the problem, and give some examples of input and expected results. It would be very difficult to provide help based on the information provided so far. |
Quote:
|
This, to me, is an excellent use-case for SQLite, a very sophisticated database-engine that is entirely based on disk files, not servers.
(And :tisk: for using a real programming language to do the work, using #!shebang to leverage the language of your choice.) Putting 500,000 records (or, many times that much ...) into an SQLite table is (yawn ...) "no big deal," and now you can query the information to get the answers that you need, exploiting "indexes" to let you get these answers efficiently. You will probably find that queries can completely replace a lot of the "procedural logic" that you might now be contemplating. Furthermore, since many spreadsheets and other tools can readily use SQLite databases as a data-source, "who knows?" ... You very well might be able to eliminate the use of "a custom computer program" altogether! :eek: (It happens. A lot.) |
Are you saying the DATA file is one line with 500 000 characters?
|
Quote:
grep is efficient and you should be able to search all at once like grep -o -f ArrayDataFile DATA file but it isn't producing the same results as an individual search so either I must be doing something wrong. https://stackoverflow.com/questions/...ep-run-so-fast https://en.wikipedia.org/wiki/Boyer%...arch_algorithm |
If the DATA file is much smaller then I would put IT in memory and process the big file. The following assumes the DATA file is one line.
Code:
#!/bin/sh |
Looking at your grep -o in post#1, I think you want display how often each key is found in DATA.
This is not good in shell, but in awk the gsub() function counts the matches. Code:
awk 'NR==FNR {data=$0; next} {printf "\"%s\" found %d times\n",$1,gsub($1,$1,data)}' DATA ArrayDataFile |
Thanks all for the answers.
Is AWK faster than SQLite for this job? I just need a very fast word count (taken from the arraydatafile) applied to the string. |
Quote:
Specific things we would need to know are how overlapping array items should be handled (i.e. 54, 654), is there a pattern to the array items (i.e. does 500,000 mean the integers from 0 to 500,000, is order significant, etc.?), how to handle repetitions, patterns that can match more than one way (How would 33 match the sequence 333333, or 131 match 1313131). As asked previously, please provide a more complete problem description with a few examples of the expected results especially of cases where array items overlap, of which there must be very many based on the sample provided so far. With regard to SQLite, I doubt this would be a good use case for it as the main benefit would come from use of indexes, and you cannot index the string to be searched in a way useful to this search. In fact, generating such an index itself would require an implementation of the search you appear to be trying to perform. Please review the Site FAQ for guidance in formulating a more complete question, and especially follow the links at bottom of that page. Finally, it is not acceptable to create multiple login accounts here at LQ. Please decide which member name you would like to use and request that the other(s) be closed. Using multiple member names is confusing to others and may be considered to be deceitful by some members. |
Quote:
33 would match the sequence 333333 3 times. 131 would match 1313131 2 times. ArrayDataFile (500000 part list of items) Code:
23 |
Quote:
This is a MUCH faster solution that does the exact job that I wanted compared to my original code. Can't thank you enough!! Still going to attempt to find an even faster solution because the more speed I can get the better. |
Awesome!!!!
|
I made a c version as an exercice, seems to run a bit faster than awk
Code:
#include <stdio.h> Code:
gcc -o program_name find_data.c Code:
./program_name data.txt data_array_list.txt |
Quote:
Cant thank you enough for this holiday gift. Would Python be faster? |
Quote:
You can do simple benchmark with the bash time command. time program arg1 arg2... |
Quote:
|
Here's a python3 implementation
Example Code:
$ wc -m data Code:
./count_sequences.py -h Code:
#!/usr/bin/env python3 |
All times are GMT -5. The time now is 04:47 PM. |