LinuxQuestions.org - Fastest way to search a 500 thousand part array in BASH?

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Fastest way to search a 500 thousand part array in BASH? (https://www.linuxquestions.org/questions/programming-9/fastest-way-to-search-a-500-thousand-part-array-in-bash-4175619614/)

nadiawicket

12-14-2017 12:42 PM

Fastest way to search a 500 thousand part array in BASH?

Really need help trying to find the fastest way to search a 500000 part unidimensional array Array in the DATA file.

I need all of the lines in the ArrayDataFile searched for at the DATA file. I generate the array with declare array and then proceed to readarray a DATA file in my Documents folder with a for loop. Is this the best code for this job? The arrayelements in the array are searched for at the DATA file, however its a 500000 part list of items. The following code is an exact replica of the DATA file contents until line number 40 or something. The real file to be used is 600000 lines long. I need to optimized this search so as to be able to search the DATA file as fast as possible on my outdated hardware:
DATA file is

Code:

1321646154654654654654654546513213321378787542119

The ArrayDataFile is

Code:

    12

    36

    3321

    31

    51

    52

    21

    13

    32    ##### .. . .. . on and on until 500000 items

BASH Script code that I have been able to put together to accomplish this:

Code:

#!/bin/bash

declare -a Array

readarray Array < ArrayDataFile  

for each in "${Array[@]}"

do

LC_ALL=C fgrep -o "$each" '/home/USER/Documents/DATA' | wc -l >> GrepResultsOutputFile

done

In order to quickly search a 500,000 part unidimensional array taken from the lines of the ArrayDataFile. what is the absolute best way to optimize code for speed in the search? I need to display the output in a per line basis in the GrepResultsOutputFile. Does not have to be the same code, any method that is the fastest, be it SED, AWK, GREP or any other method?

BASH the best way at all? Heard its slow.

danielbmartin

12-14-2017 01:26 PM

Quote:

Originally Posted by nadiawicket (Post 5793214)

Really need help trying to find the fastest way to search a 500000 part unidimensional array Array in the DATA file.

If the array can be sorted, a binary search is a fast search technique.

https://en.wikipedia.org/wiki/Binary_search_algorithm

Daniel B. Martin

astrogeek

12-14-2017 01:37 PM

If I understand your question and example correctly, you are looking for the number of occurrances of each array element in the DATA file, not simply whether it is in the DATA.

I think before I would spend much time identifying the fastest way, I would try some simple methods with existing functions and look for a fast enough solution among them.

The first that comes to mind is a simple grep|sort|uniq...

Code:

grep -wf ArrayDataFile DATA |sort -n |uniq -c

...which will produce the count per matched array element itself, without additional bash array iteration.

MadeInGermany

12-14-2017 02:02 PM

Of course "grep -c is faster than grep | wc -l",
and you can bundle the output to one stream, instead of open-append-write-close each time

Code:

LC_ALL=C

for each in "${Array[@]}"

do

  fgrep -c "$each" '/home/USER/Documents/DATA'

done > GrepResultsOutputFile

How does your DATA file look like? It would help to only search the in a certain area of the line, ideally at the beginning or end.

MadeInGermany

12-14-2017 02:14 PM

I see now, you gave a sample of the DATA file. Then please give a sample of the ArrayDataFile.
If there were a 1 in the ArrayDataFile, should it match all the lines

Code:

nadiawicket

12-14-2017 02:40 PM

Quote:

Originally Posted by astrogeek (Post 5793241)

Code:

grep -wf ArrayDataFile DATA |sort -n |uniq -c

...which will produce the count per matched array element itself, without additional bash array iteration.

Running grep -wf ArrayDataFile DATA |sort -n with the DATA file being

Code:

98787879878465321321646513218513213123216478322354798798432543216846513184351684313512131213211311321217982114645465465421313215431324651321651321653546513546515465454654654644546546545465465464321654659876544846546132654324013216513543184654

and the ArrayDataFile being

Code:

So as to have the times that every value of every line in ArrayDataFile happened inside the DATA file, which is just a string of many numbers,

Code:

grep -wf ArrayDataFile DATA |sort -n |uniq -c

did not produce the desired output. In fact. it does nothing.

astrogeek

12-14-2017 02:51 PM

Quote:

Originally Posted by nadiawicket (Post 5793273)

Running grep -wf ArrayDataFile DATA |sort -n with the DATA file being

Code:

98787879878465321321646513218513213123216478322354798798432543216846513184351684313512131213211311321217982114645465465421313215431324651321651321653546513546515465454654654644546546545465465464321654659876544846546132654324013216513543184654

and the ArrayDataFile being

Code:

No it wouldn't produce anything with a single long string.

But your example DATA file as originally stated was one element per line, not a single long string. Which is it?

Quote:

DATA file is

Code:

1321

64

61

546

54

654

654

654

654

54

651

321

3

321

37

87

87

54

211

9      #####. . . . . . . . on to 600000

In your example array you have both values of 54 and 654, how would you expect those to be distinguished? Should both match, or the longest, or the first found?

Perhaps you should provide a more complete description of the problem, and give some examples of input and expected results. It would be very difficult to provide help based on the information provided so far.

nadiawicket

12-14-2017 02:58 PM

Quote:

Originally Posted by astrogeek (Post 5793241)

Code:

grep -wf ArrayDataFile DATA |sort -n |uniq -c

...which will produce the count per matched array element itself, without additional bash array iteration.

Sorry, I was not clear in stating that my DATA file just contained a large string of data and that the ArrayDataFile is the one that contains 500 000 or so items.

sundialsvcs

12-14-2017 03:01 PM

This, to me, is an excellent use-case for SQLite, a very sophisticated database-engine that is entirely based on disk files, not servers.

(And :tisk: for using a real programming language to do the work, using #!shebang to leverage the language of your choice.)

Putting 500,000 records (or, many times that much ...) into an SQLite table is (yawn ...) "no big deal," and now you can query the information to get the answers that you need, exploiting "indexes" to let you get these answers efficiently. You will probably find that queries can completely replace a lot of the "procedural logic" that you might now be contemplating.

Furthermore, since many spreadsheets and other tools can readily use SQLite databases as a data-source, "who knows?" ... You very well might be able to eliminate the use of "a custom computer program" altogether! :eek: (It happens. A lot.)

MadeInGermany

12-14-2017 03:53 PM

Are you saying the DATA file is one line with 500 000 characters?

michaelk

12-14-2017 10:16 PM

Quote:

my DATA file just contained a large string of data and that the ArrayDataFile is the one that contains 500 000 or so items.

You have a list of 500,000 patterns that you want to find in a single string that you call your data file which is not the same thing as searching an array if I understand the question.

grep is efficient and you should be able to search all at once like

grep -o -f ArrayDataFile DATA file

but it isn't producing the same results as an individual search so either I must be doing something wrong.

https://stackoverflow.com/questions/...ep-run-so-fast
https://en.wikipedia.org/wiki/Boyer%...arch_algorithm

MadeInGermany

12-15-2017 12:38 AM

If the DATA file is much smaller then I would put IT in memory and process the big file. The following assumes the DATA file is one line.

Code:

#!/bin/sh

IFS="" read -r data < /home/USER/Documents/DATA

while read -r key

do

  case $data in

  (*$key*) echo 1;;

  (*) echo 0;;

  esac

done < ArrayDataFile > GrepResultsOutputFile

MadeInGermany

12-15-2017 02:02 PM

Looking at your grep -o in post#1, I think you want display how often each key is found in DATA.
This is not good in shell, but in awk the gsub() function counts the matches.

Code:

awk 'NR==FNR {data=$0; next} {printf "\"%s\" found %d times\n",$1,gsub($1,$1,data)}' DATA ArrayDataFile

starrysky1

12-15-2017 02:40 PM

Thanks all for the answers.

Is AWK faster than SQLite for this job? I just need a very fast word count (taken from the arraydatafile) applied to the string.

astrogeek

12-15-2017 03:18 PM

Quote:

Originally Posted by starrysky1 (Post 5793650)

Thanks all for the answers.

Is AWK faster than SQLite for this job? I just need a very fast word count (taken from the arraydatafile) applied to the string.

It is not possible for anyone to answer that question as you still have not provided a very useful description of the task, even with the changes made to the original question.

Specific things we would need to know are how overlapping array items should be handled (i.e. 54, 654), is there a pattern to the array items (i.e. does 500,000 mean the integers from 0 to 500,000, is order significant, etc.?), how to handle repetitions, patterns that can match more than one way (How would 33 match the sequence 333333, or 131 match 1313131).

As asked previously, please provide a more complete problem description with a few examples of the expected results especially of cases where array items overlap, of which there must be very many based on the sample provided so far.

With regard to SQLite, I doubt this would be a good use case for it as the main benefit would come from use of indexes, and you cannot index the string to be searched in a way useful to this search. In fact, generating such an index itself would require an implementation of the search you appear to be trying to perform.

Please review the Site FAQ for guidance in formulating a more complete question, and especially follow the links at bottom of that page.

Finally, it is not acceptable to create multiple login accounts here at LQ. Please decide which member name you would like to use and request that the other(s) be closed. Using multiple member names is confusing to others and may be considered to be deceitful by some members.

starrysky1

12-16-2017 11:59 AM

Quote:

Specific things we would need to know are how overlapping array items should be handled (i.e. 54, 654), is there a pattern to the array items (i.e. does 500,000 mean the integers from 0 to 500,000, is order significant, etc.?), how to handle repetitions, patterns that can match more than one way (How would 33 match the sequence 333333, or 131 match 1313131).

The overlapping array items are to be treated as individual items for the word count. Each gets a separate word count. The actual items are just unique strings. There is not a specific pattern to these items at all. Its just a long list of unique strings, with no pattern as to the relationship that they all have to one another.

33 would match the sequence 333333 3 times.
131 would match 1313131 2 times.

ArrayDataFile (500000 part list of items)

Code:

23

22

1

0068

0

354

00

83

3

92 ## etc etc, all UNIQUE numbers that should be treated as strings because 0 should not count as the same as 00000 for example

starrysky1

12-16-2017 12:04 PM

Quote:

Originally Posted by MadeInGermany (Post 5793632)

Looking at your grep -o in post#1, I think you want display how often each key is found in DATA.
This is not good in shell, but in awk the gsub() function counts the matches.

Code:

awk 'NR==FNR {data=$0; next} {printf "\"%s\" found %d times\n",$1,gsub($1,$1,data)}' DATA ArrayDataFile

Looks like this reads the DATA file only once instead of like a zillion times like the for loop!!!!

This is a MUCH faster solution that does the exact job that I wanted compared to my original code. Can't thank you enough!!

Still going to attempt to find an even faster solution because the more speed I can get the better.

starrysky1

12-16-2017 12:14 PM

Awesome!!!!

keefaz

12-16-2017 03:37 PM

I made a c version as an exercice, seems to run a bit faster than awk

Code:

#include <stdio.h>

#include <stdlib.h>

#include <string.h>



char *data = NULL;



void trim(char *line) {

    size_t len = strlen(line) -1;

    if (line[len] == '\n')

        line[len] = '\0';

}



int loadData(char *file) {

    size_t len = 0;

    ssize_t read;

    

    FILE *f = fopen(file, "r");

    if (f == NULL) {

        fprintf(stderr, "Failed to open %s\n", file);

        return 0;

    }

    

    read = getline(&data, &len, f);

    printf("Loaded %zu chars of data\n", read);

    trim(data);

    fclose(f);

    return 1;

}



int findData(char *findMe) {

    int count = 0;

    char *tmp = data;

    size_t len = strlen(findMe);



    while((tmp = strstr(tmp, findMe))) {

        count++;

        tmp += len;

    }

    return count;

}



int countData(char *file) {

    size_t len = 0;

    ssize_t read;

    char *line = NULL;

    int found = 0;

    

    FILE *f = fopen(file, "r");

    if (f == NULL) {

        fprintf(stderr, "Failed to open %s\n", file);

        return 0;

    }

    

    while ((read = getline(&line, &len, f)) != -1) {

        trim(line);

        if ((found = findData(line)))

            printf("\"%s\" found %d times\n", line, found);

    }

    free(line);

    fclose(f);

    return 1;

}





int main(int argc, char *argv[]) {

    if( argc != 3 ) {

        fprintf(stderr, "Usage: %s <data file> <array_data file>\n", argv[0]);

        return EXIT_FAILURE;

    }

  

    if (!loadData(argv[1]))

        return EXIT_FAILURE;

  

    if (!countData(argv[2])) {

        free(data);

        return EXIT_FAILURE;

    }



    free(data);

    return EXIT_SUCCESS;

}

Save as find_data.c or whatever, Compile as:

Code:

gcc -o program_name find_data.c

Usage:

Code:

./program_name data.txt data_array_list.txt

starrysky1

12-17-2017 04:58 PM

Quote:

Originally Posted by keefaz (Post 5794014)

I made a c version as an exercice, seems to run a bit faster than awk

Code:

#include <stdio.h>

#include <stdlib.h>

#include <string.h>



char *data = NULL;



void trim(char *line) {

    size_t len = strlen(line) -1;

    if (line[len] == '\n')

        line[len] = '\0';

}



int loadData(char *file) {

    size_t len = 0;

    ssize_t read;

    

    FILE *f = fopen(file, "r");

    if (f == NULL) {

        fprintf(stderr, "Failed to open %s\n", file);

        return 0;

    }

    

    read = getline(&data, &len, f);

    printf("Loaded %zu chars of data\n", read);

    trim(data);

    fclose(f);

    return 1;

}



int findData(char *findMe) {

    int count = 0;

    char *tmp = data;

    size_t len = strlen(findMe);



    while((tmp = strstr(tmp, findMe))) {

        count++;

        tmp += len;

    }

    return count;

}



int countData(char *file) {

    size_t len = 0;

    ssize_t read;

    char *line = NULL;

    int found = 0;

    

    FILE *f = fopen(file, "r");

    if (f == NULL) {

        fprintf(stderr, "Failed to open %s\n", file);

        return 0;

    }

    

    while ((read = getline(&line, &len, f)) != -1) {

        trim(line);

        if ((found = findData(line)))

            printf("\"%s\" found %d times\n", line, found);

    }

    free(line);

    fclose(f);

    return 1;

}





int main(int argc, char *argv[]) {

    if( argc != 3 ) {

        fprintf(stderr, "Usage: %s <data file> <array_data file>\n", argv[0]);

        return EXIT_FAILURE;

    }

  

    if (!loadData(argv[1]))

        return EXIT_FAILURE;

  

    if (!countData(argv[2])) {

        free(data);

        return EXIT_FAILURE;

    }



    free(data);

    return EXIT_SUCCESS;

}

Save as find_data.c or whatever, Compile as:

Code:

gcc -o program_name find_data.c

Usage:

Code:

./program_name data.txt data_array_list.txt

Really faster?? Going to try it out.

Cant thank you enough for this holiday gift.

Would Python be faster?

keefaz

12-17-2017 05:46 PM

Quote:

Originally Posted by starrysky1 (Post 5794361)

Really faster?? Going to try it out.

Cant thank you enough for this holiday gift.

Would Python be faster?

I don't know Python well

You can do simple benchmark with the bash time command.

time program arg1 arg2...

starrysky1

12-17-2017 05:56 PM

Quote:

Originally Posted by keefaz (Post 5794378)

I don't know Python well

You can do simple benchmark with the bash time command.

time program arg1 arg2...

Will check all of this out when I got some electric service . .

Sefyir

12-27-2017 10:02 PM

Here's a python3 implementation

Example

Code:

$ wc -m data

106037249 data

$ ./count_sequences.py -s asdfds,dsgasd,ikd,idk,23523,fdg ./data

asdfds                        0

dsgasd                        0

ikd                          90

idk                          83

23523                        0

fdg                          85

Usage

Code:

./count_sequences.py -h

usage: count_sequences.py [-h] [-s SEQUENCES]



Count occurances of string in array



optional arguments:

  -h, --help            show this help message and exit

  -s SEQUENCES, --sequences SEQUENCES

                        Sequences to detect, delimit with ","

Code

Code:

#!/usr/bin/env python3



import argparse

import re



from collections import Counter



# Create commandline flags

parser = argparse.ArgumentParser(description='Count occurances of string in array')

parser.add_argument('-s', '--sequences', help='Sequences to detect, delimit with ","')

args, data_files = parser.parse_known_args()



# Join data files together

data = ''.join((open(data_file).read().strip() for data_file in data_files))



# Strip excess | characters

counted_sequences = Counter(re.findall(args.sequences.replace(',', '|').strip('|'), data))



for sequence in args.sequences.split(','):

    msg = '{sequence} {count}'.format(

            sequence=sequence,

            count=str(counted_sequences[sequence]).rjust(30 - len(sequence)),

            )

    print(msg)

All times are GMT -5. The time now is 04:47 PM.