LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 12-16-2017, 12:59 PM   #16
starrysky1
LQ Newbie
 
Registered: Dec 2017
Posts: 26

Rep: Reputation: Disabled

Quote:
Specific things we would need to know are how overlapping array items should be handled (i.e. 54, 654), is there a pattern to the array items (i.e. does 500,000 mean the integers from 0 to 500,000, is order significant, etc.?), how to handle repetitions, patterns that can match more than one way (How would 33 match the sequence 333333, or 131 match 1313131).
The overlapping array items are to be treated as individual items for the word count. Each gets a separate word count. The actual items are just unique strings. There is not a specific pattern to these items at all. Its just a long list of unique strings, with no pattern as to the relationship that they all have to one another.

33 would match the sequence 333333 3 times.
131 would match 1313131 2 times.

ArrayDataFile (500000 part list of items)
Code:
23
22
1
0068
0
354
00
83
3
92 ## etc etc, all UNIQUE numbers that should be treated as strings because 0 should not count as the same as 00000 for example

Last edited by starrysky1; 12-16-2017 at 01:24 PM.
 
Old 12-16-2017, 01:04 PM   #17
starrysky1
LQ Newbie
 
Registered: Dec 2017
Posts: 26

Rep: Reputation: Disabled
Quote:
Originally Posted by MadeInGermany View Post
Looking at your grep -o in post#1, I think you want display how often each key is found in DATA.
This is not good in shell, but in awk the gsub() function counts the matches.
Code:
awk 'NR==FNR {data=$0; next} {printf "\"%s\" found %d times\n",$1,gsub($1,$1,data)}' DATA ArrayDataFile
Looks like this reads the DATA file only once instead of like a zillion times like the for loop!!!!

This is a MUCH faster solution that does the exact job that I wanted compared to my original code. Can't thank you enough!!

Still going to attempt to find an even faster solution because the more speed I can get the better.

Last edited by starrysky1; 12-16-2017 at 01:34 PM.
 
Old 12-16-2017, 01:14 PM   #18
starrysky1
LQ Newbie
 
Registered: Dec 2017
Posts: 26

Rep: Reputation: Disabled
Awesome!!!!

Last edited by starrysky1; 12-16-2017 at 01:15 PM.
 
Old 12-16-2017, 04:37 PM   #19
keefaz
LQ Guru
 
Registered: Mar 2004
Distribution: Slackware
Posts: 6,230

Rep: Reputation: 713Reputation: 713Reputation: 713Reputation: 713Reputation: 713Reputation: 713Reputation: 713
I made a c version as an exercice, seems to run a bit faster than awk
Code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

char *data = NULL;

void trim(char *line) {
    size_t len = strlen(line) -1;
    if (line[len] == '\n')
        line[len] = '\0';
}

int loadData(char *file) {
    size_t len = 0;
    ssize_t read;
    
    FILE *f = fopen(file, "r");
    if (f == NULL) {
        fprintf(stderr, "Failed to open %s\n", file);
        return 0;
    }
    
    read = getline(&data, &len, f);
    printf("Loaded %zu chars of data\n", read);
    trim(data);
    fclose(f);
    return 1;
}

int findData(char *findMe) {
    int count = 0;
    char *tmp = data;
    size_t len = strlen(findMe);

    while((tmp = strstr(tmp, findMe))) {
        count++;
        tmp += len;
    }
    return count;
}

int countData(char *file) {
    size_t len = 0;
    ssize_t read;
    char *line = NULL;
    int found = 0;
    
    FILE *f = fopen(file, "r");
    if (f == NULL) {
        fprintf(stderr, "Failed to open %s\n", file);
        return 0;
    }
    
    while ((read = getline(&line, &len, f)) != -1) {
        trim(line);
        if ((found = findData(line)))
            printf("\"%s\" found %d times\n", line, found);
    }
    free(line);
    fclose(f);
    return 1;
}


int main(int argc, char *argv[]) {
    if( argc != 3 ) {
        fprintf(stderr, "Usage: %s <data file> <array_data file>\n", argv[0]);
        return EXIT_FAILURE;
    }
   
    if (!loadData(argv[1]))
        return EXIT_FAILURE;
   
    if (!countData(argv[2])) {
        free(data);
        return EXIT_FAILURE;
    }

    free(data);
    return EXIT_SUCCESS;
}
Save as find_data.c or whatever, Compile as:
Code:
gcc -o program_name find_data.c
Usage:
Code:
./program_name data.txt data_array_list.txt

Last edited by keefaz; 12-16-2017 at 09:24 PM. Reason: slight improvement
 
Old 12-17-2017, 05:58 PM   #20
starrysky1
LQ Newbie
 
Registered: Dec 2017
Posts: 26

Rep: Reputation: Disabled
Quote:
Originally Posted by keefaz View Post
I made a c version as an exercice, seems to run a bit faster than awk
Code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

char *data = NULL;

void trim(char *line) {
    size_t len = strlen(line) -1;
    if (line[len] == '\n')
        line[len] = '\0';
}

int loadData(char *file) {
    size_t len = 0;
    ssize_t read;
    
    FILE *f = fopen(file, "r");
    if (f == NULL) {
        fprintf(stderr, "Failed to open %s\n", file);
        return 0;
    }
    
    read = getline(&data, &len, f);
    printf("Loaded %zu chars of data\n", read);
    trim(data);
    fclose(f);
    return 1;
}

int findData(char *findMe) {
    int count = 0;
    char *tmp = data;
    size_t len = strlen(findMe);

    while((tmp = strstr(tmp, findMe))) {
        count++;
        tmp += len;
    }
    return count;
}

int countData(char *file) {
    size_t len = 0;
    ssize_t read;
    char *line = NULL;
    int found = 0;
    
    FILE *f = fopen(file, "r");
    if (f == NULL) {
        fprintf(stderr, "Failed to open %s\n", file);
        return 0;
    }
    
    while ((read = getline(&line, &len, f)) != -1) {
        trim(line);
        if ((found = findData(line)))
            printf("\"%s\" found %d times\n", line, found);
    }
    free(line);
    fclose(f);
    return 1;
}


int main(int argc, char *argv[]) {
    if( argc != 3 ) {
        fprintf(stderr, "Usage: %s <data file> <array_data file>\n", argv[0]);
        return EXIT_FAILURE;
    }
   
    if (!loadData(argv[1]))
        return EXIT_FAILURE;
   
    if (!countData(argv[2])) {
        free(data);
        return EXIT_FAILURE;
    }

    free(data);
    return EXIT_SUCCESS;
}
Save as find_data.c or whatever, Compile as:
Code:
gcc -o program_name find_data.c
Usage:
Code:
./program_name data.txt data_array_list.txt
Really faster?? Going to try it out.

Cant thank you enough for this holiday gift.

Would Python be faster?
 
Old 12-17-2017, 06:46 PM   #21
keefaz
LQ Guru
 
Registered: Mar 2004
Distribution: Slackware
Posts: 6,230

Rep: Reputation: 713Reputation: 713Reputation: 713Reputation: 713Reputation: 713Reputation: 713Reputation: 713
Quote:
Originally Posted by starrysky1 View Post
Really faster?? Going to try it out.

Cant thank you enough for this holiday gift.

Would Python be faster?
I don't know Python well

You can do simple benchmark with the bash time command.

time program arg1 arg2...
 
Old 12-17-2017, 06:56 PM   #22
starrysky1
LQ Newbie
 
Registered: Dec 2017
Posts: 26

Rep: Reputation: Disabled
Quote:
Originally Posted by keefaz View Post
I don't know Python well

You can do simple benchmark with the bash time command.

time program arg1 arg2...
Will check all of this out when I got some electric service . .
 
Old 12-27-2017, 11:02 PM   #23
Sefyir
Member
 
Registered: Mar 2015
Distribution: Linux Mint
Posts: 579

Rep: Reputation: 267Reputation: 267Reputation: 267
Here's a python3 implementation

Example
Code:
$ wc -m data
106037249 data
$ ./count_sequences.py -s asdfds,dsgasd,ikd,idk,23523,fdg ./data
asdfds                        0
dsgasd                        0
ikd                          90
idk                          83
23523                         0
fdg                          85
Usage
Code:
./count_sequences.py -h
usage: count_sequences.py [-h] [-s SEQUENCES]

Count occurances of string in array

optional arguments:
  -h, --help            show this help message and exit
  -s SEQUENCES, --sequences SEQUENCES
                        Sequences to detect, delimit with ","
Code
Code:
#!/usr/bin/env python3

import argparse
import re

from collections import Counter

# Create commandline flags
parser = argparse.ArgumentParser(description='Count occurances of string in array')
parser.add_argument('-s', '--sequences', help='Sequences to detect, delimit with ","')
args, data_files = parser.parse_known_args()

# Join data files together
data = ''.join((open(data_file).read().strip() for data_file in data_files))

# Strip excess | characters
counted_sequences = Counter(re.findall(args.sequences.replace(',', '|').strip('|'), data))

for sequence in args.sequences.split(','):
    msg = '{sequence} {count}'.format(
            sequence=sequence,
            count=str(counted_sequences[sequence]).rjust(30 - len(sequence)),
            )
    print(msg)
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Fastest way to search a 500 thousand part array in BASH? nadiawicket Programming 1 12-14-2017 02:09 PM
Linux is now running 486 of the world's 500 fastest computers jeremy Linux - News 3 10-24-2015 01:49 AM
BASH-Adding array element: Naming issue using array[${#array[*]}]=5 calvarado777 Programming 8 07-26-2013 10:48 PM
LXer: Of the 500 Fastest Supercomputers, 455 Run on Linux LXer Syndicated Linux News 2 06-06-2010 09:32 AM
100 thousand, OR 500 Thousand first? MasterC General 23 10-05-2003 01:05 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 02:04 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration