LinuxQuestions.org - C++: need helping searching for multiple items in a string

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - C++: need helping searching for multiple items in a string (https://www.linuxquestions.org/questions/programming-9/c-need-helping-searching-for-multiple-items-in-a-string-621492/)

C++: need helping searching for multiple items in a string

say I have a string:

Code:

string foo = "ab cd ef gh ij kl mn op qr st uv wx yz";

... and I want to find where "ab" "mn" and "st" are. Is there a way I can do that without searching through the string 3 times with a find?

The reason I ask is that this string may be millions of characters long & I may be searching for 10-20 (maybe more) different items. That's a lot of overhead if I have to search for each string separately.

edited to add: To clarify, what I'm really after is a way to make one pass of the string, searching for all of these things at once, like "egrep 'ab|mn|st' some_file". What I'm trying to avoid is foo.find("ab"); foo.find("mn"); foo.find("st");

have you tried regular expressions? i cant think of a RE at the moment to do this, but that might work.
is your input space-separated? if so you might be able to use a tokenizer and split everything up into some data structure (set maybe?) and use search functions on that. yes this would still be calling a 'find' function more than once, but using the data structure and related (efficient) search functions might be ok.

a way of going through the entire string just once, wouldnt be any more efficient, i would guess, than using something like the above method. some simple brief pseudo-code for this might be

Code:

tokenizedStrings = tokenize(foo)

for each token in tokenizedStrings

  if token == tok1

    print "tok1 found at position" x

  else if token == tok2

    print "tok2 found at position" x

  ...

hope it helps

How much memory do you have? If you have a ridiculous amount of memory then you can hash the letters into tables with one table per letter listing every position it occurs at. That way your search can select the table with the same starting letter and jump to only those positions for the searchs.
ta0kira

Well, you can go the low-level C-style way, and it's not pretty :)

Code:

string foo = "ab cd ef gh ij kl mn op qr st uv wx yz";



// first get the c string pointer

const char * fptr = foo.c_str();

// store beginning

const char * fbegin = fptr; 

// storage of the locations

int indexes[] = { -1, -1, -1 };

int count = 0;



// parse the whole thing, looking for: "ab" "mn" and "st"

while (*fptr != '\0')

{

    if (*fptr == 'a' && *(fptr+1) == 'b')

    {

        // first one found => store index

        indexes[0] = (int)(fptr - fbegin);

        ++fptr; // advance pointer

        ++count; // one less to search for

    }

    else if (*fptr == 'm' && *(fptr+1) == 'n')

    {

        // second one found => store index

        indexes[1] = (int)(fptr - fbegin);

        ++fptr; // advance pointer

        ++count; // one less to search for

    }

    else if (*fptr == 's' && *(fptr+1) == 't')

    {

        // third one found => store index

        indexes[2] = (int)(fptr - fbegin);

        ++fptr; // advance pointer

        ++count; // one less to search for

    }

    if (count == 3) break; // all found

    ++fptr; // else advance pointer

}

Just from the top of my head, but basically it should already work as it is. Duplicate entries are not taken into consideration here.

Here's a similar solution to the raw C, but uses std::string methods to do it.

Code:

std::string::size_type find_stuff(const std::string &str, std::string::size_type start_pos)

{

  std::string::size_type find_pos = str.find_first_of("ams", start_pos);

  while(find_pos != std::string::npos)

  {

      switch(str[find_pos])

      {

        case 'a':

            if(str.compare(find_pos, 2, "ab") == 0)

            {

              return find_pos;

            }

            break;

        case 'm':

            if(str.compare(find_pos, 2, "mn") == 0)

            {

              return find_pos;

            }

            break;

        case 's':

            if(str.compare(find_pos, 2, "st") == 0)

            {

              return find_pos;

            }

            break;

      }

      find_pos = str.find_first_of("ams", find_pos + 1);

  }

  return find_pos;

}

With a little work it can be generalized so that you can pass in a std::vector of strings and it'll simultaneously search for them.

It's a question of algorithms and data structures, not of languages.

Ta0kira is closest to the mark.

The thing you *don't* want to do is do (perhaps multiple) sequential searches in turn through a sequential list. You want to check for "membership" in a "set".

Good luck .. PSM

Quote:

Originally Posted by paulsm4 (Post 3060404)

It's a question of algorithms and data structures, not of languages.

I couldn't disagree more (and noone here said it's about languages either).

Unless you really just look for an academic solution which doesn't take efficiency into account.
I can see no reason why the OP should double buffer the data if it's just for a one-time lookup of the indexes. Maybe it should have been made clear beforehand what is going on afterwards, but assuming the data is not used otherwise it'd be a waste (of time and memory).

Another way would be to derive properties common to all items to be found. For example, if all start with a-z then you can compare every position for values[I] >= 'a' && values[I] <= 'z' before delving into any string comparisons. Then you can hash your "find" list by first letter:

Code:

#include <string>

#include <vector>

#include <iostream>

#include <string.h>





typedef std::string data;

typedef std::vector <data> data_list;

typedef std::vector <data_list> hash_table;







class find_hash

{

public:

        find_hash(const data_list &lList) : number(0)

        {

        hash_data.resize(set_range(lList));



        for (unsigned int I = 0; I < lList.size(); I++)

        {

        if (!lList[I].size() || !range_check(lList[I][0])) continue;

        data temp(lList[I].substr(1, lList[I].size()));

        hash_data[ translate(lList[I][0]) ].push_back(temp);

        }

        }





        bool operator () (const char *dData)

        {

        if (!dData || !range_check(*dData)) return false;

        data_list::const_iterator search_current = hash_data[ translate(*dData) ].begin(),

                                  search_last    = hash_data[ translate(*dData++) ].end();



        while(search_current < search_last)

        {

        number++;

        if (strncmp(search_current->c_str(), dData, search_current->size()) == 0) return true;

        search_current++;

        }



        return false;

        }





        unsigned int comparisons() const

        { return number; }





private:

        unsigned int set_range(const data_list &lList)

        {

        if (!lList.size()) return 0;

        bool is_set = false;



        for (unsigned int I = 1; I < lList.size(); I++)

        {

        if (!lList[I].size()) continue;

        if (!is_set)

          {

        minimum = maximum = lList[I][0];

        is_set = true;

        continue;

          }



        if (lList[I][0] < minimum) minimum = lList[I][0];

        if (lList[I][0] > maximum) maximum = lList[I][0];

        }



        return is_set? (maximum - minimum + 1) : 0;

        }





        inline bool range_check(char dData) const

        { return hash_data.size()? (dData >= minimum && dData <= maximum) : false; }



        inline unsigned int translate(char dData) const

        { return dData - minimum; }





        char minimum, maximum;

        unsigned int number;

        hash_table  hash_data;

};







int main(int argc, char *argv[])

{

        const char *test_data = "trvtvey 4ycv 6y45 t4c5 tv ub86 vcyw";



        data_list test_criteria;

        test_criteria.push_back("cv 6");

        test_criteria.push_back("tv4");

        test_criteria.push_back("tve");

        test_criteria.push_back("ub8");

        test_criteria.push_back("vcy");

        test_criteria.push_back("ycv");



        find_hash test_finder(test_criteria);



        const char *current_point = test_data;

        unsigned int position = 0;



        std::cout << "data: '" << test_data << "'\n";



        while (*current_point)

        {

        if (test_finder(current_point++))

        std::cout << "match found @ " << position << "\n";



        position++;

        }



        std::cout << "comparisons: " << test_finder.comparisons()

                  << " (of a possible " << (strlen(test_data) * test_criteria.size()) << ")\n";



        return 0;

}

ta0kira

Quote:

Originally Posted by fantas (Post 3059739)

Well, you can go the low-level C-style way, and it's not pretty :)

strncmp can replace each comparison in that if/else tree.
ta0kira

Quote:

Originally Posted by ta0kira (Post 3060906)

strncmp can replace each comparison in that if/else tree.
ta0kira

Indeed yes, would be much nicer to look at and get rid of one of the branches ;), but for this simple task I thought I wouldn't call a function to do the job. Now after a bit of tinkering, if I'd needed to optimize the two comparisons I'd probably do it this way:

Code:

// e.g. for the "ab"



// originial:

// if (*fptr == 'a' && *(fptr+1) == 'b')



// optimized:

if (*((short*)fptr) == ((short)('a') + ((short)('b') << 8)))

... which the compiler could optimize quite a bit by itself (right hand side equals a constant value).

Not tested, much uglier, but probably very fast.

... but yeah, for longer strings this would definitely be an improvement.

You may be right, but it's the iteration magnitude that will get OP down. That's where the effort needs to be spent first.
ta0kira

So rather first tokenize and then compare ? Well, I can see that with more and more error checking (which is needed if the string token lengths vary) the branches will also get more and more ...

IMO this all will lead to a decision between CPU/memory usage

- tight loop -> lower mem / higher cpu
- tokenize/compare -> higher mem / lower cpu

what to choose might depend on the whole context finally (# and variety of tokens, reuse of tokens ...)

I think a good compromise can always be found when it comes to CPU/memory trade-offs. A main point to consider is that processor usage is linear based on the number of instructions, but effective use of a small amount of memory can reduce the number of instructions geometrically. The proper fit all depends on the dimensions of the comparison vs. how much memory it would take to greatly reduce the number of instructions for the given ratio of data sets being compared.
ta0kira

Good point about the effective memory usage :)

What already ran through my mind was that there could also be a solution that tokenizes into a temporary buffer of size MAX_TOKEN_LENGTH (first copy a chunk into the buffer and then tokenize with e.g. strtok) and then do the comparison on this buffer, with e.g. strncmp.

Like said before, it all depends on the situation, I think there are already enough alternatives shown in this thread (feel free to add more though).