C++: need helping searching for multiple items in a string
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
C++: need helping searching for multiple items in a string
say I have a string:
Code:
string foo = "ab cd ef gh ij kl mn op qr st uv wx yz";
... and I want to find where "ab" "mn" and "st" are. Is there a way I can do that without searching through the string 3 times with a find?
The reason I ask is that this string may be millions of characters long & I may be searching for 10-20 (maybe more) different items. That's a lot of overhead if I have to search for each string separately.
edited to add: To clarify, what I'm really after is a way to make one pass of the string, searching for all of these things at once, like "egrep 'ab|mn|st' some_file". What I'm trying to avoid is foo.find("ab"); foo.find("mn"); foo.find("st");
have you tried regular expressions? i cant think of a RE at the moment to do this, but that might work.
is your input space-separated? if so you might be able to use a tokenizer and split everything up into some data structure (set maybe?) and use search functions on that. yes this would still be calling a 'find' function more than once, but using the data structure and related (efficient) search functions might be ok.
a way of going through the entire string just once, wouldnt be any more efficient, i would guess, than using something like the above method. some simple brief pseudo-code for this might be
Code:
tokenizedStrings = tokenize(foo)
for each token in tokenizedStrings
if token == tok1
print "tok1 found at position" x
else if token == tok2
print "tok2 found at position" x
...
How much memory do you have? If you have a ridiculous amount of memory then you can hash the letters into tables with one table per letter listing every position it occurs at. That way your search can select the table with the same starting letter and jump to only those positions for the searchs.
ta0kira
Well, you can go the low-level C-style way, and it's not pretty
Code:
string foo = "ab cd ef gh ij kl mn op qr st uv wx yz";
// first get the c string pointer
const char * fptr = foo.c_str();
// store beginning
const char * fbegin = fptr;
// storage of the locations
int indexes[] = { -1, -1, -1 };
int count = 0;
// parse the whole thing, looking for: "ab" "mn" and "st"
while (*fptr != '\0')
{
if (*fptr == 'a' && *(fptr+1) == 'b')
{
// first one found => store index
indexes[0] = (int)(fptr - fbegin);
++fptr; // advance pointer
++count; // one less to search for
}
else if (*fptr == 'm' && *(fptr+1) == 'n')
{
// second one found => store index
indexes[1] = (int)(fptr - fbegin);
++fptr; // advance pointer
++count; // one less to search for
}
else if (*fptr == 's' && *(fptr+1) == 't')
{
// third one found => store index
indexes[2] = (int)(fptr - fbegin);
++fptr; // advance pointer
++count; // one less to search for
}
if (count == 3) break; // all found
++fptr; // else advance pointer
}
Just from the top of my head, but basically it should already work as it is. Duplicate entries are not taken into consideration here.
It's a question of algorithms and data structures, not of languages.
Ta0kira is closest to the mark.
The thing you *don't* want to do is do (perhaps multiple) sequential searches in turn through a sequential list. You want to check for "membership" in a "set".
It's a question of algorithms and data structures, not of languages.
I couldn't disagree more (and noone here said it's about languages either).
Unless you really just look for an academic solution which doesn't take efficiency into account.
I can see no reason why the OP should double buffer the data if it's just for a one-time lookup of the indexes. Maybe it should have been made clear beforehand what is going on afterwards, but assuming the data is not used otherwise it'd be a waste (of time and memory).
Another way would be to derive properties common to all items to be found. For example, if all start with a-z then you can compare every position for values[I] >= 'a' && values[I] <= 'z' before delving into any string comparisons. Then you can hash your "find" list by first letter:
Code:
#include <string>
#include <vector>
#include <iostream>
#include <string.h>
typedef std::string data;
typedef std::vector <data> data_list;
typedef std::vector <data_list> hash_table;
class find_hash
{
public:
find_hash(const data_list &lList) : number(0)
{
hash_data.resize(set_range(lList));
for (unsigned int I = 0; I < lList.size(); I++)
{
if (!lList[I].size() || !range_check(lList[I][0])) continue;
data temp(lList[I].substr(1, lList[I].size()));
hash_data[ translate(lList[I][0]) ].push_back(temp);
}
}
bool operator () (const char *dData)
{
if (!dData || !range_check(*dData)) return false;
data_list::const_iterator search_current = hash_data[ translate(*dData) ].begin(),
search_last = hash_data[ translate(*dData++) ].end();
while(search_current < search_last)
{
number++;
if (strncmp(search_current->c_str(), dData, search_current->size()) == 0) return true;
search_current++;
}
return false;
}
unsigned int comparisons() const
{ return number; }
private:
unsigned int set_range(const data_list &lList)
{
if (!lList.size()) return 0;
bool is_set = false;
for (unsigned int I = 1; I < lList.size(); I++)
{
if (!lList[I].size()) continue;
if (!is_set)
{
minimum = maximum = lList[I][0];
is_set = true;
continue;
}
if (lList[I][0] < minimum) minimum = lList[I][0];
if (lList[I][0] > maximum) maximum = lList[I][0];
}
return is_set? (maximum - minimum + 1) : 0;
}
inline bool range_check(char dData) const
{ return hash_data.size()? (dData >= minimum && dData <= maximum) : false; }
inline unsigned int translate(char dData) const
{ return dData - minimum; }
char minimum, maximum;
unsigned int number;
hash_table hash_data;
};
int main(int argc, char *argv[])
{
const char *test_data = "trvtvey 4ycv 6y45 t4c5 tv ub86 vcyw";
data_list test_criteria;
test_criteria.push_back("cv 6");
test_criteria.push_back("tv4");
test_criteria.push_back("tve");
test_criteria.push_back("ub8");
test_criteria.push_back("vcy");
test_criteria.push_back("ycv");
find_hash test_finder(test_criteria);
const char *current_point = test_data;
unsigned int position = 0;
std::cout << "data: '" << test_data << "'\n";
while (*current_point)
{
if (test_finder(current_point++))
std::cout << "match found @ " << position << "\n";
position++;
}
std::cout << "comparisons: " << test_finder.comparisons()
<< " (of a possible " << (strlen(test_data) * test_criteria.size()) << ")\n";
return 0;
}
strncmp can replace each comparison in that if/else tree.
ta0kira
Indeed yes, would be much nicer to look at and get rid of one of the branches , but for this simple task I thought I wouldn't call a function to do the job. Now after a bit of tinkering, if I'd needed to optimize the two comparisons I'd probably do it this way:
Code:
// e.g. for the "ab"
// originial:
// if (*fptr == 'a' && *(fptr+1) == 'b')
// optimized:
if (*((short*)fptr) == ((short)('a') + ((short)('b') << 8)))
... which the compiler could optimize quite a bit by itself (right hand side equals a constant value).
So rather first tokenize and then compare ? Well, I can see that with more and more error checking (which is needed if the string token lengths vary) the branches will also get more and more ...
IMO this all will lead to a decision between CPU/memory usage
- tight loop -> lower mem / higher cpu
- tokenize/compare -> higher mem / lower cpu
what to choose might depend on the whole context finally (# and variety of tokens, reuse of tokens ...)
I think a good compromise can always be found when it comes to CPU/memory trade-offs. A main point to consider is that processor usage is linear based on the number of instructions, but effective use of a small amount of memory can reduce the number of instructions geometrically. The proper fit all depends on the dimensions of the comparison vs. how much memory it would take to greatly reduce the number of instructions for the given ratio of data sets being compared.
ta0kira
What already ran through my mind was that there could also be a solution that tokenizes into a temporary buffer of size MAX_TOKEN_LENGTH (first copy a chunk into the buffer and then tokenize with e.g. strtok) and then do the comparison on this buffer, with e.g. strncmp.
Like said before, it all depends on the situation, I think there are already enough alternatives shown in this thread (feel free to add more though).
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.