C++: need helping searching for multiple items in a string
say I have a string:
Code:
string foo = "ab cd ef gh ij kl mn op qr st uv wx yz"; The reason I ask is that this string may be millions of characters long & I may be searching for 10-20 (maybe more) different items. That's a lot of overhead if I have to search for each string separately. edited to add: To clarify, what I'm really after is a way to make one pass of the string, searching for all of these things at once, like "egrep 'ab|mn|st' some_file". What I'm trying to avoid is foo.find("ab"); foo.find("mn"); foo.find("st"); |
have you tried regular expressions? i cant think of a RE at the moment to do this, but that might work.
is your input space-separated? if so you might be able to use a tokenizer and split everything up into some data structure (set maybe?) and use search functions on that. yes this would still be calling a 'find' function more than once, but using the data structure and related (efficient) search functions might be ok. a way of going through the entire string just once, wouldnt be any more efficient, i would guess, than using something like the above method. some simple brief pseudo-code for this might be Code:
tokenizedStrings = tokenize(foo) |
How much memory do you have? If you have a ridiculous amount of memory then you can hash the letters into tables with one table per letter listing every position it occurs at. That way your search can select the table with the same starting letter and jump to only those positions for the searchs.
ta0kira |
Well, you can go the low-level C-style way, and it's not pretty :)
Code:
string foo = "ab cd ef gh ij kl mn op qr st uv wx yz"; |
Here's a similar solution to the raw C, but uses std::string methods to do it.
Code:
std::string::size_type find_stuff(const std::string &str, std::string::size_type start_pos) |
It's a question of algorithms and data structures, not of languages.
Ta0kira is closest to the mark. The thing you *don't* want to do is do (perhaps multiple) sequential searches in turn through a sequential list. You want to check for "membership" in a "set". Good luck .. PSM |
Quote:
Unless you really just look for an academic solution which doesn't take efficiency into account. I can see no reason why the OP should double buffer the data if it's just for a one-time lookup of the indexes. Maybe it should have been made clear beforehand what is going on afterwards, but assuming the data is not used otherwise it'd be a waste (of time and memory). |
Another way would be to derive properties common to all items to be found. For example, if all start with a-z then you can compare every position for values[I] >= 'a' && values[I] <= 'z' before delving into any string comparisons. Then you can hash your "find" list by first letter:
Code:
#include <string> |
Quote:
ta0kira |
Quote:
Code:
// e.g. for the "ab" Not tested, much uglier, but probably very fast. |
... but yeah, for longer strings this would definitely be an improvement.
|
You may be right, but it's the iteration magnitude that will get OP down. That's where the effort needs to be spent first.
ta0kira |
So rather first tokenize and then compare ? Well, I can see that with more and more error checking (which is needed if the string token lengths vary) the branches will also get more and more ...
IMO this all will lead to a decision between CPU/memory usage - tight loop -> lower mem / higher cpu - tokenize/compare -> higher mem / lower cpu what to choose might depend on the whole context finally (# and variety of tokens, reuse of tokens ...) |
I think a good compromise can always be found when it comes to CPU/memory trade-offs. A main point to consider is that processor usage is linear based on the number of instructions, but effective use of a small amount of memory can reduce the number of instructions geometrically. The proper fit all depends on the dimensions of the comparison vs. how much memory it would take to greatly reduce the number of instructions for the given ratio of data sets being compared.
ta0kira |
Good point about the effective memory usage :)
What already ran through my mind was that there could also be a solution that tokenizes into a temporary buffer of size MAX_TOKEN_LENGTH (first copy a chunk into the buffer and then tokenize with e.g. strtok) and then do the comparison on this buffer, with e.g. strncmp. Like said before, it all depends on the situation, I think there are already enough alternatives shown in this thread (feel free to add more though). |
All times are GMT -5. The time now is 05:05 PM. |