LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 02-15-2008, 08:34 PM   #1
BrianK
Senior Member
 
Registered: Mar 2002
Location: Los Angeles, CA
Distribution: Debian, Ubuntu
Posts: 1,334

Rep: Reputation: 51
C++: need helping searching for multiple items in a string


say I have a string:

Code:
string foo = "ab cd ef gh ij kl mn op qr st uv wx yz";
... and I want to find where "ab" "mn" and "st" are. Is there a way I can do that without searching through the string 3 times with a find?

The reason I ask is that this string may be millions of characters long & I may be searching for 10-20 (maybe more) different items. That's a lot of overhead if I have to search for each string separately.

edited to add: To clarify, what I'm really after is a way to make one pass of the string, searching for all of these things at once, like "egrep 'ab|mn|st' some_file". What I'm trying to avoid is foo.find("ab"); foo.find("mn"); foo.find("st");

Last edited by BrianK; 02-15-2008 at 09:50 PM.
 
Old 02-15-2008, 09:59 PM   #2
nadroj
Senior Member
 
Registered: Jan 2005
Location: Canada
Distribution: ubuntu
Posts: 2,539

Rep: Reputation: 60
have you tried regular expressions? i cant think of a RE at the moment to do this, but that might work.
is your input space-separated? if so you might be able to use a tokenizer and split everything up into some data structure (set maybe?) and use search functions on that. yes this would still be calling a 'find' function more than once, but using the data structure and related (efficient) search functions might be ok.

a way of going through the entire string just once, wouldnt be any more efficient, i would guess, than using something like the above method. some simple brief pseudo-code for this might be
Code:
tokenizedStrings = tokenize(foo)
for each token in tokenizedStrings
  if token == tok1
    print "tok1 found at position" x
  else if token == tok2
    print "tok2 found at position" x
  ...
hope it helps
 
Old 02-16-2008, 09:49 AM   #3
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
How much memory do you have? If you have a ridiculous amount of memory then you can hash the letters into tables with one table per letter listing every position it occurs at. That way your search can select the table with the same starting letter and jump to only those positions for the searchs.
ta0kira
 
Old 02-16-2008, 06:44 PM   #4
fantas
Member
 
Registered: Jun 2007
Location: Bavaria
Distribution: slackware, xubuntu
Posts: 143

Rep: Reputation: 22
Well, you can go the low-level C-style way, and it's not pretty

Code:
string foo = "ab cd ef gh ij kl mn op qr st uv wx yz";

// first get the c string pointer
const char * fptr = foo.c_str();
// store beginning
const char * fbegin = fptr; 
// storage of the locations
int indexes[] = { -1, -1, -1 };
int count = 0;

// parse the whole thing, looking for: "ab" "mn" and "st"
while (*fptr != '\0')
{
    if (*fptr == 'a' && *(fptr+1) == 'b')
    {
        // first one found => store index
        indexes[0] = (int)(fptr - fbegin);
        ++fptr; // advance pointer
        ++count; // one less to search for
    }
    else if (*fptr == 'm' && *(fptr+1) == 'n')
    {
        // second one found => store index
        indexes[1] = (int)(fptr - fbegin);
        ++fptr; // advance pointer
        ++count; // one less to search for
    }
    else if (*fptr == 's' && *(fptr+1) == 't')
    {
        // third one found => store index
        indexes[2] = (int)(fptr - fbegin);
        ++fptr; // advance pointer
        ++count; // one less to search for
    }
    if (count == 3) break; // all found
    ++fptr; // else advance pointer
}
Just from the top of my head, but basically it should already work as it is. Duplicate entries are not taken into consideration here.
 
Old 02-17-2008, 11:43 AM   #5
tuxdev
Senior Member
 
Registered: Jul 2005
Distribution: Slackware
Posts: 2,012

Rep: Reputation: 115Reputation: 115
Here's a similar solution to the raw C, but uses std::string methods to do it.
Code:
std::string::size_type find_stuff(const std::string &str, std::string::size_type start_pos)
{
   std::string::size_type find_pos = str.find_first_of("ams", start_pos);
   while(find_pos != std::string::npos)
   {
      switch(str[find_pos])
      {
         case 'a':
            if(str.compare(find_pos, 2, "ab") == 0)
            {
               return find_pos;
            }
            break;
         case 'm':
            if(str.compare(find_pos, 2, "mn") == 0)
            {
               return find_pos;
            }
            break;
         case 's':
            if(str.compare(find_pos, 2, "st") == 0)
            {
               return find_pos;
            }
            break;
      }
      find_pos = str.find_first_of("ams", find_pos + 1);
   }
   return find_pos;
}
With a little work it can be generalized so that you can pass in a std::vector of strings and it'll simultaneously search for them.
 
Old 02-17-2008, 12:04 PM   #6
paulsm4
LQ Guru
 
Registered: Mar 2004
Distribution: SusE 8.2
Posts: 5,863
Blog Entries: 1

Rep: Reputation: Disabled
It's a question of algorithms and data structures, not of languages.

Ta0kira is closest to the mark.

The thing you *don't* want to do is do (perhaps multiple) sequential searches in turn through a sequential list. You want to check for "membership" in a "set".

Good luck .. PSM
 
Old 02-17-2008, 02:20 PM   #7
fantas
Member
 
Registered: Jun 2007
Location: Bavaria
Distribution: slackware, xubuntu
Posts: 143

Rep: Reputation: 22
Quote:
Originally Posted by paulsm4 View Post
It's a question of algorithms and data structures, not of languages.
I couldn't disagree more (and noone here said it's about languages either).

Unless you really just look for an academic solution which doesn't take efficiency into account.
I can see no reason why the OP should double buffer the data if it's just for a one-time lookup of the indexes. Maybe it should have been made clear beforehand what is going on afterwards, but assuming the data is not used otherwise it'd be a waste (of time and memory).
 
Old 02-17-2008, 05:23 PM   #8
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
Another way would be to derive properties common to all items to be found. For example, if all start with a-z then you can compare every position for values[I] >= 'a' && values[I] <= 'z' before delving into any string comparisons. Then you can hash your "find" list by first letter:
Code:
#include <string>
#include <vector>
#include <iostream>
#include <string.h>


typedef std::string data;
typedef std::vector <data> data_list;
typedef std::vector <data_list> hash_table;



class find_hash
{
public:
	find_hash(const data_list &lList) : number(0)
	{
	hash_data.resize(set_range(lList));

	for (unsigned int I = 0; I < lList.size(); I++)
	 {
	if (!lList[I].size() || !range_check(lList[I][0])) continue;
	data temp(lList[I].substr(1, lList[I].size()));
	hash_data[ translate(lList[I][0]) ].push_back(temp);
	 }
	}


	bool operator () (const char *dData)
	{
	if (!dData || !range_check(*dData)) return false;
	data_list::const_iterator search_current = hash_data[ translate(*dData) ].begin(),
	                          search_last    = hash_data[ translate(*dData++) ].end();

	while(search_current < search_last)
	 {
	number++;
	if (strncmp(search_current->c_str(), dData, search_current->size()) == 0) return true;
	search_current++;
	 }

	return false;
	}


	unsigned int comparisons() const
	{ return number; }


private:
	unsigned int set_range(const data_list &lList)
	{
	if (!lList.size()) return 0;
	bool is_set = false;

	for (unsigned int I = 1; I < lList.size(); I++)
	 {
	if (!lList[I].size()) continue;
	if (!is_set)
	  {
	minimum = maximum = lList[I][0];
	is_set = true;
	continue;
	  }

	if (lList[I][0] < minimum) minimum = lList[I][0];
	if (lList[I][0] > maximum) maximum = lList[I][0];
	 }

	return is_set? (maximum - minimum + 1) : 0;
	}


	inline bool range_check(char dData) const
	{ return hash_data.size()? (dData >= minimum && dData <= maximum) : false; }

	inline unsigned int translate(char dData) const
	{ return dData - minimum; }


	char minimum, maximum;
	unsigned int number;
	hash_table   hash_data;
};



int main(int argc, char *argv[])
{
	const char *test_data = "trvtvey 4ycv 6y45 t4c5 tv ub86 vcyw";

	data_list test_criteria;
	test_criteria.push_back("cv 6");
	test_criteria.push_back("tv4");
	test_criteria.push_back("tve");
	test_criteria.push_back("ub8");
	test_criteria.push_back("vcy");
	test_criteria.push_back("ycv");

	find_hash test_finder(test_criteria);

	const char *current_point = test_data;
	unsigned int position = 0;

	std::cout << "data: '" << test_data << "'\n";

	while (*current_point)
	{
	if (test_finder(current_point++))
	std::cout << "match found @ " << position << "\n";

	position++;
	}

	std::cout << "comparisons: " << test_finder.comparisons()
	          << " (of a possible " << (strlen(test_data) * test_criteria.size()) << ")\n";

	return 0;
}
ta0kira

Last edited by ta0kira; 02-17-2008 at 05:28 PM.
 
Old 02-17-2008, 10:47 PM   #9
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
Quote:
Originally Posted by fantas View Post
Well, you can go the low-level C-style way, and it's not pretty
strncmp can replace each comparison in that if/else tree.
ta0kira
 
Old 02-18-2008, 06:13 PM   #10
fantas
Member
 
Registered: Jun 2007
Location: Bavaria
Distribution: slackware, xubuntu
Posts: 143

Rep: Reputation: 22
Quote:
Originally Posted by ta0kira View Post
strncmp can replace each comparison in that if/else tree.
ta0kira
Indeed yes, would be much nicer to look at and get rid of one of the branches , but for this simple task I thought I wouldn't call a function to do the job. Now after a bit of tinkering, if I'd needed to optimize the two comparisons I'd probably do it this way:

Code:
// e.g. for the "ab"

// originial:
// if (*fptr == 'a' && *(fptr+1) == 'b')

// optimized:
if (*((short*)fptr) == ((short)('a') + ((short)('b') << 8)))
... which the compiler could optimize quite a bit by itself (right hand side equals a constant value).

Not tested, much uglier, but probably very fast.
 
Old 02-18-2008, 07:13 PM   #11
fantas
Member
 
Registered: Jun 2007
Location: Bavaria
Distribution: slackware, xubuntu
Posts: 143

Rep: Reputation: 22
... but yeah, for longer strings this would definitely be an improvement.
 
Old 02-18-2008, 07:42 PM   #12
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
You may be right, but it's the iteration magnitude that will get OP down. That's where the effort needs to be spent first.
ta0kira
 
Old 02-18-2008, 08:58 PM   #13
fantas
Member
 
Registered: Jun 2007
Location: Bavaria
Distribution: slackware, xubuntu
Posts: 143

Rep: Reputation: 22
So rather first tokenize and then compare ? Well, I can see that with more and more error checking (which is needed if the string token lengths vary) the branches will also get more and more ...

IMO this all will lead to a decision between CPU/memory usage

- tight loop -> lower mem / higher cpu
- tokenize/compare -> higher mem / lower cpu

what to choose might depend on the whole context finally (# and variety of tokens, reuse of tokens ...)
 
Old 02-18-2008, 09:14 PM   #14
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
I think a good compromise can always be found when it comes to CPU/memory trade-offs. A main point to consider is that processor usage is linear based on the number of instructions, but effective use of a small amount of memory can reduce the number of instructions geometrically. The proper fit all depends on the dimensions of the comparison vs. how much memory it would take to greatly reduce the number of instructions for the given ratio of data sets being compared.
ta0kira
 
Old 02-19-2008, 08:06 AM   #15
fantas
Member
 
Registered: Jun 2007
Location: Bavaria
Distribution: slackware, xubuntu
Posts: 143

Rep: Reputation: 22
Good point about the effective memory usage

What already ran through my mind was that there could also be a solution that tokenizes into a temporary buffer of size MAX_TOKEN_LENGTH (first copy a chunk into the buffer and then tokenize with e.g. strtok) and then do the comparison on this buffer, with e.g. strncmp.

Like said before, it all depends on the situation, I think there are already enough alternatives shown in this thread (feel free to add more though).
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LDAP API - Searching through multiple OU smurff Programming 3 09-22-2006 03:22 AM
searching for a string of charcters in some files hhegab Programming 2 04-16-2005 05:07 PM
Searching for a string krazykow Solaris / OpenSolaris 1 03-17-2005 11:55 AM
searching for multiple files ryedunn Linux - Newbie 4 09-27-2004 03:21 PM
mvoing multiple items gonus Linux - General 2 01-20-2003 07:18 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:58 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration