LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 07-27-2009, 04:59 AM   #1
culin
Member
 
Registered: Sep 2006
Distribution: Fedora Core 10
Posts: 254

Rep: Reputation: 32
Search for an element in a file


Hi friends,
I have a doubt, i have a file which contains around 3000 elements, now i want to search for an element in that file....
what will be the Efficient way of searching ? without consuming much memory and time...
1) is this efficient if i take all the 3000 elements from file to a list or array and sort them and search ?
2) or is it efficient to search directly inside the file ??
Can anyone please suggest some algorithms for an efficient way to search in this case ??

I am using C for coding...

Thanks
 
Old 07-27-2009, 07:25 AM   #2
neonsignal
Senior Member
 
Registered: Jan 2005
Location: Melbourne, Australia
Distribution: Debian Stretch (Fluxbox WM)
Posts: 1,389
Blog Entries: 52

Rep: Reputation: 355Reputation: 355Reputation: 355Reputation: 355
more information required

You might need to explain what you mean by 'element', as there are different search methods for various kinds of data (eg fixed length vs variable length).

It will also depend on whether you are looking for a known element (or sequence of elements), and/or need to do wild card matching.

It is not efficient to load the file and then sort and search. Sorting takes of the order of n x log(n) operations, whereas searching takes of the order of n operations. Loading the file would also be wasteful of memory.

However, if you are doing lots of searches, then this might become more reasonable (especially if you have only 3000 elements, and the elements are not too large - it isn't like we are short on memory these days).

I would strongly recommend the third book "Sorting and Searching" from Donald Knuth's magnum opus "The Art of Computer Programming", which comprehensively covers many search algorithms.

A typical string search algorithm is the Boyer-Moore one. But this is only relevant if the sequences you are looking for consist of more than element (eg strings of characters).

To use a string search algorithm for a file, you might buffer the file through a circular array.

Of course, if you aren't actually required to write a program, you could use grep :-)

Last edited by neonsignal; 07-27-2009 at 07:30 AM.
 
Old 07-27-2009, 09:51 AM   #3
knudfl
LQ 5k Club
 
Registered: Jan 2008
Location: Copenhagen, DK
Distribution: pclos2017 CentOS6.9 CentOS7.3 + 50+ other Linux OS, for test only.
Posts: 16,742

Rep: Reputation: 3318Reputation: 3318Reputation: 3318Reputation: 3318Reputation: 3318Reputation: 3318Reputation: 3318Reputation: 3318Reputation: 3318Reputation: 3318Reputation: 3318
*
cat -n <file> | grep <word>

and you will get the replies with line numbers.
 
Old 07-28-2009, 12:44 AM   #4
culin
Member
 
Registered: Sep 2006
Distribution: Fedora Core 10
Posts: 254

Original Poster
Rep: Reputation: 32
wow thanks for all the replies...
i will make it more clear...
Element in my case is not an integer or float, its a string containing less than 300 characters, and yes i am looking for a known element, The elements are not sorted, and in my case i can define best for me as "It should not take much RAM and it should be performed in a minimal time possible (even if i want to compromise among one i would say even if it takes a second or two more it should not consume much RAM)"
and i need one more clarification, say i will write a program for searching and all these stuff, but what if i use system("cat -n <file> | grep <word>"); will this take less time and memory ? or it is just to save coding time ? anything is fine for me which takes less time and RAM
 
Old 07-28-2009, 12:57 AM   #5
neonsignal
Senior Member
 
Registered: Jan 2005
Location: Melbourne, Australia
Distribution: Debian Stretch (Fluxbox WM)
Posts: 1,389
Blog Entries: 52

Rep: Reputation: 355Reputation: 355Reputation: 355Reputation: 355
Quote:
what if i use system("cat -n <file> | grep <word>"); will this take less time and memory
Or alternatively "grep -nF <word> <file>". The '-n' prints line numbers (you can use '-b' if you want file position instead), and '-F' uses a version of grep that is optimized for fixed strings.

Yes, grep will be quite memory efficient and fast; it is written in C, and it wouldn't be easy to improve on it.
 
Old 07-29-2009, 12:57 AM   #6
culin
Member
 
Registered: Sep 2006
Distribution: Fedora Core 10
Posts: 254

Original Poster
Rep: Reputation: 32
okie.. but what if i want to use, only C not allowed so use any command of linux such as system ??? only using pure C how can i do it efficiently.. ?
 
Old 07-29-2009, 02:53 AM   #7
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.9, Centos 7.3
Posts: 17,411

Rep: Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397
Sounds a lot like homework, in which case show us what you've tried?
In any case, use the Report button to ask the mods to move this thread to the Programming forum for better/quicker help.
 
Old 07-29-2009, 02:57 AM   #8
neonsignal
Senior Member
 
Registered: Jan 2005
Location: Melbourne, Australia
Distribution: Debian Stretch (Fluxbox WM)
Posts: 1,389
Blog Entries: 52

Rep: Reputation: 355Reputation: 355Reputation: 355Reputation: 355
Just checking - is this a real world application or a programming assignment?

Here is an example, but it won't win any prizes :-)

Code:
// quick and dirty record search
// include files
	#include <stdio.h>
	#include <stdlib.h>
// takes two arguments: filename of record list and string to be matched
	int main(int argc, char **argv)
	{
	// open file
		if (argc != 3) return 1;
		FILE *pFile;
		if (!(pFile = fopen(argv[1], "r"))) return 1;
	// read each record
		ssize_t length;
		char *record = 0; size_t lBuffer = 0;
		unsigned int index = 0;
		while ((length = getline(&record, &lBuffer, pFile)) != -1)
		{
		// if record matches
			if (length) record[length-1] = 0;
			if (strcmp(record, argv[2]) == 0)
			// show matching index
				printf("%d\n", index);
			++index;
		}
	// free getline space and exit
		if (record) free(record);
		return 0;
	}

Last edited by neonsignal; 07-29-2009 at 06:38 AM.
 
Old 07-30-2009, 02:47 AM   #9
culin
Member
 
Registered: Sep 2006
Distribution: Fedora Core 10
Posts: 254

Original Poster
Rep: Reputation: 32
[QUOTE = chrism01]
Sounds a lot like homework, in which case show us what you've tried?
[/QUOTE]

Ha ha NO, its not a homework, if that is the case i would not have struggled this much, the thing is my code should run on an embedded system, where busybox is not installed i.e. i cant use grep or system like that, and also RAM is very Less around 64MB i think.. so was thinking for an efficient design which takes minimal memory and minimum time... !!


Thanks for the code neonsignal
but we are directly comparing the string in the file, i may have around 3000 to 4000 entries, in which case if the element is at 2999th position, then the 2998 Iterations will be wasted (i.e. time taken is more), so what might be the best way to compromise and also i dont want to take all the elements to local array since it may take roughly around 3000*(string length) number of bytes..
 
Old 07-30-2009, 03:16 AM   #10
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.9, Centos 7.3
Posts: 17,411

Rep: Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397
Well, its a lot clearer now you've said that.
It sounded odd originally.

Generally there's a trade-off between RAM & CPU.
Have you considered generating a hash table?
Better still, store the file as a hash instead of the file and a hash table.

Alternately:
Are the entries stored in a fixed length format. A simple index/calc would suffice, even if you had to sacrifice a small amt of RAM to make the file that way.
 
Old 07-30-2009, 03:49 AM   #11
neonsignal
Senior Member
 
Registered: Jan 2005
Location: Melbourne, Australia
Distribution: Debian Stretch (Fluxbox WM)
Posts: 1,389
Blog Entries: 52

Rep: Reputation: 355Reputation: 355Reputation: 355Reputation: 355
If you need better than linear speed, then you will either have to pre-sort the file, or use a better data structure to store the file (eg avl tree, or hashed table as chrism01 suggested).

Perhaps a sample of the sort of strings you are working with would be helpful? Is the file a fixed set of data, or will it change over time? What is the file system stored on, a disk or an SD card?
 
Old 07-30-2009, 06:40 AM   #12
culin
Member
 
Registered: Sep 2006
Distribution: Fedora Core 10
Posts: 254

Original Poster
Rep: Reputation: 32
Thanks soo much for the replies...
Assuming the entries in the files are sorted, Can anyone please give me a simple hash table implementation for this (I mean a code with how hash table is generated ?? ) because i vaguely know to generate some index file for the entries, but still want to know exactly...
Please give me a simple code example of a file containing 10 entries of strings in a sorted order and how to generate hash table for that ??
 
Old 07-31-2009, 12:44 AM   #13
neonsignal
Senior Member
 
Registered: Jan 2005
Location: Melbourne, Australia
Distribution: Debian Stretch (Fluxbox WM)
Posts: 1,389
Blog Entries: 52

Rep: Reputation: 355Reputation: 355Reputation: 355Reputation: 355
Wikipedia provides a good overview of hash tables. There are plenty of examples of hash map code eg http://bd-things.net/hash-maps-with-...rate-chaining/ on the internet.

To adapt this to a file based database, you would store file offset indexes in the hash table instead of the strings themselves (to save memory). I leave this as an exercise for the reader.

If you aren't confident with writing the code, try to start with something simple (such as the linear search). You might even find it works well enough for your application.

A lot of decisions depend on the specifications of your embedded platform, and you have provided little information either about what this is running on, what the purpose is, or how it is to function. It isn't even clear to me why speed is the main issue, given how small the set of strings is?

Whether the data set is fixed or dynamic will also make a lot of difference to the choice of algorithms (for example, a fixed data set can be perfect hashed, or can use dictionary compression so that it fits in a small space in memory). The file system could be linearly accessible (like a disk), or randomly accessible (like an SD card) - also relevant to the choice of algorithms. Or how often searches occur compared to inserts? And many other factors.

No one method is superior, they all have advantages and disadvantages depending on the application.

Should you want to provide detail and/or code, then people will be able to give suggestions for improvements.
 
Old 07-31-2009, 01:45 AM   #14
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.9, Centos 7.3
Posts: 17,411

Rep: Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397
If its sorted, and it sounds like a relatively small num of records, possibly a simple binary chop algorithm would suffice.
http://en.wikipedia.org/wiki/Binary_search_algorithm.
 
Old 07-31-2009, 05:27 AM   #15
culin
Member
 
Registered: Sep 2006
Distribution: Fedora Core 10
Posts: 254

Original Poster
Rep: Reputation: 32
Thanks for those information, i implemented a very simple Indexing method, which contains the offset and size, but i need one more info,
say i have a file file1.txt and one more index file for this say file1_index.txt, now i read the file1_index.txt for a string and i got the offset as 234, now i have to go to the 234th line in the file1.txt, how can i go directly to that line in the file1.txt
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to do search & replace on a text file--need to extract URLs from a sitemap file Mountain Linux - General 4 08-07-2015 11:52 AM
How to search "file:///" in internet search engines? tirengarfio General 4 05-23-2009 08:56 PM
How to match element in a file using Bash ahjiefreak Programming 6 12-13-2007 04:57 AM
Find File broken, need search utility, where does WineX install, KDE file roller? Ohmn Mandriva 6 07-05-2004 11:34 PM
How to 'apt-cache search' & 'apt-file search' by distribution? davidas Debian 3 04-19-2004 02:56 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 04:31 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration