Efficient search technique for text file of size 2 mb or more

topworld · 04-01-2006, 12:54 AM

Hi all,

If i want to implement c program that finds out user-specified number or word from the text file , having size arnd 2 mb or more..
(text file is a combination of words and numbers)

What can be efficient search-technique?

Thank you.

ta0kira · 04-01-2006, 02:19 PM

I recommend a bash script as a front end to find possible text files, then submit the files to your C program. Look at 'man find'. I'm pretty sure that can take care of the size thing. Then look at 'man file' or 'man stat'; after obtaining a list of everything > 2MB, you can use these to determine if they are text files or not. You'll have to 'grep' and/or 'sed' to get something pretty looking out of it, though.
ta0kira

Mara · 04-01-2006, 03:06 PM

If the file is not sorted, and you don't have a clue on where to find the thing you're searching for, the linear search is the way to go. A C program reading data to a buffer, searching the buffer and reading new fragment is a simple and rather effective way.

addy86 · 04-01-2006, 04:49 PM

Read
http://en.wikipedia.org/wiki/String_searching_algorithm
Considering the length of the string (2M characters), a simple brute-force ( O(mn) ) is almost certainly not the fastest way.

paulsm4 · 04-01-2006, 05:12 PM

Your first question, as Mara noted, is whether there's any order in the file itself (is the file sorted? can you read a line at a time, or is it just a random byte stream? Etc etc)

The next question is whether you need to parse the entire file itself for each query, or whether it makes sense to index the file (as the Wikipedia article addy86 suggests).

It would be interesting to do some tests, but I think it's unlikely you could easily write a C program that would necessarily out-perform "grep" or "awk" for basic pattern matching (i.e. "search") speed and efficiency. (I'm prepared to be 100% wrong about that statement, by the way ;-))

'Hope that helps .. PSM

topworld · 04-03-2006, 01:56 AM

Thank you all for ur help :-)

Inputs to the program are like following

1)i will take one text file as an input file in the c-prog (that has been prepared previously,and prog will use it directly)

2) User-input is word and a number like xyz.c and 15...

Now this will be like xyz.c:15.. somewhere in the 2mb file
here
-- There is no specific pattern in the text file...

now i think linear search is the last option remaining...

Thank you