LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Possibly the silliest question ever been asked on Linux Questions (https://www.linuxquestions.org/questions/linux-newbie-8/possibly-the-silliest-question-ever-been-asked-on-linux-questions-758633/)

++nick++ 10-03-2009 02:21 AM

Hi,

whatever others have said are true in their own aspects , but to build a search engine you need knowledge about web spiders which can be built efficiently using Ruby , Perl and so on , so develop yourself strong with ruby , Perl .. then you will certainly understand what all these people have said ,

Thanks,

SaintDanBert 10-03-2009 02:33 AM

some bread crumbs
 
Quote:

Originally Posted by beckettisdogg (Post 3701594)
I want to build an image search engine running on Linux. Something that looks like images.google.com but mine can be more simpler.

You do not have to give me a very detailed step by step, but could you give me some brief step-by-step?

1. Over here you have your image files on disk, organized somehow.
Every image has a unique name of some sort.

2. Over there you have a database.

3. In the database, store the image name and whatever facts you wish to store about that image.

4. You search your database, find matching record.

5. Using the list of matching records you draw your page by using the stored image name to fetch the image from its place on disk.

You asked for a brief description; I think this does it. There are all sorts of things you can do to add chrome and bells and whistles. For example, did you store a thumbnail edition of the image or will you make the thumbnail on the fly? If your thumbnail is small enough and your database software is smart enough you could even store the thumbnail inside the database as a binary-large-object (BLOB) data type. If you have images protected by licenses or intellectual property, your database could hold that information, restrict display and grant or deny access accordingly.

Just thinking outloud,
~~~ 0;-Dan

beckettisdogg 10-13-2009 08:04 AM

I asked an IT company how much they would charge to build a clone of http://images.google.ca/advanced_image_search?hl=en and their response was,

they are not going to do it if it is going to be an image search engine that deals with the Internet in general. they are not even going to give it a consideration.

if it is going to be for my own database and my own pictures, it will cost 10k dollars, and it will take 40 days.

Quote:

It depends on where the images are. If images are all over internet
and not with you, then the task is too big for us to even consider.

If you have the images, then are they tagged with the keywords already?
If yes, then search and display functionality could be done in about
30 to 40 working days at a cost of approximately $10,000. The search
itself takes hardly any time to program, most of the time goes into
presentation of search results. So if the presentation is simple, it
could be done for half as much

If you need to have software to tag your images, then it is additional
work of say 20 working days. In addition you need several people
working for several days to tag thousands of your images.

These are only initial estimates, and the final cost depends on exact
specifications for the work.

r3sistance 10-13-2009 08:33 AM

What they are offering is for something that searchs YOUR sites or a database that already has the keywords/tags, NOT the internet. As they said, if your after something that searchs the internet, "then the task is too big for us to even consider."

Ok let's put a few things out about google. Google has potentially over 1 million servers world wide, how much storage do you think google has across all those servers? do you think google uses storage on those servers or do they use NASs and SAN for there storage what are directly connected to a cluster of database servers? Does google get dedicated fibre optic lines for their operations or atleast 1gbps connections?

When google began, Larry Page and Sergey Brin (dunno why I looked up there names) went to all there friends and family to get anything they could computer wise to make a cluster in a garage... a full cluster in a garage. That's substancely more power and storage then your talking about and that's just for the basic web search engine in the first versions of google. The company has grown alot since then and the web search engine, has become alot more advanced, what they had back then was extremely basic compared to what they have now.

I am telling you now to drop this, it's delusional past belief. First things first, if you can make a web/text search engine then that's one thing, but an image search engine? that's a whole different kettle of fish. let's look at just start the talk about basic requirements.

First off, 300GB isn't going to handle many images at all. You have to store information on the website it came from, potentially offer to cache the image, potentially offer to thumbnail it too. You need a spider to search the internet that appears and IS legitimate. You need to be able to age rate images (how do you know if an image is 18+ or not if all it's called is girl.jpg?) and handle a system where you can hide the 18+ pictures by default. This isn't even 1% of the actual requirements to building the system you are after, even as a cut-down version of an image search engine this isn't 1%. To be honest, outside of a buisness with 100 employees (most of which being software developers) and a private suite (able to house about 150amps worth of equipment) in a datacenter I would say this isn't even worth considering, and even if you did have this, a feasability study would probably show it to not be worthwhile due to lack of resources even at this point...

beckettisdogg 10-13-2009 06:28 PM

ah well I will gladly repeate some of the things I already said in this thead.

1. 300GB, I am not sure where your specified value of 300GB came from. www.geocities.com currently owned by yoohoo provides unlimited space and unlimited bandwidth for about 10 dollars a month, or roughly equivalent to two McDonald's meals.

2. potentially offer to cache the image - unnecessary. Now THAT's ridiculous.

3. There are quite a few open source spiders on the internet and IBM developerworks shows a step-by-step guilde to build a spider. (the source code of the web spider was surprisingly short by the way.)

4. thumbnails - of course. I have never seen an image search engine that has no thumbnails. thumbnails will automatically be created by imagemagick and GD library within geocities server. I will have to build a bot to do that.. that will be one of the most difficult parts of the project.

5. google will for sure need more than a milion servers as you indicated, because they try to cache everything. content of HTML files including all the styling tags, and they for some reason try to convert powerpoint slides, ms word documents, adobe pdf files in to html files and each of those converted files are usually bigger than 150 kb. My image search engine will not cache any of those. Not even the text contents of the html files.

6. In terms of the database of the image search engine, my image search engine automatically skips to index images below the size of 400x300. My database will look like this. A few million lines of....

http://www.website.com/directory/picture.jpg penguin cold snow ABCDE12345.jpg 600x450

(address of the picture, keywords, thumbnail name, resolution)

all image search engines stops generating results when the results reach a thousand.

7. girl.jpg.... ah my image engine does not support that feature. (but the filtering feature of google images is quite useless because anyone can disable it with one click.) And how google's image search engine automatically filters is quite simple. Check the keywords associate with the picture!! if the keywords contain adult material related words, it fiters.)

8. I know of an image search engine that was built by one man.

9. Actually that IT company I contacted is ridiculous, either that or they meant they were going to build the whole project from scratch, and another world-class IT expert I contacted told me if I use open source programs it won't be very difficult or take long and told me it will cost about half of what the IT company estimated. he told me he can build a clone of the entire yahoo website using various open source programs in a few weeks.

r3sistance 10-13-2009 06:59 PM

Ok, the point with girl.jpg, is that one click is a legal requirement, if you don't have it, your breaking the law so it's not just a case of oh it's pointless. You have to have a section where people are shown it's 18+ else it's illegal.

Also no, you don't get unlimited bandwidth or disc space with geocities, it's just a marketing ploy, if you read the small print it's all within TOS, I think you'd find you won't even get 1GB very easily within the TOS, let alone the 100TB you'd need for this system. As for 300GB, it's about average for a dedicated server 160~1TB generally around the 250GB mode, atleast that's what I see alot of around where I work. For this system, I am not kidding when I say you'd probably need 2 or 3 10TB~ SANs that are going to cost about 40,000$ each and cost about 1,000$ to host a month...

Yes Caching would be crazy on so many images but with what you have been saying it's hard to tell what you are after because it's all crazy. Even so, the thumbnails, you have to download the actual picture, you have to have a system to downsize the image and then have to host the reduced size image.

As for the spider, you need a machine to run the spider, that machine needs an application/service/etc to run the spider, the geocities site may run php and perl, but your talking about running a compiled application, that's not going to be supported by about most shared hosted, you are going to need a dedicated server that is constantly browsing networks and crawling around to build up it's library.

Number 9 sounds like phishing... but that aside, any single developer that thinks they can build a site that does everything you see on yahoo.com is crazy, more so in a few weeks, that's really past the realm of single person development and into the realm of medium~large size team of devolopers. As for the quote, what you got before was 40 days at 10K? So that's 1K for 4 days or 250$ for a day, that's barely enough to pay for a single developer. And 40 days seems a little small for all the documentation, testing and implementation that a project like this would actually require.

As for the one man who built an image search engine, what search engine, does it work, is it legal? does it actually have many images in it?

Oh and also the link http://www.website.com/directory/picture.jpg is dead.

P.S. this is crazy and really, you should just drop this insane idea.

smeezekitty 10-13-2009 07:12 PM

Quote:

* Oh, and don't forget that while it's doing so, your cpu will be burning in hell. Imagine that 24/7 on a pc. It won't only blow up your cpu, but also provide you with some extra pain in the form of a monthly electric bill.
not that much with only one computer

i92guboj 10-13-2009 07:21 PM

Quote:

Originally Posted by smeezekitty (Post 3718225)
not that much with only one computer

Yeah, the same computer that's going to index the whole www. Or maybe he really thinks Geocities is going to lend him all their machines so he can compete with Google :D

This is getting nowhere. Luck.

jefro 10-13-2009 09:10 PM

I might be confused.

What do you want to build exactly? Do you want to make a web site similar to google where a person can go to your site and search for images? Or do you want to use your own images as a place where you can view your files?

Not sure google copies every html page.

smeezekitty 10-13-2009 09:21 PM

Quote:

Originally Posted by jefro (Post 3718311)
I might be confused.
Not sure google copies every html page.

most, look at the cached button next to almost every result.

beckettisdogg 10-13-2009 10:08 PM

Something similar happened in the past... when Shawn Fanning first invented P2P(Napster), everyone else around him thought it will never be popular, but it grew up like a wildfire and became one of the most important parts of the internet today. (I know, it made a lot of people in hollywood and music industry and software industry upset, but I am not talking about THAT!! I am talking about the concept itself. let's ignore the legal issues for one minute. I read an article written by an IT advisor once, and it read that sometimes, failures should be allowed. If failures are never allowed, we can never come up with anything innovative. And not every concept should be directly associated with the profit.)

I am trying to build it not because I want to make profit, but because I am curious.

if it does not work, then I am ready for that too. I will still have learned a lot on the way to build it. OR I will have gained another idea to build something else.

At University of Texas, the final group project for 4th year computer science students was to build and configure a search engine. The students were assigned to stop indexing after 3000 pages.

an image search engine will be easier to build than general search engines that has to store everything.... the database will be much simpler.

Logomachist 10-14-2009 01:04 AM

I'm not going to naysay your idea because you might have a good reason for building an image search engine for an intranet or as a fun project or something. But when it comes to large scale Internet search services, a lot of engines have tried the "only index the best sites" thing and, in my opinion, it has failed every time. People say Google's innovation was its pagerank algorithm which sorted results according to how many quality sites linked to a page, but as far as I'm concerned their innovation was actually indexing the entirety of a webpage and searching that index to generate results. Before Google, search engines only saved snippets of text from every webpage they indexed and associated that text with somewhat random keywords that may never have even appeared on the page. It was incredibly useless and you could drill down through hundreds of results without finding one that contained the phrase you specified in your query and the information that you knew was out there but couldn't find.

The Internet is a giant database and people write search queries in order to find specific information that is only on a few of those pages. It doesn't do anyone any good if you only search Deviantart; if the user expected that the picture was on Deviantart they wouldn't be using a generic search engine, they would be querying dA directly (or querying Google with site:deviantart.com which is almost the same thing). And you can assume in almost every case, the user is going to know more about where the information can found than you do, because the user knows what they're looking for and you're writing a GENERIC search engine.

Sigh.

Now, if you have a reason for writing a TARGETed search engine that only searches a few sites, you might be able to save yourself a lot of work and create a customized Google search.

gerryd 10-14-2009 02:13 AM

fyi, imagery isn't an image search engine made by one guy. it seems to be more of an alternate interface to google itself. heres a quote off the site:
Quote:

To begin testing type in something in the search box and hit 'Search', related thumbnails will appear soon (per Google Image Search).
the important stuff's in parentheses. oh and this is at the bottom of the main page:
Quote:

"..a fancy-pants alternative front end to Google Image search..
a viable front end for power image searchers.."

r3sistance 10-14-2009 06:04 AM

Quote:

Originally Posted by beckettisdogg (Post 3718346)
Something similar happened in the past... when Shawn Fanning first invented P2P(Napster), everyone else around him thought it will never be popular, but it grew up like a wildfire and became one of the most important parts of the internet today.

You are comparing something that's never been done before to something that's been done before. If your original to the market then you don't have loads of multi-million dollar companies to contended with (not that any will really target the P2P market given it's general illegal usage), there are companies out there with budgets most people would never realistic see that don't have enough money to target image searching, there are already a couple of search engines that do it and they are funded by multi-billion dollar companies.

If you were really interested in doing it, you'd make the thing yourself but your not on about that, your on about paying other people to do it, with unrealistic aims and unrealistic resources. If you want to create something, an image search engine seems far far far out of your league right now... perhaps try something a little (AKA a heck of alot) easier... like creating a different front end for a current search engine...

As for those kids that create search engines, I have never seen a good one come from one. I have heard of alot of search engines over the years developed by like 16 year olds and they generally sucked very badly, barely worked, or put out data in unreadable forms. University students making search engines are likely to making something that uses current search engines to find sites and indexing off of them what isn't really making a proper search engine with spiders and the such.

i92guboj 10-14-2009 12:26 PM

Comparing this with p2p only shows that you lack even the basic understanding of the resources you are going to need. Both things are very different.

p2p networks are *distributed*, you have a program that acts as a client but also acts as a server on each node of the network. In your case you are going to have a single server to rule the whole world. So, unless you are going to program an hypothetical e-finder that everyone will install on their computers, comparing p2p nets with this is not only pointless but completely unfair.


All times are GMT -5. The time now is 02:29 PM.