LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (http://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Possibly the silliest question ever been asked on Linux Questions (http://www.linuxquestions.org/questions/linux-newbie-8/possibly-the-silliest-question-ever-been-asked-on-linux-questions-758633/)

beckettisdogg 09-30-2009 04:28 AM

Possibly the silliest question ever been asked on Linux Questions
 
I want to build an image search engine running on Linux. Something that looks like images.google.com but mine can be more simpler.

You do not have to give me a very detailed step by step, but could you give me some brief step-by-step?

thanks!!

maybe looking at the source code of images.google.com might help.

lutusp 09-30-2009 04:45 AM

Quote:

Originally Posted by beckettisdogg (Post 3701594)
I want to build an image search engine running on Linux. Something that looks like images.google.com but mine can be more simpler.

You do not have to give me a very detailed step by step, but could you give me some brief step-by-step?

thanks!!

maybe looking at the source code of images.google.com might help.

I think you may be underestimating what Web search requires in modern times. You have to validate yourself to any number of Web sites as a valid Web crawler robot, you need lots of resources and a huge storage capacity, and you must have very fast Internet access.

Web searching isn't like Web browsing.

linuxlover.chaitanya 09-30-2009 05:38 AM

All the infrastructure is one thing. But how would you search? Do you have time to develop the algorithms for a perfect search? And how will anyone trust you?

beckettisdogg 09-30-2009 06:52 AM

you know what everyone, I think that can be one of my biggest resources.... source codes of image search engines

linuxlover.chaitanya 09-30-2009 07:50 AM

No. Source code will not give you anything. Not even the lightest of idea what goes behind the scene. Do not think designing a search engine is a child's play.

dsollen 09-30-2009 09:23 AM

I'm afraid I'm going to have to express the same opinion that others are, I don't think you recognize the inherent difficulty in your request. Creating even a marginally passable image search of the web is not something that I would suggest for even the most hard core computer geeks tackle single handedly.

First there the infrastructure. One PC, no mater how powerful, is not going to be sufficient to search the web. If you limited yourself to only searching Google images instead of the entire web would make this task far easier, but were still talking multiple computers clustered together. Then if you expect to make your image database available to others there is additional infrastructure in being able to receive search request from users, process them *quickly*, and provide the results.

Next there is the difficulty of searching images. Searching regular websites is not an easy task, but images are even harder. How do you know what an image is? A bunch of pixels could be a bunch of cats laughing out loud, an overweight anime fan dressed up as Aries at comicon, or Rick Astley. I had friends pursuing graduate degrees in computer science just to begin to be able to tackle that challenge; it is not the sort of thing you code up one day in your free time. We can't point you to source code you can copy or modify here because any company willing to put the time and effort into creating logic for searching images is likely to make it proprietary. However IMO if you did want an image database your best option would be to use image tags, have your userís tag images with words or phrases they believe apply and perform searches based off of those tags. That is only viable if you have someone tagging images, not really useful if you want to get images off the web, but at least it's something a single individual may have a chance to implement.

And finally there the issues with crawling the web. Creating a very basic web crawler to search the web isn't that hard, I created one in an intermediate Java class, but creating one that can crawl the entire web with limited memory in an efficient manner is much harder. There are other issues associated with the web crawling as well, including high bandwidth and upload speeds needed.

If you still wanted to attempt an image database, which I really wouldn't suggest, there are two methods depending on what youíre actually trying. If you want to search a small number of images provided by other users have them include tags describing the images when they upload them and have your search utilize the image tags. If you want to search the web itself limit yourself to only Google images, do searches using Google and then apply whatever heuristics you require to images found by Google.

If you were to tell us why you would want to do an image search there a chance that you can get what you want without writing it yourself. Why not make use of Google images or photobucket to do whatever you need rather then trying to create something from scratch.

jonfleck 09-30-2009 10:54 AM

Not to pile on here, but how can you get much simpler than Google? There is a search box and a button not sure what you could remove to make it simpler.

beckettisdogg 09-30-2009 01:27 PM

I agree. Google will need maybe more than a hundred latest, fastest server machines.

For example Google stores a copy of every html file (not sure if they store .js files and .css files) and every copy of powerpoint slides, ms word documents, adobe pdf files. google tries to somehow convert powerpoint slides and pdf files to html files. And each converted powerpoint slides and pdf files are usually bigger than 100 kb.

now THAT's going to require some damm big amount of space,

but as I stated,

I am going to save none of that into my image search engine!! not even the content of the HTML files! I don't have to! I don't need to! (lastly, I can't afford to)

all I am going to save into the index is the actual address to the picture, the keywords associated with the picture, and the thumbnail associated with the picture. Thumbnails are stored in the server. Thumbnails will be usually 5 kb in size.

My index will look like this. Hundreds of thousands of lines of

http://www.soandso.com/directory/picture.jpg TAGS: penguin, cold, iceberg THUMBNAIL:abcde12345.jpg WIDTH 500 HEIGHT 400

the width and height will be saved so it will give the user the option of specifying the image size.
When my image search engine finds a "row" that the user wants, my image search engine will return the thumbnail and clicking on the thumbnail will take the user to the page where the picture is stored.

and there are open source spiders around the internet!! IBM DeveloperWorks also shows how to make a spider step-by-step (but I admit I did not quite understand the steps of building a web spider. IBM DeveloperWorks articles are usually very difficult for me)

Spiders do not need special permission to try to gather images or other types of information around the internet!

it just surfs around the internet the way us human surfers do. Of course, some web servers have some images and other types of files stored in special directories that requires a special permission to have access to, and that's fine if I do not have access to those pictures. I will just gather pictures I have access to using a spider.

There is an image search engine named "Imagery". It is gaining popularity nowadays and it looks like it was built by one man using various GNU-licensed softwares. What makes Imagery special is its interface has tabs. For example, if you search for "Penguin" and then you can create another tab within Imagery and search for "Panda" and both searches will continue operating. How imagery makes profit? It receives donations from the visitors. It has a "tip jar"

I am also trying to learn some tips from the two articles written by the famous boys Sergei Brin and Lawrence Page. http://infolab.stanford.edu/pub/papers/google.pdf http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

the pdf files above explain Google's searching algorithm named PageRank. You know a man was designed to design webpage ranking algorithm when his last name is "page" When I build one, I will be one of my own biggest customers of my image search engine!

i92guboj 09-30-2009 04:01 PM

Quote:

Originally Posted by beckettisdogg (Post 3702154)
I agree. Google will need maybe more than a hundred latest, fastest server machines.

For example Google stores a copy of every html file (not sure if they store .js files and .css files) and every copy of powerpoint slides, ms word documents, adobe pdf files. google tries to somehow convert powerpoint slides and pdf files to html files. And each converted powerpoint slides and pdf files are usually bigger than 100 kb.

now THAT's going to require some damm big amount of space,

but as I stated,

I am going to save none of that into my image search engine!! not even the content of the HTML files! I don't have to! I don't need to! (lastly, I can't afford to)

all I am going to save into the index is the actual address to the picture, the keywords associated with the picture, and the thumbnail associated with the picture. Thumbnails are stored in the server. Thumbnails will be usually 5 kb in size.

You still haven't told us how are you pretending to look inside every single file available in the world wide web. If you think you can do this without a dedicated cluster with some thousands of (and more) machines you are wrong. It doesn't matter that you are not storing the files, you still need to download them to parse and catalog them, you need to download a web to view it in your browser, it's not like it magically appears in your screen from the other side of the world. The fact is that you don't have the infrastructure needed to scan the whole world.

Now, if you tell me that you are going to limit the bot (or whatever it is) to one site of medium size then it starts being realistic, but even if you are using only one site the project might be out of your reach (thinking of deviantart, for example).

Quote:

and there are open source spiders around the internet!! IBM DeveloperWorks also shows how to make a spider step-by-step (but I admit I did not quite understand the steps of building a web spider. IBM DeveloperWorks articles are usually very difficult for me)
Developing a spider is not the problem (provided you have the knowledge and guts to do so). The problem is that no matter how smart you are and how much you know a pc can't deal with the whole WWW, no matter how powerful it is. You can design a spider to go visiting sites and printing their name to a text file, and still that will take years to finish in a pc. By the time it ends, the info that the txt file holds will no longer match the reality, and you need to start again. The size of the web is no joke. Some rough estimations:

http://www.worldwidewebsize.com/

Quote:

There is an image search engine named "Imagery". It is gaining popularity nowadays and it looks like it was built by one man using various GNU-licensed softwares. What makes Imagery special is its interface has tabs. For example, if you search for "Penguin" and then you can create another tab within Imagery and search for "Panda" and both searches will continue operating. How imagery makes profit? It receives donations from the visitors. It has a "tip jar"
I know nothing about imagery, let alone its licensing model or how it works. But the problem as said is not knowledge, you can learn and do it, even if it takes you the whole life. The problem is that you want to build a skyscrapper (web search engine) using only one brick (your pc). If you put a fence to the problem, as said above, and limit yourself to only one web of moderate size that you can use as a test bed, then you will have better chances to reach some goal. But your pc can't scan the whole web, if you truly want that, then use google and parse or any other engine and use it as a base to your project.

beckettisdogg 09-30-2009 04:30 PM

Actually that's a great idea. the WWW might be gigantic, but if I limit my farm to certain URLs, about 100 quality websites around WWW, it might return better quality results?

I also read search engines do this:

they will stop generating results once they reach 1000 results.

i92guboj 09-30-2009 05:11 PM

Gigantic is too soft a word, Jupiter is gigantic, the Milky Way is simply out of the human concept of measurement (for most humans). You can see Jupiter from your house with a bit of luck, if you know how and where to look. But that's all. To look outside of the near vicinity, very specialized equipment is needed. The www is exactly the same. Look there are two limiting factors, assuming that knowledge is not one of them ;)
  • How fast is your connection? You can't download the whole deviantart almost for sure in any reasonable amount of time, most image sites are very big because they accept user contributions and because they store, well, images, which are big by nature. Even a single site is probably more than your bandwidth is able to handle. Nothing that your bots can do will fix that. If it can't reach the box where your hypothetical engine lives then it can't be analyzed, parsed, cataloged or whatever you want to do to the contents.
  • Second, even if your connection speed is amazing and you can download a million thousand magabits per second, there's the limit of what your cpu, your ram, and your storage (yes, even if you don't store the big pictures) is going to be able to handle. For a site like deviantart (which is probably in the top of the list when it comes to images) even the thumbnails for the whole collection would require probably more storage space that you've ever had on a pc. Now, note that there are around 50 billion webs and deviant is just *one* of them, and you can start realizing how crazy the idea is. Back to the cpu, take a collection of let's say a thousand photos you have in between your folders, then use imagemagick to resize all of them using the command below, and measure the time it takes. Oh, and don't forget that while it's doing so, your cpu will be burning in hell. Imagine that 24/7 on a pc. It won't only blow up your cpu, but also provide you with some extra pain in the form of a monthly electric bill.

Code:

~ $ cd tmp
~/tmp $ cp ../wallpapers/blue.jpg .
~/tmp $ for i in $(seq 1 100); do cp blue.jpg $i.jpg; done
~/tmp $ identify blue.jpg
blue.jpg JPEG 1600x1200 1600x1200+0+0 8-bit DirectClass 860kb
$ echo "Starting operation: $(date)" && for i in *.jpg; do convert -resize 256x200 $i ${i%.jpg}_thumb.jpg; done && echo "Ending operation: $(date)"
Starting operation: Wed Sep 30 23:56:56 CEST 2009
Ending operation: Wed Sep 30 23:59:16 CEST 2009

That's +3 minutes for 100 images of what nowadays is a standard size. Sites like deviant art, flickr and similars have *millions* or even thousands of millions of them, and some are waaaaay bigger than this. Note that while the size of the image grows the needed time to process it grows *exponentially* which means that huge images can really take a very long time to be resized for thumbnailing. Even minutes or more in an average desktop machine (that assuming that they don't exceed your ram and the whole thing segfaults, which is very probable unless you impose a restriction on the image sizes that should be processed).

And that's just *one* of the pieces of the puzzle: thumbnailing, which is probably the lesser of your problems.

If you truly want to do this for didactical purposes, then it's fine, but choose one site to start with and one that's not very big. However you are free to try...

chrism01 09-30-2009 06:05 PM

You could start by just indexing friend's websites. That would be manageable and teach you a lot(!).
Just to note that even google only indexes a fraction of the 'net; it's simply too big, even for them.

beckettisdogg 09-30-2009 11:34 PM

Of course,

I am going to make limits.

I am going to borrow some space in a dedicated server first.

I will not accept images below the resolution of 300x400 (the default search of google images automatically skips images below 200x300)

I will not accept images below the size of 20 kb.

There is absolutely, fabulously no reason to scan the entire web. I just have to give the users the best images to make them happy

but anyways did I win the award for silliest question ever been asked on Linux Questions?

beckettisdogg 10-01-2009 05:10 PM

If I look at GeoCities (currently owned by Yahoo)

it provides me with unlimited disk space and bandwidth for about 10 dollars per month. (equivalent to a couple of McDonald's combos)

http://order.sbs.yahoo.com/ds/LearnM...mecta_20090701

and considering what kind of a company currently owns GeoCities, it won't go down anytime soon

beckettisdogg 10-02-2009 09:56 PM

Just wanted to say I appreciate all your help from the gurus on L-Questions. I do want to build an image search engine. I recently got out of school and I am currently working as a teacher. I do want to make a career out of this, but getting hired by companies is too difficult nowadays because of outsourcing. And even if I do get hired, the payment won't be enough to keep me happy.


All times are GMT -5. The time now is 07:59 PM.