LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   make database of website on internet (https://www.linuxquestions.org/questions/linux-newbie-8/make-database-of-website-on-internet-4175422911/)

ac_kumar 08-19-2012 02:49 PM

make database of website on internet
 
Hi I want to make a script which search for all the websites on internet and than add them in mysql database with there meta tags as keywords.
Any help is appreciated.
thanks in advance

TobiSGD 08-19-2012 03:22 PM

According to estimates the web currently contains about 7 billion webpages (http://www.worldwidewebsize.com/). If you assume an average data volume of 200 bytes per page (address + meta tags) you would need about 1.4 Petabyte disk space. So the first thing you should do is to buy a huge number of harddisks and servers to host your database.

devnull10 08-19-2012 05:10 PM

Also be prepared to wait a little while...

Code:

tmp $ time wget -O - www.google.co.uk > /dev/null
--2012-08-19 23:11:01--  http://www.google.co.uk/
Resolving www.google.co.uk (www.google.co.uk)... 173.194.67.94, 2a00:1450:4007:803::101f
Connecting to www.google.co.uk (www.google.co.uk)|173.194.67.94|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: "STDOUT"

    [ <=>                                                      ] 13,143      --.-K/s  in 0.03s 

2012-08-19 23:11:01 (504 KB/s) - written to stdout [13143]


real    0m0.116s
user    0m0.005s
sys    0m0.000s

and that's with a pretty fast site and relatively small page. Add into that the time taken to parse the data, write to disk, mysql to release its resources etc... Even with multiple threads, you're looking quite a number of YEARS of processing.

TB0ne 08-19-2012 05:12 PM

Quote:

Originally Posted by ac_kumar (Post 4758457)
Hi I want to make a script which search for all the websites on internet and than add them in mysql database with there meta tags as keywords.
Any help is appreciated.
thanks in advance

Ahh...that's already been done. It's called "Google".

earthnet 08-20-2012 10:30 AM

Quote:

Originally Posted by ac_kumar (Post 4758457)
Any help is appreciated.

Help with what? You didn't ask any questions.

ac_kumar 08-20-2012 11:46 AM

Quote:

Originally Posted by TB0ne (Post 4758523)
Ahh...that's already been done. It's called "Google".

Do you think I dont know about google.
If you were advicing Linus Torvalds you would have said why to make Linux we have already invented Dos.
I tell you one thing re-inventing have made technology go further for eg light bulbs to led bulbs, stone wheels to rubber wheels.

ac_kumar 08-20-2012 12:00 PM

Quote:

Originally Posted by devnull10 (Post 4758519)
Also be prepared to wait a little while...

Code:

tmp $ time wget -O - www.google.co.uk > /dev/null
--2012-08-19 23:11:01--  http://www.google.co.uk/
Resolving www.google.co.uk (www.google.co.uk)... 173.194.67.94, 2a00:1450:4007:803::101f
Connecting to www.google.co.uk (www.google.co.uk)|173.194.67.94|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: "STDOUT"

    [ <=>                                                      ] 13,143      --.-K/s  in 0.03s 

2012-08-19 23:11:01 (504 KB/s) - written to stdout [13143]


real    0m0.116s
user    0m0.005s
sys    0m0.000s

and that's with a pretty fast site and relatively small page. Add into that the time taken to parse the data, write to disk, mysql to release its resources etc... Even with multiple threads, you're looking quite a number of YEARS of processing.

Could you please explain what this command is doing. as far as I know wget get pages from internet.

TB0ne 08-20-2012 01:42 PM

Quote:

Originally Posted by ac_kumar
Could you please explain what this command is doing. as far as I know wget get pages from internet.

Did you read/look at the man pages for time and wget? It will explain what the commands do, and what the options are doing. The time command lets you see how much time the given command takes. In this case, the wget command with the -O puts things to a file.
Quote:

Originally Posted by ac_kumar (Post 4759326)
Do you think I dont know about google.

Apparently not, since you're asking how to re-create it.
Quote:

If you were advicing Linus Torvalds you would have said why to make Linux we have already invented Dos.
No, since the only way Linux is like DOS is that they both have a command line. There is no duplication of functionality.
Quote:

I tell you one thing re-inventing have made technology go further for eg light bulbs to led bulbs, stone wheels to rubber wheels.
Yes, each is better is some way than what has come before. What, exactly, is going to be different and better about what you're doing? There are MANY web-crawlers you can find easily, written in pretty much every programming language. What language are you wanting to write this in, and what problems are you having now?? You've essentially asked a very open ended question, with MANY different answers.

guyonearth 08-20-2012 03:00 PM

The technical problems of doing what you asked about have been explained. In short, your idea makes no sense given that it has already been done many times by search engines like Google. Given the nature of your question it wouldn't appear you are going to be inventing a better mousetrap any time soon.

devnull10 08-20-2012 03:49 PM

Quote:

Originally Posted by ac_kumar (Post 4759344)
Could you please explain what this command is doing. as far as I know wget get pages from internet.

Yes, I was merely illustrating the time it takes for a moderately powered PC on a fairly fast internet connection to return a single small/fast webpage from the internet. Scale that up and account for slower responses and you're looking at years and years of processing. Sure, you can have a "vision" but what we are trying to tell you is that in all reality, it's pretty much asking the impossible - Google does a good job of it - but not perfect by any means.
How do you intend to traverse sites? By developing a robot which recursively parses links on each site it finds? Then you have to check whether you have already visited every link else you'll end up with cycles (which could be massively long and wasteful).

This is a serious comment - if you have got several hundred thousand pound to simply start this project off, never mind fund it, and are able to fund a full time team of analysts, developers etc then you might get a small way into doing it.

ac_kumar 08-21-2012 02:00 PM

thanks you all for very helpful answers.
See I am very fascinated how google works and yes sometimes i don't find it usefull.
So i was thinking that i could make a step down web search engine to experiment.
As for storage problem i can manage and add few website in database than work on further.
I just want to do this project for fun.

TobiSGD 08-21-2012 02:38 PM

If you are doing it just for fun have a look here: http://www.udacity.com/overview/Course/cs101
It is a series of video tutorials, they build a search engine using Python to teach the basics of computer science.


All times are GMT -5. The time now is 06:24 PM.