Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place! |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
|
08-19-2012, 03:49 PM
|
#1
|
Member
Registered: Aug 2011
Distribution: Ubuntu, Fedora
Posts: 175
Rep:
|
make database of website on internet
Hi I want to make a script which search for all the websites on internet and than add them in mysql database with there meta tags as keywords.
Any help is appreciated.
thanks in advance
|
|
|
08-19-2012, 04:22 PM
|
#2
|
Moderator
Registered: Dec 2009
Location: Germany
Distribution: Whatever fits the task best
Posts: 17,148
|
According to estimates the web currently contains about 7 billion webpages ( http://www.worldwidewebsize.com/). If you assume an average data volume of 200 bytes per page (address + meta tags) you would need about 1.4 Petabyte disk space. So the first thing you should do is to buy a huge number of harddisks and servers to host your database.
Last edited by TobiSGD; 08-19-2012 at 06:12 PM.
Reason: fixed typo
|
|
1 members found this post helpful.
|
08-19-2012, 06:10 PM
|
#3
|
Member
Registered: Jan 2010
Location: Lancashire
Distribution: Slackware Stable
Posts: 572
Rep:
|
Also be prepared to wait a little while...
Code:
tmp $ time wget -O - www.google.co.uk > /dev/null
--2012-08-19 23:11:01-- http://www.google.co.uk/
Resolving www.google.co.uk (www.google.co.uk)... 173.194.67.94, 2a00:1450:4007:803::101f
Connecting to www.google.co.uk (www.google.co.uk)|173.194.67.94|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: "STDOUT"
[ <=> ] 13,143 --.-K/s in 0.03s
2012-08-19 23:11:01 (504 KB/s) - written to stdout [13143]
real 0m0.116s
user 0m0.005s
sys 0m0.000s
and that's with a pretty fast site and relatively small page. Add into that the time taken to parse the data, write to disk, mysql to release its resources etc... Even with multiple threads, you're looking quite a number of YEARS of processing.
|
|
|
08-19-2012, 06:12 PM
|
#4
|
LQ Guru
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 27,428
|
Quote:
Originally Posted by ac_kumar
Hi I want to make a script which search for all the websites on internet and than add them in mysql database with there meta tags as keywords.
Any help is appreciated.
thanks in advance
|
Ahh...that's already been done. It's called "Google".
|
|
2 members found this post helpful.
|
08-20-2012, 11:30 AM
|
#5
|
Member
Registered: Jul 2012
Distribution: OpenSUSE
Posts: 36
Rep:
|
Quote:
Originally Posted by ac_kumar
Any help is appreciated.
|
Help with what? You didn't ask any questions.
|
|
|
08-20-2012, 12:46 PM
|
#6
|
Member
Registered: Aug 2011
Distribution: Ubuntu, Fedora
Posts: 175
Original Poster
Rep:
|
Quote:
Originally Posted by TB0ne
Ahh...that's already been done. It's called "Google".
|
Do you think I dont know about google.
If you were advicing Linus Torvalds you would have said why to make Linux we have already invented Dos.
I tell you one thing re-inventing have made technology go further for eg light bulbs to led bulbs, stone wheels to rubber wheels.
|
|
|
08-20-2012, 01:00 PM
|
#7
|
Member
Registered: Aug 2011
Distribution: Ubuntu, Fedora
Posts: 175
Original Poster
Rep:
|
Quote:
Originally Posted by devnull10
Also be prepared to wait a little while...
Code:
tmp $ time wget -O - www.google.co.uk > /dev/null
--2012-08-19 23:11:01-- http://www.google.co.uk/
Resolving www.google.co.uk (www.google.co.uk)... 173.194.67.94, 2a00:1450:4007:803::101f
Connecting to www.google.co.uk (www.google.co.uk)|173.194.67.94|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: "STDOUT"
[ <=> ] 13,143 --.-K/s in 0.03s
2012-08-19 23:11:01 (504 KB/s) - written to stdout [13143]
real 0m0.116s
user 0m0.005s
sys 0m0.000s
and that's with a pretty fast site and relatively small page. Add into that the time taken to parse the data, write to disk, mysql to release its resources etc... Even with multiple threads, you're looking quite a number of YEARS of processing.
|
Could you please explain what this command is doing. as far as I know wget get pages from internet.
|
|
|
08-20-2012, 02:42 PM
|
#8
|
LQ Guru
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 27,428
|
Quote:
Originally Posted by ac_kumar
Could you please explain what this command is doing. as far as I know wget get pages from internet.
|
Did you read/look at the man pages for time and wget? It will explain what the commands do, and what the options are doing. The time command lets you see how much time the given command takes. In this case, the wget command with the -O puts things to a file.
Quote:
Originally Posted by ac_kumar
Do you think I dont know about google.
|
Apparently not, since you're asking how to re-create it.
Quote:
If you were advicing Linus Torvalds you would have said why to make Linux we have already invented Dos.
|
No, since the only way Linux is like DOS is that they both have a command line. There is no duplication of functionality.
Quote:
I tell you one thing re-inventing have made technology go further for eg light bulbs to led bulbs, stone wheels to rubber wheels.
|
Yes, each is better is some way than what has come before. What, exactly, is going to be different and better about what you're doing? There are MANY web-crawlers you can find easily, written in pretty much every programming language. What language are you wanting to write this in, and what problems are you having now?? You've essentially asked a very open ended question, with MANY different answers.
Last edited by TB0ne; 08-20-2012 at 02:46 PM.
|
|
1 members found this post helpful.
|
08-20-2012, 04:00 PM
|
#9
|
Member
Registered: Jun 2012
Location: USA
Distribution: Ubuntu
Posts: 424
Rep:
|
The technical problems of doing what you asked about have been explained. In short, your idea makes no sense given that it has already been done many times by search engines like Google. Given the nature of your question it wouldn't appear you are going to be inventing a better mousetrap any time soon.
Last edited by guyonearth; 08-20-2012 at 05:11 PM.
|
|
1 members found this post helpful.
|
08-20-2012, 04:49 PM
|
#10
|
Member
Registered: Jan 2010
Location: Lancashire
Distribution: Slackware Stable
Posts: 572
Rep:
|
Quote:
Originally Posted by ac_kumar
Could you please explain what this command is doing. as far as I know wget get pages from internet.
|
Yes, I was merely illustrating the time it takes for a moderately powered PC on a fairly fast internet connection to return a single small/fast webpage from the internet. Scale that up and account for slower responses and you're looking at years and years of processing. Sure, you can have a "vision" but what we are trying to tell you is that in all reality, it's pretty much asking the impossible - Google does a good job of it - but not perfect by any means.
How do you intend to traverse sites? By developing a robot which recursively parses links on each site it finds? Then you have to check whether you have already visited every link else you'll end up with cycles (which could be massively long and wasteful).
This is a serious comment - if you have got several hundred thousand pound to simply start this project off, never mind fund it, and are able to fund a full time team of analysts, developers etc then you might get a small way into doing it.
|
|
2 members found this post helpful.
|
08-21-2012, 03:00 PM
|
#11
|
Member
Registered: Aug 2011
Distribution: Ubuntu, Fedora
Posts: 175
Original Poster
Rep:
|
thanks you all for very helpful answers.
See I am very fascinated how google works and yes sometimes i don't find it usefull.
So i was thinking that i could make a step down web search engine to experiment.
As for storage problem i can manage and add few website in database than work on further.
I just want to do this project for fun.
|
|
|
08-21-2012, 03:38 PM
|
#12
|
Moderator
Registered: Dec 2009
Location: Germany
Distribution: Whatever fits the task best
Posts: 17,148
|
If you are doing it just for fun have a look here: http://www.udacity.com/overview/Course/cs101
It is a series of video tutorials, they build a search engine using Python to teach the basics of computer science.
|
|
2 members found this post helpful.
|
All times are GMT -5. The time now is 02:33 AM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|