LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 08-19-2012, 03:49 PM   #1
ac_kumar
Member
 
Registered: Aug 2011
Distribution: Ubuntu, Fedora
Posts: 175

Rep: Reputation: 9
make database of website on internet


Hi I want to make a script which search for all the websites on internet and than add them in mysql database with there meta tags as keywords.
Any help is appreciated.
thanks in advance
 
Old 08-19-2012, 04:22 PM   #2
TobiSGD
Moderator
 
Registered: Dec 2009
Location: Germany
Distribution: Whatever fits the task best
Posts: 17,130
Blog Entries: 2

Rep: Reputation: 4825Reputation: 4825Reputation: 4825Reputation: 4825Reputation: 4825Reputation: 4825Reputation: 4825Reputation: 4825Reputation: 4825Reputation: 4825Reputation: 4825
According to estimates the web currently contains about 7 billion webpages (http://www.worldwidewebsize.com/). If you assume an average data volume of 200 bytes per page (address + meta tags) you would need about 1.4 Petabyte disk space. So the first thing you should do is to buy a huge number of harddisks and servers to host your database.

Last edited by TobiSGD; 08-19-2012 at 06:12 PM. Reason: fixed typo
 
1 members found this post helpful.
Old 08-19-2012, 06:10 PM   #3
devnull10
Member
 
Registered: Jan 2010
Location: Lancashire
Distribution: Slackware Stable
Posts: 548

Rep: Reputation: 116Reputation: 116
Also be prepared to wait a little while...

Code:
 tmp $ time wget -O - www.google.co.uk > /dev/null
--2012-08-19 23:11:01--  http://www.google.co.uk/
Resolving www.google.co.uk (www.google.co.uk)... 173.194.67.94, 2a00:1450:4007:803::101f
Connecting to www.google.co.uk (www.google.co.uk)|173.194.67.94|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: "STDOUT"

    [ <=>                                                       ] 13,143      --.-K/s   in 0.03s   

2012-08-19 23:11:01 (504 KB/s) - written to stdout [13143]


real    0m0.116s
user    0m0.005s
sys     0m0.000s
and that's with a pretty fast site and relatively small page. Add into that the time taken to parse the data, write to disk, mysql to release its resources etc... Even with multiple threads, you're looking quite a number of YEARS of processing.
 
Old 08-19-2012, 06:12 PM   #4
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 17,926

Rep: Reputation: 3690Reputation: 3690Reputation: 3690Reputation: 3690Reputation: 3690Reputation: 3690Reputation: 3690Reputation: 3690Reputation: 3690Reputation: 3690Reputation: 3690
Quote:
Originally Posted by ac_kumar View Post
Hi I want to make a script which search for all the websites on internet and than add them in mysql database with there meta tags as keywords.
Any help is appreciated.
thanks in advance
Ahh...that's already been done. It's called "Google".
 
2 members found this post helpful.
Old 08-20-2012, 11:30 AM   #5
earthnet
Member
 
Registered: Jul 2012
Distribution: OpenSUSE
Posts: 36

Rep: Reputation: Disabled
Quote:
Originally Posted by ac_kumar View Post
Any help is appreciated.
Help with what? You didn't ask any questions.
 
Old 08-20-2012, 12:46 PM   #6
ac_kumar
Member
 
Registered: Aug 2011
Distribution: Ubuntu, Fedora
Posts: 175

Original Poster
Rep: Reputation: 9
Quote:
Originally Posted by TB0ne View Post
Ahh...that's already been done. It's called "Google".
Do you think I dont know about google.
If you were advicing Linus Torvalds you would have said why to make Linux we have already invented Dos.
I tell you one thing re-inventing have made technology go further for eg light bulbs to led bulbs, stone wheels to rubber wheels.
 
Old 08-20-2012, 01:00 PM   #7
ac_kumar
Member
 
Registered: Aug 2011
Distribution: Ubuntu, Fedora
Posts: 175

Original Poster
Rep: Reputation: 9
Quote:
Originally Posted by devnull10 View Post
Also be prepared to wait a little while...

Code:
 tmp $ time wget -O - www.google.co.uk > /dev/null
--2012-08-19 23:11:01--  http://www.google.co.uk/
Resolving www.google.co.uk (www.google.co.uk)... 173.194.67.94, 2a00:1450:4007:803::101f
Connecting to www.google.co.uk (www.google.co.uk)|173.194.67.94|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: "STDOUT"

    [ <=>                                                       ] 13,143      --.-K/s   in 0.03s   

2012-08-19 23:11:01 (504 KB/s) - written to stdout [13143]


real    0m0.116s
user    0m0.005s
sys     0m0.000s
and that's with a pretty fast site and relatively small page. Add into that the time taken to parse the data, write to disk, mysql to release its resources etc... Even with multiple threads, you're looking quite a number of YEARS of processing.
Could you please explain what this command is doing. as far as I know wget get pages from internet.
 
Old 08-20-2012, 02:42 PM   #8
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 17,926

Rep: Reputation: 3690Reputation: 3690Reputation: 3690Reputation: 3690Reputation: 3690Reputation: 3690Reputation: 3690Reputation: 3690Reputation: 3690Reputation: 3690Reputation: 3690
Quote:
Originally Posted by ac_kumar
Could you please explain what this command is doing. as far as I know wget get pages from internet.
Did you read/look at the man pages for time and wget? It will explain what the commands do, and what the options are doing. The time command lets you see how much time the given command takes. In this case, the wget command with the -O puts things to a file.
Quote:
Originally Posted by ac_kumar View Post
Do you think I dont know about google.
Apparently not, since you're asking how to re-create it.
Quote:
If you were advicing Linus Torvalds you would have said why to make Linux we have already invented Dos.
No, since the only way Linux is like DOS is that they both have a command line. There is no duplication of functionality.
Quote:
I tell you one thing re-inventing have made technology go further for eg light bulbs to led bulbs, stone wheels to rubber wheels.
Yes, each is better is some way than what has come before. What, exactly, is going to be different and better about what you're doing? There are MANY web-crawlers you can find easily, written in pretty much every programming language. What language are you wanting to write this in, and what problems are you having now?? You've essentially asked a very open ended question, with MANY different answers.

Last edited by TB0ne; 08-20-2012 at 02:46 PM.
 
1 members found this post helpful.
Old 08-20-2012, 04:00 PM   #9
guyonearth
Member
 
Registered: Jun 2012
Location: USA
Distribution: Mint
Posts: 410

Rep: Reputation: 82
The technical problems of doing what you asked about have been explained. In short, your idea makes no sense given that it has already been done many times by search engines like Google. Given the nature of your question it wouldn't appear you are going to be inventing a better mousetrap any time soon.

Last edited by guyonearth; 08-20-2012 at 05:11 PM.
 
1 members found this post helpful.
Old 08-20-2012, 04:49 PM   #10
devnull10
Member
 
Registered: Jan 2010
Location: Lancashire
Distribution: Slackware Stable
Posts: 548

Rep: Reputation: 116Reputation: 116
Quote:
Originally Posted by ac_kumar View Post
Could you please explain what this command is doing. as far as I know wget get pages from internet.
Yes, I was merely illustrating the time it takes for a moderately powered PC on a fairly fast internet connection to return a single small/fast webpage from the internet. Scale that up and account for slower responses and you're looking at years and years of processing. Sure, you can have a "vision" but what we are trying to tell you is that in all reality, it's pretty much asking the impossible - Google does a good job of it - but not perfect by any means.
How do you intend to traverse sites? By developing a robot which recursively parses links on each site it finds? Then you have to check whether you have already visited every link else you'll end up with cycles (which could be massively long and wasteful).

This is a serious comment - if you have got several hundred thousand pound to simply start this project off, never mind fund it, and are able to fund a full time team of analysts, developers etc then you might get a small way into doing it.
 
2 members found this post helpful.
Old 08-21-2012, 03:00 PM   #11
ac_kumar
Member
 
Registered: Aug 2011
Distribution: Ubuntu, Fedora
Posts: 175

Original Poster
Rep: Reputation: 9
thanks you all for very helpful answers.
See I am very fascinated how google works and yes sometimes i don't find it usefull.
So i was thinking that i could make a step down web search engine to experiment.
As for storage problem i can manage and add few website in database than work on further.
I just want to do this project for fun.
 
Old 08-21-2012, 03:38 PM   #12
TobiSGD
Moderator
 
Registered: Dec 2009
Location: Germany
Distribution: Whatever fits the task best
Posts: 17,130
Blog Entries: 2

Rep: Reputation: 4825Reputation: 4825Reputation: 4825Reputation: 4825Reputation: 4825Reputation: 4825Reputation: 4825Reputation: 4825Reputation: 4825Reputation: 4825Reputation: 4825
If you are doing it just for fun have a look here: http://www.udacity.com/overview/Course/cs101
It is a series of video tutorials, they build a search engine using Python to teach the basics of computer science.
 
2 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Website database crashed perfectpol7 Linux - Server 4 07-03-2010 08:33 AM
seeking advice -total beginner- how to integrate website and membership database theduck70 Linux - Newbie 5 02-09-2009 04:18 AM
Slow to browse MySQL database driven website on home server Broady Linux - Newbie 2 11-04-2007 07:47 PM
Trying to make a script to make database match changes to other database windisch Programming 3 08-21-2007 09:37 AM
Which database for Joomla website? LinuxSeeker Linux - Networking 2 07-09-2006 05:14 PM


All times are GMT -5. The time now is 04:49 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration