What is this Hadoop? Why is it getting so popular? What does Hadoop do?
GeneralThis forum is for non-technical general discussion which can include both Linux and non-Linux topics. Have fun!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
What is this Hadoop and why is it getting so popular?
Hadoop is Elephant.
Elephants are symbol of Wisdom.
Everybody want Wisdom.
Quote:
Originally Posted by cyberdome
What is this Big Data 2.0?
It is a buzzword.
For Consultant use in Buzzword Bingo.
Clients eyes glaze over you Win!!!
Quote:
Originally Posted by cyberdome
What is that Hadoop does?
Distributed file system, distributed processing of chunks of data, framework to glue functionality together.
*BTW the vendors page isn't what you listed, it's at http://hadoop.apache.org/ so best start there or with a "HOWTO" tutorial. (Or, yes, just try Cloudera.)
As the well-written vendor page implies, Hadoop is one of several tools for implementing "massively parallel operations" across clusters of, perhaps, thousands of machines ... without having a "single point of failure." In fact, these technologies are built to assume that machines will fail, that power-cords will from time to time be accidentally kicked by someone who's working behind the frame in which all these machines (they're really "blades" ... circuit-boards ...) are mounted. Any machine might fail at any time, you don't know which one(s), and the processing needs to reliably continue.
The processing that is being done is of such a nature that each CPU in the cluster has work to do, and the ability to do it more-or-less autonomously, such that the various CPUs are constantly sharing data among themselves. Several CPUs might be working on the same thing, or parts of the same thing. Yes, we are "throwing silicon at it," because, "chips are cheap (now)."
"Big Data" refers to what these machines are actually doing: trying to transform:
Quote:
Originally Posted by Be Afraid, be Very Afraid:
a minute-by-minute log of where every individual in New York City (and every other city in the world) was standing, plus-or-minus seven feet, dutifully collected by the "apps" on their cell-phone and transmitted ... with neither their consent nor their knowledge ... to a "Big Data" data-center ... somewhere ... where this most-intimate data is available to "someone." A person who might be wearing a clean "white" hat, or ... most-decidedly "not."
I put that in quotes to emphasize it, of course, but also as a lead-in to the observation that "this sort of thing, IMHO, is not likely to continue for too much longer." Big Data is big business right now, but be aware that a lot of what's being done today might soon become illegal. All of this stuff is so new (literally, "to human history") that the jurisprudence doesn't exist yet. But, it is coming.
Another equally-interesting (to me) observation that is beginning to come to light is: "does 'all of this computing' really give us business advantage in selling pretzels?" It's an axiom of psychology that the presence of the experimenter affects the experiment. (And that "the mouse will do as he damn well pleases ...") People are dimly becoming aware that their every move is being sliced-and-diced, that they start getting junk-mail from funeral homes when they mention in a text-message to someone else that a friend has just died, and so forth. They're throwing-away, unread, more than 95 of the e-mails that they receive, and two-thirds of the "targeted" letters that come in the mail. The more ubiquitous "big data" is, the more noticeable it is, and, the less effective it is with regards to actual human populations.
Last edited by sundialsvcs; 05-07-2015 at 07:08 AM.
As the well-written vendor page implies, Hadoop is one of several tools for implementing "massively parallel operations" across clusters of, perhaps, thousands of machines ... without having a "single point of failure." In fact, these technologies are built to assume that machines will fail, that power-cords will from time to time be accidentally kicked by someone who's working behind the frame in which all these machines (they're really "blades" ... circuit-boards ...) are mounted. Any machine might fail at any time, you don't know which one(s), and the processing needs to reliably continue.
The processing that is being done is of such a nature that each CPU in the cluster has work to do, and the ability to do it more-or-less autonomously, such that the various CPUs are constantly sharing data among themselves. Several CPUs might be working on the same thing, or parts of the same thing. Yes, we are "throwing silicon at it," because, "chips are cheap (now)."
"Big Data" refers to what these machines are actually doing: trying to transform:
Quote:
Originally Posted by Be Afraid, be Very Afraid
a minute-by-minute log of where every individual in New York City (and every other city in the world) was standing, plus-or-minus seven feet, dutifully collected by the "apps" on their cell-phone and transmitted ... with neither their consent nor their knowledge ... to a "Big Data" data-center ... somewhere ... where this most-intimate data is available to "someone." A person who might be wearing a clean "white" hat, or ... most-decidedly "not."
I put that in quotes to emphasize it, of course, but also as a lead-in to the observation that "this sort of thing, IMHO, is not likely to continue for too much longer." Big Data is big business right now, but be aware that a lot of what's being done today might soon become illegal. All of this stuff is so new (literally, "to human history") that the jurisprudence doesn't exist yet. But, it is coming.
Another equally-interesting (to me) observation that is beginning to come to light is: "does 'all of this computing' really give us business advantage in selling pretzels?" It's an axiom of psychology that the presence of the experimenter affects the experiment. (And that "the mouse will do as he damn well pleases ...") People are dimly becoming aware that their every move is being sliced-and-diced, that they start getting junk-mail from funeral homes when they mention in a text-message to someone else that a friend has just died, and so forth. They're throwing-away, unread, more than 95 of the e-mails that they receive, and two-thirds of the "targeted" letters that come in the mail. The more ubiquitous "big data" is, the more noticeable it is, and, the less effective it is with regards to actual human populations.
From your postings, it's easy to tell that you're fairly passionate on this topic.
However, making thousands of computers all work on crunching numbers together itself isn't a bad thing. For example, there's a distributed computing network (actually a few) that work on the simulation of folding protiens to give scientists ideas to try to cure cancer. Other uses for such a system include a render farm (if you're a company like Disney), Bitcoin mining (though the power costs would probably not be worth it), or just seeing how many FPS you can get on a game (though network latency wil become a problem).
So having a large amount of computing power at your fingertips itself isn't a bad thing. And it's not like that the computing cluster itself is collecting GPS data about you.
Very well said. So, If I setup Ubuntu LAMP Server with Hadoop. For large data, I don't need a SQL database, Oracle Database, or any other type of database. I can save a large 5 TeraByte file inside a hadoop directory?
Correct me if I am wrong? So, Hadoop is basically saving extremely large files in a directory? Then, how does a normal user retrieve information from Hadoop?
No, Hadoop is not "merely an enormous file-system." It is a massively-parallel, fault-tolerant, workload management system. It is designed to process large amounts of data and to do so using large clusters of computing engines. You wouldn't set up one server with Hadoop. You'd set up hundreds, or thousands.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.