LinuxQuestions.org - What is this Hadoop? Why is it getting so popular? What does Hadoop do?

- General (https://www.linuxquestions.org/questions/general-10/)

- - What is this Hadoop? Why is it getting so popular? What does Hadoop do? (https://www.linuxquestions.org/questions/general-10/what-is-this-hadoop-why-is-it-getting-so-popular-what-does-hadoop-do-4175541546/)

What is this Hadoop? Why is it getting so popular? What does Hadoop do?

Quote:

Originally Posted by cyberdome (Post 5357478)

What is this Hadoop and why is it getting so popular?

Hadoop is Elephant.
Elephants are symbol of Wisdom.
Everybody want Wisdom.

Quote:

Originally Posted by cyberdome (Post 5357478)

What is this Big Data 2.0?

It is a buzzword.
For Consultant use in Buzzword Bingo.
Clients eyes glaze over you Win!!!

Quote:

Originally Posted by cyberdome (Post 5357478)

What is that Hadoop does?

Distributed file system, distributed processing of chunks of data, framework to glue functionality together.
*BTW the vendors page isn't what you listed, it's at http://hadoop.apache.org/ so best start there or with a "HOWTO" tutorial. (Or, yes, just try Cloudera.)

As the well-written vendor page implies, Hadoop is one of several tools for implementing "massively parallel operations" across clusters of, perhaps, thousands of machines ... without having a "single point of failure." In fact, these technologies are built to assume that machines will fail, that power-cords will from time to time be accidentally kicked by someone who's working behind the frame in which all these machines (they're really "blades" ... circuit-boards ...) are mounted. Any machine might fail at any time, you don't know which one(s), and the processing needs to reliably continue.

The processing that is being done is of such a nature that each CPU in the cluster has work to do, and the ability to do it more-or-less autonomously, such that the various CPUs are constantly sharing data among themselves. Several CPUs might be working on the same thing, or parts of the same thing. Yes, we are "throwing silicon at it," because, "chips are cheap (now)."

"Big Data" refers to what these machines are actually doing: trying to transform:

Quote:

Originally Posted by Be Afraid, be Very Afraid:

a minute-by-minute log of where every individual in New York City (and every other city in the world) was standing, plus-or-minus seven feet, dutifully collected by the "apps" on their cell-phone and transmitted ... with neither their consent nor their knowledge ... to a "Big Data" data-center ... somewhere ... where this most-intimate data is available to "someone." A person who might be wearing a clean "white" hat, or ... most-decidedly "not."

I put that in quotes to emphasize it, of course, but also as a lead-in to the observation that "this sort of thing, IMHO, is not likely to continue for too much longer." Big Data is big business right now, but be aware that a lot of what's being done today might soon become illegal. All of this stuff is so new (literally, "to human history") that the jurisprudence doesn't exist yet. But, it is coming.

Another equally-interesting (to me) observation that is beginning to come to light is: "does 'all of this computing' really give us business advantage in selling pretzels?" It's an axiom of psychology that the presence of the experimenter affects the experiment. (And that "the mouse will do as he damn well pleases ...") People are dimly becoming aware that their every move is being sliced-and-diced, that they start getting junk-mail from funeral homes when they mention in a text-message to someone else that a friend has just died, and so forth. They're throwing-away, unread, more than 95 of the e-mails that they receive, and two-thirds of the "targeted" letters that come in the mail. The more ubiquitous "big data" is, the more noticeable it is, and, the less effective it is with regards to actual human populations.

Quote:

Originally Posted by sundialsvcs (Post 5359081)

Quote:

Originally Posted by Be Afraid, be Very Afraid

a minute-by-minute log of where every individual in New York City (and every other city in the world) was standing, plus-or-minus seven feet, dutifully collected by the "apps" on their cell-phone and transmitted ... with neither their consent nor their knowledge ... to a "Big Data" data-center ... somewhere ... where this most-intimate data is available to "someone." A person who might be wearing a clean "white" hat, or ... most-decidedly "not."

From your postings, it's easy to tell that you're fairly passionate on this topic.
However, making thousands of computers all work on crunching numbers together itself isn't a bad thing. For example, there's a distributed computing network (actually a few) that work on the simulation of folding protiens to give scientists ideas to try to cure cancer. Other uses for such a system include a render farm (if you're a company like Disney), Bitcoin mining (though the power costs would probably not be worth it), or just seeing how many FPS you can get on a game (though network latency wil become a problem).

So having a large amount of computing power at your fingertips itself isn't a bad thing. And it's not like that the computing cluster itself is collecting GPS data about you.

Of course. Of course. Of course.

(Hey, I'm not that paranoid! Really. No, really.) ;)

I am concerned ... profoundly concerned ... about the activities that I see "data-mining" being most-commonly used for.

Quote:

"We have thrown our societies headlong from the cliff of what is now possible, straight into the pit of the unforseen."

Quote:

"Elephants are symbol of Wisdom."

Very well said. So, If I setup Ubuntu LAMP Server with Hadoop. For large data, I don't need a SQL database, Oracle Database, or any other type of database. I can save a large 5 TeraByte file inside a hadoop directory?
Correct me if I am wrong? So, Hadoop is basically saving extremely large files in a directory? Then, how does a normal user retrieve information from Hadoop?

No, Hadoop is not "merely an enormous file-system." It is a massively-parallel, fault-tolerant, workload management system. It is designed to process large amounts of data and to do so using large clusters of computing engines. You wouldn't set up one server with Hadoop. You'd set up hundreds, or thousands.