Understanding Hadoop



Everyone says Hadoop is a new technology which deals with Bigdata and it contains many frameworks of applications such as,



HADOOP HDFS,
HIVE, HBASE, PIG
MAHOUT, ZOOKEEPER, 
OOZIE, FLUME, SQOOP


Confusing??

Technology has the highest power today which is leading us towards greater innovations. Everyone in this world use technology to minimize their personal effort, Connectivity, Communications and the list goes on..

Day by day Technology is changing and people should be more innovative interms of increasing their business, saving time, saving costs etc.

Lets understand what is this BigData is all about:

Bigdata, put 'big' aside and lets talk about 'data'

What is data?

Data is raw and unorganized facts that needs to be processed. Basically its useless until it is organized properly.

Universal example: A Log File.

in Log file you see some of the below:

A series of events or actions with a Timestamp prefixed with it.
A set of Exceptions/Errors occured when there is an abnormal thing happened on a system.

Another examples of Data:
a data extract csv files, flatfiles, table extracts, excel files etc.

These files basically contain data to be processed to get meaningful information out of it.

A traditional Database management system can process these files which is limited to certain amount of data size with higher speeds. When size of the data increases the processing time will also get inceases due to read write Overheads.

A Traditional Database management system is not suitable for unstructured data when data volume increases 
for example,
JSON, Large amount of 
Videos, 
Pictures etc.

A Traditional Database management system cost increases when it comes to scalabilty like if we want our application distributed on to multiple servers then to scale RDBMS is a chaos.

All above examples on Traditional Database management systems such as Oracle, SQL server etc. are suitable only when data size is limited
and structured.

When Data size increases?

When the this question has arrived then there is new word is invented to deal with data of bigger sizes, BigData.

Let us look at some facts:

1. The New York Stock Exchange generates about oen terabyte of new trade per day.
2. Facebook hosts approximately 15 billion photos comprising one petabyte of storage per day.
3. Twitter feeds that generates 10 terabytes of data per day (or 100 MB per second)
4. Google processes around 35 PetaBytes of data per day and as of August 2013. 

Now lets get into theory,

Big Data:

A term for a collection of Data sets so large and complex that it becomes difficult to processs using traditional databasemanagement systems.




To Solve all the questions related to Bigdata problems, A new Framework is developed by Apache called "Hadoop".


What is Hadoop?

"Hadoop an open source software framwork for storage and processing of larger data sets on clusters of commodity hardware."

Can we store files in Hadoop?

Yes. We can store files in Hadoop and we can store relicas of files also.
How can we process files in Hadoop?

When you place a file in Hadoop HDFS you can process the file in multiple ways.

Ex: Shell Scripts, Oozie Work Flows, Map-Reduce Jobs, Hive Jobs etc.

Can we connect a Database to Hadoop?

Yes. we can connect to data base using Sqoop connector.

Sqoop Input: database url with credentials, database info, table info.

Sqoop Output: table extract would be stored as File in HDFS.

Can we using any programming language while processing data in Hadoop?

Yes. Java is used widely in dealing with Hadoop.

What is Map-Reduce in Java?

MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. 
A MapReduce program is composed of a Map procedure that performs filtering and sorting and a Reduce procedure that performs a summary operation.

Continued in Next Post.

Comments