hadoop: Hadoop Basics

About me : I am Ashish Rastogi ,an enthusiastic techie with 5+ years of IT experience .I started working on hadoop at early 2015 and it changed my career track to AI .Believe me hadoop is as simple as a cup to tea , only one needs practice ,practice and practice !!

So lets learn about BigData and Hadoop ! Let's start ...

What is Big Data?

Big Data is a term used for a collection of data sets that are large and complex, which is difficult to store and process using available database management tools or traditional data processing applications. The challenge includes capturing, curating, storing, searching, sharing, transferring, analyzing and visualization of this data.

It is characterized by 5 V’s.

VOLUME: Volume refers to the ‘amount of data’, which is growing day by day at a very fast pace.

VELOCITY: Velocity is defined as the pace at which different sources generate the data every day. This flow of data is massive and continuous.

VARIETY: As there are many sources which are contributing to Big Data, the type of data they are generating is different. It can be structured, semi-structured or unstructured.

VALUE: It is all well and good to have access to big data but unless we can turn it into value it is useless. Find insights in the data and make benefit out of it.

VERACITY: Veracity refers to the data in doubt or uncertainty of data available due to data inconsistency and incompleteness.

What is Hadoop & it’s architecture?

Hadoop Architecture

The main components of HDFS are NameNode and DataNode.

NameNode

It is the master daemon that maintains and manages the DataNodes (slave nodes). It records the metadata of all the files stored in the cluster, e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc. It records each and every change that takes place to the file system metadata.

For example, if a file is deleted in HDFS, the NameNode will immediately record this in the EditLog. It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are live. It keeps a record of all the blocks in HDFS and in which nodes these blocks are stored.

DataNode

These are slave daemons which runs on each slave machine. The actual data is stored on DataNodes. They are responsible for serving read and write requests from the clients. They are also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode.

For processing, we use YARN(Yet Another Resource Negotiator). The components of YARN are ResourceManager and NodeManager.

ResourceManager

It is a cluster level (one for each cluster) component and runs on the master machine. It manages resources and schedule applications running on top of YARN.

NodeManager

It is a node level component (one on each node) and runs on each slave machine. It is responsible for managing containers and monitoring resource utilization in each container. It also keeps track of node health and log management. It continuously communicates with ResourceManager to remain up-to-date.

So, you can perform parallel processing on HDFS using MapReduce.

MapReduce

It is the core component of processing in a Hadoop Ecosystem as it provides the logic of processing. In other words, MapReduce is a software framework which helps in writing applications that processes large data sets using distributed and parallel algorithms inside Hadoop environment. In a MapReduce program, Map() and Reduce() are two functions.The Map function performs actions like filtering, grouping and sorting.While Reduce function aggregates and summarizes the result produced by map function.The result generated by the Map function is a key value pair (K, V) which acts as the input for Reduce function.

Some useful articles, I personally recommend :

1) Hadoop Basics

2) Pre-Requisites for learning hadoop

hadoop

Wednesday, August 29, 2018

Hadoop Basics

1 comment: