Monday, September 3, 2018

Spark vs Hadoop - The Differences

I am happy to share my knowledge on Apache Spark and Hadoop. It is one of the well known arguments that Spark is ideal for Real-Time Processing where as Hadoop is preferred for Batch Processing. The best part of Spark is its compatibility with Hadoop.

SPARK VS HADOOP
Spark performs better than Hadoop when:
  1. data size ranges from GBs to PBs
  2. there is a varying algorithmic complexity, from ETL to SQL to machine learning
  3. low-latency streaming jobs to long batch jobs
  4. processing data regardless of storage medium, be it disks, SSDs, or memory
Apart from these, Hadoop overtakes Spark.
For example when the size of the data is small (~100 MB). When the data is sorted, it can sometimes be faster when performing mapping in the data nodes.
Hadoop is used for Batch processing whereas Spark can be used for both. In this regard, Hadoop users can process using MapReduce tasks where batch processing is required. In theory, Spark can perform everything that Hadoop can and more. Thus it becomes a matter of comfort when it comes to choosing Hadoop or Spark.
SIMILARITIES BETWEEN SPARK AND HADOOP
Let us look at how using both together can be better than siding with any one technology.
Figure: Components of Spark Hadoop

Hadoop components can be used alongside Spark in the following ways:
  1. HDFS: Spark can run on top of HDFS to leverage the distributed replicated storage.
  2. MapReduce: Spark can be used along with MapReduce in the same Hadoop cluster or separately as a processing framework.
  3. YARN: Spark applications can be made to run on YARN (Hadoop NextGen).
  4. Batch & Real Time Processing: MapReduce and Spark are used together where MapReduce is used for batch processing and Spark for real-time processing.

Now let us look at the features of Hadoop which makes it valuable to use with Spark.
Figure: Features of Hadoop

Given the above features of Hadoop, it makes sense to use it along side Spark as it provides an excellent storage system through HDFS and is scalable to whatever extent we require.
Now let us look at how exactly Spark and Hadoop work together.

ARCHITECTURE OF SPARK HADOOP SYSTEM
We can see how Spark uses the best parts of Hadoop through HDFS for reading and storing data, MapReduce for optional processing and YARN for resource allocation.
So, we do know that Hadoop can be used with Spark. But the big question is whether to use Hadoop afterall because Spark is 100 times faster than Hadoop in processing. Right?
To understand this, let us look at the below bar chart.
This chart depicts the performance of Spark vs Hadoop. We can see that Spark (in Red ) obviously is quite faster than Hadoop ( in Blue ). But the green bar takes the least time. It is the case when Apache Spark is used alongside the controlled partitioning in Hadoop.
To conclude, Spark Hadoop can take the best parts of Hadoop like YARN (for resource management) and HDFS to make things easy for everyone familiar with Hadoop and combine it with Spark. Spark is not considered as a replacement but as an extension to Hadoop. Hadoop MapReduce sometimes can process faster than Spark.

Wednesday, August 29, 2018

Hadoop Basics

About me : I am Ashish Rastogi ,an  enthusiastic  techie with 5+ years of IT experience .I started working on hadoop  at early 2015 and it changed my career track  to AI .Believe me hadoop is as simple as a cup to tea , only one needs practice ,practice and practice !!

So lets learn about BigData and Hadoop ! Let's start ...









What is Big Data?

Big Data is a term used for a collection of data sets that are large and complex, which is difficult to store and process using available database management tools or traditional data processing applications. The challenge includes capturing, curating, storing, searching, sharing, transferring, analyzing and visualization of this data.

It is characterized by 5 V’s.

VOLUME: Volume refers to the ‘amount of data’, which is growing day by day at a very fast pace.

VELOCITY: Velocity is defined as the pace at which different sources generate the data every day. This flow of data is massive and continuous.

VARIETY: As there are many sources which are contributing to Big Data, the type of data they are generating is different. It can be structured, semi-structured or unstructured.

VALUE: It is all well and good to have access to big data but unless we can turn it into value it is useless. Find insights in the data and make benefit out of it.

VERACITY: Veracity refers to the data in doubt or uncertainty of data available due to data inconsistency and incompleteness.


What is Hadoop & it’s architecture?





Hadoop Architecture 

The main components of HDFS are NameNode and DataNode.

NameNode

It is the master daemon that maintains and manages the DataNodes (slave nodes). It records the metadata of all the files stored in the cluster, e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc. It records each and every change that takes place to the file system metadata.

For example, if a file is deleted in HDFS, the NameNode will immediately record this in the EditLog. It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are live. It keeps a record of all the blocks in HDFS and in which nodes these blocks are stored.

DataNode

These are slave daemons which runs on each slave machine. The actual data is stored on DataNodes. They are responsible for serving read and write requests from the clients. They are also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode.

For processing, we use YARN(Yet Another Resource Negotiator). The components of YARN are ResourceManager and NodeManager.

ResourceManager

It is a cluster level (one for each cluster) component and runs on the master machine. It manages resources and schedule applications running on top of YARN.

NodeManager

It is a node level component (one on each node) and runs on each slave machine. It is responsible for managing containers and monitoring resource utilization in each container. It also keeps track of node health and log management. It continuously communicates with ResourceManager to remain up-to-date.

So, you can perform parallel processing on HDFS using MapReduce.

MapReduce

It is the core component of processing in a Hadoop Ecosystem as it provides the logic of processing. In other words, MapReduce is a software framework which helps in writing applications that processes large data sets using distributed and parallel algorithms inside Hadoop environment. In a MapReduce program, Map() and Reduce() are two functions.The Map function performs actions like filtering, grouping and sorting.While Reduce function aggregates and summarizes the result produced by map function.The result generated by the Map function is a key value pair (K, V) which acts as the input for Reduce function.



Some useful articles, I personally recommend  :

1) Hadoop Basics

2) Pre-Requisites for learning hadoop