Monday, September 3, 2018

Spark vs Hadoop - The Differences

I am happy to share my knowledge on Apache Spark and Hadoop. It is one of the well known arguments that Spark is ideal for Real-Time Processing where as Hadoop is preferred for Batch Processing. The best part of Spark is its compatibility with Hadoop.

SPARK VS HADOOP
Spark performs better than Hadoop when:
  1. data size ranges from GBs to PBs
  2. there is a varying algorithmic complexity, from ETL to SQL to machine learning
  3. low-latency streaming jobs to long batch jobs
  4. processing data regardless of storage medium, be it disks, SSDs, or memory
Apart from these, Hadoop overtakes Spark.
For example when the size of the data is small (~100 MB). When the data is sorted, it can sometimes be faster when performing mapping in the data nodes.
Hadoop is used for Batch processing whereas Spark can be used for both. In this regard, Hadoop users can process using MapReduce tasks where batch processing is required. In theory, Spark can perform everything that Hadoop can and more. Thus it becomes a matter of comfort when it comes to choosing Hadoop or Spark.
SIMILARITIES BETWEEN SPARK AND HADOOP
Let us look at how using both together can be better than siding with any one technology.
Figure: Components of Spark Hadoop

Hadoop components can be used alongside Spark in the following ways:
  1. HDFS: Spark can run on top of HDFS to leverage the distributed replicated storage.
  2. MapReduce: Spark can be used along with MapReduce in the same Hadoop cluster or separately as a processing framework.
  3. YARN: Spark applications can be made to run on YARN (Hadoop NextGen).
  4. Batch & Real Time Processing: MapReduce and Spark are used together where MapReduce is used for batch processing and Spark for real-time processing.

Now let us look at the features of Hadoop which makes it valuable to use with Spark.
Figure: Features of Hadoop

Given the above features of Hadoop, it makes sense to use it along side Spark as it provides an excellent storage system through HDFS and is scalable to whatever extent we require.
Now let us look at how exactly Spark and Hadoop work together.

ARCHITECTURE OF SPARK HADOOP SYSTEM
We can see how Spark uses the best parts of Hadoop through HDFS for reading and storing data, MapReduce for optional processing and YARN for resource allocation.
So, we do know that Hadoop can be used with Spark. But the big question is whether to use Hadoop afterall because Spark is 100 times faster than Hadoop in processing. Right?
To understand this, let us look at the below bar chart.
This chart depicts the performance of Spark vs Hadoop. We can see that Spark (in Red ) obviously is quite faster than Hadoop ( in Blue ). But the green bar takes the least time. It is the case when Apache Spark is used alongside the controlled partitioning in Hadoop.
To conclude, Spark Hadoop can take the best parts of Hadoop like YARN (for resource management) and HDFS to make things easy for everyone familiar with Hadoop and combine it with Spark. Spark is not considered as a replacement but as an extension to Hadoop. Hadoop MapReduce sometimes can process faster than Spark.

No comments:

Post a Comment