5.6 Hadoop v.s. Spark

Hadoop and Spark are both Big Data frameworks – they provide some of the most popular tools used to carry out common Big Data-related tasks.

Hadoop, for many years, was the leading open source Big Data framework but recently the newer and more advanced Spark has become the more popular of the two Apache Software Foundation tools.

However they do not perform exactly the same tasks, and they are not mutually exclusive, as they are able to work together. Although Spark is reported to work up to 100 times faster than Hadoop in certain circumstances, it does not provide its own distributed storage system.

Distributed storage is fundamental to many of today’s Big Data projects as it allows vast multi-petabyte datasets to be stored across an almost infinite number of everyday computer hard drives, rather than involving hugely costly custom machinery which would hold it all on one device. These systems are scalable, meaning that more drives can be added to the network as the dataset grows in size.

As I mentioned, Spark does not include its own system for organizing files in a distributed way (the file system) so it requires one provided by a third-party. For this reason, many Big Data projects involve installing Spark on top of Hadoop, where Spark’s advanced analytics applications can make use of data stored using the Hadoop Distributed File System (HDFS).

What really gives Spark the edge over Hadoop is speed. Spark handles most of its operations “in memory” – copying them from the distributed physical storage into far faster logical RAM. This reduces the amount of time-consuming writing and reading to and from slow, clunky mechanical hard drives that need to be done under Hadoop’s MapReduce system.

MapReduce writes all of the data back to the physical storage medium after each operation. This was originally done to ensure a full recovery could be made in case something goes wrong – as data held electronically in RAM is more volatile than that stored magnetically on disks. However, Spark arranges data in what is known as Resilient Distributed Datasets, which can be recovered following failure.

Spark’s functionality for handling advanced data processing tasks such as real-time stream processing and machine learning is way ahead of what is possible with Hadoop alone. This, along with the gain in speed provided by in-memory operations, is the real reason, in my opinion, for its growth in popularity. Real-time processing means that data can be fed into an analytical application the moment it is captured, and insights immediately fed back to the user through a dashboard, to allow the action to be taken. This sort of processing is increasingly being used in all sorts of Big Data applications, for example, recommendation engines used by retailers, or monitoring the performance of industrial machinery in the manufacturing industry.

Spark includes its machine learning libraries, called MLib, whereas Hadoop systems must be interfaced with a third-party machine learning library, for example, Apache Mahout.

Many of the big vendors (i.e. Cloudera) now offer Spark as well as Hadoop, so will be in a good position to advise companies on which they will find most suitable, on a job-by-job basis. For example, if your Big Data consists merely of a huge amount of very structured data (i.e customer names and addresses) you may not need the advanced streaming analytics and machine learning functionality provided by Spark. This means you would be wasting time, and probably money, having it installed as a separate layer over your Hadoop storage. Spark, although developing very quickly, is still in its infancy, and the security and support infrastructure is not as advanced.