Big Data Analytics is Confusing

ConfusingOver the past 12 months I have spent lots of time with the CIO’s of the Global 2000 discussing Big Data Analytics and choosing the right platform/vendor.  Aside from spending most of my time with these CIOs in education mode, the top two concerns are there are too many vendors and solutions and the technology is moving too quickly to keep up.

As an example, about 6 months ago, I convinced the CIO of a major multinational enterprise to consider migrating to Hadoop.  This past week, she called me and said her team of “experts” has informed her that they would like to consider Spark instead and she said she was confused.  I told her that her “experts” were partially correct but Spark was not really a true replacement for Hadoop but rather an enhancement framework that allowed for much quick real-time processing.  Unfortunately, my answer confused her even more!

Hadoop is parallel data processing framework that has traditionally been used to run map/reduce jobs. These are long running jobs that take minutes or hours to complete. Spark is designed to run on top of Hadoop as an alternative to the traditional batch map/reduce model to be used for real-time stream data processing and fast interactive queries that finish within seconds.  Therefore, although it is confusing to CIO’s and everyone else, the enterprise should consider Hadoop as a general purpose framework that supports multiple models with Spark as a specialized enhancement to Hadoop rather than a replacement to Hadoop.

To emphasize the confusion among Big Data Analytics platforms even more, Raj Kumar Maurya, complied and published an overview of the Top 15 Big Data Analysis Platforms on September 15, 2015 on the PCQuest Website.

Following is Raj’s list:

Today, organizations face a lot of difficulty in managing data, because of the sheer size of datasets. It’s coming from so many different mediums, be it social media, sensors, e-mail, etc. These are all termed as unstructured data and, therefore, cannot be managed by traditional database systems. In order to create, manipulate, and manage such ‘Big Data’, you need specialized tools. Here we present some of the key tools that are easily available today.

1.    Hadoop: This is an Apache distributed data processing software. It’s basically a framework that enables distributed processing of large data sets across a cluster of servers using simple programming models. It can scale from single to thousands of servers with each one offering local computation and storage with a high degree of fault tolerance. The Hadoop software library is designed to detect and handle failures at the application layer rather than rely on hardware. The base Apache Hadoop framework is composed of Hadoop Common, Hadoop Distributed File System (HDF), Hadoop YARN and Hadoop MapReduce.


2.     Hive: Hive was initially developed by Facebook but is now being used and developed by other companies like Netflix and Amazon. It is very much similar to database code with SQL access, but it is built on top of Hadoop and MapReduce for providing data summarization, query, and analysis operations with several key differences. The first is that you can expect very high latency, which means it is not appropriate for applications that need very fast responsive time and the second is that Hive is read-based and therefore not appropriate for transaction processing that typically involves a high percentage of write operations.


3.     Hortonworks Data Platform: This provides an enterprise-ready data platform that enables organizations to adopt a Modern Data Architecture. With YARN as its architectural center, it provides a data platform for multi-workload data processing across an array of processing methods. HDP includes rich data security, governance, and operations functionality that works across component technologies and integrates with pre-existing EDW, RDBMS and MPP systems.


4.     Lumify: It’s a project to create a big data fusion, analysis, and visualization platform. Its intuitive web-based interface helps users discover connections and explore relationships in their data via a suite of analytic options, including 2D and 3D graph visualizations, full-text faceted search, dynamic histograms, interactive geographic maps, and collaborative workspaces shared in real-time. It’s a way to aggregate your data, organize it, and extract useful insights. Analyze relationships, automatically discover paths between entities and establish new links in 2D or 3D.


5.     Zookeeper:  An open source service for maintaining, configuration service, synchronization service, and naming registry for large distributed systems. ZooKeeper nodes store their data in a hierarchical name space, much like a file system or a tree data structure. Clients can read and write from/to the nodes. ZooKeeper provides an infrastructure for cross-node synchronization and can be used by applications to ensure that tasks across the Hadoop cluster are serialized or synchronized.  A ZooKeeper server is a machine that keeps a copy of the state of the entire system and preserves this information in local log files.


6.     Sqoop: A connectivity tool for easily importing  structured databases (such as SQL) and related Hadoop systems (such as Hive and HBase) into your Hadoop cluster or in simple way from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop.  It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target.  It’s a command-line interface tool, meaning you have to know the commands and type them directly into the system, rather than click on them with a mouse.


7.    Cloudera: Cloudera makes a commercial version of Hadoop. Although Hadoop is a free and open-source project for storing large amounts of data on inexpensive computer servers, the free version of Hadoop is not easy to use. Several companies have created friendlier versions of Hadoop, and Cloudera is arguably the most popular one. CDH includes the core elements of Apache Hadoop plus several additional key open source projects that, when coupled with customer support, management, and governance through a Cloudera Enterprise subscription, can deliver an enterprise data hub.


8.    Pig: Initially developed at Yahoo! to allow people using Apache Hadoop to focus more on analyzing large data sets and spend less time having to write mapper and reducer programs. The Pig programming language is designed to handle any kind of data. It is made up of two components: the first is the language itself, which is called PigLatin and the second is a runtime environment where PigLatin programs are executed.


9.    MapReduce: Originally developed by Google, a software framework for distributed processing of large data set on computing clusters of commodity hardware. MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.


10. GridGain: A Java-based tool for real-time big data processing which is an alternative to Hadoop’s MapReduce that is compatible with the Hadoop Distributed File System. It offers in-memory processing for fast analysis of real-time data. GridGain Big Data is generally integrated between a business intelligence (BI) solution and relational database management system (RDBMS) software. GridGain Big Data is designed to work in computing-intensive environments, where computations are performed on a series of distributed computers or servers. The open source version can be freely downloaded.


11. Cassandra: Developed by Facebook, and built on Amazon Dynamo and Google BigTable, it’s designed to handle large amounts of data across many commodity servers, while providing highly available service and no single point of failure.  Cassandra offers continuous availability, linear scale performance, operational simplicity and easy data distribution across multiple data centers and cloud availability zones.


12. HBase: A non-relational, distributed database written in Java. It is a NoSQL database that runs on top of Hadoop. HBase is known for providing strong data consistency on reads and writes, which distinguishes it from other NoSQL databases. It combines the scalability of Hadoop by running on the Hadoop Distributed File System (HDFS), with real-time data access as a key/value store and deep analytic capabilities of Map Reduce.


13. MongoDB: A cross-platform document-oriented database that supports dynamic schema design, allowing the documents in a collection to have different fields and structures. It’s a NoSQL database with document-oriented storage, full index support, replication and high availability, and more. MongoDB can be used as a file system, taking advantage of load balancing and data replication features over multiple machines for storing files.


14. CouchDB: It’s a database that completely embraces the web. It is a document-oriented NoSQL database that uses JSON to store data, using JavaScript as its query language using MapReduce, and HTTP for an API. CouchDB implements a form of Multi-Version Concurrency Control (MVCC) in order to avoid the need to lock the database file during writes.  Access your documents and query your indexes with your web browser, via HTTP.  CouchDB works well with modern web and mobile apps. It offers distributed scaling with fault-tolerant storage.


15. Hypertable: Based on a design developed by Google to meet their scalability requirements, it solves the scale problem better than any of the other NoSQL solutions. Hypertable was designed for the express purpose of solving the scalability problem, a problem that is not handled well by a traditional RDBMS.  While it is possible to design a distributed RDBMS system by breaking the dataset into shards, this solution requires an enormous amount of engineering effort because the core database engine was not designed for scalability.

About Charles Skamser
Charles Skamser is an internationally recognized technology sales, marketing and product management leader with over 25 years of experience in Information Governance, eDiscovery, Machine Learning, Computer Assisted Analytics, Cloud Computing, Big Data Analytics, IT Automation and ITOA. Charles is the founder and Senior Analyst for eDiscovery Solutions Group, a global provider of information management consulting, market intelligence and advisory services specializing in information governance, eDiscovery, Big Data analytics and cloud computing solutions. Previously, Charles served in various executive roles with disruptive technology start ups and well known industry technology providers. Charles is a prolific author and a regular speaker on the technology that the Global 2000 require to manage the accelerating increase in Electronically Stored Information (ESI). Charles holds a BA in Political Science and Economics from Macalester College.