What’s the Big Deal about NoSQL?

NOSQLJust when we were all getting comfortable with our data (documents for you eDiscovery folks) being stored in nice neat rows along comes Big Data and messes it all up.  Well, “messing it all up” may not be the right description of exactly what Big Data is doing to the world of traditional relational database storage.  It is probably more accurate to use terms like “stressing it to the point of breaking”, “causing simple queries to take 2 hours to run” and “requiring very large and very expensive servers to work”.

I could go on about how traditional data storage technology just isn’t economical or may not even work in the face of Big Data.  However, some of my best and longest term friends are traditional data storage folks and therefore I don’t want to rub it in.  And, that’s not really the point of my article.

The point of my article is to identify the newest and hottest alternative to traditional data storage called “NoSQL” and describe why it has become such a Big Deal in support of Big Data.

First of all, I am not sure where the term “NoSQL” came from.  I guess that I will have to do some research and find out how this term came about.   The reason that I am even mentioning this question of origin is that the term “NoSQL” doesn’t really describe what “NoSQL” databases really do.  There is a long laundry list of things that are much more important such as:

  1. Schema-free
  2. Easy replication support
  3. Object Oriented with Simple API
  4. Eventually consistent / BASE (not ACID)
  5. The Ability to Support a huge amount of data (horizontally scalable)
  6. Much More

At the top of the list for #6: Much More, I would probably add the fact that when you implement #5: The Ability to Support a huge amount of data (horizontally scalable), you can do it with really inexpensive hardware or in the cloud with a Cloud Service Provider (CSP) such as Amazon Web Services (AWS) with auto scaling as opposed to running it on very expensive server hardware (Sorry IBM, HP and Fujitsu).

So, back to the point of my article and why NoSQL is such a big deal.   I kinda already tipped my hand in the previous paragraph.  However, let’s get a little more technical and say that I think one of the really big deals about NoSQL is “Auto Sharding”.  OK , so you thought that this was going to be a really fun and light article and then I start throwing around terms like “Auto Sharding”.  Sorry!

According to the folks at MongoDB,  because of the way they are structured, relational databases usually scale vertically – a single server has to host the entire database to ensure reliability and continuous availability of data. This gets expensive quickly, places limits on scale, and creates a relatively small number of failure points for database infrastructure. The solution is to scale horizontally, by adding servers instead of concentrating more capacity in a single server.

“Sharding” a database across many server instances can be achieved with SQL databases, but usually is accomplished through SANs and other complex arrangements for making hardware act as a single server. Because the database does not provide this ability natively, development teams take on the work of deploying multiple relational databases across a number of machines. Data is stored in each database instance autonomously. Application code is developed to distribute the data, distribute queries, and aggregate the results of data across all of the database instances. Additional code must be developed to handle resource failures, to perform joins across the different databases, for data rebalancing, replication, and other requirements. Furthermore, many benefits of the relational database, such as transactional integrity, are compromised or eliminated when employing manual sharding.

NoSQL databases, on the other hand, usually support auto-sharding, meaning that they natively and automatically spread data across an arbitrary number of servers, without requiring the application to even be aware of the composition of the server pool. Data and query load are automatically balanced across servers, and when a server goes down, it can be quickly and transparently replaced with no application disruption.

Cloud computing makes this significantly easier, with providers such as Amazon Web Services providing virtually unlimited capacity on demand, and taking care of all the necessary database administration tasks. Developers no longer need to construct complex, expensive platforms to support their applications, and can concentrate on writing application code. Commodity servers can provide the same processing and storage capabilities as a single high-end server for a fraction of the price.

Its a bit more complicated than this and there are a few more technical things that you need to understand.  However, for the most part, I believe the Big Deal with NoSQL is its ability to support Auto Sharding and therefore scale horizontally to enable much faster processing at much lower costs than traditional database technology.  So, there you have it!  It all comes down to speed and cost!

For eDiscovery and Information Governance users, that need to load, index, cull, analyze and review massive amounts of data, NoSQL (especially when run in the Cloud with a CSP like AWS) will be huge as it will enable increases in processing of 10x or 15x versus legacy solutions and also support the flexibility to scale up and down on a case by case basis.  And, the really Big Deal for eDiscovery and Information Governance is the dramatic reduction in processing and review costs.  I predict that NoSQL and Auto Sharding is the beginning of the end for many legacy eDiscovery and Information Governance vendors that don’t or can’t evolve.  What’s ironic is I wold bet you couldn’t find more than one or two of the legacy vendor CEO’s and/or CTO’s that could have even told you what Auto Sharding was 5 years ago.  I guess when you don’t know what you don’t know its hard to get ready for the future.

 

About Charles Skamser
Charles Skamser is an internationally recognized technology sales, marketing and product management leader with over 25 years of experience in Information Governance, eDiscovery, Machine Learning, Computer Assisted Analytics, Cloud Computing, Big Data Analytics, IT Automation and ITOA. Charles is the founder and Senior Analyst for eDiscovery Solutions Group, a global provider of information management consulting, market intelligence and advisory services specializing in information governance, eDiscovery, Big Data analytics and cloud computing solutions. Previously, Charles served in various executive roles with disruptive technology start ups and well known industry technology providers. Charles is a prolific author and a regular speaker on the technology that the Global 2000 require to manage the accelerating increase in Electronically Stored Information (ESI). Charles holds a BA in Political Science and Economics from Macalester College.