To SQL or Not to SQL: that is the Question!

Contributed by: Wilfried Lemahieu, Seppe vanden Broucke, Bart Baesens

This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps. Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at briefings@dataminingapps.com and let’s get in touch!


This article is based on our upcoming book Principles of Database Management: The Practical Guide to Storing, Managing and Analyzing Big and Small Data, www.pdbmbook.com.

Relational database systems (RDBMS) pay a lot of attention to data consistency and compliance with a formal database schema. New data or modifications to existing data are not accepted unless they satisfy constraints represented in this schema in terms of data types, referential integrity etc. The way in which RDBMS coordinate their transactions guarantees that the entire database is consistent at all times, the well-known ACID properties: atomicity, consistency, isolation and durability. Consistency is usually a desirable property; one normally wouldn’t want for erroneous data to enter the system, nor for e.g., a money transfer to be aborted halfway, with only one of the two accounts updated.

Yet, sometimes this focus on consistency may become a burden, because it induces (sometimes unnecessary) overhead and hampers scalability and flexibility. RDBMS are at their best when performing intensive read/write operations on small or medium-sized data sets or when executing larger batch processes, but with only a limited number of simultaneous transactions. As the data volumes or the number of parallel transactions increase, capacity can be increased by vertical scaling(also called scaling up), i.e. by extending storage capacity and/or CPU power of the database server. However, obviously, there are hardware induced limitations to vertical scaling.

Therefore, further capacity increases need to be realized by horizontal scaling (also known as scaling out), with multiple DBMS servers being arranged in a cluster. The respective nodes in the cluster can balance workloads among one another and scaling is achieved by adding nodes to the cluster, rather than extending the capacity of individual nodes. Such a clustered architecture is an essential prerequisite to cope with the enormous demands of recent evolutions such as big data (analytics), cloud computing and all kinds of responsive web applications. It provides the necessary performance, which cannot be realized by a single server, but also guarantees availability, with data being replicated over multiple nodes and other nodes taking over their neighbour’s workload if one node fails.

However, RDBMS are not good at extensive horizontal scaling. Their approach towards transaction management and their urge to keep data consistent at all times induces a large coordination overhead as the number of nodes increases. In addition, the rich querying functionality may be overkill in many big data settings, where applications merely need a high capacity to ‘put’ and ‘get’ data items, with no demand for complex data interrelationships nor selection criteria. Also, big data settings often focus on semi-structured data or on data with a very volatile structure (think for instance about sensor data, images, audio data, and so on), where the rigid database schemas of RDBMS are a source of inflexibility.

None of this means that relational databases will become obsolete soon. However, the ‘one size fits all’ era, where RDBMS were used in nearly any data and processing context, seems to have come to an end. RDBMS are still the way to go when storing up to medium-sized volumes of highly structured data, with a strong emphasis on consistency and extensive querying facilities. Where massive volumes, flexible data structures, scalability and availability are more important, other systems may be called for. This need resulted in the emergence of NoSQL databases.

The Emergence of the NoSQL Movement

The term “NoSQL” has become overloaded throughout the past decade, so the moniker now relates to many meanings and systems. The modern NoSQL movement describes databases that store and manipulate data in other formats than tabular relations, i.e. non-relational databases. The movement should have more appropriately been called NoREL, especially since some of these non-relational databases actually provide query language facilities close to SQL. Because of such reasons, people have changed the original meaning of the NoSQL movement to stand for “not only SQL” or “not relational” instead of “not SQL”.

What makes NoSQL databases different from other, legacy, non-relational systems which have existed since as early as the 1970s? The renewed interest in non-relational database systems stems from Web 2.0 companies in the early 2000s. Around this period, up-and-coming web companies, such as Facebook, Google, and Amazon were increasingly being confronted with huge amounts of data to be processed, often under time-sensitive constraints. For example, think about an instantaneous Google search query, or thousands of users accessing Amazon product pages or Facebook profiles simultaneously.

Often rooted in the open source community, the characteristics of the systems developed to deal with these requirements are very diverse. However, their common ground is that they try to avoid, at least to some extent, the shortcomings of RDBMS in this respect. Many aim at near linear horizontal scalability, which is achieved by distributing data over a cluster of database nodes for the sake of performance (parallelism and load balancing) and availability (replication and failover management). A certain measure of data consistency is often sacrificed in return. A term frequently used in this respect is eventual consistency; the data, and respective replicas of the same data item, will become consistent in time after each transaction, but continuous consistency is not guaranteed.

The relational data model is cast aside for other modelling paradigms, which are typically less rigid and better able to cope with quickly evolving data structures. Often, the API (Application Programming Interface) and/or query mechanism are much simpler than in a relational setting. The Comparison Box provides a more detailed comparison of the typical characteristics of NoSQL databases against those of relational systems. Note that different categories of NoSQL databases exist and that even the members of a single category can be very diverse. No single NoSQL system will exhibit all of these properties.

Relational Databases NoSQL Databases
Data paradigm Relational tables Key-value (tuple) based
Document based
Column based
Graph based
XML, object based
Others: time series, probabilistic, etc.
Distribution Single-node and distributed Mainly distributed
Scalability Vertical scaling, harder to scale horizontally Easy to scale horizontally, easy data replication
Openness Closed and open source Mainly open source
Schema role Schema-driven Mainly schema-free or flexible schema
Query language SQL as query language No or simple querying facilities, or special-purpose languages
Transaction mechanism ACID: Atomicity, Consistency, Isolation, Durability BASE: Basically available, Soft state, Eventually consistent
Feature set Many features (triggers, views, stored procedures, etc.) Simple API
Data volume Capable of handling normal-sized data sets Capable of handling huge amounts of data and/or very high frequencies of read/write requests

Characteristics of NoSQL databases versus relational databases.

We note, however, that the explosion of popularity of NoSQL data storage layers should be put in perspective considering their limitations. Most NoSQL implementations have yet to prove their true worth in the field (most are very young and in development). Most implementations sacrifice ACID concerns in favour of being eventually consistent, and the lack of relational support makes expressing some queries or aggregations particularly difficult, with map-reduce interfaces being offered as a possible, but harder to learn and use, alternative.

Combined with the fact that RDBMS do provide strong support for transactionality, durability and manageability, quite a few early adopters of NoSQL were confronted with some sour lessons.  See for instance the FreeBSD maintainers speaking out against MongoDB’s lack of on-disk consistency support [1], Digg struggling with the NoSQL Cassandra database after switching from MySQL [2] and Twitter facing similar issues as well (which also ended up sticking with a MySQL cluster for a while longer) [3], or the fiasco of HealthCare.gov, where the IT team also went with a badly-suited NoSQL database [4].  It would be an over-simplification to reduce the choice between RDBMSs and NoSQL databases to a choice between consistency and integrity on the one hand, and scalability and flexibility on the other. The market of NoSQL systems is far too diverse for that. Still, this tradeoff will often come into play when deciding on taking the NoSQL route. We see many NoSQL vendors focusing again on robustness and durability. We also observe traditional RDBMS vendors implementing features that let you build schema-free, scalable data stores inside a traditional RDBMS, capable to store nested, semi-structured documents, as this seems to remain the true selling point of most NoSQL databases, especially those in the document store category.  Some vendors have already adopted “NewSQL” as a new term to describe modern relational database management systems that aim to blend the scalable performance and flexibility of NoSQL systems with the robustness guarantees of a traditional DBMS.

Expect the future trend to continue towards adoption of such “blended systems”, except for use cases that require specialized, niche database management systems. In these settings, the NoSQL movement has rightly taught users that the one size fits all mentality of relational systems is no longer applicable and should be replaced by finding the right tool for the job. For instance, graph databases arise as being “hyper-relational” databases, which makes relations first-class citizens next to records themselves rather than doing away with them altogether. Graph databases express complicated queries in a straightforward way, especially where one must deal with many, nested, or hierarchical relations between objects.  The below table concludes this article by summarizing the differences between traditional RDBMS, NoSQL DBMS and NewSQL DBMS.

Traditional SQL RDBMS NoSQL databases Blended systems, “NewSQL”
Relational Yes No Yes
SQL Yes No, though can come with own query languages Yes
Column stores No Yes Yes
Scalability Limited Yes Yes
Consistency model Strong Eventually consistent, though some efforts to enforce stronger consistency Strongly consistent for the most part
BASE (Basically Available, Soft state and Eventually consistent) No Yes No
Handles large (big) volumes of data No Yes Yes
Schema-less No Yes No, though can store and query free-structured fields

For more information, we are happy to refer to our book Principles of Database Management: The Practical Guide to Storing, Managing and Analyzing Big and Small Data, www.pdbmbook.com.

References