By: Bart Baesens, Seppe vanden Broucke
This QA first appeared in Data Science Briefings, the DataMiningApps newsletter as a “Free Tweet Consulting Experience” — where we answer a data science or analytics question of 140 characters maximum. Also want to submit your question? Just Tweet us @DataMiningApps. Want to remain anonymous? Then send us a direct message and we’ll keep all your details private. Subscribe now for free if you want to be the first to receive our articles and stay up to data on data science news, or follow us @DataMiningApps.
You asked: Can you provide an overview of SQL technologies on top of an Hadoop stack?
Our answer:
SQL and Hadoop have certainly had a complicated relationship so far. It all started around 2008 with the first release of Apache Hive, the original SQL-on-Hadoop solution, which has now become one of the de-facto tools included with almost all Hadoop installations. The reason for its succes was simple: Hive converts SQL queries to a series of map-reduce jobs, and presents itself to clients in a way which very much resembles a MySQL server. It also offers a command line client, Java APIs and JDBC drivers, which made the project wildly successful and quickly adapted by all organisations which were quickly beginning to realize that they’d taken a step back from their traditional data warehouse setups in their desire to switch to Hadoop as soon as possible.
However, SQL-on-Hadoop technologies are not perfect implementations of relational database management systems, and hence frequently have to sacrifice on features such as speed, SQL language compatibility, support for complex joins, and so on. For Hive, the main draw back was its lack of speed. Because of the overhead incurred by translating each query into a series of map-reduce jobs, even the simplest of queries can consume a large amount of time. However, companies such as Hortonworks are still pushing behind the development of Hive, mainly by putting efforts behind Apache Tez, which provides a new backend for Hive, no longer based on the map-reduce paradigm but on directed-acyclic-graph pipelines.
In 2012, Cloudera, another well-known Hadoop vendor, introduced their own SQL-on-Hadoop technology as part of their “Impala” stack. Cloudera opted to forego map-reduce completely, but instead uses its own set of execution daemons, which have to be installed along Hive-compatible datanodes. It offers SQL-92 syntax support, a command line client, and ODBC drivers. It is much faster than a standard Hive installation, allowing for immediate feedback after queries, hence making them more interactive.
It didn’t take long for other vendors to take notice of the need for SQL-on-Hadoop, and in the following years, we saw almost every vendor joining the bandwagon and offering their own query engines (IBM’s BigSQL platform or Oracle’s Big Data SQL, for instance). However, we’ve also witnessed the rising popularity of Apache Spark, which focuses on real-time, in-memory, parallelized processing of Hadoop data, and also comes with its own SQL engine, Spark SQL, that builds on top of it to allow SQL queries to be written against data, and which has become very popular, especially in data mining/science circles.
The field continuous to change and expand at a rapid pace, however. On interesting, new project is Apache Drill. Whereas most existing SQL-on-Hadoop engines assumed Hive-formatted tables in HDFS or tables stored in HBase (HBase is a native Hadoop data store) as the underlying storage mechanism, Drill features a extensible engine that can already query: Hive datastores, HBase datastores, but also flat files, JSON data, and MongoDB data, and it can do so no matter whether the data is stored on-disk, in HDFS, or even cloud storage systems. Think of it as SQL-on-everything rather than SQL-on-Hadoop, but again, this extra flexibility comes with performance drawbacks.
To conclude, this leaves us today with the following taxonomy:
- “Older”, Hive-with-map-reduce based stacks, which are considered slower, but still widely in use (Teradata, IBM, Oracle, for instance)
- “Hive-without-map-reduce”, companies such as Hortonworks wish to move Hive forward by moving away from a map-reduce backend, but opt to use Apache Tez
- “Spark-followers”, vendors such as Databricks (the commercial entity behind Apache Spark) and Cloudera, who’re pushing Hive-on-Spark (also bypassing the map-reduce backend)
- “New technologies”, such as Apache Drill, which are still young at the moment but might become more important, especially with the rise of NoSQL and graph based databases.