Community

A look Into Latest DBA Discussions – Distributed SQL vs. NewSQL Databases

In this series of hot DBA discussion topics, we will try to compare distributed SQL DBs against the NewSQL DBs to understand the significant differences. However, before we jump into the latest categorization of NewSQL databases, it is vital to understand why the NoSQL DBs like MariaDB, MongoDB, and Cassandra, etc. started gaining popularity in the last decade. They came up into the DBMS playground as innovative alternatives to the relational SQL databases but fell short of their objective.

From our experience, we now know that these databases were monolithic, and the distributed nature of these NoSQL databases was attractive to those applications which needed more scalability. Since most of the NoSQL database systems focused on the key-value (single row) data models and failed to handle the multi-row or relational structures of the conventional SQL language, these weren’t able to be tagged as “SQL” DBs. That is how they were called NoSQL.

In fact, NoSQL originally meant “No support to SQL” but then later re-termed as “Not Only SQL” by realizing that NoSQL databases may have to coexist with SQL databases, but cannot replace them completely. The need for conventional SQL databases persisted with the relational database models supporting single-row and ACID multi-row transactions. As time passed by, NoSQL databases proved out to be architecturally unfit to the server for the consistency-first application needs.

The invent of NewSQL

As these were proved out to be the limitations of NoSQL databases, the large scale OLTP workloads in which scalability and data correctness were critical continued to suffer even with NoSQL. Then started the era of NewSQL DBs, which were started showing up in the early 2010s to address this issue. Matthew Aslett of 451 Research coined the term ‘NewSQL’ in 2011 to categorize this new set of “scalable” DBs. Now, the NewSQL DBs come in two flavors.

One flavor of NewSQL DBs offers an automated sharding layer over multiple independent instances of the SQL monolithic databases. Say, for example, Vitess DB can handle it in the way how MySQL does it, whereas Citus handles it the same as of PostgreSQL. So, each instance, when taken independently, is similar to the same old monolithic approach. The challenges like native failover, ACID transactions in a distributed manner, etc. remain impossible to handle. Above all, the developers also have to compromise on agility, which they get by only interacting with a single logical SQL DB.
The second flavor covers DBs like VoltDB, NuoDB, and Clustrix, etc. which are built as distributed storage systems with the objective of keeping the concept of a single logical SQL database in place.

Next, let us evaluate some of the NewSQL Databases with distributed SQL

Vitess

Vitess provides MySQL automated sharding features. Each of the MySQL instances acts as a shard here. A very consistent key-value store is used in the case of Vitess, which is called ECTD. This helps to store the metadata related to the shard location like which shred is located against which given instance. Vitess also uses VTGate as a set of coordinator nodes. This helps to accept the client queries of the applications and route all those to the corresponding shard based on the pre-stored ECTD mapping. Each such instance uses the master-slave replications as per MySQL.

However, as per RemoteDBA.comexperts, the SQL features like accessing various rows of data spread across multiple rows and across various shards are strongly discouraged in this database application. Some such discouraged features are the global secondary indexes and the cross-chard JOINs. All these reiterate the point that the Vitess cluster lacks the single logical SQL DB notion in a real applicational environment. The developers should be aware of the sharding to account for this shortfall while designing their schema and executing their queries.

Citus

Basically, Citus is the PostgreSQL version of Vitess. Plying as the extension of PostgreSQL, Citus can ensure both vertical and horizontal scalability for the write commands to PostgreSQL DB deployments using open sharding. This installation begins with the number of nodes of the PostgreSQL, and each node also has a Citus extension. Afterward, one single node of the ‘number of nodes of PostgreSQL becomes the coordinator node for the situs, and the remaining nodes act as worker’s nodes.

The applications only interact with one coordinator node and will not be aware of the worker nodes existing. The replication-based architecture, which ensures availability even during failures, still acts as master-slave based on the Postgres standards. There may probably be availability and performance bottlenecks with this single-coordinator node constitution. Any slowdown for the coordinator node may ultimately slow down the whole cluster even when the worker nodes may function normally. Similarly, any coordinator node outrage may make the whole cluster down. When worker nodes are unable to interact with client applications directly, there would not be any ways to make the client drivers smarter by caching the shard metadata.

VoltDB

VoltDB acts based on the auto-sharded distributed database architecture. This is a proprietary SQL which has not foreign key support. Intra-cluster replications act on the basis of the K-safety algorithm in which K denotes the number of extra copies of the same data stored at each of the shards. For example, the configuration of K=2 maps to the Replication Factor 3 of the distributed SQL databases by default, i.e., YugabyteDB and Google Spanner, etc.

In the case of VoltDB, the replicas for any given shard get simultaneously updated in a synchronized manner by the client application. However, when the distributed consensus protocols as Paxos and Raft etc. require some writes to be sent to every replica, but only commits so when the majority of the replicas acknowledge the request. In real, waiting for responses from all the replicas is not necessary as the consensus can also be established with the majority. Also, VoltDB may not be able to detect any network partitions but requires an add-on network-fault-protection to be set. When a single node in the cluster is partitioned, fault protection mode gets activated, which may adversely impact the cluster performance, too, by increasing the cluster recovery time for accepting rights.

Other examples are NuoDB (a proprietary NewSQL DB), ClustrixDB (a scale-out SQL DB), and so on. In fact, the NewSQL cloud is still in its infancy, and the distributed SQL DBs like Google Spanner is slowly building up to take advantage of the cloud elasticity to work even on the inherently unreliable database infrastructures.

Disrupt

A look Into Latest DBA Discussions – Distributed SQL vs. NewSQL Databases