SQL Will Inevitably Come To NoSQL Databases

The name might be short for Not Only SQL, but to be a proper database that can be used by normal enterprises and not just by hyperscalers with their fleets of PhDs, any database, whether it is a relational or NoSQL, has to be able to support the Structured Query Language that has been associated with relational databases from day one.

The NoSQL databases are important because they scale much further than relational databases in terms of the raw size of datasets, but to do so they store information in a format that is often generated by applications and their underlying systems, usually a derivative of the JavaScript Object Notation, or JSON. Pulling this information into a relational database is too cumbersome and costly, and on certain dataset sizes, queries don’t even complete, as MapR explained to The Next Platform when launching a commercial variant of the Apache Drill query tool a few weeks ago.

The kinds of telemetry and clickstream data that is often stored in a document database like Couchbase or MongoDB, just to name two, is important to every industry. To offer the kinds of personalized experiences that end users expect these days requires that applications are both context aware and location aware, and to become the general purpose database that sits under modern applications requires that this information be stored in a native format, and yet be queried by a general purpose language that is familiar to many, like SQL.

This is not a new dichotomy, incidentally. Most relational databases had their own native database access methods, which offered superior performance compared to SQL, but over time the performance of SQL has improved and it has become ubiquitous as the means for end users and applications to extract data from and to run transactions against relational databases.

NoSQL and NewSQL datastores have their own methods of extracting data, including the MapReduce batch algorithm used in Hadoop. All of the major Hadoop distributions have their own SQL add-ons: Impala from Cloudera, HAWQ from Pivotal (and now used by Hortonworks), Drill from MapR Technologies, and Big SQL from IBM. Adding SQL processing for semi-structured data is a bit tricky, though, and according to researchers at the University of California at San Diego, what the industry needs is a superset of SQL language that will let it handle these new data types in sufficiently flexible ways to be useful but still be enough like the SQL that companies know that they can deploy these tools without thinking about it too much.

Yannis Papakonstantinou, a professor of computer science at UCSD, has just published a paper proposing the creation of the SQL++ standard, which will loosen up SQL just enough to be useful for semi-structured data without sacrificing compliance with the ANSI standards for how SQL tickles relational databases. With funding from the National Science Foundation, Papakonstantinou and his team of database researchers looked at eleven different query languages and found them all lacking in some measure when it came to supporting SQL query constructs as they hook into semi-structured data such as the JSON documents in a Couchbase database or the BSON documents in a MongoDB database. (Interestingly, under a project called Forward, the UCSD team created at middleware query processor that used SQL++ to issue distributed queries over a variety of relational and NoSQL data sources as well as on the SQL layers for Hadoop distributions.)

Independent of the SQL++ project at UCSD, document database supplier Couchbase has been working on a project called N1QL (pronounced “nickel”) that is a SQL query overlay for its JSON document stores. The N1QL query language will be part of the Couchbase Server 4.0 release coming later this year, which will also include the ForestDB storage engine, which is tuned for multicore processors and flash storage, to hold secondary indexes, and multi-dimensional scaling, which will allow for query, index, and data services in the Couchbase cluster to be independently scaled rather than as a monolithic unit as in the past. (The code is in beta now, and you can download it here.)

“With N1QL, we are SQL 92 compliant, which means if you write a query, you would not know if the query is executing on a relational database or on a NoSQL database,” explains Ravi Mayuram, vice president of products and engineering at Couchbase, to The Next Platform. “We have taken that level of care to make the SQL be that equivalent. So it has JOIN, GROUP BY, subqueries, COUNT, arithmetic functions – the level of expressiveness that typical SQL has. From a SQL grammar perspective, N1QL is 100 percent SQL 92 compliant, but there are extensions because certain things have to be done differently in a document database rather than with information sitting in tables. The differences mainly have to do with how you look inside of nested objects in JSON.”

Incidentally, N1QL is in the process of being certified as compliant with the SQL++ standard under development at UCSD. The AsterixDB database under development at the University of California at Irvine will support most of the SQL++ extensions, and Papakonstantinou says in the SQL++ paper that Apache Drill is in the process of aligning with SQL++ as well. As for MongoDB and DataStax, the commercializer of the Cassandra NoSQL datastore developed by Facebook after its frustrations with the limitations of MySQL relational databases, neither of which have SQL layers of their own. The N1QL layer is open source, so it can in theory be adapted to other databases and datastores, but Bob Wiederhold, president and CEO at Couchbase, doesn’t expect for N1QL to be adopted as a standard for other document databases but does think that SQL is in the future of all such datastores.

“We certainly encourage MongoDB and DataStax and all NoSQL vendors to move to SQL languages to help expand the NoSQL space,” says Wiederhold. “There is a huge base of experienced SQL users, and that is why all of the Hadoop vendors have added this capability.”

The future Couchbase Server will have JDBC and ODBC drivers as well, allowing it to hook into relational databases and for applications to hook into it through these methods. Couchbase says that it will continue to support the JavaScript programming interface that is the native access method for Couchbase Server, much as relational databases continue to support their own native access methods. But over time, Mayuram expects for Couchbase shops to move to the N1Ql access method because a lot of the logic that is necessary to do queries currently resides in the application tier and will be pushed down into the database tier by the use of N1QL.

N1QL Turns On A Dime

Comparing performance between older versions of Couchbase Server and the forthcoming one with N1QL and multi-dimensional scaling will be a bit tricky, says Mayuram. All of that data manipulation that is going on inside of the application tier and that is coded by companies will be more standardized as they switch to SQL++ query methods. This will make the job easier for programmers, and that is a kind of performance boost, too.

“The performance question is very difficult to answer,” says Wiederhold. “Our objective is that the performance using N1QL is no different from the performance of the current methods. But it is going to vary depending on the types of indexes that you are building and the types of queries that you are running. Our desire is that N1QL is going to be just as high performance, and that will depend on how good the query processing is with N1QL and how the multi-dimensional scaling comes into play. Some queries will run a bit slower than if you did it the old way, but we are trying to build this capability so you don’t pay a performance penalty on most queries, and if you do, it is very small.”

The situation is somewhat analogous to the HBase database layer on top of Hadoop, which allowed querying of information by converting queries into MapReduce algorithms and then submitting batch jobs to the Hadoop Distributed File System. The HBase layer made Hadoop accessible to a whole new class of users who were not Java programmers and who were not going to write their own MapReduce algorithms.

Couchbase has more than 500 customers using the Enterprise Edition of Couchbase Server in production, and Wiederhold estimates that there are tens of thousands of open source installations of Couchbase running in the world. The company has been adding about a hundred production customers a year, and the typical use case is to use the NoSQL database as the element of a web and mobile application, perhaps as a user authentication or user profile datastore.

At this point in the NoSQL adoption curve, the use cases are specific and importantly they are growing as companies get more experience and familiarity running these alternatives to relational databases. Couchbase Server scales to something on the order of 80 to 90 nodes in a single instance, and the company is working to scale up commercial-grade support on systems with upwards of 150 nodes in future releases. The software has been tested in the labs running on up to 200 nodes (this is without data sharding and without counting shards as nodes, Wiederhold points out). The largest Couchbase users have 300 to 400 servers running the document database and storing petabytes of data, but they tend to not have individual clusters that are larger than around 80 to 90 nodes. Couchbase is counting on N1QL to cause an inflection point in NoSQL adoption, which should mean the need for larger clusters and a larger pool of customers. In the meantime, Couchbase is helping to fund further development of SQL++ at UCSD and will be ensuring that N1QL aligns to this specification that could, someday, become a standard that all NoSQL and NewSQL database vendors align to.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

2 Comments

  1. Here’s the thing. On one side, the relational engine guys like Oracle and SQL Server have failed to get their engines to scale much above the terabyte level. These guys are not dumb, though they might be a tad lazy. On the other side the marketing groups in this and that company think they can take a petabyte of data, stuff it into a “NoSQL” database, and suddenly query it in ways that have stumped Oracle and Microsoft. All you can say is LOL.

    I think as usual there will be a convergence from both sides. Oracle and Microsoft will slowly increase the scalability of their relational engines – Oracle is already doing so with their Exadata platform, Microsoft has just dallied in the area so far. And yes, SQL engines will be stuck onto “NoSQL” engines and will work the same problems from the other side.

    I went to a bigdata confab a while ago and just laughed at all the stories, “Doh! We ran out of temp space when we were processing our queries against joined hundred terabyte tables, we didn’t expect that!” Anyone who doesn’t see that coming is really just stumbling around in the design space.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.