Microsoft Research Pens Quill for Data Intensive Analysis
November 8, 2016 Ben Cotton
Collecting data is only useful to the extent that the data is analyzed. These days, human Internet usage is generating more data (particularly for advertising purposes) and Internet of Things devices are providing data about our homes, our cars, and our bodies.
Analyzing that data can become a challenge at scale. Streaming platforms work well with incoming data but aren’t designed for post hoc analysis. Traditional database management systems can perform complex queries against stored data, but cannot be put to real-time usage.
One proposal to address these challenges, called Quill, was developed by Badrish Chandramouli and colleagues at Microsoft Research. It is a distributed platform for analyzing large datasets. Building off of Microsoft’s Trill project, Quill is designed with several key features in mind: support for both streaming and post hoc analysis, rich data movement plans, temporal support, and scale from single-core to many-core in the cloud.
Quill can read data from files on disk or in memory. However, it can also accept a real-time stream constructor as input. This allows the same logic to work against both real-time and stored data with no additional modification.
Datasets in Quill are represented as a set of streamable shards. Shards can be distributed in a variety of ways, including round-robin, multicast, and broadcast. The multicast operation runs a user-provided lambda over the data and allows for complex operations like theta joins and matrix operations. The size of the worker pool can be expanded or reduced by using data movement actions to replicate the data to a differently-sized pool.
Because Quill is built on Trill, it can take advantage of the latter’s temporal support. The paper uses an advertising platform data set as the running example, which is an excellent fit for temporal analysis. The native temporal support sets Quill apart from many other distributed analytics platforms such as map-reduce, Hive, and Spark.
The cloud support may be the most compelling aspect of Quill. Creation and destruction of clusters within a Quill workflow are single statements. Not only is Quill distributed, but it is decentralized by design. With a masterless architecture, Quill is robust against failure. The client can disconnect and active queries will continue to execute on worker nodes. Quill uses Microsoft Azure features such as Azure Table storage and Azure Queues to reduce failure points and improve scalability. Benchmark tests showed nearly linear throughput improvement up to the largest cluster size of 40 nodes. The same benchmark with using identical cloud instances running Spark showed a near-constant performance regardless of cluster size. For the 40-node cluster, the query benchmark was six times faster with Quill compared to Spark.
Performance improvements are even more pronounced when performing temporal queries. Using a hopping window query, Quill had approximately 12 times better throughput than Spark using a window slide of 1 day and a window size of 1 week. At 1 hour/1 week, Quill’s performance was approximately the same, but Spark’s had decreased, resulting in a 120 time better throughput for Quill. Quill’s performance halved moving to a 1 second/1 week window, but Spark was not able to complete that benchmark test.
Quill appears to still be in the early development stages. The authors note in the paper that real-time queries will need to have fault-tolerance added in future work. Quill is currently only a .NET library, but providing libraries for other languages used in analytics (for example, Python and R) will add considerable value. The cloud features are not well-described, but it appears that production users would benefit from additional flexibility in configuration. And of course, if the cloud management grows to include Amazon Web Services in addition to Microsoft Azure, that will open up this platform to a broader audience.
Despite these few shortcomings, it is clear that Quill represents a potentially valuable approach to analyzing data in a variety of settings. As the big data field continues to mature, Quill can become a major player in the space if a community can develop around it.