In distributed computing, there are two choices: move the data to the computation or move the computation to the data. Public and off-site private clouds present a third option: move them both. In any case, something is moving somewhere. The “right” choice depends on a variety of factors – including performance, quantity, and cost – but data seems to have more intertia in many cases, especially when the volume approaches the scale of terabytes.
For the modern cloud, adding additional compute power is trivial. Moving the data to that compute power is less so. With a 10 gigabit connection, the best case scenario for a 10 terabyte upload is just over two and a half hours. When dealing with these very large volumes of data, moving the data to the compute is not an effective strategy for getting results quickly.
This is one of the primary concerns addressed by AstroCloud, an astronomy-specific cloud service in China. Because astronomical data is often astronomical in size (many terabytes up to petabytes), AstroCloud sites its datacenters near the major research telescopes. By providing both a storage and a computation service together, the researchers don’t have to try to manage large data movement. AstroCloud provides on-demand virtual machines with common astronomy software pre-loaded so that all interactive work can be done remotely. They claim in a paper that most operations can be done on a thin client or mobile device.
While AstroCloud’s approach is reasonable, it has its limitations. Pinning the data to a single location removes much of the cloudiness from the service. If compute capacity is unavailable then either the data has to be transferred to a location that has available capacity or the researcher has to wait. If the entire site is offline, then of course no work can be done. This is not an AstroCloud-specific problem – any provider that serves very-large-data communities will face this limitation.
One approach for solving it is a model like what is proposed for the Large Synoptic Survey Telescope (LSST), currently under construction in Chile. Expected to produce 15 terabytes of observational imagery per night when in operation, LSST data will be stored and processed in a central archive at the National Center for Supercomputing Applications in Champaign, Illinois. The data will be replicated to a network of “data access centers” in a tiered structure based on capacity and response. This model, in turn, is similar to that used by the Compact Muon Solenoid (CMS) project. It is not clear if AstroCloud already has or is considering such a model.
Another challenge for purpose-built scientific clouds is getting a balance of demand. General purpose clouds can rely on a broad mix of utilization patterns to balance out demands on shared resources like network bandwidth. Compute- or data-intensive workloads are intermixed with small web servers. Demand from ecommerce and streaming video peaks during waking hours while the overnight is filled with market risk modeling and other batch processing. The net effect of all of this is that while the utilization still ebbs and flows, it is smoothed by the heterogeneity. This is less likely to be the case on a cloud dedicated to a particular field of science.
AstroCloud has solved this problem by not limiting itself to providing data and computation resources. It also hosts administrative services like telescope access proposal management. By providing lower-impact services, they can mitigate the problems of homogeneous demand somewhat. The telescope access proposal system and the fact that AstroCloud is funded by the National Development and Reform Commission and the Chinese Academy of Sciences implies the possibility of an allocation system for data and compute as well. This would be similar to how the National Science Foundation’s XSEDE project works in the United States. It’s worth noting that even if the core of AstroCloud is based on an allocation model, it does include some services to the general public.
Apart from the technical demands of handling data and smoothing resource demand, the major challenge for purpose-built scientific clouds is economic. Rackspace – a strong player, but not among the giants – has annualized expenses of nearly $2 billion per year as of their last quarterly filing before going private last summer. A scientific cloud can probably spend less on sales and marketing than a general purpose offering, and there’s no reason that a purpose-built science cloud service has to be even as large as Rackspace. Nonetheless, building a cloud service, particularly for datasets on the order of petabytes, is an expensive proposition.
When the U.S. Department of Energy funded the Magellan project from 2009-2011, public cloud offerings were less suited to large scientific applications. The Magellan findings specifically called out a lack of InfiniBand (which Microsoft Azure offers), getting processors tuned for performance (public cloud providers are now getting new CPUs and GPUs ahead of the general market in some cases), and a lack of high-capacity data archives (AWS, Google Cloud, and Azure all offer cold storage). This leaves pre-installed applications and parallel filesystems as the major drawbacks of public cloud, both of which can be handled by trained personnel.
The cloud paradigm is becoming more attractive to science applications just like it has appeal for startups and enterprises. While purpose-built scientific clouds help solve domain-specific problems, it seems to us that the best approach is to make them a software and services layer on top of existing cloud infrastructure, and not to build new resources from the ground up.