There is no workload in the datacenter that can’t, in theory and in practice, be supplied as a service from a public cloud. Big Data as a Service, or BDaaS for short, is an emerging category of services that delivers data processing for analytics in the cloud and it is getting a lot of buzz these days – and for good reason. These BDaaS products vary in features, functions, and target use cases, but all address the same basic problem: Big data and data warehousing in the cloud is deceptively challenging and customers want to abstract away the complexity.
Data analytics in the cloud is especially tough for companies with extensive datacenter investments hoping to create a hybrid architecture. Enterprises rarely ponder about wholesale migration of their IT infrastructure to the cloud, regardless of what Amazon Web Services would have you believe. It is just not feasible for most existing companies. Instead, a common goal is to create hybrid architectures, which leverage the on-premises systems and processes that are working well, but augment infrastructure with new cloud resources and analytic capabilities.
The business drivers for hybrid data architectures include better enabling data science, freeing up capacity on costly data warehouses, adding new data lakes or data pipelines, sharing and monetizing data, or collecting new data sources, especially high-volume cloud sources like social, mobile, or sensor data. But, it turns out that hybrid big data architectures are much easier on PowerPoint slides than in the real world. Even new cloud databases or data warehousing as a service resources (for example, Amazon Redshift and its equivalents) have been challenging for many enterprises to integrate with on-premises infrastructures – let alone more complex big data technologies.
While the components for cloud data processing are readily available, many companies lack the time and skills needed for integration, implementation and operations. According to Gartner analyst Adam Ronthal, “The ‘some assembly required’ approach for effectively integrating a range of data management and analytics-related services in large cloud service provider (CSP) ecosystems can be daunting to new cloud adopters.”
The irony is that the cloud promises the biggest impact for older enterprises that are often the most challenged by integration, architecture and lack of cloud skills. Newer companies had the advantage of architecting and staffing for the cloud, or at least with cloud capabilities solidly in mind. And, there is the well-covered “consumerization” trend that leaves workers wondering why it is so hard for their enterprise employers to deliver them the type of integrated cloud capabilities that they get from Apple. It is a frustrating situation for business and IT leaders alike, articulated colorfully by one CIO:
“It took us about eight months to create a new data warehousing environment in the cloud. We hired someone, sent a few people to training and brought in consultants. Now it is live, but it is like accessing a space station. Getting data up there and using it requires a significant effort – a major mission every time. So, cloud is not part of our normal processes and it is not saving us money yet. We’re barely using it.”
This is a common conversation in data circles, which is why several vendors have developed BDaaS in hopes of addressing these challenges.
What BDaaS Is And Is Not
BDaaS offerings vary greatly today. As the category matures, there will likely be more consistency in service functions, but for now, many analysts are painting it in broad brushstrokes. According to Gartner, “Vendors are combining components of analytic platforms in the cloud with multiple processing engines, hybrid on-premises integration, and secure data movement.” And Forrester reports “Big data as a service technology provides capture management and operations capability delivered as-a-service in the public or hybrid cloud. Uses generally include SQL analytics (data warehouse or data mart), data lake, machine learning, and operational analytics application support.” (From Big Data Tech Radar, Q12016.)
There is general agreement on a few key requirements: BDaaS services are always in the cloud. They provide data processing and analytic execution, using data processing technologies such as massively parallel SQL, Hadoop, or Spark. And, BDaaS vendors provide cloud operations and maintenance. But BDaaS may look very different across vendors and picking a supplier requires careful evaluation.
Ultimately, BDaaS is about enabling analytics. Some services are targeted to the data scientist, some more to the data engineer or data warehouse professional supporting business intelligence or analytics programs. BDaaS may simply replace an existing data warehouse or data mart, and analysts may not even know (or need to know or care) that the underlying platform has changed.
BDaaS is new enough that there are some common misconceptions related to the moniker. BDaaS is not the same as “data as a service” and vendors generally do not sell datasets. Another area of confusion is the availability of canned analytics or reports, especially in industries like retail, which are more accustomed to analytics outsourcing. Most BDaaS providers are focused on the processing platform, and do not prescribe which analytics to run or what questions to ask. To put it simply, a company’s analysts or strategic partners are still coming up with the questions and queries, and BDaaS makes it easy for them to get results quickly.
BDaaS Technical Characteristics
A technical review of BDaaS offerings yields more similarities and differences.
All are cloud-based, though with variations. Some leverage the public cloud infrastructure of Microsoft Azure, Amazon Web Services, Google Cloud Platform, or others. Others run in the BDaaS providers’ own clouds. Some are single tenant, running on dedicated servers or only sharing physical infrastructure. Many services with the BDaaS label are multi-tenant, where several customers share server infrastructure, a model that may reduce costs, but increases security and compliance concerns for regulated industries. Some BDaaS vendors support multiple public cloud platforms; some allow companies to move workloads between different clouds or on-premises platforms.
The core function of BDaaS is data processing and analytic execution. Some BDaaS providers are (or were) also labeled “Hadoop as a Service” or “Spark as a Service,” with automation that makes adopting those new technologies easier. But BDaaS is definitely not all about Hadoop.
More BDaaS vendors are offering multiple processing engines now, such as Hadoop, Spark, massively parallel SQL, or others. This gives enterprises more choices for matching the data technology to the workload. For example, some big data use cases call for handling large volumes of data with batch processing, which can be handled well via Hadoop or Spark. Other cases, such as ad hoc business intelligence, require less storage, but more compute power, and are better delivered by massively parallel SQL processing.
In all cases, exactly how the technology is provisioned and configured has a huge impact on cost. Provisioning and optimization is very challenging due to frequent changes and rapid innovation in both the cloud and data technology markets. Some BDaaS providers make it easy to move datasets between different engines; others require building your own integrations. Some BDaaS vendors have their own analytics interfaces; others support industry-standard visualization tools (Tableau, Spotfire, and so forth) or analytic languages like R and Python. BDaaS vendors have different approaches, which should be carefully evaluated.
The key thing about any cloud-based service is that it reduces operation and support costs, even if the infrastructure can be more expensive (by some measures) than on premises gear. Perhaps the most obvious, yet under-appreciated, characteristic of BDaaS providers is their expertise in cloud operations, security, maintenance and upgrades. BDaaS enables incredible economies of scale, meaning a highly-skilled BDaaS team can handle cloud operations, so you don’t have to. That often includes patching, upgrades, operating system updates and the like. Some enterprises may also require specialized encryption, security, audit, or compliance controls from their BDaaS provider. While those operational activities may sound like no big deal, they are also the ones most often requiring troubleshooting, special skills and late nights, particularly for security issues. These features are, in essence, what makes any software enterprise-grade.
These production and operations requirements have caught many companies by surprise, particularly on Hadoop-related projects. While learning and installing a new technology like Hadoop or Spark might be a fun side project, troubleshooting a cloud operating system upgrade during a rapidly closing maintenance window is not so glamorous. BDaaS providers have very different enterprise SLAs, monitoring, and management capabilities, so enterprises must closely review what will meet their requirements.
BDaaS has been transformational for many adopters. It is fast to implement, which means that IT can use BDaaS for quick wins, like delivering new analytic capabilities, freeing data warehouse capacity or implementing a new data lake in days, with no special skills required. For data scientists and analysts, BDaaS instantly enables cloud-scale compute and storage, along with expanded analytic capabilities.
In the long term, BDaaS radically simplifies data access for all enterprise workers, partners and customers – and enables faster integration and use of new data. BDaaS also future-proofs companies from the constant change inherent to the data and cloud markets, by providing regular upgrades and enhancements to leverage new technologies and best practices. Compared with on-premises or DIY cloud projects, BDaaS is significantly cheaper and has low maintenance overhead, thanks to its “as a service” delivery model.
That’s all why BDaaS is quickly gaining traction in enterprises. There are a variety of services with the label, but as you dig in, you will find differences in architecture, capabilities, and costs that will make it easy to shortlist the best options.
Prat Moghe is founder and CEO of Cazena, a supplier of big data as a service vendor that is backed by Andreessen Horowitz, Formation 8, and North Bridge Partners. Moghe was more than 18 years of experience founding next-generation technology products and was most recently senior vice president of strategy, products, and marketing at IBM’s Netezza division, where he led a 400-person team that launched the latest Netezza data warehouse appliance in the wake of Netezza’s sale to IBM for $1.7 billion in 2010.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.