It’s Time to Start Paying Attention to Vector Databases

The concepts underpinning vector databases are decades old, but it is only relatively recently that these are the underlying “secret weapon” of the largest webscale companies that provide services like search and near real-time recommendations.

Like all good clandestine competitive tools, the vector databases that support these large companies are all purpose-built in-house, optimized for the types of similarity search operations native to their business (content, physical products, etc.).

These custom-tailored vector databases are the “unsung hero of big machine learning,” says Edo Liberty, who built tools like this at Yahoo Research during its scalable machine learning platform journey. He carried some of this over to AWS, where he ran Amazon AI labs and helped cobble together standards like AWS Sagemaker, all the while learning how vector databases could integrate with other platforms and connect with the cloud.

“Vector databases are a core piece of infrastructure that fuels every big machine learning deployment in industry. There was never a way to do this directly, everyone just had to build their own in-house,” he tells The Next Platform. The funny thing is, he was working on high dimensional geometry during his PhD days; the AI/ML renaissance just happened to perfectly intersect with exactly that type of work.

“In ML, suddenly everything was being represented as these high-dimensional vectors, that quickly became a huge source of data, so it you want to search, rank or give recommendations, the object in your actual database wasn’t a document or an image—it was this mathematical representation of the machine learning model.” In short, this quickly became important for a lot of companies.

It is this opportunity that pushed him to build one of the only companies creating a scalable, cloud-native vector database. The result, Pinecone ($10 million in funding so far), thinks that the time is right to give more companies that underlying “secret weapon” to let them take traditional data warehouses, data lakes, and on-prem systems that contain all of this un-vectorized data and allow them to quickly spin up machine learning applications that skip the ugly step of data conversions that are truly difficult, if not almost impossible for mid-sized companies. In fact, this is one of those things that almost has to be a fully managed service, and for several reasons.

Most important, for the average company looking to provide recommendation services like the hyperscalers do, simply converting data from warehouses and lakes into the high-dimensional format is not simple and it’s vastly expensive computationally, especially if one if trying to do all of that in near real-time for recommendation, classification, or fraud detection. Accuracy and speed will suffer.

This is not just about data conversation, either. This is complex from both a distributed computing and algorithmic point of view. “One of the hardest things you can do is make massive data actionable in real time for really large applications. It’s not just about converting data into that format, that’s just the first step. It needs to be put into a real-time engine and that’s difficult as well,” Liberty says when describing how the non-hyperscalers are looking at the problem.

And by the way, it’s not just the really large companies that want to follow in hyperscale search and recommendation footsteps. Liberty says small, few-person startups with huge data and ML operations are especially strapped without something that can handle vectorization at scale. He points to a startup using Pincone for image search; they index millions of images to find similar ones in real time. That’s all done with convolutional nets that are converted into high dimensional vector representations that can be indexed by the Pinecone database and searched in real time.

Startups who handle everything from anomaly detection, time series analysis, and even deduplication of data should all take note—there will likely be a mini-wave of vector database startups, although they might not use that term. Liberty agrees “vector database” is in a bit of a no man’s land in terms of what where it fits in the ML ecosystem. It’s gets lumped in with ML Ops, data warehousing, traditional databases, and cloud. And the other weird thing is that it sounds like it’s something that has been a standard enterprise IT tool for decades. Vectors? Databases? What could be new? Well, quite a bit.

The real trick to making something that’s always been built in-house and not widely discussed is getting it to work with variable infrastructure, so this is cloud-native. Liberty says it took a lot of footwork to figure out how to take this very internal tool and make it broadly applicable. This involved a fair bit of research on what users might need in terms of scale and accuracy, making sure resource allocation and capacity and so on could be designed for cloud use, then building custom containerization, among other hooks. “The entire ETL we have doesn’t look like a traditional database, it looks like a pipeline of ML models that are configurable by users and all containerized,” he adds.

Overall, here’s why vector databases, especially and perhaps only as managed services, are going to have their day in the sun:

Aside from the hyperscalers who had this figured out long ago and built their own tooling, and also aside from the tiny startups with massive data and ML projects, the average company is in a tough place when it comes to full-scale integration of ML into a majority of their workflows. There are more complicated answers for why that’s difficult but in terms of the data itself there are a few trends to note. First, we are still watching a lot of companies finally make a slow jump to the cloud. At the same time, their R&D and data science folks are finding that while they can do some ML work with traditional data warehouse and lake-stored data, none of that is centralized or ready to be piped into their applications. And besides, they just spent a whole lotta money on infrastructure to do all of this.

What they are going to need is more sophisticated data pipelining to do all the fancy ML projects their R&D teams showcase. And data engineering is really tough. Even if they get that right they need to index all that data into vector formats since ML applications will need that and unfortunately, those traditional databases, data lakes, and warehouses just cannot do that. Pinecone (and we point to them because they are it, as far as we see) can be a layer that sits on top of all that data warehousing investment and converts all the useful data for use in ML applications.

Even though we don’t hear about vector databases, we encounter them with almost every online service we use regularly, from a Google query to being served ads by the best networks, to being recommended perfect fit products. “Most companies are waking up to the idea that tech giants are all using this secret weapon no one has. We want to give that to everyone.”

Amith Indurthi says:

March 13, 2021 at 9:43 pm

Nice Article with Lots of Insights and Information.

james says:

April 22, 2022 at 1:11 am

don’t think pinecone is a real database, take a look at vespa or milvus

Pep says:

September 22, 2022 at 3:13 am

Nice article, thanks so much. There is another open source vector database called Nuclia DB

Barry Smith says:

July 16, 2023 at 8:24 pm

Nice article (I found through a link). Was wondering if there are any specific updates on how things have progressed since the writing.

It’s Time to Start Paying Attention to Vector Databases

Sign up to our Newsletter

4 Comments

Leave a Reply Cancel reply