Keeping Large Language Models From Running Off The Rails

The heady, exciting days of ChatGPT and other generative AI and large-language models (LLMs) is beginning to give way to the understanding that enterprises will need to get a tight grasp on how these models are being used in their operations or they will risk privacy, security, legal, and other problems down the road.

Given how aggressively Microsoft is pushing OpenAI’s various offerings into its own portfolio of enterprise and consumer products and services and the accelerated pace that Google and others are developing such technologies, organizations will need to get ahead of it or find themselves stamping out potentially dangerous fires.

They’re now getting some help to do this.

CalypsoAI, a five-year-old startup launched by people who worked at such places as NASA, DARPA, and the Department of Defense and that develops tools to help companies and governments more safely use AI in their environments, last week unveiled its Moderator product to help organizations get control of the LLMs they’re bringing into their environments.

The aim is to help enterprises get the benefits that LLMs offer while managing the risks, including data exfiltration, bad information – also known as “hallucinations” – malware, spyware, and a lack of auditability. Mismanaged, generative AI tools can expose a company’s data or let the bad guys in.

“Our goal is to provide a solution that enables organizations to harness the full potential of LLMs, while also protecting against potential exposure and security threats,” CalypsoAI CEO Neil Serebryany said in a statement.

Now comes Nvidia, a longtime vocal proponent and enabler of AI in the enterprise that has rolled out NeMo Guardrails, an open source toolkit that essentially monitors the conversation – or chat – between the user and the AI application to ensure that they adhere to rules created by the organization and don’t veer into areas that could expose sensitive information, create avenues for cyberattacks, or violate company norms.

The Python-based dialog engine can prevent LLMs from executing malicious code or making calls to an outside application that poses a security risk and can be used to ensure that responses from the LLM are factual and are based on credible sources, making it more difficult for hackers to get into the AI systems and putting in false, inappropriate, of biased information.

“The concept of AI safety is obviously getting a lot of attention these days, so things like hallucinations, generation of toxic content, misinformation, all these are cases where you might want to guide a conversation away from or towards certain kinds of responses,” Jonathan Cohen, vice president of applied research, said during a virtual press and analyst briefing. “The concept of security is becoming more and more important as large-language models are allowed to connect to third-party APIs and applications. This can be become a very attractive surface for cybersecurity threats. Whenever you allow a language model to actually execute some action in the world, you want to monitor what requests are being sent to that language model and what that language model is doing in response and provide a place to implement all of these sorts of checks that would indicate different kinds of attack and security threats.”

Guardrails “monitors the conversation in both directions. Under the hood, it is actually a sophisticated contextual dialog engine, so it tracks the state of the conversation – who said what, what are we talking about now, what were we talking about before – and it provides a programable way for developers to implement guardrails of all these different types.”

An ongoing worry in enterprises is that a developer may expose confidential information when developing AI apps or that an LLM-based service – such as one used by HR to answer employee questions about benefits – may do the same during a conversation with a worker.

Nvidia Guardrails – like those that line highways – is designed to keep the chat from venturing into those dangerous areas. Guardrails is part of Nvidia’s NeMo framework, which itself is part of the Nvidia AI platform and is cloud-native tool for building, customizing, and deploying generative AI models that can have billions of parameters.

Using Guardrails, developers can create programmable rules for interactions between the user and AI app. It supports LangChain, a collection of toolkits that include templates and patterns that tie together LLMs, APIs, and other software. For Guardrails, it adds another layer of security and trustworthiness. Guardrails sit between the user and AI app or between the user and LangChain (which is between Guardrails and the app).

Nvidia’s tool is built on Colang, a modeling language conversational AI that comes with a runtime and is designed to use natural language to define the behavior of chatbots. Developers can create guardrails by defining flows in a Colang file. The guardrails include canonical form (to determine the topic of the conversation and match it to the rules), messages (for classifying user intent), and flows (the messages and actions between the user and AI app).

Guardrails are categorized under three areas. Topical guardrails aim to keep the conversation on topic, safety guardrails protect against misinformation and inappropriate content, and security guardrails make sure the LLM doesn’t run malicious code.

The toolkit uses LLMs itself.

“The simplest kind of fact check is you just ask another large language model,” Cohen said. “This language model produced this result based on this data. Is it factually accurate? The reason why that is a good idea is you could actually have a language model that you have customized and improved specifically to be a fact checker. There are very general-purpose language models. There are also a lot of value in training the language model with a lot of data in a very specific task. We have a lot of evidence and the community has a lot of evidence that when you fine-tune these models with lots of examples, it actually can perform much better. That’s the concept here. Rather than forcing someone via, let’s say, prompt engineering in a typical way you think about it, to prompt the language model to fact check and stay on rails and avoid certain topics and whatever, you can actually have another system that might call language models.”

Having a specialized model also can be more efficient. The Guardrails engine monitoring the conversation is relatively computationally inexpensive, but it does rely on using LLMs for checks. A model designed just for fact checking could be less expensive than a general LLM, he said.

As far as hardware, the Colang runtime runs on a CPU and then connects to LLMs for various disparate cases. The hardware an enterprise will need for Guardrails depends on the service they’re calling, he said. Essentially, whatever LangChain supports, Nvidia works with.

Cohen was asked why Nvidia is going with another entity like Guardrails rather than simply including the various conversational parameters in the training of the LLM instead. Cohen admitted the method – called “instruction following” or “human alignment” – can work.

“However, if you’re going to use the language model in practice, you really want multiple layers of liability,” he said. “The guardrail system sits on top of the language model and is another check that you can write a rule, ‘If someone says something insulting, do this.’ By layering all these systems on top of each other, you can create a bit more reliability. Training a language model to respond in a certain way is important, but there’s a lot of value in having a programmable system where you can explicitly write the rules that you want. You can change them dynamically.”

Guardrails is now available in NeMo, which itself is available in GitHub, and also as part of Nvidia’s AI Foundations family of enterprise-level generative AI cloud services.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.