Site icon The Next Platform

Modernizing The Turing Test For 21st Century AI

To understand how far natural language processing (NLP) has progressed in the past decade and how fast it is evolving now, we need to update Alan Turing’s thought experiment on how to test an AI for conversational intelligence to a 21st Century context and methodology.

Many organizations are working to create conversational AIs that seem human for a wide range of conversational styles and complexity. A key competitive and social challenge will be to create a test to measure the “humanness” of each conversational AI.

We don’t know yet what makes human intelligence and cognition work or how to effectively and reliably test people for intelligence. So, we are not going to test conversational AI for intelligence. Our goal is to test conversational AI for effectiveness – how effectively can a conversational AI convince people to do something or to change their minds about a topic?

Turing Test Limitations

Alan Turing first proposed his eponymous Turing Test in the 1950s as a thought experiment, well before the modern era of computers. Turing designed his thought experiment to assess whether a computer can hold a conversation well enough to be mistaken for human.

In a nutshell, the Turing Test is a blind test where:

Turing phrased his thought experiment very generally, leaving many other practical experimental gaps:

Defining The Uncanny Valley for NLP

We propose an AI trying to pass a general conversational effectiveness test (implicitly or explicitly) must first pass through the uncanny valley.

The uncanny valley is typically used to describe visual effects where humans become less tolerant of mistakes the closer a visual simulation approaches the appearance and physical behavior of a real-life human. The effect is related to suspension of disbelief in entertainment – people easily tolerate major violations of physics and physiology when watching cartoons, but as simulated actors become more realistic, people expect visual physics and physiology to behave correctly.

As a simulation more accurately depicts human looks and physiology, it can initially be mistaken for human. If it is just slightly wrong, eventually a human viewer will figure it out as the sum of the small anomalies becomes disturbing. Crossing the uncanny valley means that the errors become so small that people cannot detect that the visual is not a real person. The word “Deepfake” was recently coined to describe such hyper-realistic simulations of humans and the sense of betrayal that people feel upon discovering they can no longer tell the difference between real and simulated humans.

For NLP, we define the uncanny valley as the point at which a human listener or reader realizes that they had started to anthropomorphize the AI, but then something went wrong with the AI’s response and betrayed its non-humanness. The listener may not fully identify what is wrong, but they perceive that the conversation or story went sideways, and the AI is no longer making complete sense or responding appropriately. In a blind test, perhaps a human listener might think the AI is a crazy or unstable human.

IBM Project Debater Suggests Using A Debate Format

We like the idea of using a debate format to assess a binary choice, because it cleanly separates the first-person conversation between debaters from third-party assessment of how each debater performs. IBM and Intelligence Squared modified Intelligence Squared’s standard debate format for its Project Debater live debate on February 11 this year. The debate format used for Project Debater takes about 25 minutes per test (each debater delivers four minutes of opening arguments, four minutes of rebuttal, and then two minutes of closing summary, totaling ten minutes per debater). We believe this can likely be shortened considerably with additional research.

To be clear, IBM explicitly did not set out to create an alternative to the Turing Test. But we like IBM’s style in selecting a debate format to train and launch its conversational AI, so we’re coopting the idea.

There are many different ways to subjectively grade a debate, such as asking whether a Tester liked one speaking style over another. We chose to measure whether a Tester changed opinion on the debate topic because it is a simple binary answer that measures the effectiveness of both debaters’ arguments on each Tester.

For a debate format general conversational effectiveness test, there is a set-up phase and a delivery phase.

Set Up

  1. A set of debate topics is created, covering a range of context and conversational complexity. Variations in the complexity of debate topics is likely a rich topic of research all by itself, both for studying human conversational complexity and for this test.
  2. Each debate is recorded in private.
  3. For each topic, three debate permutations of each topic are recorded:
    1. Two humans debate the topic (control group 1)
    2. Two versions of the same AI that’s being tested debate the topic (control group 2)
    3. One human and the test AI debate the topic (test group)
  4. A transcript is created for each debate permutation. Each transcript may be translated into multiple languages, dialects, and accents for playback during delivery phase.

Delivery

  1. A recorded debate will be played to each Tester individually, over the Internet.
  2. Optional: demographic and psychographic data can be collected for each Tester prior to the start of each test.
  3. For each test, apparent twins face-off to debate each other (same face, voice, clothes, etc.). These twins (the Debaters) will be video simulations, designed to be understandable to each individual Tester. Using synthetic video/voice twin Debaters eliminates many simple but subtle human biases (above) that might affect each Tester’s perception of credibility and persuasiveness for either of the Debaters. To each Tester, it looks like one person is arguing both sides of a debate topic.
  4. Before the start of playback, each Tester will be asked their position on the debate topic.
  5. After the end of playback, each Tester will be asked if they changed their position on the debate topic, as is typical for grading debates. This is the core of our objective measurement of AI conversational effectiveness.
  6. Each Tester will then be asked to identify which of the Debaters represented a human and which represented an AI, including the explicit possibility that both or neither of the Debaters might have represented an AI. This is the core of our objective measurement of human ability to conversationally identify an AI.

At the end of a series of test debates, for a specific group of people taking the test:

A general conversational effectiveness test is loaded with observational context, because everyone has a subjective point of view. Evaluating test results will have to account for variability in audience demographics and psychographics, because objectively measuring whether a conversational AI can pass for human depends on the subjective state of each Tester observing a debate.

Our methodology does not depend on any Tester asking insightful questions – remember that we are not testing for intelligence, either on the part of the Tester or an AI Debater. Our methodology depends only upon each Tester’s listening skills and whether each Tester has formed an opinion of the debate topic before the start of each test.

In a modern social media context, this type of test will scale well, enabling simultaneous testing of large numbers of people, many debate topics, and many levels of debate complexity. General conversational effectiveness testing can also be gamified and performed over large social media networks.

Also, testing two different AIs against each other in another test group is likely to provide valuable technical and competitive data about the relative performance of the tested AIs. This may be useful in comparing NLP techniques from different companies, but will likely see more use in comparing competing NLP alternatives within an organization or between research groups.

Challenges And Future Work

We tried to identify pros and cons of our recommended methodology in the text above, but there are a couple of meta-considerations worth mentioning.

This style of test is a two-way test. It will measure individual humans’ conversational understanding skills and position on a wide range of tested topics. A series of tests will measure each Tester’s understanding of an aggregate number of different debates. The series will also provide data on each Tester’s attitude on the debated topics plus each Tester’s likelihood of changing beliefs due to the persuasiveness of the Debaters.

There are obvious privacy considerations to fielding tests within and across geopolitical boundaries. There are also ethical considerations to selecting debate topics, as careful selection of debate topics can collect significant psychographic data describing political and social inclinations of individuals taking the test.

This is not a real-time conversational test. If people don’t know they are listening to an AI, how long does it take for them to realize that they are not talking to a human? Today a person might simply believe that they are talking to another somewhat broken person – not very well informed, emotionally stunted, or perhaps low intelligence. If this seems absurd, consider that many people already interact with automated first-level call center support and never realize they are talking with conversational software.

There are already ethical considerations for conversational AI systems that do not identify themselves as such. At some point, developers may be required to implant easily testable patterns and responses into conversational AI systems, so that most people can easily discover they are talking to an AI. An opposing path might lead us to Blade Runner Decker requiring sophisticated software and interrogation techniques to determine the difference between human and AI.

Changing The Conversation On AI

If machine cognition is possible, then our methodology should be able to measure the resulting impact on AI conversational skills. Many people, including some AI researchers, believe that AI will never be capable of human-level cognition (however that may be defined), and therefore they believe that AI will not be able to pass a general conversational effectiveness test, ever. Our test tries not to make any assumptions about possibility or impossibility of machine cognition, the pace of AI technology development, or the pace of human development.

It may be that AI can only fool some of the people some of the time, and it stops there. It may turn out that AI eventually becomes capable of fooling all the people all the time. We hope test methodologies such as this will help humans figure out if or when AI passes that point.

Exit mobile version