Modernizing The Turing Test For 21st Century AI

Paul Teich

6 years ago

To understand how far natural language processing (NLP) has progressed in the past decade and how fast it is evolving now, we need to update Alan Turing’s thought experiment on how to test an AI for conversational intelligence to a 21st Century context and methodology.

Many organizations are working to create conversational AIs that seem human for a wide range of conversational styles and complexity. A key competitive and social challenge will be to create a test to measure the “humanness” of each conversational AI.

We don’t know yet what makes human intelligence and cognition work or how to effectively and reliably test people for intelligence. So, we are not going to test conversational AI for intelligence. Our goal is to test conversational AI for effectiveness – how effectively can a conversational AI convince people to do something or to change their minds about a topic?

Turing Test Limitations

Alan Turing first proposed his eponymous Turing Test in the 1950s as a thought experiment, well before the modern era of computers. Turing designed his thought experiment to assess whether a computer can hold a conversation well enough to be mistaken for human.

In a nutshell, the Turing Test is a blind test where:

A human (the Tester) asks questions of two people hidden behind a curtain.
Hidden Person A answers the Tester’s questions directly.
Hidden Person B silently and quickly types each question the Tester asks into a computer, quickly receives a silently typed response back from the computer, and then speaks the computer’s response to the tester. The time it takes to type a question and then start to receive an answer is an obvious experimental flaw, but let’s assume for the thought experiment that it happens instantaneously or that Hidden Person A delays their responses to match the computer’s lag time (latency).
If the Tester cannot discern which Hidden Person (A or B) is speaking a computer’s responses, the computer has passed the Turning Test.

Turing phrased his thought experiment very generally, leaving many other practical experimental gaps:

Context. Every human conversation is context-specific. Can an AI pass the Turing Test for a wide range of Tester conversational contexts, such as:
- Small-talk at a party
- Workplace performance (such as call center automation)
- A philosophical discussion
- The number of human conversational contexts and topics is essentially infinite
Listener Bias. The Hidden People in the original thought experiment ideally should be identical twins speaking with the same voice and accent, because human testers have extensive biases that affect perception of speaker credibility:
- Physical appearance and/or speaking voice
- National or ethnic group identity, including dialects and accents
- Gender identity
- Age identity
- Relative status, via title, accolades, etc.
- There are many more biases…
Listener Cognitive Load. The Tester must question the two Hidden People on-the-fly while also critically listening to each Hidden Person’s responses. Most people are not qualified to explore the question of “are you human?” with a modern AI system. A general conversational AI:
- Knows more facts than any single human can memorize
- Can emulate humor and amusement, and soon other human emotions
- Will eventually emulate empathy
Experimental Rigor. The experiment as stated is:
- Uncontrolled – the Tester knows that one of the people is an AI and the other is not, there are no control groups
- Unique – each conversation stands alone and is therefore not repeatable
- Subjective – each Tester has a different conversational intelligence level, holds a different conversation with the Hidden People, and has a different evaluation context
- Biased – the test assumes at its core that the best an AI can do is equal a human, plus the test does not comprehend comparing two different AIs against each other

Defining The Uncanny Valley for NLP

We propose an AI trying to pass a general conversational effectiveness test (implicitly or explicitly) must first pass through the uncanny valley.

The uncanny valley is typically used to describe visual effects where humans become less tolerant of mistakes the closer a visual simulation approaches the appearance and physical behavior of a real-life human. The effect is related to suspension of disbelief in entertainment – people easily tolerate major violations of physics and physiology when watching cartoons, but as simulated actors become more realistic, people expect visual physics and physiology to behave correctly.

As a simulation more accurately depicts human looks and physiology, it can initially be mistaken for human. If it is just slightly wrong, eventually a human viewer will figure it out as the sum of the small anomalies becomes disturbing. Crossing the uncanny valley means that the errors become so small that people cannot detect that the visual is not a real person. The word “Deepfake” was recently coined to describe such hyper-realistic simulations of humans and the sense of betrayal that people feel upon discovering they can no longer tell the difference between real and simulated humans.

For NLP, we define the uncanny valley as the point at which a human listener or reader realizes that they had started to anthropomorphize the AI, but then something went wrong with the AI’s response and betrayed its non-humanness. The listener may not fully identify what is wrong, but they perceive that the conversation or story went sideways, and the AI is no longer making complete sense or responding appropriately. In a blind test, perhaps a human listener might think the AI is a crazy or unstable human.

IBM Project Debater Suggests Using A Debate Format

We like the idea of using a debate format to assess a binary choice, because it cleanly separates the first-person conversation between debaters from third-party assessment of how each debater performs. IBM and Intelligence Squared modified Intelligence Squared’s standard debate format for its Project Debater live debate on February 11 this year. The debate format used for Project Debater takes about 25 minutes per test (each debater delivers four minutes of opening arguments, four minutes of rebuttal, and then two minutes of closing summary, totaling ten minutes per debater). We believe this can likely be shortened considerably with additional research.

To be clear, IBM explicitly did not set out to create an alternative to the Turing Test. But we like IBM’s style in selecting a debate format to train and launch its conversational AI, so we’re coopting the idea.

There are many different ways to subjectively grade a debate, such as asking whether a Tester liked one speaking style over another. We chose to measure whether a Tester changed opinion on the debate topic because it is a simple binary answer that measures the effectiveness of both debaters’ arguments on each Tester.

For a debate format general conversational effectiveness test, there is a set-up phase and a delivery phase.

Set Up

A set of debate topics is created, covering a range of context and conversational complexity. Variations in the complexity of debate topics is likely a rich topic of research all by itself, both for studying human conversational complexity and for this test.
Each debate is recorded in private.
For each topic, three debate permutations of each topic are recorded:
1. Two humans debate the topic (control group 1)
2. Two versions of the same AI that’s being tested debate the topic (control group 2)
3. One human and the test AI debate the topic (test group)
A transcript is created for each debate permutation. Each transcript may be translated into multiple languages, dialects, and accents for playback during delivery phase.

Delivery

A recorded debate will be played to each Tester individually, over the Internet.
Optional: demographic and psychographic data can be collected for each Tester prior to the start of each test.
For each test, apparent twins face-off to debate each other (same face, voice, clothes, etc.). These twins (the Debaters) will be video simulations, designed to be understandable to each individual Tester. Using synthetic video/voice twin Debaters eliminates many simple but subtle human biases (above) that might affect each Tester’s perception of credibility and persuasiveness for either of the Debaters. To each Tester, it looks like one person is arguing both sides of a debate topic.
Before the start of playback, each Tester will be asked their position on the debate topic.
After the end of playback, each Tester will be asked if they changed their position on the debate topic, as is typical for grading debates. This is the core of our objective measurement of AI conversational effectiveness.
Each Tester will then be asked to identify which of the Debaters represented a human and which represented an AI, including the explicit possibility that both or neither of the Debaters might have represented an AI. This is the core of our objective measurement of human ability to conversationally identify an AI.

At the end of a series of test debates, for a specific group of people taking the test:

If Testers consistently grade an AI as an AI, then it may not have reached the uncanny valley or may have entered the uncanny valley. Further research is needed to understand the entry boundary into the uncanny valley – between a Tester immediately and accurately determining that a Debater is an AI and a Tester requiring deeper listening of the debate to accurately determine that a Debater is an AI.
If Tester answers form a random distribution between human and AI for both human and AI Debaters, that means that an AI has left the uncanny valley and can effectively pass as human for the context of each test – for each individual debate. Each test is dependent on the quality of both Debaters (human and AI) and the abilities of specific human audience(s) observing the debate.
However, if Testers consistently grade an AI Debater as human more than they grade a human Debater as human, well then, that’s a whole different game. . . . A computer being more human than humans is not something envisioned by Turing. Each test will have to control for the quality of human Debaters, because each test is as much about the level of human conversation as it is the AI conversation.

A general conversational effectiveness test is loaded with observational context, because everyone has a subjective point of view. Evaluating test results will have to account for variability in audience demographics and psychographics, because objectively measuring whether a conversational AI can pass for human depends on the subjective state of each Tester observing a debate.

Our methodology does not depend on any Tester asking insightful questions – remember that we are not testing for intelligence, either on the part of the Tester or an AI Debater. Our methodology depends only upon each Tester’s listening skills and whether each Tester has formed an opinion of the debate topic before the start of each test.

In a modern social media context, this type of test will scale well, enabling simultaneous testing of large numbers of people, many debate topics, and many levels of debate complexity. General conversational effectiveness testing can also be gamified and performed over large social media networks.

Also, testing two different AIs against each other in another test group is likely to provide valuable technical and competitive data about the relative performance of the tested AIs. This may be useful in comparing NLP techniques from different companies, but will likely see more use in comparing competing NLP alternatives within an organization or between research groups.

Challenges And Future Work

We tried to identify pros and cons of our recommended methodology in the text above, but there are a couple of meta-considerations worth mentioning.

This style of test is a two-way test. It will measure individual humans’ conversational understanding skills and position on a wide range of tested topics. A series of tests will measure each Tester’s understanding of an aggregate number of different debates. The series will also provide data on each Tester’s attitude on the debated topics plus each Tester’s likelihood of changing beliefs due to the persuasiveness of the Debaters.

There are obvious privacy considerations to fielding tests within and across geopolitical boundaries. There are also ethical considerations to selecting debate topics, as careful selection of debate topics can collect significant psychographic data describing political and social inclinations of individuals taking the test.

This is not a real-time conversational test. If people don’t know they are listening to an AI, how long does it take for them to realize that they are not talking to a human? Today a person might simply believe that they are talking to another somewhat broken person – not very well informed, emotionally stunted, or perhaps low intelligence. If this seems absurd, consider that many people already interact with automated first-level call center support and never realize they are talking with conversational software.

There are already ethical considerations for conversational AI systems that do not identify themselves as such. At some point, developers may be required to implant easily testable patterns and responses into conversational AI systems, so that most people can easily discover they are talking to an AI. An opposing path might lead us to Blade Runner Decker requiring sophisticated software and interrogation techniques to determine the difference between human and AI.

Changing The Conversation On AI

If machine cognition is possible, then our methodology should be able to measure the resulting impact on AI conversational skills. Many people, including some AI researchers, believe that AI will never be capable of human-level cognition (however that may be defined), and therefore they believe that AI will not be able to pass a general conversational effectiveness test, ever. Our test tries not to make any assumptions about possibility or impossibility of machine cognition, the pace of AI technology development, or the pace of human development.

It may be that AI can only fool some of the people some of the time, and it stops there. It may turn out that AI eventually becomes capable of fooling all the people all the time. We hope test methodologies such as this will help humans figure out if or when AI passes that point.