CMU Research Shows Audio Can Make AI More Engaging and Human

Researchers at Carnegie Mellon University are investigating how humans respond to artificial intelligence agents that sound physically present in the same room, work that could shape the future of audio-only AI systems used in smart glasses, accessibility tools and other screen-free technologies.

A team from CMU’s School of Computer Science(opens in new window) worked with experts in the Department of Psychology(opens in new window) and other universities to develop an interface between humans and chatbots that relies only on audio cues. They aimed to more fully engage the user by making the chatbot seem as if it were physically present.

“The question becomes, ‘If I had an AI assistant, what would happen if I made the audio component more like an actual human?'” said David Lindlbauer, an assistant professor in the Human-Computer Interaction Institute (HCII). In the end, the answer would surprise even the researchers.

Humans rely heavily on vision to communicate and interact with the environment, so a considerable amount of research has understandably focused on interfaces people can see, such as avatars or robots, said HCII Ph.D. student Yi Fei Cheng(opens in new window). But the necessary equipment might not always be available or suitable, so an audio-only interface could be essential in some situations, such as when using smart glasses that include microphones and cameras but no displays.

The research team used spatialization and Foley effects to create an audio-only interface. Spatialization helps place an AI agent in the room as it speaks, moves around, completes tasks or makes other noises. Foley effects are the sound effects typically added to movies and television shows in post-production. For this research, the Foley effects included typing on a laptop, riffling through papers and pouring a glass of water.

“When a movie star sits down on a bar stool in his leather jacket, you expect a leathery rustle and a squeaky bar stool and the sounds of his hands hitting the bar,” said Laurie Heller(opens in new window), a psychology professor who studies auditory perception and cognition. “These sounds happen in real life, and if they aren’t part of the movie soundtrack, it doesn’t seem realistic. It doesn’t immerse you.”

To test the audio-only interface, study participants spoke to AI agents that used different combinations of spatialized and Foley effects. The participants were told to acquaint themselves with the room, such as by locating a laptop, blocks, a whiteboard, books and other items that accompanied audio effects. Then they were seated in the center of the room and told to converse with the AI agent, which proceeded to give the impression of moving about the room, typing, flipping through a book or drinking water as it chatted. Afterward, they completed questionnaires and structured interviews to share their impressions of the experience.

“We found that, yes, the audio interface made the AI assistant seem more humanlike,” Lindlbauer said. “We have statistically clear results demonstrating that adding spatial and Foley effects increases your engagement.”

But the perception of the interface as humanlike had an unexpected side effect. The users also expected this seemingly human interface to follow human social norms.

“As soon as the participants felt like their agent was engaged in something else, such as if the agent was talking and typing at the same time, or rustling papers, the participant actually felt like ‘This is not cool, my agent is not paying attention to me. My agent is distracted.’ They considered this rude,” Lindlbauer said. “To me, this seemed like a remarkably odd characterization of a computational system.”

In the study, many of the Foley effects were automated and not tied directly to the conversation between the agent and the participant. Designing the audio cues to be more aware of conversations might reduce this sense of distraction, Cheng suggested.

Though the experiment included an interface that was specific to the office environment, Lindlbauer said a final system might not need to be so specialized.

“My gut feeling is that I could design a good number of audio effects that are independent of the space, which do not require a lot of knowledge of the space, and I could still get this boost in engagement,” he said.

It’s possible that a participant might look in the direction of a voice or at a laptop when they hear typing. But this mismatch between what their eyes and ears told them didn’t seem to ruin the overall effect.

“Based on the data from this study, the sounds still had an effect on people consistent with another human being there,” Heller said.

The researchers will present their findings at the upcoming Association for Computing Machinery Conference on Human Factors in Computing Systems (CHI 2026)(opens in new window) in Barcelona. In addition to Lindlbauer, Heller, Cheng and Bloch, the authors include Alexander Wang, a Ph.D. student in the HCII; and colleagues at South Korea’s KAIST; the University of Sydney, Australia; and the University of Michigan.

Carnegie Mellon

“Carnegie Mellon University is a private research university in Pittsburgh, Pennsylvania. The institution was originally established in 1900 by Andrew Carnegie as the Carnegie Technical School. In 1912, it became the Carnegie Institute of Technology and began granting four-year degrees.”

Please visit the firm link to site

share this article Share this content

share this article Share this content

You Might Also Like

Robotics for Environmental Innovation

When Greatness Marries Passion: Freddie Hendricks Wins 2026 Excellence in Theatre Education Award

CMU’s Robotics Innovation Center Secures FieldAI as Inaugural Corporate Tenant

Five CMU Faculty Members Named 2026 Sloan Research Fellows

Share this content

Share this content