Enhancing Expressiveness in Vocal Conversational Agents through Large Language Model-Generated Speech Synthesis Markup Language

Authors

DOI:

https://doi.org/10.32473/flairs.38.1.138814

Keywords:

Expressive Speech Synthesis, Large Language Models, Human-Robot Interaction, Behavior Generation, Generative AI

Abstract

Advancements in speech synthesis have enabled more natural and engaging conversational agents, including neural text-to-speech models that can adjust speech inflections to produce distinct vocal styles. For example, Azure Neural Voices can adjust speech using Speech Synthesis Markup Language (SSML) style tags, such as “affectionate,” “cheerful,” and “hopeful.” However, determining when to apply these tags in real-time interactions can be challenging and time-consuming. In this paper, we present a prompt-based approach that enables large language models (LLMs) to dynamically stylize their responses with appropriate SSML tags, enhancing synthesized speech expressiveness across 34 different styles. Using targeted probes designed to elicit specific speech styles, we demonstrate that LLM-generated responses are syntactically well-formed and correctly apply style tags to enhance expressiveness. This simple, customizable approach facilitates the rapid development of expressive vocal conversational agents.

Downloads

Published

14-05-2025

How to Cite

Salisbury, J. (2025). Enhancing Expressiveness in Vocal Conversational Agents through Large Language Model-Generated Speech Synthesis Markup Language. The International FLAIRS Conference Proceedings, 38(1). https://doi.org/10.32473/flairs.38.1.138814

Issue

Section

Special Track: Applied Natural Language Processing