Enhancing Expressiveness in Vocal Conversational Agents through Large Language Model-Generated Speech Synthesis Markup Language
DOI:
https://doi.org/10.32473/flairs.38.1.138814Keywords:
Expressive Speech Synthesis, Large Language Models, Human-Robot Interaction, Behavior Generation, Generative AIAbstract
Advancements in speech synthesis have enabled more natural and engaging conversational agents, including neural text-to-speech models that can adjust speech inflections to produce distinct vocal styles. For example, Azure Neural Voices can adjust speech using Speech Synthesis Markup Language (SSML) style tags, such as “affectionate,” “cheerful,” and “hopeful.” However, determining when to apply these tags in real-time interactions can be challenging and time-consuming. In this paper, we present a prompt-based approach that enables large language models (LLMs) to dynamically stylize their responses with appropriate SSML tags, enhancing synthesized speech expressiveness across 34 different styles. Using targeted probes designed to elicit specific speech styles, we demonstrate that LLM-generated responses are syntactically well-formed and correctly apply style tags to enhance expressiveness. This simple, customizable approach facilitates the rapid development of expressive vocal conversational agents.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Joseph Salisbury

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.