In addition to upgrades powered by generative AI and the ability to continue conversations without having to use the wake word “Alexa” again, Amazon’s voice assistant will gain more natural-sounding voices. The company today introduced its latest “text-to-speech” engine that is more contextually aware of users’ emotions and tone of voice. This allows Alexa to respond with similar emotion variations in the output.
The company demonstrated a new voice that makes Alexa sound less robotic and more expressive. It is powered by large transformers trained in a variety of languages and accents, the company noted.
For example, if a customer asks for updates about their favorite sports team and the latest win, Alexa can respond with a joyful voice. But if they were losing, Alexa would sound more empathetic.
“And we are working on a new model — what we call speech-to-speech — that again has a large transformer. We will first use speech recognition to turn the customer’s voice request into text. Instead of using LLM to generate text responses or actions, and then using text-to-speech to generate speech, this new model integrates these tasks to create a richer conversational experience. ” said SVP of Alexa Rohit Prasad.
Amazon says Alexa can indicate attributes that encourage users to continue the conversation, such as laughter, surprise, and even an “eh”.
This all leverages Amazon’s Large Text-to-Speech (LTTS) and Speech-to-Speech (S2S) technology. The former allows Alexa to adapt its responses using text input, such as a user’s request or a topic being discussed, while the latter allows Alexa to create richer conversations by layering voice input on top of text. Amazon says it will be able to adapt its responses.
Correction, as of September 20, 2023, 12:28 PM: The new engine is called “speech-to-speech” instead of “text-to-speech.” The article has been updated to reflect this.