Presented by Justin Uberti on June 07, 2024 at AI Tinkerers Seattle - June 2024
Abstract
Ultravox is a new multimodal LLM that is able to directly understand speech (unlike current voice AI stacks, it does not require a separate speech recognition stage). This approach makes voice AI applications faster, more robust, and allows them to understand the non-textual parts of speech.
It builds on a Llama 3 backbone which means that it can be trained much faster than a typical foundation model. We've just open-sourced Ultravox at https://ultravox.ai and are working on growing a community around it.
Justification
With the announcement of GPT-4o, there has been a spotlight on speech LLMs. Ultravox shows that there's a path to support the same sort of functionality with open source models. Accordingly, this talk will be useful for people building voice AI applications, or interested in pushing open source AI forward. The talk will include a brief discussion of multi modality, an overview of the Ultravox architecture, a basic API walkthrough, and finally, an end-to-end demo.