LIA: The Multilingual 3D Avatar is an innovative project developed by ZAKA’s talented AI Certification students, Sary Mallak and Albertino Kreiker, in collaboration with the AI Development team at ZAKA, showcasing our expertise in real-time multilingual interaction. This project redefines what is possible with virtual assistants. With its ability to seamlessly handle Arabic, English, and French, LIA is not just a technological marvel but a practical tool poised to transform industries like customer service, education, and tourism. Stay tuned as we continue to push the boundaries of AI innovation! 🚀
Imagine a digital avatar that speaks to you in your native language. In customer service, this avatar could handle inquiries in multiple languages, delivering responses that feel truly personal. In education, it could serve as a virtual tutor, effortlessly switching languages to reach a broader audience.
Our team set out to make this a reality by developing a 3D avatar capable of engaging in real-time multilingual communication, with seamless speech-to-text, text generation, and lip synchronization across languages. By harnessing advanced language models and speech synthesis tools, this project aims to break down language barriers in real-time and enhance user experiences across industries.
Problem Statement
Our main challenge was designing a 3D avatar with real-time speech recognition, text generation, and synchronized lip movements in Arabic, English, and French. Traditional speech-to-text and text-to-speech systems focus on individual languages, making it difficult to build a fluid, multilingual experience.
Arabic, in particular, introduced complexities due to its unique script and linguistic structure, requiring highly accurate transcription and lip sync. To create a distinct, recognizable voice for the avatar without cloning techniques, we customized a text-to-speech model to bring the avatar’s persona to life. Finally, we needed to optimize every part of the workflow to ensure low latency for responsive interactions, even with complex language models running simultaneously.
Tools and Technologies
In developing a multilingual 3D avatar capable of real-time speech interaction, we relied on several advanced technologies, each selected for its capabilities in speech processing, text generation, and animation. Here is an overview of these tools and the strengths they bring to such a project.
1. Whisper for Speech Recognition
Whisper is an automatic speech recognition (ASR) model developed by OpenAI, known for its robust multilingual capabilities. Trained on a wide range of languages and dialects, Whisper’s architecture features an encoder-decoder design with attention mechanisms that allow it to handle diverse linguistic inputs with high accuracy.
- Multilingual Support: Whisper can detect and transcribe multiple languages automatically, making it suitable for multilingual applications without the need for manual language selection.
- Noise Robustness: The model is trained on noisy data, enhancing its performance in real-world environments with background noise.
- General Availability: Whisper is available in several configurations, allowing users to select a model that balances speed and accuracy based on their needs.
2. LLaMA for Text Generation
LLaMA (Large Language Model Meta AI) is a transformer-based language model designed to generate human-like text responses. Created by Meta AI, LLaMA’s large-scale training across various topics makes it effective in understanding and generating coherent, contextually accurate responses.
- Versatility and Customization: LLaMA can be fine-tuned on domain-specific datasets, allowing users to create custom conversational experiences that align with specialized knowledge or cultural nuances.
- Advanced Text Processing: It handles complex text generation tasks, including answering questions, summarizing content, and holding extended dialogues, making it suitable for a wide range of applications.
- Multiple Language Support: With native support for several languages, including Arabic, LLaMA is designed to accommodate multilingual projects.
3. gTTS (Google Text-to-Speech) for Speech Synthesis
Google Text-to-Speech (gTTS) is an API that provides high-quality voice synthesis in multiple languages. It leverages Google’s cloud-based TTS capabilities, making it a widely used solution for applications requiring real-time voice output.
- Multilingual Voice Output: gTTS supports multiple languages and dialects, providing flexibility for diverse linguistic needs.
- Efficiency: Known for its fast processing speeds, gTTS is well-suited for applications that require real-time or near-instantaneous voice responses.
Natural Voice Quality: Despite its speed, gTTS maintains clear, natural-sounding voice synthesis, which is critical for applications focused on user engagement and immersion.
4. Rhubarb Lip Sync for Phoneme-Viseme Mapping
Rhubarb Lip Sync is an open-source tool for creating lip-sync animations from audio files. It is commonly used to enhance animated characters with synchronized mouth movements based on spoken content.
- Phoneme Detection: Rhubarb processes audio files to identify phonemes, the distinct units of sound in speech, which can then be mapped to corresponding mouth shapes (visemes) for animation.
- Animation Compatibility: The tool is compatible with various 3D animation frameworks, making it ideal for projects that require precise lip-syncing for animated characters.
- User-Friendly Output: Rhubarb’s phoneme detection output is friendly to work with, enabling seamless integration into animation workflows.
5. Backend and Deployment Technologies
To enable efficient data handling and real-time interaction, a combination of backend technologies and deployment tools were employed:
- Flask: A lightweight and versatile web framework, Flask is commonly used for creating APIs and managing backend services in real-time applications.
- ngrok: ngrok allows developers to expose local servers to the internet through secure tunneling, facilitating remote access and collaboration during development phases.
- React, Ready Player Me, and Mixamo: For animation and front-end design, tools like React, Ready Player Me, and Mixamo provide resources for creating and customizing 3D avatars, adding dynamic visual elements that enhance interactivity and user engagement.
Methodology
Step 1: Speech Recognition
To enable real-time language understanding, we started with Wav2Vec2 which is a self-supervised speech recognition model developed by Meta AI. It is able to learn speech representations directly from raw audio. Wav2Vec2 achieved a word error rate (WER) of 25.81%. To enhance accuracy across Arabic, English, and French, we upgraded to the Whisper model, which lowered the WER to 15.1%. Whisper’s multilingual capabilities and automatic language detection streamlined the user experience by eliminating the need for manual language selection.
Step 2: Text Generation
For text generation, we initially used LLaMA 3.1, fine-tuning it with datasets such as Arabica_QA and TyDi QA, as well as custom datasets related to Lebanese culture. When LLaMA 3.2 launched, we transitioned to this updated model for better native Arabic support. Our custom datasets ensured culturally accurate responses, especially when handling Arabic input, while retrieval-augmented generation (RAG) added depth to answers by pulling in relevant information.
Step 3: Text-to-Speech Synthesis
After testing several models, we chose Google Text-to-Speech (gTTS) for its fast, resource-efficient performance. This allowed us to meet real-time demands, an essential component for the fluid, interactive experience with our avatar. Our custom avatar, named LIA (Lebanese Information Assistant), uses a female voice from gTTS to create a friendly, welcoming persona.
Step 4: Lip Synchronization
We integrated Rhubarb to synchronize lip movements with speech. Rhubarb translates audio into phonemes, mapping them to specific mouth shapes (visemes) for realistic, real-time lip syncing. By pairing these animations with LIA’s spoken output, we created an immersive, human-like interaction for users.
Results and Findings
Our final implementation met key project goals in real-time multilingual interaction, delivering marked improvements across speech recognition, text generation, and lip synchronization.
- Speech Recognition: By switching from Buckwalter transliteration to native Arabic script and upgrading to Whisper, we reduced the WER from 25.81% to 15.1%.
- Text Generation: Fine-tuning LLaMA 3.1 and later 3.2 improved the avatar’s cultural understanding, with custom datasets enhancing responses related to Lebanese culture.
- Text-to-Speech: gTTS enabled rapid, real-time responses with an optimized speech synthesis pipeline, delivering a seamless experience without lag.
- Lip Synchronization: Rhubarb allowed precise mapping between phonemes and visemes, resulting in realistic, real-time lip movements synced to spoken words. This feature greatly enhanced user immersion.
Below are the final results showcasing the project
Playlist
Final Thoughts
This project demonstrates the potential of real-time, multilingual avatars across various fields. Our avatar, LIA, could serve as a virtual assistant in customer service, a multilingual tutor in educational settings, or an interactive guide in tourism. The system’s adaptability enables it to handle personalized interactions across Arabic, English, and French with natural language responses and accurate lip sync. With future enhancements, LIA could be deployed on more powerful GPUs, reducing latency further and allowing for broader applications in high-demand environments.