A
AkshitSrivastava1✉Emailakshit0405@gmail.com
1A
School of Computational Science and Artificial IntelligenceVIT Bhopal UniversityBhopal Akshit Srivastava
School of Computational Science and Artificial Intelligence
VIT Bhopal University, Bhopal
akshit0405@gmail.com
Abstract
The paper introduces AIYA: Automatic Inferential Yielding Agent, which is a 3D agent designed for an interactive and embodied intelligence in a virtual environment. Unlike the conventional agents limited to text-based interaction, A.I.Y.A transforms the world of conventional conversational agents into a 3-Dimensional Simulation integrating perception and reasoning to introduce a seamless 3D companion. A.I.Y.A combines a modular pipeline which allows a real-time spoken interaction with the users, demonstrating the capability to listen, understand, and respond in a natural language while operating in a simulation. The paper describes the architecture of A.I.Y.A and its capabilities across interactive tasks and discusses its potential involving other tasks such as navigation, object manipulation and real time assistance.
Keywords:
Artificial Intelligence
Embodied AI
3D Conversational Agents
AI companion
A
Introduction
Embodied AI agents are systems that are instantiated in a Visual, Virtual or Physical forms that allow the agents to learn and interact with not just the user but the physical or digital surroundings allowing them to perceive and act within the environment in a meaningful way (fung2025embodied). The Conversational Agents has been identified with an illustrious history of assisting human tasks which has been observed for a series of evolution from call-center agents to agents like Siri, Google Assistant, Alexa and further to Chatbots where the ChatGPT was identified as first-ever LLM (Large Language Model) that was fine-tuned on conversational data with the ability to carry out efficient conversations with the users while managing intuitive multi-tasking capabilities [1], [2] .
The research have identified the significant development in the multi-modal interfaces, computer graphics and autonomous agents that have contributed to the development of virtual humans over the past few years. Previous studies have also communicated the architectural requirements that are necessary to communicate the face-to-face conversations which includes components such as Multi-Modal Input and Output, Real Time System to track feedbacks and Requests, A conversational function model, etc..,. [3]
A.I.Y.A (Automatic Inferential Yielding Agent) introduces a transformative leap in the domain of Embodied AI agents by introducing an effective conversational agent in a 3-Dimensional Virtual Environment, transcending the boundaries of traditional conversational agents by integrating a modular architecture, enabling real-time spoken interaction which does not limit the agents with just command but communication and connection.
Related Works
The recent developments has witnessed a significant interest in Embodied Conversational Agents (ECA) integrating the multi-modal interaction and advanced capabilities within a 3D simulated environments, where an architecture was introduced by ECA's that automate the interactive conversation with synchronized speech, gestures and facial expressions, laying the blueprint for multi-modal system design with advancements in animation, dialogue, management and persona modeling [3]. Previous studies have introduced multilevel architecture for Embodied Conversational Agent with interactive virtual humans that behaves like human in simulation where the avatars show advanced capabilities such as speech, gestures, emotions and behaviors according to a specific persona and adapt on the basis of the application where the architecture introduces a three level including Cognitive Level thinking, Sensor Fusion level and Environmental Simulator Level.[4]
Grok Ani is a notable example that introduces a next generation embodied AI agent which stands out for its capacity to integrate a real time perception and reasoning model integrated with a 3D avatar that acts as a real time companion agent capable of carrying out exceptional conversational skills blending intuitive decision-making process with embodied social cues. [5]
Methodology
The proposed system introduces a modular pipeline that integrates three modules : speech recognition, Language Model, and 3D rendering with real time control, where the architecture adopts the concept of client-server paradigm that makes the back-end responsible for handling the voice assistant task by managing speech recognition, Conversational module and text-to-speech synthesis, with a intuitive THREEJs based front-end that manages the real-time rendering, animation and user-interaction.
The workflow begins with an Audio Input from the user, where the speech recognition module is expected to capture the audio and performs a real-time transcription of the received audio, the transcription is then processed via the Large Language Model ( LLM) module to generate a context based coherent response which is further synthesized into speech based on the extracted prosodic and acoustic features from the reference audio. The generated speech is transmitted through a low-latency WebSocket channel to the frontend, where the 3D avatar is animated in synchrony as per the volume-based information from the audio output. (Figure. 1)
As shown in the Fig. 1., The application can be expressed as the Client-Server Architecture with the following components:
1. SERVER
1.1 Speech Recoginition Module
The Speech Recognition Module is implemented using the Faster-Whisper model, an optimized version of the openAI’s Whisper which is a transformer based Seq2Seq (Sequence to Sequence) model which is designed for Automatic Speech Recognition (ASR) or related tasks. The whisper model processes the input audio at the sample rate of 16khz and converts it into log-Mel spectrograms, which are passed through a small convolutional stem to compress the temporal dimension. The encoder block is responsible to process the representation using multiple transformer blocks and decoder module generates text-based tokens which are conditioned on both audio-encoding and task specific control tokens. [6] The faster whisper, implemented in the proposed architecture is an optimizes the whisper model by utilizing the Ctranslate2 engine, which provides highly efficient execution of transformer models. The optimization strategy involves a mixed precision computation, operation fusion and efficient memory allocation, and batched decoding strategies which significantly reduce both the computational costs and memory footprint.
The proposed module continuously receives an audio stream from the user, with each recording session that is initiated with the Push-To-Talk mechanism, where the proposed module utilizes 'V' key as the Push-To-Talk Key (PTTkey) within the Client interface ensuring efficient input capture with minimal noise and further preventing the unintended speech from being processed. (Fig. 2)
The captured audio is further segmented and processed by the Faster-Whisper model to output efficient transcription with minimal latency. To maintain the robustness in the multi-turn architecture. The transcription is temporarily stored in the buffer memory, which allows a reliable retrieval and further reduces the risk of data loss between each consecutive inputs. Once a Push-to-Talk event is concluded, the buffered transcription is forwarded to the Natural Language Understanding (NLU) module which is powered the Large Language Model (LLM) component for the system ensuring that each user input is processed as a coherent unit of input. hence, introducing a foundation for subsequent conversations. (Fig. 2).
1.2 Large Language Model (LLM) Module
The Natural Language Understanding (NLU) Module utilizes the Large Language Models (LLM) as a core ChatEngine of the proposed architecture with a context-aware interpretation of user input and hence, generating a coherent, human like responses. The model utilizes OpenAI's GPT-OSS-20b model deployed via HuggingFace Inference API for efficient and scalable execution.
The gpt-oss is an advanced open weight reasoning models available under the Apache 2.0 license with an ability to adjust the reasoning effort for tasks that don’t require complex reasoning and further providing a customizable, full Chain-of-Thoughts and support structure outputs [7]. The model is autoregressive Mixture-of-Experts (MOE) transformers that were built upon the architecture of GPT-2 and GPT-3 which utilizes a residual stream dimension of 2880 and incorporates the root mean square normalization before each attention and MOE block, utilizing the benefits of Pre-LN placement. Each of the MOE block stores a fixed number of experts with a router that selects the top 4 experts per token and further outputs a weighted combination using SoftMax over the experts. The blocks also employ a gated SwiGLU activation, while the attention layers alternate between the banded window and the fully dense attention patterns, where each attention block uses 64 query heads that are grouped into 8 key-value heads and a rotary position embedding that allows the models to process approximately up to 131k context tokens. The quantization is introduced to the MoE weights to enable efficient deployment on standard hardware which allows the models to operate efficiently regardless of their large parameter counts. [7].
The NLU module introduces a ChatEngine which structures an interaction as a multi-turn dialogue. The Engine integrates the gpt-oss-20b model from HuggingFace through the langchain_huggingface interface for efficient conversational interface. The engine constructs a retrieval-augmented generation (RAG) pipeline that combines the HuggingFaceEndpoint for large-language inference by utilizing the Chroma vector store backed by “sentence-transformers/all-MiniLM-L6-v2” embeddings (wang2020minilm). The configuration allows the proposed system to retrieve semantically relevant context from the persistent knowledge base "rag_store" and condition the model's output on both retrieved documents and the conversational history. The context window for the dialogue continuity is maintained through a persistent history buffer, which is serialized to a JSON file and reloaded at each session to preserve context across interactions. The transcriptions from the Speech Recognition Module are appended to the history and formatted as
structured messages (System, User, Assistant) and then it is passed to the model via the HuggingFace API. The response generated is further appended back to the history buffer to ensure that the agent maintains the dialogue coherence while supporting low-latency, context-aware conversational responses. (Fig. 3)
1.3 Voice Synthesis Module
The voice synthesis module is responsible for transforming the textual response from the Natural Language Understanding (NLU) unit into a high-fidelity speech. The module is implemented by utilizing the GPT-SoVITS model which is an open-source framework released in February 2024 utilizing the concept of "Few-shot learning" to provide high-quality text-to-speech (TTS) and voice cloning with minimal data requirements.
1.3.1 Text Processing:
The text processing module is responsible for converting the raw text to its meaningful phonemic representation by transforming the text encoding information to it's phonemic encoding information where the BERT model is used to encode this phonemic information and process it to the tensor as the desired output. [8]
1.3.2 Audio Processing:
A
The Audio Processing tool handles the reference audio required for the voice cloning. The tool utilizes the
CN_HuBERT Encoder model to extract the information from reference audio to capture the acoustic characteristics required to maintain the identity of the speaker while maintaining the audio quality. [
8]
1.3.3 VALL-E component:
The VALL-E component, developed by Microsoft Research is a sequence-to-sequence component that acts like a neural codec language model trained on discrete codes that are derived from an off-the-shell neural audio codec model. The VALL-E treats the Text-To-Speech as a conditional language modelling task rather than continuous signal regression and
hence enabling context learning capabilities to synthesize high quality personalized speech while preserving the acoustic environment of the prompt in the synthesis. [8], [9]
1.3.4 VITS-decoder:
The VITS or Variational Inference with adversarial learning for end-to-end Text-To-Speech is a model widely recognized for speech synthesis that simplifies the traditional TTS systems by utilizing the deep learning techniques such as GANs, Variational Autoencoders (VAE), and Normalizing flows. The model is used to reconstruct the speech wavefrom from the acoustic tokens using the SoftVC-VITS decoder and is further optimized to produce high-fidelity audio in real time with low latency. [8], [10].
The Voice Synthesis module implements a sovitsTTS engine which acts like a wrapper for the GPT-SoVITS inference server and provides a text-to-speech interface for the interactive model focusing on the low-latency inference, in-memory audio handling and automatic silence trimming. The engine implements a 'generate_audio_in_memory()' function which acts as the primary interface by packaging all the input text from the ChatEngine along with the configuration parameters such as language, reference audio path, etc,.., and pushes it to GPT-SoVITS inference server through a RESTful API call using 'requests.post' where the server returns a raw 16kHZ WAV data corresponding to the synthesized speech. The byte stream received from the server is automatically passed to an internal method ‘trim_leading_silence_in_memory()' which employs pydub to detect and remove the leading silence from the audio stream using a configurable threshold and chunk size of 19ms. The processed audio is then re-encoded as the WAV bytes and returned to the pipeline which enables the downstream modules (WebSockets) to immediately stream the audio to the client.
2. CLIENT
The Client-side of the application provides an interactive, real-time interface that allows the end-users to experience the embodied conversational agent. The client offers a web-based application which loads and integrates the 3D avatar along with its respective animation with a low-latency communication with the pipeline offered from the Server. The application ensures that the facial expressions and the speech generated from the server are visually synchronized and are perceptible to the user.
2.1 3D Avatars: Modeling and Integrating the 3D model
The primary version of the 3D avatar was developed by utilizing the VROID studio, which is a software-based application, specifically designed for the procedural development of 3D avatars with an hierarchical bone structure. The tool allows the end-users to expedite the complex process from 3D modeling to rigging by categorizing the process of model generation as a process of three key stages which include:
1.Procedural Generation: The VROID studios allow the end users to define the foundational facial features, hair geometry and the body proportions using various parametric controls
2.Material and Texture: The Studio helps the users to define custom texture maps for various components of the model including irises, skins and clothing and hence these texture maps are applied to the model's material channels to achieve an appealing visual style.
3.Asset Export: The final Character is exported as a single .VRM file which helps to encapsulate all the necessary geometric data, materials and the pre-rigged humanoid skeleton.
The VRM or Virtual Reality model is an open standard file format which is used for the humanoid in the virtual reality and the real-time context. The format offers an efficient and easy to implement process for the modelling of 3D avatars by packaging the mesh structures with Texture packs and the humanoid bone hierarchy within a single document and hence assisting with the streamlined asset integration process which ensures the compatibility with animation re-targeting systems. The exported. VRM model is post-processed and optimized using Blender, an open-source 3D computer graphics application. The post-processing ensures the compatibility with the real-time web rendering by performing custom rigging adjustments to improve the rendering the performance without compromising the visual fidelity. The character motion is also incorporated for both the idle and talking states from the Adobe Mixamo library, which provides a large repository for per-rigged motion capture data. These animations were then re-targeted and baked into the VRM skeleton within blender to enable smooth playback and transition within states during the real-time rendering in the Three.js environment. (Fig. 5)
2.2 Web Based Client Architecture
The client is implemented as a modular and event-driven system that incorporates a low latency interaction while maintaining a high-rendering performance and utilizes a single-page application that is developed using HTML5, CSS3 and JavaScript (ES6 Modules) where the rendering is built upon the Three.js library, which is a high-level API that abstracts the complexities of browser's webGL API. [11]
The architecture is mainly composed of five functional sub-processes:
2.2.1 Three.js Rendering Engine
The Three.js rendering engine was utilized to handle the development and management of the 3D scene which includes the 'Scene', 'PerspectiveCamera' and 'WebGL Renderer'. The engine configures the global illumination, environment lighting and the user-camera controls by utilizing the 'OribitControls'.
2.2.2 VRM Model Integration
The ‘@pixiv/three-vrm’ library was utilized to load and integrate the. VRM model into the Scene. The library is used to effectively interpret the Avatar's bone rig structure and provides the direct access to the avatar's blend shapes that allow the effective facial expression and lip-sync manipulation.
2.2.3 WebSocket Communication
A persistent WebSocket connection is initialized and maintained with the server to enable real-time data exchange where, the Audio data, State signals (tts_start and tts_end), and control commands are transmitted through the channel to drive the speech playback and the animations.
2.2.4 Animation and Re-targeting System
A
The
THREE.AnimationMixer is utilized to manage the playback of various character animations which includes the client-side re-targeting module to remap the animation data from the. fbx format to VRM's standardized bone structure. (Fig. 6b)
2.2.5 Real-Time Audio Processing
The client utilizes the native WebAudio API to decode and play the audio received through the WebSockets from the Server. Furthermore, the API's AnalyserNode is utilized to process the audio volume in the real-time to drive the avatar's lip-sync animation.
2.3 Real-Time Synchronization and Animation Control
The proposed interactive agent incorporates the ability to perform efficient lip-synchronization capabilities and body animations corresponding to the generated speech. An event driven pipeline was developed to connect the WebSocket interface with the animation and the rendering subsystems. (Fig. 6a)
1.Signal and Data Reception: The client's WebSocket listener recieves the Base64-encoded audio packets and control signals via (tts_start, and tts_end).
2.Audio Processing and Playback: The audio received from the WebSockets is decoded by the WebAudio's decodeAudioData Method into an AudioBuffer. The Audio is then scheduled for playback through 'AudioBufferSourceNode'.
3.Real-time Viseme Generation: During the Playback event, The AnalyserNode calculates the audio signal's amplitude at each rendered frame. The value is normalized and mapped to the VRM model's "aa" blendshape, enabling real-time mouth movements that correspond to the speech volume.
4.State-Based Body Animation: The control signals are used to trigger the animation state changes. The
tts_start event initiates a transition from the idle animation to a talking animation, whereas the
tts_end prompts a smooth return to the idle state, ensuring that the avatar's body language reflects its speaking status.
(a) (b)
Figure 6: (a) Real-time audio processing and lip-synchronization workflow for the client-side application. (b) Animation re-targeting workflow which demonstrates the mapping of FBX animation joints to VRM humanoid rig through the lookup dictionary.
Results and Discussions
The proposed conversational agent was implemented and evaluated on a mid-to-high performance mobile workstation to ensure the real-time operation and efficient user interaction. The experimentation and evaluation was executed on machine with Ubuntu 24.04.3 LTS (64-bit) equipped with the following hardware configuration:
Processor: AMD Ryzen™ 7 7735HS (8 physical cores / 16 threads, up to 4.83 GHz with frequency boost enabled)
GPU: NVIDIA GeForce RTX™ 4050 Laptop GPU, 6GB dedicated VRAM
Architecture: x86_84, 48-bit physical/virtual addressing
The configuration allows a balanced computational environment with the benefits of multi-threaded CPU and GPU accelerated rendering, which helps in handling of the simultaneous tasks including Speech Recognition, LLM inference and high-fidelity Voice Synthesis along with real-time rendering of the 3D avatar.
5.1 Pipeline Latency Analysis
The temporal profiling was implemented on the three core modules (Speech Recognition, Natural Language Understanding and Voice Synthesis) that guides the backend of the proposed system along with the over-all round trip from the user-input to the generated audio output.
The results observed as per the Table. 1 demonstrates the performance metrics for each module, where it is observed that the Speech Recognition Module dominates the total latency, accounting for 46.9% of the overall processing time which is aligned with the expected processing time with respect to the model configured for high transcription accuracy. The result also aligns with a competitive performance of the LLM inference which delivers a contextually coherent response in roughly 9s, whereas the GPT-SoVITS voice synthesis model generates the natural-speech in ~ 12s with minimal training data.
5.2 Client-Side Performance
The web-based 3D avatar was rendered using Three.js with a stable average frame rate of 144FPS which was measured during the continuous interaction with the agent. The high refresh rate demonstrates a smooth character motion and facial animations. The WebSocket performance testing revealed that the end-to-end round-trip latency varied between 1-3ms, which confirms the negligible overhead of network communication between the Client and Server.
5.3 The System Efficiency
The Server performance (Table.1) and the Client-Side Performance of the proposed system has introduces several key findings which include:
1.Real Time Feasibility: The study has observed that the system has been efficient with the consistent delivery of the end-to-end response within ~ 40s, which is not instantaneous but acceptable for the semi-real-time conversation settings with a high-fidelity voice synthesis with an efficient avatar animation that has been prioritized over ultra-low latency
2.Hardware Utilization: The AMD Ryzen™ 7 7735HS processor with 8-core/16-thread architecture and 4.83 Boost frequency supported sufficient parallelism to execute the speech recognition, language understanding and TTS concurrently. The NVIDIA GeForce RTX 4050 GPU had a significant contribution to the performance of Faster-Whisper and GPT-SoVITS and further accelerated the 3D rendering and animation, enabling 144 FPS client-side performance.
3.Component Bottleneck: The speech-to-text module was identified to showcase a dominant bottleneck to the overall processing time by contributing to almost 47% of the total end-to-end time. The future works may include the exploration of various streaming based advanced speech recognition (ASR) models and different optimization strategies in order to reduce the transcription time and hence, improving the overall responsiveness
4.Scalability and Deployment: The WebSocket communication allowed for a low-latency (1-3ms) full duplex communication, where the lightweight VRM avatar enables a high compatibility of the agent across various devices. The efficient hardware scaling can be utilized for the further optimizations to improve the performance of the responsibilities of Client and Server to achieve a better real-time interactivity.
Table 1
Average processing time of each module (STT, LLM and TTS) and total round-trip latency measured on the evaluation system.
Metric | Mean processing Time (s) |
|---|
Speech-To-Text (STT) | 18.88 |
LLM Inference | 9.21 |
Text-To-Speech | 12.06 |
Total End-to-End Time | 40.18 |
Conclusion
The proposed 3-Dimensional conversational agent demonstrates a high-fidelity, multi-modal interaction which is compatible with a mid-to-high performance mobile workstation which effectively utilizes the CPU parallelism and GPU acceleration. The model further integrates three core functionalities including the speech recognition, Large Language Model (LLM) inference and high-quality voice synthesis with real-time 3D avatar rendering which allows for a fluid and engaging user experience where the proposed methodology has been successfully maintaining a stable client-side performance even when introduced with computationally intensive tasks. The speech-to-text function was observed to contribute significantly to the processing time (~ 47%), however, the overall architecture of the proposed methodology shows a promising performance for a semi-real-time conversational settings which enables a balanced responsiveness and high quality visual and audio outputs. The WebSocket communication and lightweight humanoid avatars highlighted the scalable and compatible nature of the architecture across different devices, suggesting that further optimizations can be utilized to support increasingly complex and immersive human-computer interactions.
Future Works
The future improvements for the proposed 3-Dimensional agent would focus on different strategies that can be incorporated to enhance the performance, adaptability and the user experience. The improved performance can be achieved significantly by optimizing the speech-to-text functionality through various algorithmic strategies or exploring through various streaming ASR models, GPU accelerated inference and other strategies to reduce the overall latency to enable faster and more responsive interactions. The future studies would also focus on expanding the model with Vision capabilities to enable the model to interact and react based on the user's environment, gestures and objects to provide a better context for dialogue and interaction. The enhanced environmental settings are expected to allow the system to significantly affect the avatar's behavior, and response strategies to different physical or virtual spaces. Finally, the improvements would also focus on developing an effective User Recognition module which utilizes the Vision capabilities to enhance the personalization on the basis of identification of individual users, engagement tracking and personalized responses to offer a highly immersive and adaptive interactions.