BHASHABLEND: Bridging Transcription and Translation for Multilingual Video Content

Ayush Tripathi 1 Emailaytripathi9@gmail.com

Vanshika Yadav 1 Emailraovanshika2004@gmail.com

Tanishq Chauhan 1 Emailtanishqchauhan767@gmail.com

Ali Imam Abidi 1✉ Emailaliabidi4685@gmail.com

1 Department of Computer Science & Engineering, Sharda School of Engineering & Technology Sharda University Greater Noida Uttar Pradesh India

Ayush Tripathi

aytripathi9@gmail.com

Vanshika Yadav

raovanshika2004@gmail.com

Tanishq Chauhan

tanishqchauhan767@gmail.com

Ali Imam Abidi*^{[0000–0002-7420−0027]}

aliabidi4685@gmail.com

Department of Computer Science & Engineering, Sharda School of Engineering & Technology

Sharda University, Greater Noida, Uttar Pradesh, India

Abstract

Translating video content into multiple languages is feasible with existing solutions but remains challenging. This work outlines a sophisticated advanced system that satisfies quality and accessibility improvements in multilingual video translation. The proposed method includes extracting audio from video, transcribing the audio using an innovative speech recognition model, and translating the transcribed text into various languages. The system uses Google’s Translation API and Text-to-Speech library, ensuring synchronization with the original video. The BhashaBlend model achieved a strong word error rate of 12.4%, significantly better than many major ASR systems: Google at 15.82%, and Microsoft at 16.51%. The model's performance was powerful for languages with the simplest phonetic realization, such as German, English, and Spanish, proving its dependability in delivering video dubbing. This highlights the potential of the model to produce results where excessive lingual complexity is involved and points towards the high applicability scope of BhashaBlend in language-polyvalent applications.

Keywords

Multilingual Video Dubbing

Automatic Speech Recognition

WER

BERT

Neural Machine Translation

Transformer Architecture

1. Introduction

Multimedia content reflects the diverse languages and cultures of today’s interconnected world. The availability of proper multilingual transcription and translation is the best way of attaining the most possible outreach among target audiences, especially in video formats. This positions NLP and AI as key technologies in addressing the demand for automated media localization, education, and business communication and the use of multimedia changes in real-time video dubbing and captioning with innovative speech recognition and translation models.

Despite all the significant achievements, current solutions face challenges with delivering translations that are not only precise but culturally appropriate. ASR systems generally perform poorly on complicated phonetic, semantic, and dialect differences, resulting in high error rates, especially with languages that have special linguistic characteristics. For instance, although conventional ASR systems have made some progress, researchers have pointed out that they still face challenges dealing with different speech durations over various languages that may cause unnatural dubbing [1]. Moreover, common issues are issues of accurate translation due to incomplete sentence structures and mismatched timestamps [2]. Concerning the above difficulties, more recently sophisticated models have been developed, such as SqueezeBERTSum, that have competitive ROUGE scores while reducing computational costs by 49% compared to BERT-base [3]. Such ensemble models using BERT-related architectures have improved semantic similarity assessment in patent document analysis for complex domains to a very high Pearson correlation coefficient, mitigating limitations related to language barriers and document complexity [4]. New approaches to speech recognition, such as LAS-Transformer, have further brought about substantial improvements in word error rates from the implementation of the mechanisms of local attention and depth-wise separable convolution subsampling [5]. Improvements in end-to-end speech recognition with long contexts, conducted through advanced context-expanded Transformers, also improved the accuracy and computational efficiency of long audio recordings [6].

BhashaBlend comes up as a revolutionary solution that integrates synchronization-aware TTS for accurate timestamp alignment. It addresses the phonetic complexities of other languages by using transformer-based models that ensure context understanding. The above capabilities make BhashaBlend different from the other solutions, such as VideoDubber. It further enhances inclusion capabilities by providing accessible and contextually accurate translations suited to linguistic and cultural requirements.

Despite progress, NLP and ASR technologies require further innovation to fully address challenges in multilingual content accessibility. This enhanced capability for multilingual transcription and dubbing empowers content creators and media professionals to reach even larger audiences and inclusivity in media consumption. Such solutions can transcend language barriers, convey information to large numbers of people speaking different languages, and improve user experience in today's connected digital world.

2. Literature Survey

Recent advancements in machine translation, speech recognition, and synthesis technologies have Most applications have opened in other numbers of industries by making use of these machine translation, speech recognition, and synthesis technologies. A few video dubbing of educational content would only be several examples. According to Wang et al. [2], Such machine translators as Google Translate or DeepL translate all educational material from English alike to Chinese and Spanish languages, thereby automatically translating the educational video course. In the process of the proposed procedure, attempts to align the translations with video timestamps to find where the mistake by translation lies in the sentence and transcription. Hence, the system fetched the accuracy values until 0.955 on video reading and obtained a more significant score on math videos with metrics such as sBERT and BERTScore F1. Thus, it provides the sense of correct mapping of the timestamp in machine translation, which maximizes access to the learning material. Similarly, a similar example is that of video dubbing, wherein though the quality of translation has been improved with such progress so far achieved in integrating the ASR models with big language models like BERT and GPT-3, this sub-domain of the multilingual video dubbing has also benefited through the ASR models which are providing suitable and accurate transcripts concerning the content that happens to be educational being translated later by fine-tuning the language models on specific language pairs [7]. Even though all this sounds like a step forward at this point, it contributes to trouble with fluent synchronization and fluid naturalness between languages. But so much promise and added value lies even within the critique against the prevalent technologies: what if the newest neural networks were linked with workflows older than the dubbings themselves to further multilingual access even to educational content in support of objectives relative to automated translation processes. To keep up with these disparities in quality and naturalness, translation techniques change gradually.

The VideoDubber system by Wu et al. [1] builds on the developments in multilingual video dubbing and takes a big step forward by including voice length information in neural machine translation models. This effective approach addresses the problem of synchronization and ensures that the synthesized speech is in sync with the original audio in terms of time and serves as an extension of earlier integration of ASR models with big language models. The system makes use of duration-aware positional embeddings in conjunction with a special pause token for higher fluency and naturalness that ultimately results in better quality and synchronization over the baselines. Experiments have validated the usefulness of VideoDubber while yielding substantial improvements in naturalness and synchronization, furthering the goal of facilitating greater accessibility for multilingual instructional content. While systems like VideoDubber facilitate more accessible multi-lingual materials, an elementary emulator built specifically to perform the task draws on these technologies to translate text to speech and vice versa. In STT, acoustic and linguistic modeling is applied to translate a spoken language into text. However, in the case of TTS, it employs tokenization and phonetic transcription to produce synthetic speech. This is applied in contiguity with trying attempts to ensure naturalness and synchronization in dubbing since algorithms including neural network-based techniques and Hidden Markov Models are deployed to enhance the performance. Its application in educational sectors shows the TTS and STT technologies applicability, which will improve the learning experience and will also complement innovations in video dubbing systems to provide content more accessibility in various domains [8]. This improves semantic similarity matching for patent papers besides mirroring the progress in video dubbing and instructional aids. This method utilizes customization in token scoring and weighted averaging to combine various BERT models toward the extraction of semantic links from patent documents. It provides higher accuracy and flexibility to suit the complexity of any provided document structure. This innovation shows how improvements in text processing and machine learning brought sophistication to document analysis and categorization [4].

A move forward in audio-visual speech translation would be to develop AV2AV, considering the advancements made in video dubbing and educational resources. Similarly, just like ensemble BERT-related models improve semantic similarity matching within patent papers, AV2AV ensures that translations are strong, even in noisy settings, where uniform audio-visual speech representations are learned through self-supervised learning. The fact that this system makes it possible to translate many-to-many in languages without the text enables quality virtual chats and resolves problems associated with acoustic noise. This architecture shows how new technologies might improve conversation and spoken language translation; it is exactly like how document analysis and categorization progress when technologies advance in many environments [9][4]. As seen in "Automatic Speech Recognition Systems for Children" [10], proper time-stamping is emphasized to yield high-quality transcripts from ASR systems. First, the need for time-aligned transcriptions will be required to guarantee that each stage of the project is in sync: this requirement is critical for frameworks such as AV2AV. For example, the timestamps guarantee that the text produced by the translation is brought up to speed with the source movie upon completion of the translation. These timestamps instruct the text-to-speech synthesis to deliver a speech item that perfectly agrees with what is happening in the video.

Overall, these machine translation, speech recognition, and synthesis technologies open channels for innovative educational and multimedia applications. These will soon change the way we access languages and interact with content in the languages by getting rid of issues with accuracy, synchronization, and naturalness.

3. Evaluation Metrics

3.1. Word Error Rate

In speech recognition and machine translation systems, one of the most commonly used performance evaluation metrics is called Word Error Rate (WER). The measures the capability of a proper transcription or translation of words by comparing those in the hypothesis-which usually is the system's output-to the one provided in the reference text. WER is calculated by adding up the number of insertions I, deletions D, and substitutions S of reference words to get the hypothesis and then dividing this by the number of words in the reference, N. The formula is:

$\:WER=\:\raisebox{1ex}{$\left(I+D+S\right)$}\!\left/\:\!\raisebox{-1ex}{$N$}\right.$

This is useful because it directly measures speech recognition or translation performance. Low WER means the results have high accuracy and sound of quality in speech recognition and models of translation. It is useful for text accuracy evaluation, but it has some shortcomings-in particular, it is very sensitive to word order and thereby excludes timestamp consideration making it lose effectiveness in rating translation or transcription quality.

3.2. BERTScore F1

BERTScore F1 is a more advanced performance metric based on the power of BERT (Bidirectional Encoder Representations from Transformers) to measure the semantic similarity between a reference and a hypothesis text. Unlike metrics like Word Error Rate (WER), it is not based on the exact word-matching process, but on word contextual embeddings that consider the meaning and nuances of the sentences. The F1 score in BERTScore is the harmonic mean of Precision and Recall, where:

Precision: Measures how many words in the hypothesis have a semantically similar counterpart in the reference.

Recall: Measures how many words in the reference are semantically matched by the hypothesis.

BERTScore F1 is crucial for the project because it allows us to evaluate the quality of translations in a way that accounts for the meaning and context, not just word-for-word accuracy. This is particularly important when translating complex content where the semantic integrity of the translation is more important than the exact wording.

3.3. Translation Edit Rate (TER)

Translation Edit Rate (TER) measures the amount of post-editing required to convert a machine-generated translation into a human reference translation. It calculates the number of edits— insertions, deletions, substitutions, and shifts—needed to align the system output with the reference translation. TER is particularly useful for evaluating the quality of translations by quantifying the extent of necessary corrections [7]. While TER provides insights into the edit distance between translations, it has limitations. TER does not account for the fluency or readability of the translations, focusing instead on the edit operations required. It also does not reflect the semantic accuracy of translations, which is crucial for content like educational videos where meaning preservation is essential. Additionally, TER can be less sensitive to minor translation errors that may impact overall quality but require substantial editing to correct [11]. In the model, the decision was made not to use TER due to its focus on edit operations rather than semantic accuracy. For educational and multimedia content, preserving meaning and context is more critical than the number of edits required. Metrics like WER and BERTScore F1 better capture the semantic fidelity and contextual accuracy of translations, aligning more closely with the project’s objectives. Consequently, TER’s limitations in addressing meaning and readability rendered it less suitable for evaluating the translation and transcription systems.

3.4. BLEU Score

The BLEU (Bilingual Evaluation Understudy) Score is a metric for evaluating the quality of machine-generated translations by comparing them to reference translations. It measures the precision of n-grams (typically unigrams, bigrams, trigrams, and 4-grams) and includes a brevity penalty to account for translation length. BLEU has limitations including its focus on n-gram precision, which can overlook semantic accuracy and fluency. It also tends to favor shorter translations and does not capture sentence-level meaning well. BLEU’s brevity penalty can lead to misleading scores for translations that are slightly longer but more contextually accurate. Given the complexity of educational and multimedia content, BLEU’s emphasis on n-gram precision and brevity penalties was less suited for evaluating semantic accuracy and contextual meaning. Metrics like BERTScore F1 provided a better assessment of meaning preservation, which was crucial for our project’s goals [9].

3.5. METEOR Score

METEOR (Metric for Evaluation of Translation with Explicit ORdering) Score evaluates translation quality by considering synonyms, stemming, and paraphrasing. It is designed to address some limitations of BLEU by incorporating semantic meaning and word alignment, providing a more nuanced evaluation of translation quality. METEOR is more complex to compute compared to BLEU and can be sensitive to the quality of word alignments. It also requires substantial resources for accurate evaluation and does not always align with human judgments on translation fluency and contextual accuracy [11]. While METEOR provides more semantic evaluation than BLEU, its complexity and alignment requirements made it less practical for our model, which focused on semantic accuracy and context. The simplicity and effectiveness of BERTScore F1 better suited our needs for evaluating translations based on meaning rather than exact word matching.

4. Methodology

The proposed model is an advanced video translation system designed to convert video content into multiple languages while maintaining contextual integrity and synchronization. The system consists of three key components: Speech Analytics, Translation Processing, and Text-To-Speech synthesis. It began by extracting and transcribing audio using cutting-edge speech recognition models to produce precise text transcriptions. These transcriptions were then translated into various languages via the Google Translation API, ensuring accuracy and contextual relevance. Finally, the translated text was converted back to speech and synchronized with the video, making the content accessible to a multilingual audience. Currently, the system requires the videos to be pre-downloaded before processing. This step ensures that the content is available for analysis and translation.

Fig. 1

Implementation of the Proposed Model

The dataset used in this project included almost 150 video clips that span a wide range of domains from educational content, movies, and general multimedia, for the languages specified in this study. To ensure proper evaluation, the dataset was divided into two parts:

1. Testing Phase: A small set of random videos in different languages was used first to test the model, which enabled iterative fine-tuning and adjustment.

2. Language-Specific Evaluation: The remaining dataset was structured to include an equal number of videos for each language, wherein almost 15 videos from each language were used, ensuring balanced evaluation across linguistic and cultural nuances.

The dataset included diverse content such as documentaries, dialogues, and instructional videos, adding complexity. The dialogues presented overlapping speech challenges, whereas documentaries frequently had long, formal sentences that needed precise alignment and context preservation. With this variety, the dataset simulated real-world scenarios and helped the model show its robustness across a broad spectrum of use cases.

The performance validation on the dataset's 19–20 test videos indicates that the model requires approximately 5 to 8 hours to process, including transcription, translation, and synchronization. The system demonstrated impressively high accuracy across different languages, showcasing its ability to handle the variability of video content and the nuances of language effectively. This robustness makes the model highly suitable for real-time applications. Testing was crucial in determining whether the model is ready for deployment, as it allowed for the refinement of its processes and validation of its performance in practical, multilingual transcription and translation tasks.

Fig. 2

Pipeline of the Proposed Model

4.1. Speech Analytics

The Speech Analytics module is very important for translating spoken words in any language or dialect into the target language and keeping synchronization with the video. This pipeline initiates its process by extracting the audio from videos, normally through tools like moviepy or ffmpeg, saving the audio in WAV or MP3 format. Then, the audio is processed with an ASR system whose basic design is created on transformer-based architectures.

The acoustic modeling, which starts of any process of ASR breaks down the audio into phonemes - the smallest unit of sound. Grounded within acoustic-phonetic techniques, it is a linkage of sound with elements based on phonetic terminology. It becomes pivotal in the accurate transcription of speech into text. These techniques are essential in achieving the identification of word representations that yield very accurate and noise-robust transcription that identifies the building blocks of spoken language. After acoustic modeling becomes active, it affects the language model where it predicts word sequences by using its phonemes. The model uses statistical methods and relies on large text corpora to fine-tune these predictions. This is further augmented by transformer-based local attention mechanisms that focus on parts of speech, making it even more effective in handling ambiguities. It's built on top of the transformer, an architecture that has been making headlines in the field of natural language processing (NLP) for a few months now. Now, transformers work by forwarding input data through several layers of attention mechanisms that allow the model to weigh how important different parts of the input audio data. Whisper is doing an incredible job in this department simply by using a transformer structure that works wonderfully well for understanding speech and transcribing it. It has been trained on large and diverse datasets and uses hundreds of thousands of language segments across multiple languages. Its extensive training enables it to manage various tones of speech and subtleties of language. The model transcribes by replacing audio waveforms into sequences of tokens fed into multiple transformer layers to predict accurate transcripts for every input audio sequence, thereby improving transcription accuracy due to the architecture and the power in the ASR system by the multi-head attention mechanism. This is very helpful in processing long-range dependencies. It is because such languages rely more on a high ability to handle dependency structures; it would project semantics, particularly from the word order or grammatically complicated structures, especially in languages like English or even German.

In the decoding stage, large context windows combine the acoustic and language model outputs to produce a transcription appropriate and relevant to the context of the input audio. They improve the accuracy of the transcription further since the system considers speech utterances in extended dependencies [4].

In comparison to other ASR models like Google Speech-to-Text and IBM Watson STT, this model is even more accurate when exposed to noisy conditions, such as noise or overlapping dialogues, which the former does not, but tends to excel well because it has been trained on more diverse speech. The open-source nature of this technology provides more freedom to the users, who can tune the model to better suit the specific needs of the application, which is a significant superiority over proprietary systems such as Microsoft Azure Speech and Amazon Transcribe [13].

An accurate transcription by the model prepared for automatic translation opens the door to the next step of this project. This demands the transcribed text transformation into a defined target language keeping in mind exactly what was said, which will make it possible for the script to naturally flow from one dialect or language to another in multilingual dubbing.

4.2. Translation Process

The translation phase is one of the components of a video translation system, which would translate the text into several target languages. The transcription methods used in the transcription phase rely on acoustic-phonetic techniques that are quite significant for transcription accuracy and therefore a good quality translation [17].

It will then be translated through a Google Translation API cloud service that makes use of machine learning and huge datasets, to provide highly accurate and contextual translations. Supporting over 100 languages, the Translation API fits an extremely wide range of multilingual needs. Thus, the traditional rule-based and statistical models find it extremely difficult to cope with variability in natural language, particularly in noisy conditions. Even RNNs, the earliest types of neural networks, and also the later ones like LSTMs are usually unable to handle long-term dependencies and thus most of their performance on the more complex tasks of transcriptions gets reduced drastically [18][19]. Post-processing ensures that the translated text is grammatically correct and coherent. This phase is supplemented by the mechanisms of attention in transformers that focus on relevant text sections during translation, and preserve meaning while moving between languages with different grammatical structures or cultural nuances [12]. The addition of Google Translation API also enhances the multilingual capabilities of the system; therefore, users can have information available in various languages. The metrics of this evaluation of the translation component are regarding the efficiency and accuracy with which the multilingual dubbing has contextually appropriate and grammatically correctible texts.

4.3. Text To Speech

TTS component Translate the audio output of translated text into multiple languages and make video content available through an audio source. In this case, the Google Text-to-Speech library can take translated text and convert it into an audio waveform, and match that up with the temporal structure of the video.

The proposed system is the TTS, however, it is different from other TTS with the fact that it uses Google's Text-to-Speech library. The selection of the Google Text-to-Speech library is based on its features like simplicity, robustness, and an all-round language support. The system is built on the text-to-speech process. It is speech recognition of the source language against audio to find the text and translate the text into productions of equivalent texts in target languages. Translate is the title of one module. Consequently, text preprocessing steps are taken to make the verse optimized for TTS input. These steps include dealing with special characters, noise removal, and structural enhancement. GTTS serves as the main library of the TTS system that carries out the transformation of the processed text into an audio waveform. The reformulated audio is put so that it is accurately synchronized with the original video, with its temporal structure, to produce a seamless conjunction with video content.

The evaluation metrics were mainly on factors like speech quality, naturalness, and intelligibility TTS component. Although the gTTS library had evident success in producing high-quality audio, the main problems were related to audio-video synchronization and appropriate pronunciation in certain languages. One of the solutions to this problem was that the most recent developments were made through new joint algorithms between audio-video synchronization and language-specific TTS model adaptation techniques.

The acoustic and phonetic techniques used by the ASR model in transcribing it give a firm backbone to the TTS process. Proper transcription means that the timing and delivery of TTS output will be perfectly similar to the original content. Pre-processing consists of eliminating special characters and noise and optimizing sentence structure. The Google Text-to-Speech library then converts the pre-processed text to audio synchronized with the video to create a fluid speech translation effect.

Among these challenges, audio-video synchronization and also specific languages' accurate pronunciation are significant ones. Algorithms for new sync besides techniques for adapting TTS models based on language to improve pronunciation are some of the advances in this field. The testing part of the TTS component deals with strategic metrics including speech quality, naturalness, and intelligibility.

5. Deep Dive into Word Error Rate and Translation: Challenges and Solutions

5.1. Word Error Rate (WER): Challenges Encountered and How We Addressed Them

Calculating the WER was quite a hustle in the development of the model because, by its nature, the output from the model and the ground truth data, that is, YouTube transcripts, would not match. Originally, the WER was too high, so much so that the issue did not necessarily prove to be with the mismatched words but with the misaligned structures. Majorly, the comparison of the transcripts was line by line. Any small word order shift, some inconsistency in punctuation, or a line break between the model output and the YouTube-provided transcript would cause large spikes in the error rate. To address this specific issue, the initial approach was adjusted. Instead of comparing the transcripts line by line, all of the text was combined into a single paragraph. This eliminated the possible structural discrepancies. Punctuation handling was also standardized, and all text was converted to lowercase, which further resolved issues related to capitalization mismatches. After these refinements, the obtained WER was found to be significantly improved. However, even with these changes, challenges remained.

One of the issues that the model encountered was related to synonyms. YouTube’s transcripts might use a particular word, while the model could generate a synonym of that word. For instance, YouTube might use "assist," while the model might output "help." Although these words are semantically identical, this caused discrepancies in the WER calculation, as it is a strictly word-based comparison metric. This underscores a common challenge in transcription systems where correctness related to semantics does not always align with WER. According to Microsoft Azure's guidelines, a WER between 5–10% is considered reliable, whereas a WER of 30% or more indicates poor quality requiring significant customization and further training [18].

Table 1

EVALUATION OF WORD ERROR RATE (WER) BY THE PROPOSED MODEL ACROSS 10 LANGUAGES
LANGUAGES	WER
English	0.02
Italian	0.08
French	0.16
Spanish	0.11
Russian	0.08
German	0
Japanese	0.45
Hebrew	0.09
Korean	0.05
Hindi	0.2

Fig. 3

Visualization of Word Error Rate (WER) Across Multiple Languages

5.2. Language-Specific WER Observations

English (WER: 0.02): The English WER was impressively low at 0.02, showing almost perfect transcription accuracy. This result aligns with the fact that most ASR models are heavily trained on English data, resulting in highly optimized performance for this language.

French (WER: 0.16): French had a slightly higher WER of 0.16, likely due to the language’s intricate grammar, silent letters, and distinctive phonetic patterns. For instance, words like "est" (is) and "et" (and) sound alike, leading to possible substitution errors. Further causes of higher WER are homophones and verb conjugation differences. These are typical problems of the ASR system that work with complex phonetic structures of Romance languages. These problems fall under common challenges of the ASR system while working with complex phonetic structures of Romance languages [7].

Italian (WER: 0.08): Italian did fairly well with a WER of 0.08. While both Italian and French do share some phonetic challenges, the rules of sound for Italian appear to be more uniform than those of French; hence that gave way to a lower error rate for the former. The patterns of vowel and consonant for Italian were also more regular compared to those for French, which made it easier for the model to transcribe it and thus its WER is better in comparison.

Spanish (WER: 0.11): The WER for the Spanish language is at 0.11, and that's quite evidently suggesting that the language is used mostly speaking with very copious volume and with great amounts of training data. Moreover, since it is phonetic, words are mainly pronounced as they are spelled out, hence not affecting the transcription errors. However, in most areas of Latin America and Spain, great variations are noticed. For example, "casa," which is the word for home, sounds very similar to "caza," which is a verb that means to hunt. An ASR will likely make some contextual errors in its transcription of the following.

Russian (WER: 0.08): WER for Russian was 0.08. With the Cyrillic script and intricately complex grammar unique to it. Problems in its free word order, high rate of diminutives, and case endings may result in either substitution or deletion, the two most frequent errors. These are errors due to incidents of misinterpretation or ignoring such subtle grammatical hints by the model.

German (WER: 0.00): Testing with the perfect transcription gives an almost perfect WER of 0.00 for German. The reason for this is clean pronouncing and predictable word structure in German. Even compound words of rather high complexity, which were likely to be miscoded receive proper treatment in the model with no errors at all. This accuracy of output in the desirable outcome puts German at the top of the list of the most successful languages of the project.

Hebrew (WER: 0.09): Hebrew achieved WER score of 0.09, this too is positive given the script, namely the Hebrew alphabet, and the distinctive phonetic properties of the language. Transcription ambiguities due to the absence of vowels in written Hebrew and possible layout problems due to the right-to-left script did not turn out to be beyond the capabilities of the ASR system.

Korean (WER: 0.05): The WER of Korean is at 0.05, meaning the model was efficient in processing the non-European languages. Even though hurdles were given by the unique nature of the Hangul script and sentence structure wherein the subject sometimes precludes the verb, the performance of the ASR system was excellent. One major issue is the intricate system of honorifics which is highly dependent on context to ascertain what form of speech is appropriate. Such subtlety leads to substitution errors; however, the low WER values indicate that it was able to manage most of the aspects of these nuances.

Japanese (WER: 0.45): While many languages derive much of their meaning from context, Japanese is highly dependent upon context for determining the meaning of many words and uses words whose meanings are context-dependent, depending on how one uses them. One example includes "はし" (hashi), which could stand for "bridge," "chopsticks," or "edge," depending on the context. In this respect, an ASR with no deep contextual comprehension might fail to accurately transcribe such words, hence increasing errors. Moreover, Japanese sentences usually do not include subjects and objects if they are implicit in the context; this is another factor that might puzzle ASR systems trained in languages with more explicit sentence structures. These issues were reflected in the poor performance of Japanese transcription, where phonetic similarity between different words has led to higher WER and less accurate translations. Phonetic challenges were particularly pronounced as shown in Fig. 2. For example, the Japanese sentence "彼は橋を渡った" ("He crossed the bridge") can be easily misrecognized if the ASR system mistakes "橋" for "箸". These kinds of errors underline how hard it is to transcribe Japanese properly, especially when the context is lean or the model is having problems with certain small differences in phonetics. Moreover, difficulties also arise with transcription due to the kanji system, where a single kanji character may be read in different ways depending on the context in which it appears, such as "生," which may be read as "sei" meaning life, "nama" meaning raw, or even "ikiru" meaning to live. This aligns with previous research that highlighted the difficulties in applying ASR to languages with complex orthographies and context-dependent meanings, particularly in Japanese [4]. Synchronization difficulties in Japanese transcription and translation are enhanced by its variable sentence lengths as well as the omission of explicit subjects or objects, which distort timestamp alignment during dubbing. These were addressed through techniques such as duration-aware positional embeddings as well as pause token insertion being integrated into the model. These innovations ensure that the translated output is closely aligned with the original audio's temporal structure, even though sentence construction and linguistic style differ. For instance, pause tokens are used to maintain appropriate timing between translations of shorter and longer sentences, thus reducing mismatches in dubbing. Additionally, the model was improved by incorporating context-aware attention mechanisms, which allowed it to better handle the nuances of meaning in words like "橋" (bridge) or "箸" (chopsticks) when time-sensitive, thus enhancing synchronization without losing semantic accuracy.

Moreover, various factors contribute to the errors observed in WER calculations. These include deletion errors (missing words), insertion errors (extra words), and substitution errors (incorrect words). Deletion errors often result from weak audio signals, while insertion errors are caused by noisy recording environments. Substitution errors occur when the model lacks domain-specific vocabulary, all of which impact the WER [18].

5.3. Translation: Achievements and Challenges with BERT Score

The focus on using BERT Score for translation evaluation was to ensure that the translated output not only preserved the original meaning but also conveyed the phonetic and semantic accuracy of the original text.

Table 2

BERT SCORE EVALUATION OF THE PROPOSED MODEL FOR VARIOUS LANGUAGE PAIRS
LANGUAGES	English	Italian	French	Spanish	Russian	German	Japanese
English	NAN	0.84	0.84	0.85	0.76	0.81	0.73
Italian	0.85	NAN	0.85	0.86	0.75	0.84	0.71
French	0.82	0.81	NAN	0.81	0.4	0.80	0.72
Spanish	0.83	0.85	0.84	NAN	0.72	0.78	0.70
Russian	0.78	0.76	0.76	0.76	NAN	0.76	0.72
German	0.84	0.83	0.83	0.83	0.74	NAN	0.69
Japanese	0.77	0.74	0.75	0.75	0.75	0.71	NAN

Fig. 4

Visualization of BERT Score Across Different Language pairs

The measurement was then carried out, and the general BERT Score was taken as an aggregate across different language pairs. Table 2 shows the maximum BERT Score that occurred from a particular Italian-Spanish language pair at 86%, which showed that these translations had high-level semantic similarity. However, there were problems such as sentence structure differences that kept reoccurring in languages. For example, a simple sentence such as "Today, I cleaned my room" might be translated into another language as "I cleaned my room today." Both these sentences mean the same thing, but the BERT Score may be lower due to word order differences. Also, in languages like German, "doch" is a multi-meaning word, and Google Translate will pick a different meaning than what's meant; hence, a drop in BERT Score. This finding is consistent with the results from the paper "Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Processing Method," which also worked on semantic similarity between different texts and emphasized the difficulty in capturing the exact meaning in a varied context [4].

Table 3

MEAN BERTSCORE VALUES ALONG WITH THEIR CORRESPONDING CONFIDENCE INTERVALS.
Language Pair	Mean BERTScore F1	Confidence Interval (±)
English -- Italian	0.86	± 0.03
English -- French	0.85	± 0.02
English -- Spanish	0.87	± 0.04
English -- Japanese	0.74	± 0.05
English -- Hindi	0.76	± 0.03
Japanese -- Korean	0.72	± 0.06

Along with F1, confidence intervals (95%) were also calculated to ensure reliability. The analysis shows that languages from Europe (e.g., English ↔ Italian, French, and Spanish) reached high BERTScores with smaller intervals of confidence, thus presenting higher semantic similarity and coherence between translations. In contrast, Japanese and Korean language pairs tended to have lower scores due to significant grammatical and semantic differences that make contextual coherence harder to maintain. The confidence intervals add more support to the robustness of the result. For instance, the larger English ↔ Japanese pair seems more diverse and challenging for languages with complex syntactic structures. To ensure a proper evaluation, the model was tested on multiple runs (n = 10) to confirm consistency. The standard deviation observed for high-performing pairs, such as English ↔ Italian, was minimal (σ = 0.02), which further strengthens the reliability of BERTScore for these translations. This metric strongly supports the model's ability to retain semantic integrity across multiple languages, especially in languages such as Japanese and Korean with relatively less complex grammatical structure, while there were areas for improvement in the linguistically complex languages.

To solve the issues of synchronization, since most of these languages have different lengths in sentences, a method like duration-aware positional embeddings can be applied so that it better aligns with the timestamps of translations. In addition, it will take into consideration the fact that languages, for which sentence structure and word order are not close to the source languages such as English, might need additional consideration. For example, Japanese often does not use subjects and heavily relies on context, which makes direct alignment complicated. Using synchronization-aware techniques, such as pause tokens and temporal embeddings, the system can maintain the correct timing across translations, thereby reducing the discrepancies in dubbing and improving the naturalness of the multilingual outputs. These innovations were very effective in handling languages that have longer or context-dependent phrases so that synchronization was maintained despite linguistic variations.

The first experiment also used some of the other translation APIs, such as MyMemoryTranslated, which were soon discovered to be too slow for practical use. In a word, switching to Google Translate has brought greater reliability and speed to this solution. Google's translation services had many good alternatives to do plenty of text at one time, which would be critical for project scalability.

5.4 Why Not Azure, DeepL, or Other Translation Services?

Azure's translation service could not be adapted to our use case because of its cost structure and because it was not compatible with some of the use cases. Most impressive among these is Google Translate, which boasts great language support, speed, and accuracy, especially for complex languages such as Hebrew and Korean. It became apparent that Google Translate surpassed them by the level of leadership in speed and language versatility after comparisons with other tools on automated machine translation when an experiment was conducted within the research study called "Applying Automated Machine Translation to Educational Video Courses" [2].

Besides, DeepL holds a very good reputation for the quality of translation in European languages; however, it does not support the project's key languages, such as Japanese and Korean. DeepL reduced utility when comprehensive language coverage was required because it had been focused on a smaller set of languages. On the other hand, Google Translate offered better consistency across a broader range of languages, fitting well with multilingual transcription and translation needs.

On the other hand, LibreTranslate was probed as open source. However, in comparison to Google Translate, it still shows poor performance in terms of both speed and accuracy. In addition, LibreTranslate behaves poorly in non-European languages. These limitations led to the decision to base the translation system on Google Translate, further backed by findings from "Multilingual video dubbing—a technology review and current challenges" that suggested quite similar challenges in multilingual translation [7].

6. Result

An impressive overall Word Error Rate (WER) accuracy for the BhashaBlend model is found to be 12.4 when compared to other popular ASR systems, such as Google (15.82), Microsoft (16.51), and Amazon (18.42) [19]. This shows better performance by the model in transcription. In addition to this, a comparative study of the systems by IBM Watson, Google API, and Wit ASR benchmarked that Google led in predicting the sentences right with the least errors, while IBM and Wit lagged. In comparison, BhashaBlend's WER of 12.4 is still better than Google's WER in the same context [20]. This indicates that BhashaBlend is more robust with better accuracy in handling linguistic challenges than state-of-the-art systems.

Table 3

WER COMPARISON OF PROPOSED MODEL VS. COMPARATIVE ASR SYSTEMS
ASR SYSTEM	IBM	AMAZON	MICROSOFT	GOOGLE	BHASHABLEND
WER	38.1	18.4	16.51	15.82	12.4

Fig. 5

Relative WER across State of Art Models

The table demonstrates that the proposed model, BhashaBlend, achieves the lowest Word Error Rate (WER) of 12.4, a lower WER indicates better performance, meaning it provides more accurate transcription compared to its counterparts. The model has indeed shown excellent performance in languages of lesser phonetic complexity and semantic challenges. For example, the system attained an excellent WER of 0 in German. From this, it can easily be deduced that the system is efficient when handling languages that are structured very well with proper grammar and limited phonetic variation. Conversely, it faced major problems while testing with the Japanese language as the language poses very complex structures phonetically as well as semantically. Not all these languages, but in general, the results obtained by the suggested architecture are very promising. It performed excellently in languages like English, French, Italian, Spanish, and Korean, where the WER was surprisingly close to zero. The efficiency for tasks like multilingual transcription and video dubbing was validated by the results, particularly in languages with relatively less semantic and phonetic complexity.

Although the results are promising, the dependency on pre-downloaded videos makes it less suitable to be used in real-time processing scenarios. Real-time video processing is highly challenging from a computational perspective, mainly in terms of synchronization and on-the-fly transcription and translation. Future iterations should involve the integration of lightweight transformer architectures and advanced real-time inference mechanisms to cut down on processing delays and provide support for on-the-fly transcription and translation. These enhancements would make it possible for BhashaBlend to respond to the demands of dynamic environments, such as a live broadcast or streaming platform.

While the present outcomes clearly demonstrate BhashaBlend's superiority over long-established ASR systems like Google, Microsoft, and Amazon, later tests could be accomplished even more effectively by evaluating similar new open-source alternatives as OpenAI Whisper to draw out a more in-depth, recent assessment.

7. Conclusion

This paper explains the latest advancements in machine translation, recognition, and synthesis for quality video dubbing and instructive material translation. Making use of a neural machine translation output has further improved translation by using matching audio scripts and video captions for dubbing. The integration of advanced ASR models with big language models solves key issues of synchronization and naturalness that have determined the effectiveness of state-of-the-art models traditionally. In particular, BERTScore F1 offers a more contextually relevant estimate of translation quality than the more traditional metrics of Meteor and BLEU Score. Integrating accessibility features along with TTS/STT from VideoDubber has been one step ahead. While promising in its approach, using advanced ASR models and synchronized text-to-speech synthesis, the method still presents certain challenges, such as proper synchronization and pronunciation, for further research. Efforts should be directed to improve the system performance of languages, such as Japanese, by tuning and integrating contextual embeddings that address phonetic and semantic complexities. The efficiency and performance will be quite improved with lightweight transformers as well as adaptive inference mechanisms in real-time transcription translation optimizations. Besides, an increased dataset for additional language dialects and regional accents for further refining robustness will keep the system highly versatile and open to variations for the best fit.

Reducing computational costs and improving scalability will enable real-time video processing without pre-downloading, making the system suitable for applications like live broadcasting and multilingual conferencing. This will promote multilingualism and improve access to multilingual content, encouraging an all-embracing digital landscape where the full potential of BhashaBlend can be realized across languages.

Funding Declaration

The authors did not receive support from any organization for the submitted work.

No funding was received to assist with the preparation of this manuscript.

No funding was received for conducting this study.

No funds, grants, or other support was received.

Conflicts of interest/Competing interests

The authors have no relevant financial or non-financial interests to disclose.

The authors have no conflicts of interest to declare that are relevant to the content of this article.

All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

The authors have no financial or proprietary interests in any material discussed in this article.

Author Contribution

A.T., T.C. and V.Y. wrote the main manuscript text and V.Y. prepared the figures. A.I.A. fine tuned the content flow and quality of figures. All authors reviewed the manuscript.

8. References

Wu, Y., Guo, J., Tan, X. (2023). VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing. Proceedings of the AAAI Conference on Artificial Intelligence. ;37(11):13772–13779. https://doi.org/10.1609/aaai.v37i11.26613

Wang, L. (2023). Applying automated machine translation to educational video courses. Education and Information Technologies. Published online October 2. https://doi.org/10.1007/s10639-023-12219-0

Abdel-Salam, S., & Rafea, A. (2022). Performance Study on Extractive Text Summarization Using BERT Models. Information, 13(2), 67. https://doi.org/10.3390/info13020067

Yu, L., Liu, B., Lin, Q., Zhao, X., & Che, C. Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Processing Method. arXiv (Cornell University). Published online January 1, 2024. https://doi.org/10.48550/arxiv.2401.06782

Fu, P., Liu, D., & Yang, H. (2022). LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition. Information, 13(5), 250. https://doi.org/10.3390/info13050250

Hori, T., Moritz, N., Hori, C., Roux, J. L.. Advanced Long-Context End-to-End Speech Recognition Using Context-Expanded Transformers., & arXiv (2021). (Cornell University). Published online August 27. https://doi.org/10.21437/interspeech.2021-1643

Bigioi, D., & Corcoran, P. (2023). Multilingual video dubbing—a technology review and current challenges. Frontiers in signal processing. https://doi.org/10.3389/frsip.2023.1230755. 3.

Dimitrichka Nikolaeva. (2023). An Elementary Emulator Based on Speech-To-Text and Text-to-Speech Technologies for Educational Purposes. Published online September, 13. https://doi.org/10.1109/et59121.2023.10278929

Choi, J., Park, S. J., Kim, M., & Ro, Y. M. (2024). AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation. ;22:27315–27327. https://doi.org/10.1109/cvpr52733.2024.02580

10.

Bhardwaj, V., Ben Othman, M. T., Kukreja, V., et al. (2022). Automatic Speech Recognition (ASR) Systems for Children: A Systematic Literature Review. Applied Sciences, 12(9), 4419. https://doi.org/10.3390/app12094419

11.

Och, F. J. Minimum error rate training in statistical machine translation. Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - ACL ’03. Published online 2003. https://doi.org/10.3115/1075096.1075117

12.

Biswajit, D., Sarma, S. R., & Mahadeva Prasanna (2017). Acoustic–Phonetic Analysis for Speech Recognition: A Review. IETE Technical Review, 35(3), 305–327. https://doi.org/10.1080/02564602.2017.1293570

13.

Wei, K., Guo, P., & Jiang, N. Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism. Interspeech 2022. Published online September 16, 2022. https://doi.org/10.21437/interspeech.2022-10066

14.

Moritz, N., Hori, T., & Roux, J. L. Streaming automatic speech recognition with the transformer model. arXiv (Cornell University). Published online January 1, 2020. https://doi.org/10.48550/arxiv.2001.02674

15.

Narayanan, S. M., Kumar, A., Vepa, J., & Phoneme-BERT Joint Language Modelling of Phoneme Sequence and ASR Transcript. arXiv (Cornell University). Published online January 1, 2021. https://doi.org/10.48550/arxiv.2102.00804

16.

Ganesh, S., Vedant Dhotre, Patil, P., & Dipti Pawade (2023). A Comprehensive Survey of Machine Translation Approaches. Published online December, 8. https://doi.org/10.1109/icast59062.2023.10455003

17.

Li, X., & van Deemter, K. Lin C. A Text Reassembling Approach to Natural Language Generation. arXiv (Cornell University). Published online January 1, 2020. https://doi.org/10.48550/arxiv.2005.07988

18.

Eric-Urban (2023). Test accuracy of a Custom Speech model - Speech service - Azure AI services. learn.microsoft.com. Published July 18, https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-custom-speech-evaluate-data?pivots=speech-studio‌

19.

Understanding Word Error Rate (WER) in Automatic Speech Recognition (ASR) (2021). Clari Published December 13, https://www.clari.com/blog/word-error-rate/

20.

Filippidou, F., & Moussiades, L. (2020). Α Benchmarking of IBM, Google and Wit Automatic Speech Recognition Systems. IFIP Advances in Information and Communication Technology. Published online, 73–82. https://doi.org/10.1007/978-3-030-49161-1_7

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Yes

Abstract

Translating video content into multiple languages is feasible with existing solutions but remains challenging. This work outlines a sophisticated advanced system that satisfies quality and accessibility improvements in multilingual video translation. The proposed method includes extracting audio from video, transcribing the audio using an innovative speech recognition model, and then translating the transcribed text into various languages. The system uses Google’s Translation API and Text-to-Speech library, ensuring synchronization with the original video. The BhashaBlend model achieved a strong word error rate of 12.4%, significantly better than many of the major ASR systems: Google at 15.82%, and Microsoft at 16.51%. The model's performance was powerful on languages with the simplest phonetic realization, such as for example, German, English, and Spanish, which proves its dependability also to deliver video dubbing. This highlights the potential of the model to produce results where excessive lingual complexity is involved and points towards the high applicability scope of BhashaBlend in language-polyvalent applications.