Optimizing T5 for Lightweight Tibetan-English Translation
A
JacobMoore1Email
PaulaLauren1Email
1Lawrence Technological UniversitySouthfieldMIUSA
Jacob Moore
Lawrence Technological University, Southfield, MI, USA
jmoore10@ltu.edu
Paula Lauren
Lawrence Technological University, Southfield, MI, USA
plauren@ltu.edu
Abstract
We present the first lightweight Tibetan-English machine translation models optimized for low-resource settings and edge deployment. Our approach combines (1) a custom tokenizer trained on Tibetan script, (2) continued pretraining on Tibetan-English corpora, and (3) supervised fine-tuning on domain-specific translation pairs. Through ablation studies, we quantify each component’s contribution to translation quality. Results show that both the tokenizer and pretraining significantly improve performance, especially at small data scales. This work establishes the first strong baseline results for Tibetan-English translation with compact models and offers a practical framework for other underrepresented, non-Latin-script languages.
Keywords:
Tibetan Language
Machine Translation
Low Resource Language
Small Language Models
This work includes excerpts from Advice on Bending Mind Toward the Good by Khenchen Ngawang Palzang (translated by Joseph McClellan, 2024) and from Protection from All Fears: A Prayer to Ārya Tārā by Sera Khandro (translated by Adam Pearcey, 2025), licensed under Creative Commons Attribution-NonCommercial 4.0 International. These excerpts appear in Section 5.1.2.
1 Introduction
Machine translation (MT) for the Tibetan language remains critically underdeveloped, particularly in the Tibetan-English direction. While Tibetan-Chinese MT has progressed from phrase-based approaches to neural models (Chen et al., 2019; Zhou, 2024), Tibetan-English MT has received minimal attention. To our knowledge, only two recent studies have addressed this task, Shu et al. (2024) and Nehrdich and Keutzer (2025), and both rely on large-scale models that are impractical for possible real-time applications or edge-device deployment. Additionally, both of these studies look at multiple languages, rather than focusing solely on Tibetan. Shu et al. (2024) reports BLEU (Bilingual Evaluation Understudy) scores (Papineni et al., 2002) of 0.00 for LLaMA 3.1 405B and GPT-4o, with only marginal improvements using retrieval-augmented generation. In contrast, Nehrdich and Keutzer (2025) achieve reasonably high GEMBA (GPT Estimation Metric Based Assessment) (Kocmi & Federmann, 2023) scores for large, unfinetuned models like Gemma 3 27B (Gemma Team, 2025), and improves these with finetuning, but these results still depend on significant compute resources. While these contributions demonstrate the potential of large-scale approaches, there remains a critical need for dedicated lightweight frameworks which we show can produce performance that is comparable to, or better than, large models, in addition to being resource-efficient.
Several linguistic characteristics of Tibetan contribute to this challenge. Tibetan is a morphologically rich and syntactically distinctive language (Tournadre & Dorje, 2003), featuring complex ergative constructions, auxiliary verb systems, and a syllabic script with no spaces between words. Tokenization is particularly problematic: commonly used tokenizers sometimes can not handle Tibetan at all, and when they do they often fragment meaningful units (Thupten et al., 2021), likely as a result of the complex orthography of the written script (Tournadre & Dorje). Tibetan Buddhist texts, a common application of Tibetan-English translation in academic and commercial literature, are especially difficult due to their philosophical vocabulary, sometimes archaic syntax, and ritual features such as untranslated Sanskrit mantras (Wilson, 1998). Understanding these linguistic complexities is essential for developing effective MT systems and serves as crucial background for researchers entering this domain.
Efforts in other low-resource settings suggest a path forward. Custom tokenization has been shown to improve translation performance by reducing input length (Dewangan et al., 2025) and increasing semantic cohesion, as in Sennrich and Zhang (2019) for German-English and Miyagawa (2023) for Japanese-Ainu. Continued pretraining on domain-specific corpora has likewise improved results in automatic speech recognition (ASR) and MT tasks in other languages (DeHaven & Billa, 2022; Liu, 2025; Pang et al., 2024). These findings underscore the importance of tailored data and pretraining strategies in low-resource settings, providing methodological foundations that can be adapted for Tibetan-English translation.
A
This study offers the first detailed baseline results for Tibetan-English translation using compact, transformer-based models, with an emphasis on low-resource conditions and edge compatibility. Unlike prior efforts in lightweight MT research, that compress large models (Jiao et al., 2020; Sanh et al., 2019; Wu et al., 2020), our approach begins with an already lightweight architecture, Text-to-Text Transfer Transformer (T5) (Raffel et al., 2020), specifically the Small (60M parameters) and Base (220M parameters) variants. We compensate for this small model size through targeted interventions. We develop the first neural MT system dedicated exclusively to Tibetan-English translation, allowing for language-specific optimizations not possible in multilingual frameworks.
We conduct ablation experiments across multiple dataset sizes to isolate the effect of each intervention. We evaluate using BLEU, chrF (Character-level F-score) (Popovic, 2015), TER (Translation Edit Rate) (Snover et al., 2006) and GEMBA (Kocmi and Federmann, 2023), as well as with a qualitative human analysis of complex translation samples, demonstrating that our combined approach of custom tokenization, continued pretraining, and domain-restricted data produces the best results. These findings challenge the assumption that translation performance scales primarily with dataset size (Gordon et al., 2021; Brants et al., 2007), and highlight the importance of dataset composition and training strategy.
Beyond reporting results, this study aims to provide a comprehensive methodological foundation and detailed linguistic analysis to support future research in Tibetan MT. Our work contributes a practical and efficient framework for Tibetan-English MT, with open-source implementations designed to serve as starting points for subsequent research. By combining lightweight architectures with linguistically-informed interventions and providing extensive evaluation details, we demonstrate that strong translation results can be achieved without large models or massive datasets, while establishing reference methodologies for researchers approaching this challenging translation task. The implications extend to other low-resource and morphologically or orthographically complex languages, offering a replicable framework for similar translation challenges.
2 Related Work
2.1 Tibetan Machine Translation and Custom Tokenization
Published efforts in Tibetan MT focus almost exclusively on translation from Tibetan to Chinese. Chen et al. (2019) introduced Long Short Term Memory (LSTM) based neural machine translation (NMT) with word segmentation, demonstrating effective translations with limited training data. Zhou (2024) further improved accuracy by integrating statistical alignment into NMT. Thupten et al. (2021) optimized byte-pair encoding (BPE) (Shibata et al., 1999) segmentation for Tibetan, highlighting the critical role of tokenization in low-resource languages.
Beyond Tibetan, similar tokenization interventions have proven valuable. Sennrich and Zhang (2019) found that reducing BPE vocabulary size in German-English NMT improved BLEU by nearly five points. Additionally, Dewangan et al. (2025) find that custom tokenization dramatically reduces encoded input lengths and improves performance across several languages and tasks. Miyagawa (2023) carefully separated individual morphemes in their Ainu-language training data for Japanese-Ainu translation, achieving promising results despite scarce training data. Across these cases, language-specific intervention consistently led to improved outcomes.
Tibetan-English translation is an under-researched area. Only two studies, to our knowledge, have reported results on this task, both of which use much larger models than the present study. Shu et al. (2024) reported BLEU scores of 0.00 for both Llama 3.1 405B (Dubey et al., 2024) and GPT-4o (Hurst et al., 2024). They increased this score to 0.108 for GPT-4o by implementing retrieval-augmented generation. This poor performance is particularly striking in light of much stronger results for the translation of other low-resource languages by GPT-4o reported by Nguyen et al. (2024). Nehrdich and Keutzer (2025), contrastingly, find relatively high GEMBA scores for several large, unfinetuned models. The best performing model, Gemma 3 27B, achieved a score over 50. They improve this performance to over 70 through finetuning of Gemma 3 7B. However, these models, being quite large, are unsuitable for edge devices or for real-time translation.
2.2 Dataset Scope and Continued Pretraining
Domain-specific datasets and continued pretraining have emerged as key strategies for improving MT performance in low-resource and domain-specific contexts. Chu and Wang (2018) survey domain adaptation methods in NMT, concluding that fine-tuning on in-domain data yields improved translation results. Freitag and Al-Onaizan (2016) similarly demonstrate that adapting general-purpose NMT models using small, domain-specific datasets can lead to significant gains in BLEU scores. However, both studies assume access to large general-domain corpora as a pretraining foundation. Gordon et al. (2021) provides evidence that BLEU scores scale predictably with dataset size, following a power-law distribution, an insight reinforcing Brants et al. (2007), suggesting that translation performance improves primarily through the accumulation of more data.
Continued pretraining has shown success using the Transformer architecture (Vaswani et al., 2017) as seen in Pang et al. (2024), as well as mBART (Liu, 2020) and LLaMA (Alves et al., 2024). Recent work on large language models (LLMs) offers further support for this approach. Zheng et al. (2024) demonstrate that general-domain pretraining does not transfer well to domain-specific MT tasks. Their proposed framework, DragFT, incorporates dictionary-augmented prompts, retrieval-augmented few-shot examples, and targeted fine-tuning. DragFT significantly outperforms strong baselines such as GPT-3.5 and GPT-4o, highlighting the importance of integrating structured domain knowledge during adaptation. Similarly, Verma et al. (2022) analyze zero-shot, cross-lingual domain adaptation, showing that both linguistic typology and domain alignment affect transfer effectiveness. Their work underscores the necessity of well-defined domain boundaries and curated training sets. These constraints closely mirror the Tibetan-English case, where corpora are often small, highly specialized, and require precise linguistic handling, and point to potential avenues for future work.
These themes carry over to related modalities in underrepresented languages. DeHaven and Billa (2022) find that continued pretraining on unlabeled in-language audio improves ASR performance in low-resource settings, outperforming standard semi-supervised methods while remaining computationally efficient. Liu (2025) adapts BERT for Xhosa-English translation, incorporating linguistic preprocessing and adaptive optimization to handle the unique structure of the language.
2.5 Lightweight Translation Models
A growing body of research addresses the need for lightweight MT systems suitable for deployment on edge devices or consumer hardware. These approaches typically reduce model size and inference time through techniques such as model distillation, architectural modifications, and pruning. For example, Kim and Rush (2016) introduced sequence-level knowledge distillation for NMT, showing that, in student-teacher knowledge distillation approaches, student models could achieve near-teacher performance while being significantly smaller. Jiao et al. (2020) and Sanh et al. (2019) applied distillation to BERT models, yielding compact variants like TinyBERT and DistilBERT. Wu et al. (2020) proposed the Lite Transformer architecture, which reduces the number of operations and latency without substantial loss in BLEU scores.
Additional research demonstrates the feasibility of edge-based MT in real-world settings. Watt et al. (2023) deployed a recursive neural net (RNN) English-Hausa translation model on a Raspberry Pi and achieved a BLEU score of 73.5, demonstrating strong performance for edge device deployment. Tan et al. (2022) proposed dynamic multi-branch layers for on-device NMT, improving BLEU scores by up to 1.8 points while maintaining efficiency and reducing inference time. Le (2025) developed a privacy-preserving Vietnamese-English translation app for iOS using a quantized version of TinyLlama 1.1B (Zhang et al., 2024), highlighting the value of lightweight transformer models for real-time translation. Similarly, Gan et al. (2023) achieved real-time sign language recognition and translation on edge devices by combining shallow graph convolutional networks with structural reparameterization.
Outside of MT, small language model architectures are also showing tremendous promise in their ability to perform comparably to much larger models. Deshwal and Chawla (2024) achieve state-of-the-art results with a 3.8 billion parameter Phi-3 model, not only in providing evaluations of machine translations but also in providing feedback in other tasks. Magister et al. (2023) demonstrate that small language models, specifically T5, can be trained to produce chain-of-thought reasoning which improves their performance on arithmetic, commonsense, and symbolic reasoning tasks, an approach previously restricted to much larger architectures.
A
A
3 Background
3.1 The Tibetan Language
A
Fig. 1
Tibetan ergative construction example
Click here to Correct
This section examines Tibetan's unique linguistic features that challenge MT systems. Key characteristics include ergative syntax and word order that differ greatly from English, morphological structures relying on auxiliary verbs and particles for grammatical meaning, and a complex orthographic system with multi-component syllables and no word spacing. Tibetan's archaic spelling and continuous literary tradition from the 12th century potentially requires models to handle both historical and contemporary texts. These complexities necessitate custom tokenization and suggest that encoder-decoder architectures like T5, which process entire sequences bidirectionally, may outperform autoregressive models for Tibetan translation.
3.1.1 Syntax and Word Order
Tibetan syntax presents several challenges for MT, particularly for models trained primarily on English or other nominative-accusative languages. Tibetan exhibits ergative alignment, marking the agent of a transitive verb rather than the object. For example, the English sentence “Lobzang drank the tea” appears in (phoneticised) Tibetan as lobsang-ki cha tung-song, which translates more literally as “By Lobzang tea drank” (Tournadre & Dorje, p. 23). While grammatically active in Tibetan, such constructions may be misinterpreted as passive by translation models not attuned to these ergative structures.
Tibetan also, generally, follows a Subject-Object-Verb (SOV) word order, placing the main verb at the end of the clause (p. 23). In contrast, English uses a Subject-Verb-Object (SVO) word order. This difference in structure may create difficulties for auto-regressive language models, which generate output word-by-word (or token n-gram by token n-gram) and rely heavily on recent context to predict the next token, or tokens. Since Tibetan defers the verb, a key piece of the sentence’s meaning, until the end, early predictions in a left-to-right generation process may lack crucial semantic information. This could result in unstable or incoherent translations, especially when source and target languages differ significantly in word order.
Furthermore, Tibetan verbs do not inflect for gender, number, or person, with only limited exceptions for the first person (p. 23). Instead, grammatical roles and speaker intentions are encoded through clitics, particles, and contextual cues (Tournadre, 2010). Tibetan also makes a systematic distinction between intentional and unintentional actions, a feature absent in English.
These structural features collectively suggest that encoder-decoder architectures like T5 are likely better suited to Tibetan translation. Unlike auto-regressive, decoder-only architectures, T5 models encode the entire input sequence bidirectionally before decoding, allowing them to model long-range dependencies, such as late-arriving Tibetan verbs or distributed syntactic cues, with greater accuracy. This global view of the source sentence mitigates the limitations of left-to-right generation and likely better accommodates Tibetan’s grammar and syntax.
3.1.2 Morphology and Phonetics
Tibetan is a morphologically rich language, a language which conveys multiple levels of meaning at the level of individual words (Tsarfaty et al., 2013). It is characterized by a complex network of auxiliary verbs and a phonetically intricate syllable structure (Tournadre & Dorje, p. 23). The language encodes fine-grained grammatical distinctions, including tense, aspect, evidentiality, and speaker intention, through a system of auxiliary verbs and particles.
This reliance on auxiliaries and function particles introduces challenges for MT, as much of the sentence's grammatical and pragmatic meaning is distributed non-locally and often encoded in short, frequent morphemes that are easily mistranslated or omitted. These functional elements can be polysemous and context-sensitive, requiring models to integrate information across long spans of text to disambiguate meaning accurately.
These factors underscore the importance of both domain-specific tokenization and context-aware modeling in Tibetan MT. Models like T5, which use learned subword vocabularies and global encoding of input sequences, are likely better equipped to capture these morphosyntactic cues than models relying on rigid token boundaries or local decoding. In particular, the use of a custom tokenizer trained on Tibetan corpora allows for more faithful syllable segmentation, reducing tokenized sequence lengths and improving both computational efficiency and translation fidelity.
3.1.3 Script and Orthography
Tibetan’s written syllable structure further complicates both tokenization and translation. Words are not separated by spaces, and spelling remains highly conservative and archaic (Tournadre & Dorje, p. 23). The script consists of 30 base consonants, which can be modified through stacking and affixes to form syllables, the primary unit of written Tibetan. A syllable may contain up to seven components: prefix, superscript, radical, subscript, vowel diacritic, and one or two suffixes (pp. 46–47). Despite this complexity, many valid syllables consist of just a radical and vowel (p. 47). Additionally, Tibetan includes marginal, reversed letters for transcribing Sanskrit retroflexes, especially in mantras (p. 64) (see Section 3.2.4). Punctuation in literary texts includes three main signs: a simple bar (།, rkyang-shad) similar to a comma or full stop; a double bar (།།, nyis-shad) marking section boundaries; and a start-of-text marker (༄, yig-mgo) (p. 67).
These features introduce considerable noise for standard tokenization algorithms, particularly under BPE schemes trained on mixed-language corpora without Tibetan-specific optimization. As a result, model perplexity increases for unseen syllable structures, and subword splits often misalign with meaningful morphological boundaries, likely reducing translation quality.
3.1.4 Language Family and Literary Tradition
Tibetan belongs to the Tibeto-Burman branch of the Sino-Tibetan language family (Tournadre & Dorje, p. 25). However, it is syntactically, lexically, and orthographically distinct from most other regional languages (including Chinese) making cross-lingual transfer from better-resourced Sino-Tibetan languages unlikely to be effective. The exception is Burmese, with which Tibetan shares some phonological and syntactic structures (p. 25), though these similarities are limited in practical human translation contexts.
A key feature of Tibetan is its deep and continuous literary tradition. Classical Literary Tibetan developed by the 12th century and remains largely readable to speakers of Modern Literary Tibetan, which, despite introducing some modern vocabulary, preserves much of the grammar of its classical ancestor (Tournadre & Dorje, p. 27). This high degree of diachronic mutual intelligibility is unusual and potentially enables access to a vast body of premodern texts using a unified orthographic standard.
This historical continuity has several implications for MT. First, it necessitates models that can handle archaic and formal registers, which often include complex syntax, rare vocabulary, and long-distance dependencies. Second, it offers an opportunity: because Classical and Modern Tibetan remain closely aligned in grammar and lexicon (Tournadre & Dorje, p. 27), domain adaptation techniques, such as continued pretraining on literary corpora, can enable general-purpose MT models to effectively translate both historical and more contemporary texts from within relevantly similar domains (i.e. Buddhist philosophy).
Additionally, the lack of divergence across dialects in written form contrasts with high spoken dialect diversity, making Tibetan a prime candidate for text-focused MT pipelines that prioritize orthographic stability over spoken variation.
3.2 Tibetan Buddhist Texts
This section explores the specialized domain of Tibetan Buddhist texts, which present unique translation challenges beyond general linguistic complexity. These texts operate within a sophisticated philosophical and ritual framework that generates highly technical vocabulary with context-dependent meanings. Single terms may denote different concepts across epistemological, ritual, or meditative contexts. The corpus holds particular historical significance, as many canonical Buddhist works survive only in Tibetan, making accurate translation crucial since alternative source texts for verification no longer exist. The formal register of this corpus employs specialized terminology that requires deep understanding of Buddhist philosophical frameworks rather than simple lexical knowledge. Additionally, the preservation of Sanskrit mantras and technical terms for phonemic and doctrinal precision creates a translation paradigm where some elements must remain untranslated (or translated into non-English vocabulary), complicating standard computational approaches that assume complete source-to-target conversion.
3.2.1 Philosophical and Ritual Context
A
Tibetan Buddhism, while superficially polytheistic, rejects the existence of a creator deity and emphasizes impermanence, karma, and rebirth (Tournadre & Dorje, p. 300). Its practices incorporate not only philosophical reflection and meditation, but also ritual elements such as mantra (spiritual formulae recited in Sanskrit), mudrā (symbolic gestures), and psychophysical yogic techniques (p. 301). These practices generate specialized vocabulary that reflects complex conceptual distinctions (i.e. stages of meditative absorption). The integration of philosophical analysis with ritual practice creates texts that simultaneously function as doctrinal exposition and liturgical instruction, requiring translators to preserve both intellectual precision and devotional register. This dual function shapes syntactic choices, with texts employing elevated honorific language for enlightened beings while maintaining technical accuracy for philosophical concepts. Such contextual complexity often necessitates domain-specific translation strategies that standard computational approaches trained on general corpora may not adequately capture.
3.2.2 Textual History and Transmission
Many canonical Buddhist texts survive today only in Tibetan. Their original Indian or Chinese versions have been lost (Tournadre & Dorje, p. 22), making the Tibetan corpus a unique and indispensable resource. This situation creates particular pressure for translation accuracy, as there may be no parallel texts in other languages for verification or comparison. The absence of source texts also means that Tibetan versions often represent the sole authoritative record of specific teachings, commentaries, and philosophical treatises. This textual isolation reinforces the importance of preserving and accurately translating Tibetan-language Buddhist materials, as errors or misinterpretations cannot be corrected through reference to alternative versions in other languages.
3.2.3 Register and Terminology
The term “ཆོས་སྐད་” ("book language", or “doctrinal language”) refers not to a separate grammar, but to a register of Literary Tibetan that employs specialized philosophical and ritual vocabulary (Tournadre & Dorje, p. 81). Classical Literary Tibetan serves as a liturgical language even outside Tibetan-speaking regions, including Mongolia, Buryatia, Tuva, Kalmykia, and parts of Nepal (p. 81). This register includes doctrinal terminology used in epistemology, ritual, psychology, and meditation (p. 81).
Crucially, Tibetan Buddhist texts often assign multiple technical senses to a single lexical item depending on context (Tournadre & Dorje, p. 81), creating challenges for disambiguation and semantic parsing. For instance, terms like "དབང་" can denote "empowerment" in ritual contexts, "control" in philosophical discourse, or "power" in political texts. This polysemy extends to fundamental concepts: "རིག་པ་" may refer to ordinary cognition in epistemological treatises, but designates the highly technical notion of “primordial awareness” in other texts. Such contextual variation means that successful translation requires not merely lexical knowledge, but deep understanding of doctrinal frameworks and textual genres. This presents a significant challenge to MT, which may struggle with this level of semantic complexity where meaning depends on philosophical context, rather than syntactic environment.
3.2.4 Mantric Language and Sanskrit Terminology
Tibetan Buddhist philosophy treats sound as semantically potent in itself. Beyond syntactic meaning, mantras are believed to operate at a phonemic level, where sound carries ritual and metaphysical efficacy (Wilson, p. 85). This principle underlies the non-translation of mantras from Sanskrit.
Beyond strictly mantric contexts, Sanskrit preservation is sometimes preferred by translators for doctrinal precision. Terms like "paṇḍita" (scholar) or "siddha" (accomplished practitioner) retain specific connotations that generic translations might obscure. The phonemic dimension creates a translation paradigm where some lexical items function as semantic units requiring preservation rather than conversion, complicating standard computational approaches that assume all source language elements should be rendered in the target language. This principle further complicates modeling Buddhist texts with standard tokenization or semantic methods, as translation models must distinguish between terms requiring phonemic preservation and those amenable to translation.
3.3 T5 for Machine Translation
The T5 model architecture (Raffel et al., 2020) introduces a unified architecture that frames all natural language processing (NLP) tasks, including translation, as text-sequence-to-text-sequence problems. This general-purpose formulation simplifies model adaptation across tasks without requiring architectural changes. Additionally, its open-source implementations allow for full control over tokenizer customization and fine-tuning workflows.
3.3.1 The T5 Architecture and Text-to-Text Framework
T5 simplifies task-specific adaptation and contributes to improved accuracy, fluency, and cultural relevance in translation outputs when properly tuned for the target domain (Zaki, 2024). T5 thus supports experimentation with language-specific preprocessing strategies and lightweight deployment for under-resourced languages like Tibetan. This provides fertile ground for future work in other languages, or for more complex interventions.
The T5 model comes in a variety of sizes (see Table 1), placing it in both the small language model and large language model categories. T5 is also available with updated architectural adjustments, namely T5 Version 1.1 and T5-Efficient, which incorporate architectural and training refinements for improved performance. T5 Version 1.1 replaces ReLU with GEGLU activations (Shazeer, 2020), removes parameter sharing between the embedding and classifier layers, disables dropout during pre-training, and trains exclusively on the C4 corpus without multi-task mixing. T5-Efficient applies a “deep-narrow” design strategy, prioritizing increased model depth over width, to improve efficiency (Tay et al., 2021).
For this preliminary study, only the Small and Base sizes of the original T5 architecture are used. Future work might explore how the architectural adjustments of T5 Version 1.1 or T5-Efficient impact performance for this use-case.
Table 1
T5 model sizes
Model
Feed Forward Layers
#Params
Tiny
1024
16M
Mini
1536
31M
Small
2048
60M
Base
3072
220M
Large
4096
738M
XL
16384
3B
XXL
65536
11B
3.3.2 Applications in Specialized and Low-Resource Translation Tasks
T5 has been applied to domain-specific and low-resource settings such as translating historical Japanese to modern Japanese: Usui and Komiya (2023) found that, despite limited parallel corpora, T5 outperformed earlier neural and statistical methods when contextual information (e.g., source text title) was prepended to the input. This insight is directly applicable to Tibetan-English translation, where historical and religious documents vary in register and period. Incorporating metadata or structural cues (e.g., section headers or text genre) into input sequences may similarly improve performance in future studies by helping the model disambiguate language variety and domain-specific vocabulary.
3.3.3 Comparative Performance and Model Evaluation
T5 has shown competitive performance in comparative evaluations. Akshat et al. (2024) benchmarked T5 against an LSTM architecture and ChatGPT-3.5-turbo on English-Hindi translation, highlighting T5 as a strong baseline. This comparison underscores T5’s balance between performance and efficiency, making it a suitable choice for Tibetan-English translation, where strong baselines have not been established and where training data is limited.
3.3.4 Decoding Strategies and Translation Quality
Fig. 2
Data clustering pipeline
Click here to Correct
Decoding strategy is a critical factor in T5’s translation output quality. Suryakusuma et al. (2023) found that greedy inference (Sennrich et al., 2015) consistently yielded the highest BLEU and BERTScore (Zhang et al., 2019) results for English-German translation. For Tibetan-English translation, decoding with greedy inference also offers advantages: it produces fast, deterministic outputs suitable for real-time or edge-device settings. Conveniently, T5 utilizes greedy inference by default, and it is the decoding strategy used in this study.
4 Methodology
4.1 Dataset Construction and Preprocessing
The dataset for this study is a corpus of ~ 860,000 Tibetan-English translation pairs from Buddhist texts compiled from materials generously provided by Open Pecha, an organization focused on Tibetan language technologies, and the University of Virginia’s Tibetan and Himalayan Library. The dataset reflects a range of literary and philosophical styles typical of Tibetan Buddhist literature, offering a diverse foundation for translation modeling, even within the limited domain of Buddhist texts. While non-Buddhist material existed in the original dataset, one of the primary academic and commercial use-cases for translation from Tibetan into English is for the translation of Buddhist material, so in this preliminary study, we restrict ourselves to that domain. We leave the exploration of curriculum training or domain adaptation, such as in Verma et al. (2022), for future work.
Prior to this study, the dataset was cleaned for mistranslations and misalignment, and de-duplicated by Open Pecha. To identify and extract Buddhist material from the larger corpus of translation pairs, the data was first clustered and labeled by topic. This was done using an adaptation of the Text Clustering pipeline developed by Hugging Face (Allal & von Werra, 2024) as shown in Fig. 2.
The English portions of each translation pair were first embedded into vectors using the MiniLM embedding model (Wang et al., 2020) and projected into two dimensions using the UMAP algorithm (McInnes et al., 2018). The projected embeddings were then clustered using HDBSCAN (McInnes et al., 2017). The hyperparameters for HDBSCAN were automatically optimized using Optuna (Akiba et al., 2019). Finally, a random sample of ten examples was taken from each cluster and summarized using Mixtral-8x7B-Instruct-v0.1 (Mistral 2023) with the prompt: "Use three words total (comma separated) to describe general topics in above texts. Under no circumstances use enumeration. Example format: Tree, Cat, Fireman". The generated summaries were then plotted for visual inspection. See Fig. 3 for an example plot of a subset of the data. Note that overlapping labels are a known issue in the pipeline.
After visual inspection, category labels and their content were more carefully examined and manually flagged as either pertaining to Buddhism or not.
The customized implementation of the pipeline used for this study is available on the Python Package Index at: https://pypi.org/project/easy-text-clustering/
The data was then split into training, development, and test sets consisting of 85%, 5%, and 10% of the total data respectively. These percentages allow for a large training set, while preserving a small, but reasonably sized, set for evaluation during training.
Fig. 3
Example of plotted clusters with labels
Click here to Correct
4.2 Getok: Custom Tokenizer Development
Standard subword tokenizers perform poorly on Tibetan, frequently treating entire sentences as unknown tokens. A minimal fix involves expanding the vocabulary to include all Tibetan Unicode characters, enabling basic compatibility during tokenization and decoding. However, this character-level approach yields excessively long input sequences, limiting model efficiency. To address this, a custom tokenizer was trained on Tibetan text to learn subword units optimized for this script, as well as on English translations of those texts to accommodate the unique vocabulary of the training corpus. The dataset used to train this tokenizer was the training split of the dataset described in the previous section. This ensures optimal data hygiene in the evaluation portion of this study.
The tokenizer was trained using the Byte Pair Encoding algorithm, with a vocabulary size of 32,000, matching that of the standard T5 tokenizer. This tokenizer is publicly available at: https://huggingface.co/billingsmoore/getok-v0. The tokenizer’s name, “Getok” is Tibetan for “virtuous thoughts”.
4.3 MLotsawa: Model Architecture and Training
The models in this study were trained using a two-stage process consisting of continued pretraining followed by supervised fine-tuning (see Fig. 4).
This study introduces two open-source transformer-based models for Tibetan-English translation, both using a finetuned T5 architecture. The first model uses the T5-small variant (~ 60 million parameters), and the second uses T5-base (~ 220 million parameters). The finetuned T5-small model is available at: https://huggingface.co/billingsmoore/mlotsawa-ground-small. The finetuned T5-base model is available at: https://huggingface.co/billingsmoore/mlotsawa-ground-base.
These models were chosen to balance performance with computational efficiency, making them suitable for low-resource settings and feasible to deploy or further finetune on commercially available hardware. The models are named MLotsawa-Ground, from ‘ML’ for machine learning, and “lotsawa”, the Tibetan word for “translator”. The “ground” portion of their names is to reflect their position as a starting point for future research.
4.3.1 Continued Pretraining
Prior to fine-tuning, the models were subjected to one epoch of unsupervised continued pretraining on the Tibetan-English corpus using a learning rate of 3e-4, the learning rate used in Raffel et al. (2020), and a batch size of 4 using an RTX 4070 GPU. The pretraining objective was T5’s original span corruption denoising task (Raffel et al., 2020), in which random spans of input tokens are masked and the model learns to reconstruct the missing content. This phase served two purposes: it allowed the model to adapt to the custom tokenizer trained on Tibetan script, and it facilitated initial exposure to the linguistic and structural properties of Tibetan.
Fig. 4
Training procedure
Click here to Correct
4.3.2 Fine-Tuning
Following pretraining, the small model was fine-tuned for 50 epochs on the full set of aligned translation pairs. At 50 epochs, performance continued to improve, but very slowly. The base model was finetuned for 44 epochs, after which, performance ceased to improve. The optimizer used was Adafactor (Shazeer & Stern, 2018), with an initial learning rate of 3e-4. Both of these hyperparameters were used by Raffel et al. (2020). The batch size used was 8, and training was performed with an RTX 4070 GPU.
4.4 Experiments
4.4.1 Effects of Custom Tokenizer on Tokenized Input Lengths
The models presented in this study rely on a custom tokenizer to produce efficient and effective encodings of input text. In this experiment, we investigate whether, and to what extent, a custom tokenizer reduces the length of tokenized inputs. We compare tokenized sequence lengths using: a “baseline” tokenizer, the default tokenizer of T5 with added Tibetan Unicode characters (U + 0F00-U + 0FFF); and a “custom” tokenizer, the custom tokenizer described in Section 4.2.
Both tokenizers are applied to the Tibetan sentences from the test split of the dataset described in Section 4.1. We then compute and compare the mean tokenized lengths.
4.4.2 Ablation Study
The MLotsawa-Ground models presented in this study utilize Getok, a custom tokenizer, and continued pretraining prior to finetuning. We perform an ablation study to determine the impact of each of these steps.
A T5-small model was finetuned for three epochs for each of the following combinations: no custom tokenizer (Tibetan script characters were simply added to the vocabulary of the default T5 tokenizer), and no continued pretraining (NT + NP); no custom tokenizer with pretraining (NT + P); custom tokenizer, but no pretraining (T + NP); and custom tokenizer and pretraining (T + P).
We perform this study at two scales: 100,000 training samples and 400,000 training samples, in order to evaluate the relative performance of these combinations at different scales of training data. This is of particular interest to low-resource language research, where extremely small sets of training data are common. Each set of training samples is a random subset of the training data described in Section 4.1. The models were evaluated on the test data described in that same section. The custom tokenizer used is the Getok tokenizer, described in Section 4.2.
4.4.3 Domain-Specific vs. General Data Experiment
The MLotsawa-Ground models presented in this study were trained only on Buddhist material. Related work (described in Section 2.2) suggests that this may harm performance. This experiment investigates the impact of the restriction of training data in this way.
There are two datasets used for this experiment. The “restricted” dataset is the dataset described in Section 4.1 of this paper. The “unrestricted” dataset contains the restricted dataset, and additional data from the following sources: non-Buddhist material provided by Open Pecha, Tatoeba (Tiedemann, 2020), TED2020 (Reimers & Gurevych, 2020), and No Language Left Behind (Fan et al., 2021; Schwenk et al., 2021), all of which, with the exception of the Open Pecha data, are available through OPUS (Tiedemann, 2012). This produced 1,531,208 training samples, which is significantly larger than the 861,417 samples in the restricted dataset.
The restricted dataset was used to train the T5-small MLotsawa-Ground model presented in this study, whose three-epoch checkpoint serves as our baseline (“restricted”) model for this experiment. The unrestricted dataset was used to train a T5-small (“unrestricted”) model using an identical training procedure: the data was used to produce a custom tokenizer; the model was pretrained on the data for one epoch using the custom tokenizer, then finetuned on the translation pairs. The finetuning was performed for 3 epochs.
4.5 Evaluation Framework
The automated metrics used in this study are BLEU (Bilingual Evaluation Understudy), chrF (Character-level F-score), TER (Translation Edit Rate), and GEMBA (GPT Estimation Metric Based Assessment). BLEU, chrF, and TER all rely on the surface-level features of the output string itself (i.e. word order, character errors). In contrast, GEMBA relies on the use of an LLM to give feedback.
For this preliminary study, GEMBA scores were only collected for 500 random samples from the test set for the main MLotsawa-Ground models presented in this paper. This was not performed for models in the other portions of this study. Because this sample size is relatively small, a 95% confidence interval is provided. Additional statistical analysis is performed in Section 5.1.3.
Additionally, we provide a rudimentary qualitative analysis of example translations for a more intuitive understanding of the models’ output quality.
4.5.1 Automatic Metrics
BLEU (Bilingual Evaluation Understudy), introduced by Papineni et al. (2002), is a widely adopted automated metric for evaluating MT systems through n-gram overlap analysis between candidate and reference translations. The algorithm calculates modified n-gram precision scores that limit each n-gram count to its maximum frequency across reference translations, preventing artificially inflated scores from repetitive content. BLEU incorporates a brevity penalty
to discourage overly short outputs, where
and
represent reference and candidate lengths respectively. The final score combines the geometric mean of n-gram precisions with the brevity penalty:
where
denotes the precision of n-grams of length
, producing values from 0 to 1. Despite its computational efficiency, BLEU's reliance on exact token matching limits its ability to capture semantic equivalence through paraphrasing, and its precision-only focus may undervalue valid alternative translations.
chrF (Character n-gram F-score), introduced by Popović (2015), is an automated evaluation metric that measures character-level similarity between candidate and reference texts, addressing some limitations of BLEU. chrF calculates character n-gram precision and recall between candidate and reference sequences. The metric computes F-score as the harmonic mean of precision and recall:
where
and
represent character n-gram precision and recall respectively, and
controls the relative importance of recall versus precision. Unlike BLEU's precision-only approach, chrF's inclusion of recall makes it more sensitive to content coverage.
TER (Translation Error Rate) is an automated evaluation metric that measures the minimum number of edits required to transform a machine translation into a reference translation, expressed as a percentage of the reference length. Developed by Snover et al. (2006), TER calculates the edit distance using four basic operations: insertion, deletion, substitution, and phrasal shifts (movement of contiguous word sequences). The metric is computed as:
TER ranges from 0 (perfect match) to potentially over 100 (more edits than reference words), making lower scores preferable. The inclusion of phrasal shifts distinguishes TER from simpler edit distance metrics, as it can capture reordering phenomena common in translation. However, TER's reliance on exact string matching limits its ability to recognize semantically equivalent variations, and its edit-based approach may penalize valid stylistic differences.
GEMBA (GPT Estimation Metric-Based Assessment), introduced by Kocmi and Federmann (2023), is an LLM-based evaluation metric that leverages large language models to assess translation quality through natural language prompting, rather than mathematical computation. The metric uses an LLM to evaluate translation quality based on criteria such as accuracy, fluency, and adequacy, producing either numerical scores or qualitative assessments.
GEMBA attempts to capture semantic equivalence, contextual appropriateness, and translation quality aspects that require linguistic understanding. The metric's effectiveness depends importantly on model scale, and represents a paradigmatic shift from algorithmic computation to leveraging the emergent evaluation capabilities of large language models trained on diverse multilingual data.
Here, Claude 4 Sonnet (Anthropic, 2025) is prompted to provide a score for a given translation. Following Kocmi and Federmann (2023), we use the prompt that they find best correlates with human judgments:
“Score the following translation from {source_lang} to {target_lang} with respect to the human reference on a continuous scale from 0 to 100, where a score of zero means "no meaning preserved" and a score of one hundred means "perfect meaning and grammar".
{source_lang} source: "{source_seg}"
{target_lang} human reference: {reference_seg}
{target_lang} translation: "{target_seg}"
Score:”
A
4.5.2 Qualitative Assessment Protocol
Given the limitations of automated metrics in capturing translation quality for specialized religious texts, for this preliminary study, we developed a rudimentary qualitative assessment framework for Tibetan Buddhist literature..
We selected two representative excerpts based on: (1) domain-specific technical vocabulary, (2) complex sentence structures typical of classical Tibetan literature, (3) theological concepts requiring cultural knowledge, and (4) devotional register requiring linguistic elevation. The texts included a devotional prayer (Advice on Bending Mind Toward the Good by Khenchen Ngawang Palzang) and a liturgical invocation (Protection from All Fears: A Prayer to Ārya Tārā by Sera Khandro), representing two major genres within Tibetan Buddhist literature. Neither of these two texts is present in the dataset described in Section 4.1 and thus, has not been seen by the model previously.
Our qualitative assessment examined four primary dimensions:
1.
Terminological Precision: Models' handling of specialized vocabulary, as well as distinguishing between terms requiring preservation versus translation.
2.
Doctrinal Accuracy: Preservation of theological and philosophical concepts, examining whether translations maintained specialized meanings rather than literal renderings.
3.
Register Appropriateness: Whether translations preserved the register of the source text through appropriate vocabulary and tone.
4.
Cultural Competency: Models' ability to recognize when direct translation would result in semantic loss and substitute culturally appropriate English equivalents established within Buddhist scholarly discourse.
Each translation output was compared against a human reference translation, focusing on semantic preservation rather than strict adherence to reference text wording. Our qualitative assessment acknowledges that specialized religious texts often permit multiple valid translations. We therefore focused on identifying patterns of competency and consistent handling of domain-specific challenges, rather than establishing absolute quality rankings.
For this preliminary study, assessment was conducted by the author, and independently reviewed by translation experts with specific expertise in Tibetan-English translation who are not affiliated with this research. This analysis, given its small scale should be taken as suggestive, but not definitive. Future work should utilize a more robust framework, such as the Multidimensional Quality Metric framework (Lommel, 2013), and utilize a meaningfully large number of human evaluators.
5. Evaluation and Results
5.1 MLotsawa-Ground Performance
5.1.1 Small vs. Base Results
Table 2
Automatic evaluation scores on held-out test data
Model
BLEU
chrF
TER
GEMBA
Small
3.54
19.89
87.58
76.43 ± 1.24
Base
4.08
20.85
86.74
76.00 ± 1.24
Table 2 reports automatic evaluation scores for the Small and Base MLotsawa-Ground models on the held-out test set. The Base model achieved slightly higher BLEU (4.08 vs. 3.54) and chrF scores (20.85 vs. 19.89) compared to the Small model, and a lower TER (86.74 vs. 87.58). GEMBA scores were nearly identical, with Small at 76.43 ± 1.24 and Base at 76.00 ± 1.24.
5.1.2 Qualitative Translation Analysis
Click here to Correct
Excerpt 1: Advice on Bending Mind Toward the Good
by Khenchen Ngawang Palzang | Human translation: Joseph McClellan (2024) [Licensed under CC BY-NC 4.0]
Original
Human Translation
Small Machine Translation
Base Machine Translation
གྲུབ་བརྒྱའི་སྤྱི་མེས་པཎ་ཆེན་བི་མ་ལ། །བསམ་བཞིན་སྤྲུལ་པའི་ཟློས་གར་ཉེར་བཟུང་བ། །རྒྱལ་བའི་དབང་པོ་ཀློང་ཆེན་རབ་འབྱམས་པ། །འདི་ཙམ་མ་ཡིན་ཚེ་རབས་གཏན་གྱི་སྐྱབས། །
Grandsire of a hundred siddhas—great scholar, Vimalamitra,
And you who fully embraced the spectacle of intentional emanation,
Lord of conquerors, Longchen Rabjam—You are my unfailing refuge;
not just now, but in the concatenation of my lives.
Great paṇḍita Vimalamitra, forefather of hundreds of siddhas,
Manifesting in the form of a play, Lord of the victorious ones, Longchen Rabjam,
Not just this but the constant refuge throughout all my lives,
Great pạṇdita
Vimalamitra, forefather
of a hundred siddhas, Intentionally you took
on the dance of your
emanation,
Lord of the victorious ones, Longchen Rabjam,
Not just this but all our lives to come
Click here to Correct
Excerpt 2: Protection from All Fears: A Prayer to Ārya Tārā
by Sera Khandro | Human translation: Adam Pearcey (2025) [Licensed under CC BY-NC 4.0]
Original
Human Translation
Small Machine Translation
Base Machine Translation
ཀ་དག་སྤྲོས་བྲལ་འོད་གསལ་རིག་པའི་དབྱིངས༔ལྷུན་གྲུབ་སྣང་ཆ་མ་འགགས་སྒྱུ་འཕྲུལ་གར༔ཐུགས་རྗེ་རྒྱལ་བ་ཀུན་གྱི་ཡུམ་གཅིག་མ༔རྗེ་བཙུན་ཨཱརྱ་ཏཱ་རེ་ཚེ་སྦྱིན་དཔལ༔གསོལ་བ་འདེབས་སོ་རླུང་སེམས་དབང་བསྡུས་ནས༔ཚེ་དང་བསོད་ནམས་འཕེལ་བར་མཛད་དུ་གསོལ༔
Out of the primordially pure unelaborate space of luminous awareness,
As the magical manifestation of unobstructed spontaneous presence,
Arises the compassionate one, the one and only mother of all victorious ones,
Noble Lady Ārya Tārā, glorious bestower of longevity,
To you I pray!
Take control of my vital winds and mind,
And increase my lifespan and merit!
Within the space of awareness—primordial purity free of elaboration—
Illusory dance of spontaneously present appearances unceasing
Only mother of all the buddhas of compassion Noble Ārya Tārā glorious Tārā
To you I pray:
bringing the vāyu-mind under control
And increase our lifespan and merit.
The clear light of primordial purity simplicity, the space of awareness,
Dance-like display
of unceasing spontaneously
Present Appearances,
Only mother of all the victorious ones of compassion Noble Ārya Tārā glorious Amitāyus
To you I pray: gather the empowerments of wind-energy and mind
Increase our lifespan and merit!
Click here to Correct
In both excerpts, it is worth noting that basic sentence structures are correctly rendered in English, despite the difference in standard word order between Tibetan and English. However, there are also notable instances of grammatical incoherence. This is especially notable given the relatively complex sentence structures that are found in these excerpts.
In the first excerpt, from Advice on Bending the Mind Toward Good, both models correctly handle specialized terminology: "paṇḍita", the Sanskrit term for a highly educated scholar, retains Sanskrit diacriticals; "གྲུབ་པ" renders as "siddhas" (not generic "accomplished ones"), and "རྒྱལ་བའི་དབང་པོ" translates as "Lord of the victorious ones" rather than literal "king's lord." The models distinguish between terms requiring preservation (Sanskrit "paṇḍita") versus translation (Tibetan "སྤྲུལ་པ" as "emanation"). Both also maintain a devotional register through elevated vocabulary (i.e. "Great," "Lord").
Critical theological concepts show accurate rendering: "སྤྲུལ་པའི་ཟློས་གར་" becomes "dance of emanation"/"spectacle of intentional emanation," capturing the Buddhist concept of an enlightened person’s deliberate manifestation in the world. The temporal phrase "ཚེ་རབས་གཏན་གྱི་སྐྱབས་" yields "throughout all my lives"/"all our lives to come," both preserving the notion of multi-lifetime devotion, important within the context of the Buddhist doctrine of reincarnation.
With respect to the second excerpt, from Protection from All Fears: A Prayer to Ārya Tārā, the models display sophisticated handling of technical terminology and religious concepts that standard metrics like BLEU, chrF, and TER, would likely undervalue.
The phrase "ཀ་དག་སྤྲོས་བྲལ་འོད་གསལ་རིག་པའི་དབྱིངས་" contains four distinct technical terms. The Small model renders this as "space of awareness—primordial purity free of elaboration," correctly identifying "རིག་པའི་དབྱིངས" as "space of awareness" (not the literal, and somewhat generic, "consciousness space") and "ཀ་དག་" as "primordial purity", not simply "pure", better aligning with the norm among human translators. The Base model translates "འོད་གསལ་" as "clear light," the precise English equivalent for this term, while maintaining "primordial purity" and "space of awareness." Both preserve the technical meaning that more literal translation would miss.
For "རླུང་སེམས་དབང་བསྡུས་", the Small model uses "vāyu-mind under control" while the Base uses "empowerments of wind-energy and mind." Both recognize "རླུང་" as the technical term for subtle, metaphysical energy (not merely the literal meaning of "wind"), with the Small model's "vāyu" preserving the Sanskrit terminology sometimes preferred by human translators, and the Base model's "wind-energy" clarifying the concept more explicitly. The term "དབང་བསྡུས་" shows divergent but defensible interpretations: "under control" (Small) versus "gather the empowerments" (Base), both capturing the sense of mastery over subtle energies typical of yogic practice.
The Base model's substitution of the name "Amitāyus" for the name of the goddess "Tārā" in the final line, while incorrect, is an interesting mistake which relies on a basic conception of Tibetan Buddhist iconography: both deities are associated with longevity-focused practices. The Small model correctly maintains "Tārā" throughout. The specialized terminology "སྒྱུ་འཕྲུལ་གར" renders as "illusory dance" (Small) and "dance-like display" (Base), both accurately conveying the concept of appearance as an illusion.
5.1.3 GEMBA Score Analysis
Fig. 5
Comparison of GEMBA score distributions
Click here to Correct
Both models were evaluated on 500 samples from the test split of the dataset using Claude 4 Sonnet. For the Small model, the mean GEMBA score was 76.43 with a standard deviation of 14.15. At the 95% confidence level, the margin of error was calculated to be ± 1.24 points, yielding a confidence interval of [75.19, 77.67] for the true mean. For the base model the sample mean GEMBA score was 76.00, with a standard deviation of 14.16. The margin of error is ± 1.24 points, resulting in a 95% confidence interval of [74.76, 77.24] for the true mean.
A
Fig. 6
Correlation of GEMBA scores
Click here to Correct
As shown in Fig. 5, GEMBA scores for both models are left-skewed, indicating that while most translations are of reasonably high quality, a subset of inputs result in poor performance. The Base model appears to handle these difficult cases slightly better than the Small model, yielding higher scores on the low-performing outliers. At the same time, the Small model appears to produce more reliable results on the majority of easier inputs.
Model predictions are strongly correlated (r = 0.908), indicating that the models agree on which material is most difficult.
5.2 Effects of Custom Tokenizer on Tokenized Input Lengths
The mean length of tokenized inputs for the baseline tokenizer was 175.888. The mean length of tokenized inputs for the custom tokenizer was 29.711 (see Fig. 7). The custom tokenizer reduced the mean length by 83.11%.
Fig. 7
Tokenized input mean length comparison
Click here to Correct
5.3 Ablation Study
The BLEU, chrF, and TER scores on the test data for both scales of ablation study are shown in the tables below.
A
Table 3
Ablation study results, 100k samples
Model
BLEU
chrF
TER
NT + NP
0.072
5.260
96.934
NT + P
0.101
6.111
96.126
T + NP
0.489
8.033
101.035
T + P
0.474
8.100
100.661
A
Table 4
Ablation study results, 400k samples
Model
BLEU
chrF
TER
NT + NP
0.230
7.190
95.246
NT + P
0.832
11.892
92.703
T + NP
1.056
9.159
99.268
T + P
1.142
9.526
98.692
At the 100,000 training sample scale, the model with neither a custom tokenizer nor continued pretraining (NT + NP) performed worst on two of the three metrics, achieving a BLEU score of 0.072, a chrF of 5.260, and a TER of 96.934. Adding continued pretraining alone (NT + P) led to modest gains, with BLEU rising to 0.101, chrF to 6.111, and TER improving slightly to 96.126. Introducing a custom tokenizer without pretraining (T + NP) produced a larger improvement, yielding a BLEU of 0.489 and chrF of 8.033, although TER worsened to 101.035. The combination of both custom tokenization and pretraining (T + P) produced similar BLEU (0.474) and chrF (8.100) scores to T + NP, with a marginally better TER of 100.661.
A
Fig. 8
Ablation study results
Click here to Correct
At the larger scale of 400,000 training samples, the NT + NP model remained lowest-performing overall (BLEU 0.230, chrF 7.190, TER 95.246), while NT + P showed substantial improvements (BLEU 0.832, chrF 11.892, TER 92.703). The T + NP model performed better than NT + NP, but worse than NT + P on most metrics (BLEU 1.056, chrF 9.159, TER 99.268). The T + P model achieved the highest BLEU (1.142) and chrF (9.526), but with a TER of 98.692.
5.4 Domain-Specific vs. General Data Experiment
A
Table 5
Results of domain-specific data experiment
Model
BLEU
chrF
TER
Unrestricted
1.29
10.35
99.30
Restricted
2.12
14.26
92.80
A
Fig. 9
Domain-specific data experiment results
Click here to Correct
The model trained on the domain-specific restricted dataset outperformed the model trained on the broader unrestricted dataset across all evaluation metrics. The restricted model achieved a BLEU score of 2.12, compared to 1.29 for the unrestricted model. Similarly, the restricted model achieved a chrF score of 14.26, surpassing the 10.35 scored by the unrestricted model. The restricted model also performed better with respect to TER, with a lower score of 92.80 compared to 99.30 for the unrestricted model.
6. Discussion
6.1 Overall Performance and Evaluation Challenges
Our models demonstrate mixed performance across different evaluation metrics, highlighting fundamental challenges in assessing MT quality for specialized, low-resource domains. While BLEU, chrF, and TER scores remain modest, both models surpass the BLEU score of 0.108 reported by Shu et al. (2024) for GPT-4o with retrieval-augmented generation. More notably, the GEMBA scores appear remarkably high, comparable to those reported by Nehrdich and Keutzer (2025) for a 7 billion parameter model, more than 30 times larger than our Base model, and over 100 times larger than our Small model.
This stark divergence between automated metric scores raises critical questions about metric reliability for Tibetan-English translation. The disconnect suggests that traditional n-gram based metrics may inadequately capture translation quality in this domain, potentially missing semantic accuracy and cultural nuance that human evaluators or LLM-based metrics like GEMBA can better assess. Qualitative inspection of sample outputs reveals that our models produce relatively semantically accurate and coherent translations of Tibetan Buddhist texts, supporting the hypothesis that automated metrics provide an incomplete picture of actual performance.
These evaluation challenges underscore the need for more comprehensive assessment frameworks in specialized translation domains. The apparent contradiction between metric types indicates that future work must carefully consider which evaluation approaches most accurately reflect end-user needs and translation utility.
6.2 Model Efficiency and the Limits of Scale
Perhaps our most striking finding is that model size does not consistently translate to improved performance in this specialized domain. The Small and Base models achieve comparable GEMBA scores with overlapping confidence intervals, suggesting that careful training and domain adaptation may be more critical than raw parameter count for Tibetan translation.
Analysis of GEMBA score distributions reveals nuanced performance patterns across both models. While scores are generally left-skewed, indicating that most translations achieve reasonable quality, both models struggle with a subset of difficult inputs. Interestingly, the Base model handles these challenging cases slightly better, while the Small model produces more consistent results on typical inputs. This suggests that model size may primarily benefit edge cases, rather than general translation quality.
These findings challenge the common assumption that larger models will yield better specialized translation performance. The lack of significant GEMBA improvement despite higher BLEU scores for the Base model suggests that size-related gains may reflect token-level changes that do not meaningfully impact comprehension or user satisfaction. This has important implications for deployment in resource-constrained environments, where smaller, efficiently trained models may offer optimal cost-performance trade-offs.
6.3 Technical Interventions: Tokenization and Continued Pretraining
Our ablation study reveals that both custom tokenization and continued pretraining contribute to translation performance, though their relative importance varies with dataset size and training conditions. The custom tokenizer dramatically reduces tokenized input lengths compared to baseline tokenization, enabling the model to process longer source texts within fixed input windows while preserving crucial contextual information.
At smaller data scales (100k samples), custom tokenization alone produces greater performance gains than continued pretraining, suggesting that improving token-level representation becomes especially critical when training data is limited. This finding highlights the importance of domain-appropriate preprocessing in low-resource settings. Continued pretraining provides modest improvements even without custom tokenization, indicating that exposure to unsupervised Tibetan text helps models adapt to script and domain characteristics despite suboptimal input representations.
As training data volume increases to 400k samples, continued pretraining benefits become more pronounced. The combination of both interventions (T + P) consistently achieves the best performance across scales, demonstrating their complementary nature. These results suggest a practical strategy for low-resource translation: prioritize custom tokenization for immediate gains at small scales, then add continued pretraining as more data becomes available in order to maximize performance improvements.
The differential impact across metrics further emphasizes evaluation complexity. While BLEU scales predictably with increased data volume, chrF shows diminishing returns, and TER improves only marginally, reinforcing concerns about metric selection and the importance of reporting multiple evaluation measures.
6.4 Domain-Specific vs. General Data
Our comparison between domain-specific and mixed-domain training data yields clear evidence favoring specialization in low-resource contexts. The model trained exclusively on Buddhist texts significantly outperforms the mixed-domain model, achieving 64% higher BLEU scores, 38% higher chrF scores, and 6.5% lower TER. These substantial improvements demonstrate that targeted data curation can be more beneficial than dataset size alone.
This finding contrasts with research emphasizing the benefits of large-scale, diverse training data, but aligns with domain adaptation literature in specialized fields. For Tibetan-English translation, where training data remains scarce and texts often contain domain-specific terminology, concepts, and cultural references, focused curation appears essential. The specialized model likely develops better representations of Buddhist terminology and discourse patterns, leading to more accurate and contextually appropriate translations.
These results have practical implications for translation systems development in specialized domains. Rather than pursuing broad coverage through diverse but potentially noisy data, development should prioritize high-quality, domain-relevant datasets that match their target use cases.
6.5 Limitations and Future Directions
6.5.1 Evaluation Challenges and Standardization Needs
Reliable evaluation constitutes a major challenge for Tibetan MT, particularly in specialized domains such as Buddhist texts. The datasets used in this and other studies (Shu et al., 2024; Nehrdich & Keutzer, 2025) are distinct, making direct numerical comparisons difficult and constraining our ability to benchmark progress in the field. Future work should prioritize developing standardized, domain-representative evaluation benchmarks that enable meaningful cross-system comparisons while accounting for the linguistic and cultural specificities of Tibetan texts.
BLEU, chrF, and TER, while providing useful rough indicators, appear inadequate for capturing the nuanced semantic and stylistic adequacy required for high-quality translation in this specialized context. The stark divergence between these metrics and GEMBA scores in our study underscores this limitation. Future evaluation frameworks should incorporate human evaluation, reference-free quality estimation methods, and domain-specific assessment schemes that better reflect end-user needs and translation utility.
Because our preliminary qualitative analysis is simple and limited to two examples, the generalizability of our findings is constrained. Expanding the set of test samples and systematically exploring translation quality across diverse text types, genres, and difficulty levels will help identify model strengths, weaknesses, and areas for targeted improvement. Such comprehensive evaluation would provide clearer guidance for model selection and deployment decisions.
6.5.2 Advanced Translation Methodologies
Several sophisticated approaches remain unexplored for Tibetan-English translation. Prior work has demonstrated that contextual information such as metadata or structural cues can significantly improve translation performance in low-resource, domain-specific settings. Given the diversity of registers and genres within Tibetan historical and religious texts, metadata including text genre, section headers, or period of composition could help models disambiguate terminology and adapt stylistic choices appropriately.
Targeted domain adaptation presents another promising avenue. Though this study focused only on Buddhist material, this corpus still comprises an exceptionally diverse set of texts, from philosophical treatises to ritual manuals. Approaches such as curriculum learning, multi-stage fine-tuning, or domain-specific pretraining may yield substantial gains when applied to narrower textual contexts or specific Buddhist schools and traditions.
6.5.3 Hybrid Architectures and Selective Model Routing
Our finding that smaller and larger models perform comparably on most inputs while differing on edge cases suggests opportunities for hybrid deployment strategies. A selective routing approach, like that presented in Khalili et al. (2022), could direct challenging inputs to larger models while processing routine translations with smaller, more efficient models. This strategy could maintain high overall translation quality while minimizing computational overhead, which is particularly valuable for organizations with limited resources.
The strong correlation between model predictions (r = 0.908) supports this approach's feasibility. A translation produced by the smaller model could be automatically evaluated, and inputs falling below quality thresholds could be re-processed by the larger model.
6.5.4 Architectural Exploration and Scaling Studies
While this study evaluated a single T5 architecture variant as just two sizes, systematic investigation of other architecture variants (i.e. T5 Version 1.1, T5-Efficient) and model sizes (i.e. T5-large, T5-XL, and T5-XXL) could provide additional quality improvements or clarify scaling behavior for Tibetan-English translation.
Additionally, comparative studies with decoder-only architectures could determine whether encoder-decoder models provide the suggested systematic advantages for this translation task. For example, comparably sized T5 and Gemma models could be compared directly.
7 Conclusion
7.1 Summary of Contributions
This study provides the first comprehensive investigation of Tibetan–English MT using compact transformer architectures, showing that careful design choices can achieve competitive results without large-scale computational resources. We evaluated three interventions: custom tokenization, continued pretraining, and domain-specific data curation; and found each significantly improves translation quality in this specialized, low-resource setting.
Getok, our custom tokenizer trained on in-domain Tibetan data, reduces input sequence length by more than 80% relative to baseline tokenization, improving token alignment and allowing longer inputs. Continued pretraining, even for a single epoch, yields substantial downstream gains, especially with larger datasets, suggesting that unsupervised pretraining complements supervised finetuning in scalable ways. Domain-specific training data proved surprisingly impactful. Models trained exclusively on Buddhist texts achieved 64% higher BLEU, 38% higher chrF, and 6.5% lower TER than mixed-domain counterparts.
Finally, our results challenge the assumption that larger models are essential for high-quality translation. Small and Base models achieved comparable GEMBA scores, underscoring that training strategy and domain adaptation can outweigh raw parameter count, an important consideration for resource-constrained deployments.
7.2 Applicability and Impact
The proposed framework is broadly applicable to low-resource, morphologically rich languages, and it requires no large models, high-resource corpora, or specialized hardware. Custom tokenization is particularly beneficial for languages with rich morphology, non-Latin scripts, or agglutinative structures, where standard tokenizers produce inefficient segmentations.
Domain-specific data curation offers another transferable principle: in specialized contexts such as legal, medical, religious, or technical translation, targeted datasets can outperform larger heterogeneous collections. Likewise, continued pretraining is viable wherever monolingual data is abundant, even with minimal parallel corpora, offering a practical path for performance gains.
By prioritizing data quality, domain alignment, and efficient architectures, this approach supports the democratization of translation technology, enabling underserved communities to preserve and access rich textual traditions without prohibitive computational costs.
A
Author Contribution
J.M wrote the main manuscript text and conducted the analysis. P.L provided supervision, guidance, and critical revisions to the manuscript. Both authors reviewed the final manuscript.
A
Acknowledgement
We wish to thank the following people: Jamie Gordon Creek, In-House Translator & Catalog Curator at Khyentse Vision Project for providing independent validation of our translation quality assessments, as well as invaluable correction to Tibetan spellings.Ngawang Trinley and Open Pecha for the use of their data and guidance in the early stages of this project.Dr. David Germano and the University of Virginia’s Tibetan and Himalayan Library for the use of their data and their encouragement of this work.Dr. Andres Montano, Director of Tibet House Guatemala for suggesting the name “Getok” for the custom tokenizer presented in this study.Sebastian Nehrdich, at the Berkeley AI Research Lab, for his support and encouragement in this study.Any remaining errors are solely our responsibility.
References
Allal, L., & von Werra, L. (2024). Huggingface/text–clustering: Easily embed, cluster and semantically label text datasets. GitHub. Retrieved June 25, 2025 from https://github.com/huggingface/text-clustering
Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A next-generation hyperparameter optimization framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. https://doi.org/10.1145/3292500.3330701
A
Akshat, A., Tripathi, K., Raj, G., Sar, A., Choudhury, T., Saraf, S., & Dewangan, B. K. (2024, June). A comparative study between chat GPT, T5 and LSTM for machine language translation. In 2024 OPJU International Technology Conference (OTCON) on Smart Computing for Innovation and Advancement in Industry 4.0 (pp. 1–6). IEEE.
Alves, D. M., Pombal, J., Guerreiro, N. M., Martins, P. H., Alves, J., Farajian, A., Peters, B., Rei, R., Fernandes, P., Agrawal, S., & Colombo, P. (2024). Tower: An open multilingual large language model for translation–related tasks. CoRR. https://doi.org/10.48550/arXiv.2402.17733. arXiv:2402.17733.
A
Anthropic (2025, May). System card: Claude Opus 4 & Claude Sonnet 4. https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf
A
Brants, T., Popat, A., Xu, P., Och, F. J., & Dean, J. (2007, June). Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (pp. 858–867).
Chen, X., Wang, H., & Xiang, W. (2019). Implementation of Tibetan–Chinese translation platform based on LSTM algorithm. Proceedings of the ACM Turing Celebration Conference – China (ACM TURC ’19) (Article 142), 1–5. https://doi.org/10.1145/3321408.3326670
Chu, C., & Wang, R. (2018). A survey of domain adaptation for neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 1304–1319). Association for Computational Linguistics. https://doi.org/10.48550/arXiv.1806.00258
DeHaven, M., & Billa, J. (2022). Improving low-resource speech recognition with pretrained speech models: Continued pretraining vs. semi-supervised training. arXiv preprint arXiv:2207.00659.
Deshwal, M., & Chawla, A. (2024). PHUDGE: Phi-3 as Scalable Judge. arXiv e-prints, arXiv-2405.
Dewangan, V., Suri, G., Raj, S., & Sonavane, R. (2025). When every token counts: Optimal segmentation for low–resource language models. Proceedings of the First Workshop on Language Models for Low–Resource Languages, 294–308. Association for Computational Linguistics.
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., & Ganapathy, R. (2024). The llama 3 herd of models. arXiv e-prints, arXiv-2407.
Fan, A., Bhosale, S., Schwenk, H., Ma, Z., El–Kishky, A., Goyal, S., & Ott, M. (2021). Beyond English-centric multilingual machine translation. Journal of Machine Learning Research, 22(107), 1–48. https://doi.org/10.48550/arXiv.2010.11125
Freitag, M., & Al-Onaizan, Y. (2016). Fast Domain Adaptation for Neural Machine Translation. arXiv e-prints, arXiv-1612.
Gan, S., Yin, Y., Jiang, Z., Xie, L., & Lu, S. (2023). October. Towards real-time sign language recognition and translation on edge devices. In Proceedings of the 31st ACM International Conference on Multimedia (pp. 4502–4512).
A
Gemma Team, Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Rouillard, L., Mesnard, T., & Cideron, G. (2025). Gemma 3 Technical Report. arXiv e-prints, https://doi.org/10.48550/arXiv.2503.19786
Gordon, M. A., Duh, K., & Kaplan, J. (2021). Data and parameter scaling laws for neural machine translation. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 5915–5922. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.478
Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., & Kivlichan, I. (2024). GPT-4o System Card. arXiv e-prints, arXiv-2410.
Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., & Liu, Q. (2020). TinyBERT: Distilling BERT for natural language understanding. Findings of EMNLP 2020, 4163–4174. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.372
Kim, Y., & Rush, A. M. (2016). Sequence-level knowledge distillation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 1317–1327. Association for Computational Linguistics. https://doi.org/10.18653/v1/D16-1139
Khalili, L., You, Y., & Bohannon, J. (2022). BabyBear: Cheap inference triage for expensive language models. arXiv e-prints, pp.arXiv-2205.
A
Khandro, S. (2025). Prayer to Ārya Tārā. (A. Pearcey, Trans). Lotsawa House. Retrieved June 23, 2025, from https://www.lotsawahouse.org/tibetan-masters/sera-khandro/tara-prayer-protect-all-fears [Licensed under CC BY-NC 4.0].
Kocmi, T., & Federmann, C. (2023). Large Language Models Are State-of-the-Art Evaluators of Translation Quality. Proceedings of the 24th Annual Conference of the European Association for Machine Translation, 193–203.
Le, C. (2025). Privacy-Preserving Real-Time Vietnamese-English Translation on iOS using Edge AI. arXiv preprint arXiv:2505.07583.
Liu, Y. (2025). Improving machine translation accuracy for underrepresented languages in linguistic research using transformer models. Journal of Computational Methods in Sciences and Engineering, p.14727978251337995.
A
Lommel, A. R., Burchardt, A., & Uszkoreit, H. (2013). Multidimensional quality metrics: a flexible system for assessing translation quality. In Proceedings of Translating and the Computer 35.
Magister, L. C., Mallinson, J., Adamek, J., Malmi, E., & Severyn, A. (2023). Teaching Small Language Models to Reason. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 1773–1781).
McInnes, L., Healy, J., & Astels, S. (2017). hdbscan: Hierarchical density based clustering (JOSS 2(11), 205).
McInnes, L., Healy, J., Saul, N., & Großberger, L. (2018). UMAP: Uniform manifold approximation and projection. Journal of Open Source Software, 3(29), 861. https://doi.org/10.48550/arXiv.1802.03426
Mistral, A. I. (2023). Mixtral-8x7b-instruct-v0. 1
Miyagawa, S. (2023). Machine translation for highly low-resource language: A case study of ainu, a critically endangered indigenous language in northern Japan. Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages (pp. 120–124).
Nehrdich, S., & Keutzer, K. (2025). MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval for Pāli, Sanskrit, Buddhist Chinese, and Tibetan. Unpublished manuscript.
Nguyen, X. P., Aljunied, M., Joty, S., & Bing, L. (2024). Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 3501–3516. https://doi.org/10.18653/v1/2024.acl-long.192
A
Palzang, K. N. (2024). Bending mind toward good (J. McClellan, Trans.). Lotsawa House. Retrieved June 23, 2025, from https://www.lotsawahouse.org/tibetan-masters/khenchen-ngawang-palzang/advice-bending-mind-to-good [Licensed under CC BY-NC 4.0].
Pang, J., Yang, B., Wong, D. F., Wan, Y., Liu, D., Chao, L. S., & Xie, J. (2024). Rethinking the exploitation of monolingual data for low–resource neural machine translation. Computational Linguistics, 50(1), 25–47. https://doi.org/10.1162/coli_a_00496
Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311–318).
Popović, M. (2015). chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the tenth workshop on statistical machine translation (pp. 392–395).
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text–to–text transformer. Journal of Machine Learning Research, 21(140), 1–67. https://doi.org/10.48550/arXiv.1910.10683
Reimers, N., & Gurevych, I. (2020). Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the Fifth Conference on Machine Translation, 1174–1182. https://doi.org/10.48550/arXiv.2004.09813
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv e-prints, arXiv-1910.
Schwenk, H., Wenzek, G., Edunov, S., Grave, É., Joulin, A., & Fan, A. (2021). CCMatrix: Mining billions of high-quality parallel sentences on the web. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 1, 6490–6500. https://doi.org/10.18653/v1/2021.acl-long.507
Sennrich, R., Haddow, B., & Birch, A. (2015). Neural Machine Translation of Rare Words with Subword Units. ArXiv, abs/1508.07909.
Sennrich, R., & Zhang, B. (2019). Revisiting low–resource neural machine translation: A case study. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 211–221. https://doi.org/10.18653/v1/P19-1021
Shazeer, N., & Stern, M. (2018). Adafactor: Adaptive learning rates with sublinear memory cost. Proceedings of the International Conference on Machine Learning, 4596–4604. https://doi.org/10.48550/arXiv.1804.04235
Shazeer, N. (2020). Glu variants improve transformer. arXiv preprint arXiv:2002.05202.
Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., & Arikawa, S. (1999). Byte pair encoding: A text compression scheme that accelerates pattern matching.
Shu, P., Chen, J., Liu, Z., Wang, H., Wu, Z., Zhong, T., Li, Y., Zhao, H., Jiang, H., Pan, Y., & Zhou, Y. (2024). Transcending Language Boundaries: Harnessing LLMs for Low-Resource Language Translation. arXiv e-prints, pp.arXiv-2411.
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., & Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers (pp. 223–231).
Suryakusuma, M. R., Shiddiq, M. F. A., Lucky, H., & Iswanto, I. A. (2023). November. Investigating T5 Generation Neural Machine Translation Performance on English to German. In 2023 International Conference on Informatics, Multimedia, Cyber and Informations System (ICIMCIS) (pp. 12–15). IEEE.
Tan, Z., Yang, Z., Zhang, M., Liu, Q., Sun, M., & Liu, Y. (2022). Dynamic multi-branch layers for on-device neural machine translation. IEEE/ACM Transactions on Audio Speech and Language Processing, 30, 958–967.
Tay, Y., Dehghani, M., Rao, J., Fedus, W., Abnar, S., Chung, H. W., & Metzler, D. (2021). Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers. arXiv e-prints, arXiv-2109.
Thupten, T., Rinchen, D., Nyima, T., Yu, Y., & Deng, Q. (2021). Research on Chinese–Tibetan machine translation model based on improved byte pair encoding. Journal of the University of Electronic Science and Technology of China, 50(2), 249–255. https://doi.org/10.12178/1001–0548.2020218
A
Tiedemann, J. (2012, May). Parallel data, tools and interfaces in OPUS. In Lrec (Vol. 2012, pp. 2214–2218).
Tiedemann, J. (2020, November). The Tatoeba Translation Challenge–Realistic Data Sets for Low Resource and Multilingual MT. In Proceedings of the Fifth Conference on Machine Translation (pp. 1174–1182).
Tournadre, N. (2010). The Classical Tibetan cases and their transcategoriality: From sacred grammar to modern linguistics. Himalayan Linguistics, 9(2).
Tournadre, N., & Dorje, S. (2003). Manual of Standard Tibetan. Snow Lion.
Tsarfaty, R., Seddah, D., Kübler, S., & Nivre, J. (2013). Parsing morphologically rich languages: Introduction to the special issue. Computational linguistics, 39(1), 15–22.
Usui, H., & Komiya, K. (2023). December. Translation from Historical to Contemporary Japanese Using Japanese T5. In Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages (pp. 27–35).
Verma, N., Murray, K., & Duh, K. (2022). Strategies for adapting multilingual pre–training for domain–specific machine translation. Proceedings of the 15th Conference of the Association for Machine Translation in the Americas, 1, 31–44.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 5998–6008. https://doi.org/10.48550/arXiv.1706.03762
Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., & Zhou, M. (2020). MiniLM: Deep self–attention distillation for task–agnostic compression of pre–trained transformers. NeurIPS, 33, 5776–5788. https://doi.org/10.48550/arXiv.2002.10957
Watt, T., Chrysoulas, C., & Gkatzia, D. (2023). October. Edge NLP for Efficient Machine Translation in Low Connectivity Areas. In 2023 IEEE 9th World Forum on Internet of Things (WF-IoT) (pp. 1–6). IEEE.
Wilson, J. B. (1998). Translating Buddhism from Tibetan. Snow Lion.
Wu, Z., Liu, Z., Lin, J., & Han, S. (2020). Lite Transformer with long–short range attention. Proceedings of the International Conference on Learning Representations. https://doi.org/10.48550/arXiv.2004.11886
Zaki, M. Z. (2024). Revolutionising translation technology: A comparative study of variant transformer models–BERT, GPT and T5. Computer Science and Engineering–An International Journal, 14(3), 15–27.
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). BERTScore: Evaluating Text Generation with BERT. arXiv e-prints, arXiv-1904.
Zhang, P., Zeng, G., Wang, T., & Lu, W. (2024). Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385.
Zheng, J., Hong, H., Liu, F., Wang, X., Su, J., Liang, Y., & Wu, S. (2024). Fine-tuning large language models for domain-specific machine translation. arXiv preprint arXiv:2402.15061.
Zhou, M. (2024). Research on Tibetan-Chinese neural machine translation integrating statistical method. In Proceedings of the 2023 6th International Conference on Machine Learning and Natural Language Processing, 126–129. https://doi.org/10.1145/3639479.3639506
Total words in MS: 9918
Total words in Title: 6
Total words in Abstract: 92
Total Keyword count: 4
Total Images in MS: 12
Total Tables in MS: 7
Total Reference count: 69