Abstract
Cross-Lingual Machine Translation (CLMT) remains a complicated issue because of linguistic diversity, lack of parallel corpora, and the necessity of effective semantic alignment. ContextXL is a new and complete CLMT framework that combines sophisticated pre-processing, semantic representation, and feature optimization strategies. This method starts with Named Entity Recognition (NER) and Byte Pair Encoding (BPE) to maintain semantic units and to deal with rare words. It then improves language representation with Cross-Lingual Word Embeddings (MUSE) and sub wordsensitive FastText embedding. Bidirectional Encoder Representations from Transformers (BERT) and Embedding from Language Models (ELMo) are used to extract richer features using contextual embedding. An effective feature selection is performed using a new Golden Hawk Search Optimization (GHSO) algorithm which is a combination of Golden Section Search and Chaotic Harris Hawk Optimization. Transformer-XL is the translation engine and models long-range dependencies with segment-level recurrence and memory caching. Experimental analysis reveals, ContextXL has a high translation accuracy of 98.77%, and good results in precision (98.97%), recall (98.57%), and F-score (98.85%). The model also performs better than the state-of-the-art baselines, Neural Machine Translation (NMT), BERT, Transformer, and RoBERTa in various evaluation metrics like Mathews Correlation Coefficient (MCC), sensitivity, and specificity, False Negative Rate (FPR), False Positive Rate (FPR) and Negative Predictive Value (NPV). Its effectiveness is also confirmed by a human evaluation, which gives it high scores in contextual preservation (4.52/5), fluency (4.47/5), and appropriateness (4.38/5). These findings point to the strength of ContextXL, which is appropriate to process lowresource, morphologically rich, even informal or code-switched language data.
Keywords: Bidirectional Encoder Representations from Transformers, Byte Pair Encoding, Cross-Lingual Machine Translation, Embedding from Language Models, Golden Hawk Search Optimization, Named Entity Recognition.