One instruction does not fit all: how well do embeddings align personas and instructions in low-resource Indian languages?

Singh, Mayank

doi:10.48550/arXiv.2601.10205

One instruction does not fit all: how well do embeddings align personas and instructions in low-resource Indian languages?

Source

arXiv

ISSN

2331-8422

Date Issued

2026-01-01

Author(s)

Shah, Arya

Beniwal, Himanshu

Singh, Mayank

DOI

10.48550/arXiv.2601.10205

Abstract

Aligning multilingual assistants with culturally grounded user preferences is essential for serving India's linguistically diverse population of over one billion speakers across multiple scripts. However, existing benchmarks either focus on a single language or conflate retrieval with generation, leaving open the question of whether current embedding models can encode persona-instruction compatibility without relying on response synthesis. We present a unified benchmark spanning 12 Indian languages and four evaluation tasks: monolingual and cross-lingual persona-to-instruction retrieval, reverse retrieval from instruction to persona, and binary compatibility classification. Eight multilingual embedding models are evaluated in a frozen-encoder setting with a thin logistic regression head for classification. E5-Large-Instruct achieves the highest Recall@1 of 27.4\% on monolingual retrieval and 20.7\% on cross-lingual transfer, while BGE-M3 leads reverse retrieval at 32.1\% Recall@1. For classification, LaBSE attains 75.3\% AUROC with strong calibration. These findings offer practical guidance for model selection in Indic multilingual retrieval and establish reproducible baselines for future work\footnote{Code, datasets, and models are publicly available at https://github.com/aryashah2k/PI-Indic-Align

URI

https://repository.iitgn.ac.in/handle/IITG2025/33981