Do LLMs model human linguistic variation? a case study in Hindi-English verb code-mixing

Choudhury, Monojit

Do LLMs model human linguistic variation? a case study in Hindi-English verb code-mixing

Source

19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026)

Date Issued

2026-03-24

Author(s)

Choudhary, Mukund

Jindal, Madhu

Aeron, Gaurja

Choudhury, Monojit

Abstract

Do large language models (LLMs) model linguistic variation? We investigate this question through Hindi-English (Hinglish) verb code-mixing, where speakers can use either a Hindi verb or an English verb with the light verb karna (’do’). Both forms are grammatical, but speakers show unexplained variation in language choice for the verb. We compare human preferences on controlled code-mixed minimal pairs to LLM perplexities spanning families, sizes, and training language compositions. We find that current LLMs do not reliably classify verb language preferences to match native speaker judgments. We also see that with specific supervision, some models do predict human preference to an extent. We release native speaker acceptability judgments on 30 verb pairs, perplexity ratios for 4,279 verb pairs across 7 models, and experimental materials.

URI

https://aclanthology.org/2026.findings-eacl.291/

https://repository.iitgn.ac.in/handle/IITG2025/34908