dc.contributor.author |
Gupta, Rahul |
|
dc.contributor.author |
Srivastava, Vivek |
|
dc.contributor.author |
Singh, Mayank |
|
dc.coverage.spatial |
United States of America |
|
dc.date.accessioned |
2023-03-03T15:40:59Z |
|
dc.date.available |
2023-03-03T15:40:59Z |
|
dc.date.issued |
2023-02 |
|
dc.identifier.citation |
Gupta, Rahul; Srivastava, Vivek and Singh, Mayank, "MUTANT: a multi-sentential code-mixed Hinglish dataset", arXiv, Cornell University Library, DOI: arXiv:2302.11766v1, Feb. 2023. |
en_US |
dc.identifier.uri |
https://arxiv.org/abs/2302.11766v1 |
|
dc.identifier.uri |
https://repository.iitgn.ac.in/handle/123456789/8611 |
|
dc.description.abstract |
The multi-sentential long sequence textual data unfolds several interesting research directions pertaining to natural language processing and generation. Though we observe several high-quality long-sequence datasets for English and other monolingual languages, there is no significant effort in building such resources for code-mixed languages such as Hinglish (code-mixing of Hindi-English). In this paper, we propose a novel task of identifying multi-sentential code-mixed text (MCT) from multilingual articles. As a use case, we leverage multilingual articles from two different data sources and build a first-of-its-kind multi-sentential code-mixed Hinglish dataset i.e., MUTANT. We propose a token-level language-aware pipeline and extend the existing metrics measuring the degree of code-mixing to a multi-sentential framework and automatically identify MCT in the multilingual articles. The MUTANT dataset comprises 67k articles with 85k identified Hinglish MCTs. To facilitate future research, we make the publicly available. |
|
dc.description.statementofresponsibility |
by Rahul Gupta, Vivek Srivastava and Mayank Singh |
|
dc.language.iso |
en_US |
en_US |
dc.publisher |
Cornell University Library |
en_US |
dc.subject |
MUTANT |
en_US |
dc.subject |
MCT |
en_US |
dc.subject |
Code-mixed languages |
en_US |
dc.subject |
Multi-sentential framework |
en_US |
dc.subject |
Hinglish |
en_US |
dc.title |
MUTANT: a multi-sentential code-mixed Hinglish dataset |
en_US |
dc.type |
Pre-Print Archive |
en_US |
dc.relation.journal |
arXiv |
|