MUTANT: a multi-sentential code-mixed Hinglish dataset

Show simple item record

dc.contributor.author Gupta, Rahul
dc.contributor.author Srivastava, Vivek
dc.contributor.author Singh, Mayank
dc.coverage.spatial United States of America
dc.date.accessioned 2023-03-03T15:40:59Z
dc.date.available 2023-03-03T15:40:59Z
dc.date.issued 2023-02
dc.identifier.citation Gupta, Rahul; Srivastava, Vivek and Singh, Mayank, "MUTANT: a multi-sentential code-mixed Hinglish dataset", arXiv, Cornell University Library, DOI: arXiv:2302.11766v1, Feb. 2023. en_US
dc.identifier.uri https://arxiv.org/abs/2302.11766v1
dc.identifier.uri https://repository.iitgn.ac.in/handle/123456789/8611
dc.description.abstract The multi-sentential long sequence textual data unfolds several interesting research directions pertaining to natural language processing and generation. Though we observe several high-quality long-sequence datasets for English and other monolingual languages, there is no significant effort in building such resources for code-mixed languages such as Hinglish (code-mixing of Hindi-English). In this paper, we propose a novel task of identifying multi-sentential code-mixed text (MCT) from multilingual articles. As a use case, we leverage multilingual articles from two different data sources and build a first-of-its-kind multi-sentential code-mixed Hinglish dataset i.e., MUTANT. We propose a token-level language-aware pipeline and extend the existing metrics measuring the degree of code-mixing to a multi-sentential framework and automatically identify MCT in the multilingual articles. The MUTANT dataset comprises 67k articles with 85k identified Hinglish MCTs. To facilitate future research, we make the publicly available.
dc.description.statementofresponsibility by Rahul Gupta, Vivek Srivastava and Mayank Singh
dc.language.iso en_US en_US
dc.publisher Cornell University Library en_US
dc.subject MUTANT en_US
dc.subject MCT en_US
dc.subject Code-mixed languages en_US
dc.subject Multi-sentential framework en_US
dc.subject Hinglish en_US
dc.title MUTANT: a multi-sentential code-mixed Hinglish dataset en_US
dc.type Pre-Print Archive en_US
dc.relation.journal arXiv


Files in this item

Files Size Format View

There are no files associated with this item.

This item appears in the following Collection(s)

Show simple item record

Search Digital Repository


Browse

My Account