MUTANT: a multi-sentential code-mixed Hinglish dataset

DR Home
→
Mechanical Engineering
→
E-print Articles
→
View Item

dc.contributor.author	Gupta, Rahul
dc.contributor.author	Srivastava, Vivek
dc.contributor.author	Singh, Mayank
dc.coverage.spatial	United States of America
dc.date.accessioned	2023-03-03T15:40:59Z
dc.date.available	2023-03-03T15:40:59Z
dc.date.issued	2023-02
dc.identifier.citation	Gupta, Rahul; Srivastava, Vivek and Singh, Mayank, "MUTANT: a multi-sentential code-mixed Hinglish dataset", arXiv, Cornell University Library, DOI: arXiv:2302.11766v1, Feb. 2023.	en_US
dc.identifier.uri	https://arxiv.org/abs/2302.11766v1
dc.identifier.uri	https://repository.iitgn.ac.in/handle/123456789/8611
dc.description.abstract	The multi-sentential long sequence textual data unfolds several interesting research directions pertaining to natural language processing and generation. Though we observe several high-quality long-sequence datasets for English and other monolingual languages, there is no significant effort in building such resources for code-mixed languages such as Hinglish (code-mixing of Hindi-English). In this paper, we propose a novel task of identifying multi-sentential code-mixed text (MCT) from multilingual articles. As a use case, we leverage multilingual articles from two different data sources and build a first-of-its-kind multi-sentential code-mixed Hinglish dataset i.e., MUTANT. We propose a token-level language-aware pipeline and extend the existing metrics measuring the degree of code-mixing to a multi-sentential framework and automatically identify MCT in the multilingual articles. The MUTANT dataset comprises 67k articles with 85k identified Hinglish MCTs. To facilitate future research, we make the publicly available.
dc.description.statementofresponsibility	by Rahul Gupta, Vivek Srivastava and Mayank Singh
dc.language.iso	en_US	en_US
dc.publisher	Cornell University Library	en_US
dc.subject	MUTANT	en_US
dc.subject	MCT	en_US
dc.subject	Code-mixed languages	en_US
dc.subject	Multi-sentential framework	en_US
dc.subject	Hinglish	en_US
dc.title	MUTANT: a multi-sentential code-mixed Hinglish dataset	en_US
dc.type	Pre-Print Archive	en_US
dc.relation.journal	arXiv

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

E-print Articles [38]

Show simple item record

Search Digital Repository

Browse

All of DSpace
This Collection
- Titles
- Authors
- By Advisor
- By Issue Date
- Subjects
- By Type
- By Degree
- By Department

MUTANT: a multi-sentential code-mixed Hinglish dataset

Files in this item

This item appears in the following Collection(s)

Search Digital Repository

Browse

All of DSpace

This Collection

My Account