Singh, MayankLodwal, HiteshHiteshLodwal2025-09-042025-09-042024-01-01https://repository.iitgn.ac.in/handle/IITG2025/32011hbk.; 30 cmen-USWeb ScrapingDeduplication-SimHashTokenizer-SentencePiece Byte Pair EncodingData curation for Indic languageM.Tech Thesesxi, 39p.123456789/440