Multi-label imbalanced text handling using ensemble methodology with application to biomedical data classification
Source
Iran Journal of Computer Science
ISSN
25208438
Date Issued
2025-01-01
Author(s)
Ghosh, Subhajit
Gupta, Sanidhya
Bhattacharyya, Sourav
Das, Avik Kumar
Nandi, Apurba
Sarkar, Ardhendu
Samanta, Partha Sarathi
Polprasert, Chantri
Abstract
The surge in biomedical literature and clinical reports presents a formidable challenge for automated text analysis, particularly in multi-label classification tasks where severe class imbalance and interdependent labels are common. To address these issues, we propose MITHEM (Multi-label Imbalance-aware Text Classification using Hybrid Ensemble Model), an ensemble framework that combines threshold-guided binning, SMOTE based oversampling, and a set of diverse classifiers Support Vector Machines, Decision Trees, and Random Forests within a meta-classification approach. Unlike traditional techniques, MITHEM not only improves the representation of minority classes but also learns correlations between labels to refine decision-making. We tested the framework on eight standard biomedical text datasets and observed notable gains in macro F1-score, Hamming loss, and label coverage compared with strong baselines. Empirical results validate that MITHEM outperforms other competitive methods on a variety of biomedical datasets, especially in imbalanced scenarios, in terms of enhanced Recall and Precision scores.
Keywords
Classification algorithm | Ensemble model | Imbalanced text | Machine learning | Multi-label learning
