OVERSAMPLING HYBRID METHOD FOR HANDLING MULTI-LABEL IMBALANCED

Dara Tursina, Sherly Rosa Anggraeni, Chastine Fatichah, Misbakhul Munir Irfan Subakti

Abstract


Data and information continue to increase along with the development of digital technology. Data availability is becoming increasingly numerous and complex. The existence of unbalanced data causes classification errors due to the dominance of majority-class data over the minority class. Not only limited to the binary class, but data imbalance is also often encountered in multi-label data, which become increasingly important in recent years due to its vast application scope. However, the problem of class imbalance has been a characteristic of many complex multi-label datasets, making it the focus of this research. Handling unbalanced multi-label data still has a lot of potential for development. One approach, Synthetic Oversampling of Multi-Label Data Based on Local Label Distribution (MLSOL) and Integrating Unsupervised Clustering and Label-specific Oversampling to Tackle Imbalanced Multi-Label Data (UCLSO), has been developed. UCLSO's attention only focuses on the majority class, which can lead to data imbalance and excessive overfitting. Although effective in preventing majority class domination, this approach cannot overcome the lack of variation within the minority class. By contrast, MLSOL focuses on minority classes, allowing for variations in multi-label data and significantly improving classification performance. This research aims to overcome the problem of data imbalance by combining the MLSOL and UCLSO oversampling methods. Combining these two approaches is expected to exploit the strengths and reduce the weaknesses of each, resulting in significant performance improvements. The trial results show that the hybrid oversampling method produces the highest value on biological data with an F-1 score of 88%. Meanwhile, the single oversampling methods UCLSO and MLSOL on biological data produce an F-1 score of 67% and 62%, respectively.


Full Text:

PDF

References


Q. Meidianingsih dan D. E. W. Meganingtyas, “Analisis Perbandingan Performa Metode Ensemble Dalam Menangani Imbalanced Multi-class Classification,” J. Apl. Stat. Komputasi Stat., vol. 14, no. 2, hal. 13–21, 2022, doi: 10.34123/jurnalasks.v14i2.335.

H. Duan, Y. Wei, P. Liu, dan H. Yin, “A novel ensemble framework based on K-means and resampling for imbalanced data,” Appl. Sci., vol. 10, no. 5, 2020, doi: 10.3390/app10051684.

M. A. Tahir, J. Kittler, dan A. Bouridane, “Multilabel classification using heterogeneous ensemble of multi-label classifiers,” Pattern Recognit. Lett., vol. 33, no. 5, hal. 513–523, 2012, doi: 10.1016/j.patrec.2011.10.019.

P. Vuttipittayamongkol dan E. Elyan, “Neighbourhood-based undersampling approach for handling imbalanced and overlapped data,” Inf. Sci. (Ny)., vol. 509, hal. 47–70, 2020, doi: 10.1016/j.ins.2019.08.062.

M. Błaszczyk dan J. Jedrzejowicz, “Framework for imbalanced data classification,” Procedia Comput. Sci., vol. 192, hal. 3477–3486, 2021, doi: 10.1016/j.procs.2021.09.121.

R. Rastogi dan S. Mortaza, “Imbalance multi-label data learning with label specific features,” Neurocomputing, vol. 513, hal. 395–408, 2022, doi: 10.1016/j.neucom.2022.09.085.

J. J. Rodríguez, J. F. Díez-Pastor, Á. Arnaiz-González, dan L. I. Kuncheva, “Random Balance ensembles for multiclass imbalance learning,” Knowledge-Based Syst., vol. 193, hal. 105434, 2020, doi: 10.1016/j.knosys.2019.105434.

A. Fernández, V. López, M. Galar, M. J. Del Jesus, dan F. Herrera, “Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches,” Knowledge-Based Syst., vol. 42, hal. 97–110, 2013, doi: 10.1016/j.knosys.2013.01.018.

F. Marbun, A. Baizal, dan M. A. Bijaksana, “Perpaduan Combined Sampling Dan Ensemble of Support Vector Machine (Ensvm) Untuk Menangani Kasus Churn Prediction Perusahaan Telekomunikasi,” JUTI: Jurnal Ilmiah Teknologi Informasi, vol. 8, no. 2. hal. 43, 2010, doi: 10.12962/j24068535.v8i2.a316.

S. Chen, R. Wang, J. Lu, dan X. Wang, “Stable matching-based two-way selection in multi-label active learning with imbalanced data,” Inf. Sci. (Ny)., vol. 610, hal. 281–299, 2022, doi: 10.1016/j.ins.2022.07.182.

B. Liu dan G. Tsoumakas, “Synthetic Oversampling of Multi-label Data Based on Local Label Distribution,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 11907 LNAI, hal. 180–193, 2020, doi: 10.1007/978-3-030-46147-8_11.

T. Zhu, C. Luo, Z. Zhang, J. Li, S. Ren, dan Y. Zeng, “Minority oversampling for imbalanced time series classification,” Knowledge-Based Syst., vol. 247, hal. 108764, 2022, doi: 10.1016/j.knosys.2022.108764.

A. N. Tarekegn, M. Giacobini, dan K. Michalak, “A review of methods for imbalanced multi-label classification,” Pattern Recognit., vol. 118, hal. 107965, 2021, doi: 10.1016/j.patcog.2021.107965.

M. Koziarski, “Radial-Based Undersampling for imbalanced data classification,” Pattern Recognit., vol. 102, 2020, doi: 10.1016/j.patcog.2020.107262.

L. Cai, H. Wang, F. Jiang, Y. Zhang, dan Y. Peng, “A new clustering mining algorithm for multi-source imbalanced location data,” Inf. Sci. (Ny)., vol. 584, hal. 50–64, 2022, doi: 10.1016/j.ins.2021.10.029.

E. K. Y. Yapp, X. Li, W. F. Lu, dan P. S. Tan, “Comparison of base classifiers for multi-label learning,” Neurocomputing, vol. 394, hal. 51–60, 2020, doi: 10.1016/j.neucom.2020.01.102.

G. Wei, W. Mu, Y. Song, dan J. Dou, “An improved and random synthetic minority oversampling technique for imbalanced data,” Knowledge-Based Syst., vol. 248, hal. 108839, 2022, doi: 10.1016/j.knosys.2022.108839.

J. Dou, Z. Gao, G. Wei, Y. Song, dan M. Li, “Switching synthesizing-incorporated and cluster-based synthetic oversampling for imbalanced binary classification,” Eng. Appl. Artif. Intell., vol. 123, no. April 2022, hal. 106193, 2023, doi: 10.1016/j.engappai.2023.106193.

T. G.S., Y. Hariprasad, S. S. Iyengar, N. R. Sunitha, P. Badrinath, dan S. Chennupati, “An extension of Synthetic Minority Oversampling Technique based on Kalman filter for imbalanced datasets,” Mach. Learn. with Appl., vol. 8, no. January, hal. 100267, 2022, doi: 10.1016/j.mlwa.2022.100267.

F. Charte, A. J. Rivera, M. J. del Jesus, dan F. Herrera, “Addressing imbalance in multilabel classification: Measures and random resampling algorithms,” Neurocomputing, vol. 163, hal. 3–16, 2015, doi: 10.1016/j.neucom.2014.08.091.

P. Sadhukhan, A. Pakrashi, S. Palit, dan B. Namee, “Integrating Unsupervised Clustering and Label-Specific Oversampling to Tackle Imbalanced Multi-Label Data,” hal. 489–498, 2023, doi: 10.5220/0011901200003393.

J. Ren, Y. Wang, M. Mao, dan Y. ming Cheung, “Equalization ensemble for large scale highly imbalanced data classification,” Knowledge-Based Syst., vol. 242, hal. 108295, 2022, doi: 10.1016/j.knosys.2022.108295.

D. C. Li, S. Y. Wang, K. C. Huang, dan T. I. Tsai, “Learning class-imbalanced data with region-impurity synthetic minority oversampling technique,” Inf. Sci. (Ny)., vol. 607, hal. 1391–1407, 2022, doi: 10.1016/j.ins.2022.06.067.

J. H. J. Einmahl dan Y. He, “Pr ep rin ot pe er r Pr ep er ed,” vol. 4, no. 2, hal. 0–3, 2021.

W. Lu, Z. Li, dan J. Chu, “Adaptive Ensemble Undersampling-Boost: A novel learning framework for imbalanced data,” J. Syst. Softw., vol. 132, hal. 272–282, 2017, doi: 10.1016/j.jss.2017.07.006.

M. Zheng et al., “UFFDFR: Undersampling framework with denoising, fuzzy c-means clustering, and representative sample selection for imbalanced data classification,” Inf. Sci. (Ny)., vol. 576, hal. 658–680, 2021, doi: 10.1016/j.ins.2021.07.053.




DOI: http://dx.doi.org/10.12962/j24068535.v22i1.a1208

Refbacks

  • There are currently no refbacks.