Based upon a previous rule-based text classification model, an hybrid multilabel classifier was developed to assign topic labels to a dataset of rock news headlines, aiming to explore this variant of the classification problem and enhance its accuracy. This repository presents the steps implemented to develop the multilabel classification task. Several classifiers were tested including the ones following the problem transformation approach and the MultiOutputClassifier. In summary, results demonstrated that multioutput algorithms outperformed problem transformation algorithms, achieving significantly higher Micro-average F1-score values, with tree-based models and ensemble methods showing inherent robustness for imbalanced datasets.
Table of Contents
- The dataset contains 20.000 headlines and the average number of labels per headline stands at 1.45 (see Table 1).
- 36 predefined labels were derived from the rule-based text classification model (see Table 1).
- The number of labels for which a headline can be assigned ranges from 1 to 7 (see Figure 1).
- Two-thirds of the headlines are assigned to a single topic label, while nearly one-fourth are tagged with two topic labels (see Figure 1).
- The cumulative percentage of headlines assigned to more than three labels is not significant (see Figure 1).
- The text corpus shows high imbalance (see Figure 2).
- Nearly one-third of the headlines (6,397 out of 20,000) are tagged with the class 'diverse', indicating topics other than the 35 predefined labels in this classification task (see Figure 2).
- Core topic labels include: 'announce', 'release', 'album', 'tour', 'song', 'show', 'watch', 'video', 'single', 'death', 'play' and 'cover' (see Figure 2).
- Most topic labels tend to co-occur with another label rather than being associated with multiple labels or appearing as a single label (see Figure 2).
- Exceptions to this pattern include 'song', 'death' or 'cover', which tend to appear as single labels, and 'video' and 'single' which are more associated with multiple labels (see Figure 2).
- Strong correlations are observed among pairs of labels such as ['tour', 'announce'], ['album', 'announce'], ['album', 'release'], ['single', 'release'] and ['video', 'release'] (see Figure 3).
Table 1. Dataset descriptive statistics
Figure 1. Distribution of the number of topic labels
Figure 2. Frequency distribution of topic labels and respective co-occurrence
Figure 3. Co-occurrence of topic labels
- The multilabel classification task was designed based on a rule-based text classification model with the purpose of identifying keywords and assigning both topic labels and publication type categories. Details about the rule-based text classification model can be found here. The keywords generated by the manual rule-based model were the foundation for assigning topic labels to headlines. Instead of directly using the derived topic labels, the multilabel classifier relies on the identified keywords.
- To ensure a "well-balanced distribution of label relations", an iterative stratification technique was implemented to split the dataset into training and testing sets, as proposed by Szymański and Kajdanowicz (2016). The test size was set at 0.2.
- In order to address potential class imbalance, no re-sampling or re-weighting methods were adopted, as they tend to "result in oversampling of common labels" (Huang et al., 2021).
- Various classifiers and estimators were evaluated using both Problem Transformation and MultiOutputClassifier approaches.
- Emphasis was placed on inherently robust algorithms for imbalanced datasets, particularly tree-based and ensemble methods (Ganganwar, 2012; Mahani, 2022; Mulugeta et al., 2023).
- Hyperparameter optimization was conducted using Grid Search for base models showing high performance.
- Aiming to mitigate overfitting during the tuning of Logistic Regression (Jurafsky & Martin, 2024), an initial grid was set for the regularization strength parameter ('C') with values ranging from [0,01,0.001, 0.0001,0.00001]. The adoption of these small values resulted in technical issues. Due to this constraint, a new grid using values of [0.1, 0.25, 0.5] was tested but technical issues still persisted specifically when applied to Classifier Chain in conjunction with Logistic Regression. It was observed that only when grid values were set to >=1 the technical issues subside. Furthermore, no improved outcomes were obtained with this grid in terms of model performance for Binary Relevance combined with Logistic Regression.
- Several methodologies were experimented to optimize hyperparameters in Gradient Boosting, including Grid search and Randomized search using a balanced subset of around 6000 records, Random over-sampling, Random under-sampling and Class weighting. Despite these efforts, the results were not satisfatory. The inherent characteristics of the dataset, where 48% (435 out of 906) of label combinations consist of only one sample, coupled with the complexity of the Gradient Boosting algorithm, which involves a multitude of parameters (Guan et al., 2023), led to unsucessful optimization. The limited size of the training dataset often presents a significant challenge in machine learning optimization, suggesting that expanding the text corpus by gathering additional news headlines could enrich the model with more diverse examples to learn from.
- In addition, a cost-sensitive learning experiment was carried out by adjusting the "class_weight" parameter of a tree-based classifier. However, no significant impact on the model's performance was observed.
- After optimization, the selected base models were fine-tuned with the following hyperparameters: a) Logistic Regression ("C": .5; "penalty": l2; "solver": "sag; max_iter": 1000); b) Decision Tree ("criterion": gini; "max_depth": None; "max_leaf_nodes": None; "min_samples_split": 2); c) Random Forest ("bootstrap": False; "max_depth": None; "max_features": None; "max_leaf_nodes": None; "n_estimators": 50); d) AdaBoost ("algorithm": SAMME.R; "learning_rate": 1.04; "n_estimators": 50); e) Extra Trees ("criterion": gini; "n_estimators": 20).
- To assess model performance, more informative evaluation metrics for imbalanced datasets such as the Micro-average F1-score were employed. This metric aggregates the contributions of "all the units together, without taking into consideration possible differences between classes" (Grandini et al., 2020).
- Multioutput algorithms showed significantly higher Micro-average F1-score values compared to problem transformation algorithms (see Table 2).
- In agreement with findings in academic literature (Ganganwar, 2012; Mahani, 2022; Mulugeta et al., 2023), tree-based models (Decision Tree) and ensemble methods (Random Forest, Extra Trees, AdaBoost, Gradient Boosting) demonstrated inherent robustness for imbalanced datasets, outperforming other algorithms as indicated by Micro-average F1-score (see Table 2).
- AdaBoost stood out as the top performer, showing a Micro-average F1-score of 0.989 (see Table 2).
- Following hyperparameter tuning, a very slight performance enhancement was observed for Random Forest (see Figure 4).
Table 2. Evaluation metrics by classifier
Figure 4. Tuned models vs. Base models: performance evaluation
- Ganganwar, V. (2012) An overview of classification algorithms for imbalanced datasets. International Journal of Emerging Technology and Advanced Engineering. ISSN 2250-2459, Volume 2, Issue 4.
- Grandini, M., Bagli, E., Visani, G. (2020) Metrics for multi-class classification: an overview.
- Guan, H., Xiao, Y., Li, J., Liu, Y., Bai1, G. (2023) A Comprehensive Study of Real-World Bugs in Machine Learning Model Optimization.
- Huang, Y., Giledereli, B., Köksal, A., Özgür, A., Ozkirimli, E. (2021) Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution.
- Jurafsky, D., Martin, J. (2024) Speech and Language Processing.
- Mahani, A. (2022) Classification in Multi-Label Datasets in Information Systems Management.
- Mulugeta, G., Zewotir, T., Tegegne, A., Juhar, L., Muleta, M. (2023) Classification of imbalanced data using machine learning algorithms to predict the risk of renal graft failures in Ethiopia. BMC Medical Informatics and Decision Making, Vol. 23, Nr. 98.
- Szymański, P., Kajdanowicz, T. (2016) A scikit-based Python environment for performing multi-label classification. Journal of Machine Learning Research, 1, 1-15.