From 728c7482655f48c8e28d79635a0f6bf1d96844ea Mon Sep 17 00:00:00 2001 From: The Open Journals editorial robot <89919391+editorialbot@users.noreply.github.com> Date: Mon, 18 Mar 2024 12:28:00 +0000 Subject: [PATCH] Creating 10.21105.joss.06310.jats --- joss.06310/10.21105.joss.06310.jats | 748 ++++++++++++++++++++++++++++ 1 file changed, 748 insertions(+) create mode 100644 joss.06310/10.21105.joss.06310.jats diff --git a/joss.06310/10.21105.joss.06310.jats b/joss.06310/10.21105.joss.06310.jats new file mode 100644 index 0000000000..ce837ab966 --- /dev/null +++ b/joss.06310/10.21105.joss.06310.jats @@ -0,0 +1,748 @@ + + +
+ + + + +Journal of Open Source Software +JOSS + +2475-9066 + +Open Journals + + + +6310 +10.21105/joss.06310 + +Imbalance: A comprehensive multi-interface Julia toolbox +to address class imbalance + + + +https://orcid.org/0009-0009-1198-7166 + +Wisam +Essam + + + + +https://orcid.org/0000-0001-6689-886X + +Blaom +Anthony + + + + + +Cairo University, Egypt + + + + +University of Auckland, New Zealand + + + + +17 +10 +2023 + +9 +95 +6310 + +Authors of papers retain copyright and release the +work under a Creative Commons Attribution 4.0 International License (CC +BY 4.0) +2022 +The article authors + +Authors of papers retain copyright and release the work under +a Creative Commons Attribution 4.0 International License (CC BY +4.0) + + + +machine learning +classification +class imbalance +resampling +oversampling +undersampling +julia + + + + + + Summary +

Given a set of observations that each belong to a certain class, + supervised classification aims to learn a classification model that + can predict the class of a new, unlabeled observation + (Cunningham + et al., 2008). This modeling process finds extensive + application in real-life scenarios, including but not limited to + medical diagnostics, recommendation systems, credit scoring, and + sentiment analysis.

+

In various real-world scenarios where supervised classification is + employed, such as those pertaining to the detection of particular + conditions like fraud, faults, pollution, or rare diseases, a severe + discrepancy between the number of observations in each class can + occur. This is known as class imbalance. This poses a problem if + assumptions inherent in the classification model imply hindered + performance when the model is trained on imbalanced data as is + commonly the case + (Ali + et al., 2015). Two prevalent strategies for mitigating class + imbalance, when it poses a problem to the classification model, + involve either increasing the representation of less frequently + occurring classes through oversampling or reducing instances of more + frequently occurring classes through undersampling. It may be also + possible to achieve even greater performance by combining both + approaches in a sequential pipeline + (Zeng + et al., 2016) or by undersampling the data multiple times and + training the classification model on each resampled dataset to form an + ensemble model that aggregates results from different model instances + (Liu + et al., 2009). Contrary to undersampling, oversampling, or + their combination, the ensemble approach possesses the ability to + address class imbalance while making use of the entire dataset and + without generating synthetic data.

+
+ + Statement of Need +

A substantial body of literature in the field of machine learning + and statistics is devoted to addressing the class imbalance issue. + This predicament has often been aptly labeled the “curse of class + imbalance,” as noted in + (Picek + et al., 2018) and + (Kubát + & Matwin, 1997) which follows from the pervasive nature of + the issue across diverse real-world applications and its pronounced + severity; a classifier may incur an extraordinarily large performance + penalty in response to training on imbalanced data.

+

The literature encompasses a myriad of oversampling and + undersampling techniques to approach the class imbalance issue. These + include SMOTE + (Chawla + et al., 2002) which operates by generating synthetic examples + along the lines joining existing ones, SMOTE-N and SMOTE-NC + (Chawla + et al., 2002) which are variants of SMOTE that can handle + categorical data. The sheer number of SMOTE variants makes them a body + of literature on their own. Notably, the most widely cited variant of + SMOTE is BorderlineSMOTE + (Han + et al., 2005). Other well-established oversampling techniques + include RWO + (Zhang + & Li, 2014) and ROSE + (Menardi + & Torelli, 2012) which operate by estimating probability + densities and sampling from them to generate synthetic points. On the + other hand, the literature also encompasses many undersampling + techniques. Cluster undersampling + (Lin + et al., 2016) and condensed nearest neighbors + (Hart, + 1968) are two prominent examples that attempt to reduce the + number of points while preserving the structure or classification + boundary of the data. Furthermore, methods that combine oversampling + and undersampling such as SMOTETomek + (Zeng + et al., 2016) are also present. The motivation behind these + methods is that when undersampling is not random, it can filter out + noisy or irrelevant oversampled data. Lastly, resampling with ensemble + learning has also been presented in the literature with EasyEnsemble + being the most well-known approach of that type + (Liu + et al., 2009).

+

The existence of a toolbox with techniques that harness this wealth + of research is imperative to the development of novel approaches to + the class imbalance problem and for machine learning research broadly. + Aside from addressing class imbalance in a general machine learning + research setting, such a toolbox can help in class imbalance research + settings by making it possible to juxtapose different methods, compose + them together, or form variants of them without having to reimplement + them from scratch. In prevalent programming languages, such as Python, + a variety of such toolboxes already exist, such as imbalanced-learn + (Lemaître + et al., 2016) and SMOTE-variants + (Kovács, + 2019). Meanwhile, Julia + (Bezanson + et al., 2017), a well-known programming language with over 40M + downloads + (Tuychiev, + 2023), has been lacking a similar toolbox to address the class + imbalance issue in general multi-class and heterogeneous data + settings. This has served as the primary motivation for the creation + of the Imbalance.jl toolbox, which we introduce + in the subsequent section.

+
+ + Imbalance.jl +

In this work, we present, Imbalance.jl, a + software toolbox implemented in the Julia programming language that + offers over 10 well-established techniques that help address the class + imbalance issue. Additionally, we present a companion package, + MLJBalancing.jl, which: (i) facilitates the + inclusion of resampling methods in pipelines with classification + models via the BalancedModel construct; and + (ii) implements a general version of the EasyEnsemble algorithm + presented in + (Liu + et al., 2009).

+

The toolbox offers a pure functional interface for each method + implemented. For example, SMOTE can be used in + the following fashion:

+ Xover, yover = smote(X, y) +

Here Xover, yover are + X, y after oversampling.

+

A ratios hyperparameter or similar is always + present to control the degree of oversampling or undersampling to be + done for each class. All hyperparameters for a resampling method have + default values that can be overridden.

+

The set of resampling techniques implemented in either + Imbalance.jl or + MLJBalancing.jl are shown in the table below. + Note that although no combination resampling techniques are explicitly + presented, they are easy to form using the + BalancedModel wrapper found in + MLJBalancing.jl which can wrap an arbitrary + number of resamplers in sequence.

+ + +

Resampling techniques implemented in + Imbalance.jl and + MLJBalancing.jl.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
TechniqueTypeSupported Data Types
BalancedBaggingClassifierEnsembleContinuous and/or nominal
Borderline SMOTE1OversamplingContinuous
Cluster UndersamplerUndersamplingContinuous
Edited Nearest Neighbors UndersamplerUndersamplingContinuous
Random OversamplerOversamplingContinuous and/or nominal
Random UndersamplerUndersamplingContinuous and/or nominal
Random Walk OversamplerOversamplingContinuous and/or nominal
ROSEOversamplingContinuous
SMOTEOversamplingContinuous
SMOTE-NOversamplingNominal
SMOTE-NCOversamplingContinuous and nominal
Tomek Links UndersamplerUndersamplingContinuous
+
+ + Imbalance.jl Design Principles +

The toolbox implementation follows a specific set of design + principles in terms of the implemented techniques, interface + support, developer experience and testing, and user experience.

+ + Implemented Techniques + + +

Should support all four major types of resampling + approaches (oversampling, undersampling, combination, + ensemble)

+
+ +

Should be generally compatible with multi-class + settings

+
+ +

Should offer solutions to heterogeneous data settings + (continuous and nominal data)

+
+ +

When possible, preference should be given to techniques + that are more common in the literature or industry

+
+
+

Methods implemented in the Imbalance.jl + toolbox indeed meet all aforementioned design principles for the + implemented techniques. The one-vs-rest scheme as proposed in + (Fernández + et al., 2013) was used to generalize binary technique to + multi-class when needed.

+
+ + Interface Support + + +

Should support both matrix and table type inputs

+
+ +

Target variable may or may not be given as a separate + column

+
+ +

Should expose a pure functional implementation, but also + support popular Julia machine learning interfaces

+
+ +

Should be possible to wrap an arbitrary number of resampler + models with a classification model to behave as a unified + model

+
+
+

Methods implemented in the Imbalance.jl + toolbox meet all the interface design principles above. It + particularly implements the MLJ + (Blaom + et al., 2020) and TableTransforms + interface for each method. BalancedModel + from MLJBalancing.jl also allows fusing an + arbitrary number of resampling models and a classifier together to + behave as one unified model.

+
+ + Developer Experience and Testing + + +

There should exist a developer guide to encourage and guide + contribution

+
+ +

Functions should be implemented in smaller units to aid in + testing

+
+ +

Testing coverage should be maximized; even the most basic + functions should be tested

+
+ +

Features commonly used by multiple resampling techniques + should be implemented in a single function and reused

+
+ +

Should document all functions, including internal ones

+
+ +

Comments should be included to justify or simplify written + implementations when needed

+
+
+

This set of design principles is also satisfied by + Imbalance.jl. Implemented techniques are + tested by testing smaller units that form them. Aside from that, + end-to-end tests are performed for each technique by testing + properties and characteristics of the technique or by using the + imbalanced-learn toolbox + (Lemaître + et al., 2016) from Python and comparing outputs.

+
+ + User Experience + + +

Functional documentation should be comprehensive and + clear

+
+ +

Examples (with shown output) that work after copy-pasting + should accompany each method

+
+ +

An illustrative visual example that presents a plot or + animation should preferably accompany each method

+
+ +

A practical example that uses the method with real data + should preferably accompany each method

+
+ +

If an implemented method lacks an online explanation, an + article that explains the method after it is implemented + should be preferably written

+
+
+

The Imbalance.jl documentation indeed + satisfies this set of design principles. Methods are each + associated with an example that can be copy-pasted, a visual + example that demonstrates the operation of the technique, and + possibly, an example that utilizes it with a real-world dataset to + improve the performance of a classification model.

+
+
+ + Author Contributions +

Design: E. Wisam, A. Blaom. Implementation, tests and + documentation: E. Wisam. Code and documentation review: A. Blaom. + The authors would like to acknowledge the financial support provided + by the Google Summer of Code program, which made this project + possible.

+
+
+ + + + + + + BezansonJeff + EdelmanAlan + KarpinskiStefan + ShahViral B. + + Julia: A fresh approach to numerical computing + SIAM Review + Society for Industrial & Applied Mathematics (SIAM) + 201701 + 59 + 1 + https://doi.org/10.1137%2F141000671 + 10.1137/141000671 + 65 + 98 + + + + + + CunninghamPádraig + CordMatthieu + DelanySarah Jane + + Supervised learning + Machine learning techniques for multimedia: Case studies on organization and retrieval + + CordMatthieu + CunninghamPádraig + + Springer Berlin Heidelberg + Berlin, Heidelberg + 2008 + 978-3-540-75171-7 + https://doi.org/10.1007/978-3-540-75171-7_2 + 10.1007/978-3-540-75171-7_2 + 21 + 49 + + + + + + AliAida + ShamsuddinSiti Mariyam Hj. + RalescuAnca L. + + Classification with class imbalance problem: A review + Soft computing models in industrial and environmental applications + 2015 + https://api.semanticscholar.org/CorpusID:26644563 + + + + + + ZengMin + ZouBeiji + WeiFaran + LiuXiyao + WangLei + + Effective prediction of three common diseases by combining SMOTE with tomek links technique for imbalanced medical data + 2016 IEEE International Conference of Online Analysis and Computing Science (ICOACS) + 2016 + https://api.semanticscholar.org/CorpusID:25184489 + 10.1109/ICOACS.2016.7563084 + 225 + 228 + + + + + + LiuXu-Ying + WuJianxin + ZhouZhi-Hua + + Exploratory undersampling for class-imbalance learning + IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) + 2009 + 39 + https://api.semanticscholar.org/CorpusID:62808464 + 10.1109/TSMCB.2008.2007853 + 539 + 550 + + + + + + PicekStjepan + HeuserAnnelie + JovićAlan + BhasinShivam + RegazzoniFrancesco + + The curse of class imbalance and conflicting metrics with machine learning for side-channel evaluations + IACR Trans. Cryptogr. Hardw. Embed. Syst. + 2018 + 2019 + https://api.semanticscholar.org/CorpusID:44136202 + 10.13154/tches.v2019.i1.209-237 + 209 + 237 + + + + + + KubátMiroslav + MatwinStan + + Addressing the curse of imbalanced training sets: One-sided selection + International conference on machine learning + 1997 + https://api.semanticscholar.org/CorpusID:18370956 + + + + + + ChawlaN. + BowyerK. + HallLawrence O. + KegelmeyerW. Philip + + SMOTE: Synthetic minority over-sampling technique + ArXiv + 2002 + abs/1106.1813 + https://api.semanticscholar.org/CorpusID:1554582 + 10.1613/jair.953 + + + + + + HanHui + WangWenyuan + MaoBinghuan + + Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning + International conference on intelligent computing + 2005 + https://api.semanticscholar.org/CorpusID:12126950 + 10.1007/11538059_91 + + + + + + ZhangHuaxiang + LiMingfang + + RWO-sampling: A random walk over-sampling approach to imbalanced data classification + Inf. Fusion + 2014 + 20 + https://api.semanticscholar.org/CorpusID:205432428 + 10.1016/j.inffus.2013.12.003 + 99 + 116 + + + + + + MenardiGiovanna + TorelliNicola + + Training and assessing classification rules with imbalanced data + Data Mining and Knowledge Discovery + 2012 + 28 + https://api.semanticscholar.org/CorpusID:18164904 + 10.1007/s10618-012-0295-5 + 92 + 122 + + + + + + LinWei-Chao + TsaiChih-Fong + HuYa-Han + JhangJing-Shang + + Clustering-based undersampling in class-imbalanced data + Inf. Sci. + 2016 + 409 + https://api.semanticscholar.org/CorpusID:424467 + 10.1016/j.ins.2017.05.008 + 17 + 26 + + + + + + HartPeter E. + + The condensed nearest neighbor rule (corresp.) + IEEE Trans. Inf. Theory + 1968 + 14 + https://api.semanticscholar.org/CorpusID:206729609 + 10.1109/TIT.1968.1054155 + 515 + 516 + + + + + + LemaîtreGuillaume + NogueiraFernando + AridasChristos K. + + Imbalanced-learn: A Python toolbox to tackle the curse of imbalanced datasets in machine learning + ArXiv + 2016 + abs/1609.06570 + https://api.semanticscholar.org/CorpusID:1426815 + + + + + + KovácsGyörgy + + Smote-variants: A Python implementation of 85 minority oversampling techniques + Neurocomputing + 2019 + 366 + 0925-2312 + https://www.sciencedirect.com/science/article/pii/S0925231219311622 + 10.1016/j.neucom.2019.06.100 + 352 + 354 + + + + + + TuychievBekhruz + + The rise of Julia + 2023 + https://www.datacamp.com/blog/the-rise-of-julia-is-it-worth-learning-in-2022 + + + + + + FernándezAlberto + LópezVictoria + GalarMikel + JesúsMaría José del + HerreraFrancisco + + Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches + Knowl. Based Syst. + 2013 + 42 + https://api.semanticscholar.org/CorpusID:131286 + 10.1016/J.KNOSYS.2013.01.018 + 97 + 110 + + + + + + BlaomAnthony D. + KirályFranz J. + LienartThibaut + SimillidesYiannis + ArenasDiego + VollmerSebastian J. + + MLJ: A julia package for composable machine learning + J. Open Source Softw. + 2020 + 5 + https://api.semanticscholar.org/CorpusID:220768685 + 10.21105/joss.02704 + 2704 + + + + + +