diff --git a/joss.05873/10.21105.joss.05873.jats b/joss.05873/10.21105.joss.05873.jats new file mode 100644 index 0000000000..a01992574e --- /dev/null +++ b/joss.05873/10.21105.joss.05873.jats @@ -0,0 +1,871 @@ + + +
+ + + + +Journal of Open Source Software +JOSS + +2475-9066 + +Open Journals + + + +5873 +10.21105/joss.05873 + +pudu: A Python library for agnostic feature selection and +explainability of Machine Learning spectroscopic +problems + + + +https://orcid.org/0000-0002-8357-5824 + +Grau-Luque +Enric + + + + +https://orcid.org/0000-0002-7087-6097 + +Becerril-Romero +Ignacio + + + + +https://orcid.org/0000-0002-3634-1355 + +Perez-Rodriguez +Alejandro + + + + + +https://orcid.org/0000-0002-2072-9566 + +Guc +Maxim + + + + +https://orcid.org/0000-0002-5502-3133 + +Izquierdo-Roca +Victor + + + + + +Catalonia Institute for Energy Research (IREC), Jardins de +les Dones de Negre 1, 08930 Sant Adrià de Besòs, Spain. + + + + +Departament d’Enginyeria Electrònica i Biomèdica, IN2UB, +Universitat de Barcelona, C/ Martí i Franqués 1, 08028 Barcelona, +Spain. + + + + +30 +6 +2023 + +8 +92 +5873 + +Authors of papers retain copyright and release the +work under a Creative Commons Attribution 4.0 International License (CC +BY 4.0) +2022 +The article authors + +Authors of papers retain copyright and release the work under +a Creative Commons Attribution 4.0 International License (CC BY +4.0) + + + +Python +Spectroscopy +Machine Learning +Explainability and intepretability +Classification and regression + + + + + + Statement of need +

Spectroscopic techniques (e.g. Raman, photoluminescence, + reflectance, transmittance, X-ray fluorescence) are an important and + widely used resource in different fields of science, such as + photovoltaics + (Fonoll-Rubio + et al., 2022) + (Grau-Luque + et al., 2021), cancer + (Bellisola + & Sorio, 2012), superconductors + (Fischer + et al., 2007), polymers + (Easton + et al., 2020), corrosion + (Haruna + et al., 2023), forensics + (P. + V. Bhatt & Rawtani, 2023), and environmental sciences + (Estefany + et al., 2023), to name just a few. This is due to the + versatile, non-destructive and fast acquisition nature that provides a + wide range of material properties, such as composition, morphology, + molecular structure, optical and electronic properties. As such, + machine learning (ML) has been used to analyze spectral data for + several years, elucidating their vast complexity, and uncovering + further potential on the information contained within them + (Goodacre, + 2003) + (Luo et + al., 2022). Unfortunately, most of these ML analyses lack + further interpretation of the derived results due to the complex + nature of such algorithms. In this regard, interpreting the results of + ML algorithms has become an increasingly important topic, as concerns + about the lack of interpretability of these models have grown + (Burkart + & Huber, 2021). In natural sciences (like materials, + physical, chemistry, etc.), as ML becomes more common, this concern + has gained especial interest, since results obtained from ML analyses + may lack scientific value if they cannot be properly interpreted, + which can affect scientific consistency and strongly diminish the + significance and confidence in the results, particularly when tackling + scientific problems + (Roscher + et al., 2020).

+

Even though there are methods and libraries available for + explaining different types of algorithms such as SHAP + (Lundberg + et al., 2017), LIME + (Ribeiro + et al., 2016), or GradCAM + (Selvaraju + et al., 2017), they can be difficult to interpret or understand + even for data scientists, leading to problems such as + miss-interpretation, miss-use and over-trust + (Kaur + et al., n.d.). Adding this to other human-related issues + (Krishnå1 + et al., 2022), researchers with expertise in natural sciences + with little or no data science background may face further issues when + using such methodologies + (Zhong + et al., 2022). Furthermore, these types of libraries normally + aim for problems composed of image, text, or tabular data, which + cannot be associated in a straightforward way with spectroscopic data. + On the other hand, time series (TS) data shares similarities with + spectroscopy, and while still having specific needs and differences, + TS dedicated tools can be a better approach. Unfortunately, despite + the existence of several libraries that aim to explain models for TS + with the potential to be applied to spectroscopic data, they are + mostly designed for a specialized audience, and many are + model-specific + (Rojat + et al., 2021). Moreover, spectral data normally manifests as an + array of peaks that are typically overlapped and can be distinguished + by their shape, intensity, and position. Minor shifts in these + patterns can indicate significant alterations in the fundamental + properties of the subject material. Conversely, pronounced variations + in the other case might only indicate negligible differences. + Therefore, comprehending such alterations and their implications is + paramount. This is still true with ML spectroscopic analysis where the + spectral variations are still of primary concern. In this context, a + tool with an easy and understandable approach that offers + spectroscopy-aimed functionalities that allow to aim for specific + patterns, areas, and variations, and that is beginner and + non-specialist friendly is of high interest. This can help the + different stakeholders to better understand the ML models that they + employ and considerably increase the transparency, comprehensibility, + and scientific impact of ML results + (U. + Bhatt et al., 2020) + (Belle + & Papantonis, 2021).

+
+ + Overview +

pudu is a Python library that quantifies the effect of + changes in spectral features over the predictions of ML models and + their effect to the target instances. In other words, it perturbates + the features in a predictable and deliberate way and evaluates the + features based on how the final prediction changes. For this, four + main methods are included and defined. Importance + quantifies the relevance of the features according to the changes in + the prediction. Thus, this is measured in probability or target value + difference for classification or regression problems, respectively. + Speed quantifies how fast a prediction changes + according to perturbations in the features. For this, the + importance is calculated at different perturbation + levels, and a line is fitted to the obtained values and the slope, or + the rate of change of importance, is extracted as the + speed. Synergy indicates how + features complement each other in terms of prediction change after + perturbations. Finally, re-activations account for + the number of unit activations in a Convolutional Neural Network (CNN) + that after perturbation, the value goes above the original activation + criteria. The latter is only applicable for CNNs, but the rest can be + applied to any other ML problem, including CNNs. To read in more + detail how these techniques work, please refer to the + definitions + in the documentation.

+

pudu is versatile as it can analyze classification and + regression algorithms for both 1- and 2-dimensional problems, offering + plenty of flexibility with parameters, and the ability to provide + localized explanations by selecting specific areas of interest. To + illustrate this, + [fig:figure1] + shows two analysis instances using the same + importance method but with different parameters. + Additionally, its other functionalities are shown in examples using + scikit-learn + (Pedregosa + et al., 2011), keras + (Chollet + et al., 2018), and localreg + (Marholm, + 2022) are found in the documentation, along with XAI methods + including LIME and GradCAM.

+

pudu is built in Python 3 + (Van + Rossum & Drake, 2009) and uses third-party packages + including numpy + (Harris + et al., 2020), matplotlib + (Caswell + et al., 2021), and keras. It is available in both PyPI and + conda, and comes with complete documentation, including quick start, + examples, and contribution guidelines. Source code and documentation + are available in https://github.com/pudu-py/pudu.

+ +

Two ways of using the same method + importance by A) using a sequential change pattern + over all the spectral features and B) selecting peaks of interest. + These spectras are measured from thin-film photovoltaic samples and + are correlated to their performance using ML, as explained in + (Fonoll-Rubio + et al., 2022). In A), the vector was divided in window sizes + of 25 pixels were perturbed individually. The impact of the peak in + the range of 1200-1400 opaques the impact of the rest. In contrast, + in B) specific ranges are defined, so only the first four main peaks + are selected to be analyzed and better visualize their impact in the + prediction.

+ +
+
+ + Acknowledgements +

Co-funded by the European Union (GA Nº 101058459 Platform-ZERO). + Views and opinions expressed are however those of the authors only and + do not necessarily reflect those of the European Union (EU) or + European Health and Digital Executive Agency (HADEA). Neither the EU + nor the granting authority can be held responsible for them. This + project has received funding from the EU’s Horizon 2020 research and + innovation programme under Marie Skłodowska-Curie GA Nº 801342 + (Tecniospring INDUSTRY) and the Government of Catalonia’s Agency for + Business Competitiveness (ACCIÓ). This work has received funding from + the EU’s Horizon 2020 Research and Innovation Programme under GA Nº + 958243 (SUNRISE project). Authors from IREC belong to the MNT-Solar + Consolidated Research Group of the “Generalitat de Catalunya” (ref. + 2021 SGR 01286) and are grateful to European Regional Development + Funds (ERDF, FEDER Programa Competitivitat de Catalunya + 2007–2013).

+
+ + Authors contribution with + <ext-link ext-link-type="uri" xlink:href="https://credit.niso.org/">CRediT</ext-link> + + +

Enric Grau-Luque: Conceptualization, Data curation, Software, + Writing – original draft

+
+ +

Ignacio Becerril-Romero: Investigation, Methodology, Writing – + review & edition

+
+ +

Alejandro Perez-Rodriguez: Funding acquisition, Project + administration, Resources, Supervision

+
+ +

Maxim Guc: Formal analysis, Validation, Methodology, Writing – + review & edition

+
+ +

Victor Izquierdo-Roca: Funding acquisition, Project + administration, Supervision

+
+
+
+ + + + + + + GoodacreRoyston + + Explanatory analysis of spectroscopic data using machine learning of simple, interpretable rules + Vibrational Spectroscopy + Elsevier + 200308 + 32 + 1 + 0924-2031 + 10.1016/S0924-2031(03)00045-6 + 33 + 45 + + + + + + LuoRuihao + PoppJuergen + BocklitzThomas + + Deep Learning for Raman Spectroscopy: A Review + Analytica + 2022 + 3 + 3 + 10.3390/analytica3030020 + 287 + 301 + + + + + + EastonChristopher D. + KinnearCalum + McArthurSally L. + GengenbachThomas R. + + Practical guides for x-ray photoelectron spectroscopy: Analysis of polymers + Journal of Vacuum Science & Technology A: Vacuum, Surfaces, and Films + American Vacuum Society + 202003 + 38 + 2 + 0734-2101 + https://pubs.aip.org/avs/jva/article-abstract/38/2/023207/247679/Practical-guides-for-x-ray-photoelectron?redirectedFrom=fulltext + 10.1116/1.5140587 + + + + + + HarunaK. + ObotI. B. + SalehT. A. + + Infrared Spectroscopy in Corrosion Research + Corrosion Science + Apple Academic Press + New York + 202310 + 9781003328513 + https://www.taylorfrancis.com/chapters/edit/10.1201/9781003328513-9/infrared-spectroscopy-corrosion-research-haruna-obot-saleh + 10.1201/9781003328513-9 + 261 + 289 + + + + + + EstefanyCedeño + SunZhenli + HongZijin + DuJingjing + + Raman spectroscopy for profiling physical and chemical properties of atmospheric aerosol particles: A review + Ecotoxicology and Environmental Safety + Academic Press + 202301 + 249 + 0147-6513 + 10.1016/J.ECOENV.2022.114405 + 36508807 + 114405 + + + + + + + BhattPayal V. + RawtaniDeepak + + Spectroscopic Analysis Techniques in Forensic Science + Modern Forensic Tools and Devices: Trends in Criminal Investigation + John Wiley & Sons, Ltd + 202301 + 9781119763406 + https://onlinelibrary.wiley.com/doi/full/10.1002/9781119763406.ch8 https://onlinelibrary.wiley.com/doi/abs/10.1002/9781119763406.ch8 https://onlinelibrary.wiley.com/doi/10.1002/9781119763406.ch8 + 10.1002/9781119763406.CH8 + 149 + 197 + + + + + + FischerØystein + KuglerMartin + Maggio-AprileIvan + BerthodChristophe + RennerChristoph + + Scanning tunneling spectroscopy of high-temperature superconductors + Reviews of Modern Physics + American Physical Society + 200703 + 79 + 1 + https://journals.aps.org/rmp/abstract/10.1103/RevModPhys.79.353 + 10.1103/REVMODPHYS.79.353 + 353 + 419 + + + + + + BellisolaGiuseppe + SorioClaudio + + Infrared spectroscopy and microscopy in cancer research and diagnosis + American Journal of Cancer Research + e-Century Publishing Corporation + 2012 + 2 + 1 + /pmc/articles/PMC3236568/ /pmc/articles/PMC3236568/?report=abstract https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3236568/ + 22206042 + 1 + + + + + + + ZhongXiaoting + GallagherBrian + LiuShusen + KailkhuraBhavya + HiszpanskiAnna + HanT. Yong Jin + + Explainable machine learning in materials science + npj Computational Materials 2022 8:1 + Nature Publishing Group + 202209 + 8 + 1 + 2057-3960 + https://www.nature.com/articles/s41524-022-00884-7 + 10.1038/s41524-022-00884-7 + 1 + 19 + + + + + + RibeiroMarco Tulio + SinghSameer + GuestrinCarlos + + "Why Should I Trust You?": Explaining the Predictions of Any Classifier + NAACL-HLT 2016 - 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Demonstrations Session + Association for Computational Linguistics (ACL) + 201602 + 9781450342322 + https://arxiv.org/abs/1602.04938v3 + 10.48550/arxiv.1602.04938 + 97 + 101 + + + + + + BurkartNadia + HuberMarco F. + + A Survey on the Explainability of Supervised Machine Learning + Journal of Artificial Intelligence Research + AI Access Foundation + 202101 + 70 + 1076-9757 + https://www.jair.org/index.php/jair/article/view/12228 + 10.1613/JAIR.1.12228 + 245 + 317 + + + + + + RoscherRibana + BohnBastian + DuarteMarco F. + GarckeJochen + + Explainable Machine Learning for Scientific Insights and Discoveries + IEEE Access + Institute of Electrical; Electronics Engineers Inc. + 2020 + 8 + https://arxiv.org/abs/1905.08883 + 10.1109/ACCESS.2020.2976199 + 42200 + 42216 + + + + + + BelleVaishak + PapantonisIoannis + + Principles and Practice of Explainable Machine Learning + Frontiers in Big Data + Frontiers Media S.A. + 202107 + 4 + https://arxiv.org/abs/2009.11698 + 10.3389/FDATA.2021.688969 + 34278297 + 39 + + + + + + + LundbergScott M + AllenPaul G + LeeSu-In + + A Unified Approach to Interpreting Model Predictions + Advances in Neural Information Processing Systems + 2017 + 30 + https://github.com/slundberg/shap + + + + + + SelvarajuRamprasaath R. + CogswellMichael + DasAbhishek + VedantamRamakrishna + ParikhDevi + BatraDhruv + + Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization + Proceedings of the IEEE International Conference on Computer Vision + 2017 + http://gradcam.cloudcv.org + 10.1109/iccv.2017.74 + 618 + 626 + + + + + + PedregosaFabian + VaroquauxGaël + GramfortAlexandre + MichelVincent + ThirionBertrand + GriselOlivier + BlondelMathieu + PrettenhoferPeter + WeissRon + DubourgVincent + VanderplasJake + PassosAlexandre + CournapeauDavid + BrucherMatthieu + PerrotMatthieu + DuchesnayÉdouard + + Scikit-learn: Machine Learning in Python + 2011 + 12 + http://scikit-learn.sourceforge.net. + 2825 + 2830 + + + + + + CholletFrançois + Others + CholletFrançois + Others + + Keras: The Python Deep Learning library + Astrophysics source code library + 2018 + https://ui.adsabs.harvard.edu/abs/2018ascl.soft06022C/abstract + ascl:1806.022 + + + + + + + MarholmSigvald + + sigvaldm/localreg: Multivariate RBF output + 202203 + https://zenodo.org/record/6344451 + 10.5281/ZENODO.6344451 + + + + + + HarrisCharles R. + MillmanK. Jarrod + WaltStéfan J. van der + GommersRalf + VirtanenPauli + CournapeauDavid + WieserEric + TaylorJulian + BergSebastian + SmithNathaniel J. + KernRobert + PicusMatti + HoyerStephan + KerkwijkMarten H. van + BrettMatthew + HaldaneAllan + RíoJaime Fernández del + WiebeMark + PetersonPearu + Gérard-MarchantPierre + SheppardKevin + ReddyTyler + WeckesserWarren + AbbasiHameer + GohlkeChristoph + OliphantTravis E. + + Array programming with NumPy + Nature Research + 202009 + 585 + https://doi.org/10.1038/s41586-020-2649-2 + 10.1038/s41586-020-2649-2 + 32939066 + 357 + 362 + + + + + + KaurHarmanpreet + NoriHarsha + JenkinsSamuel + CaruanaRich + WallachHanna + Wortman VaughanJennifer + + Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning + 9781450367080 + http://dx.doi.org/10.1145/3313831.3376219 + 10.1145/3313831.3376219 + + + + + + Krishnå1Satyapriya + Han˚1Tessa Han˚1 + GuAlex + PombraJavin + JabbariShahin + WuZhiwei Steven + LakkarajuHimabindu + + The Disagreement Problem in Explainable Machine Learning: A Practitioner’s Perspective + 202202 + https://arxiv.org/abs/2202.01602v3 + + + + + + RojatThomas + PugetRaphaël + FilliatDavid + Del SerJavier + GelinRodolphe + Díaz-RodríguezNatalia + + Explainable Artificial Intelligence (XAI) on TimeSeries Data: A Survey + 202104 + https://arxiv.org/abs/2104.00950v1 + + + + + + BhattUmang + XiangAlice + SharmaShubham + WellerAdrian + TalyAnkur + JiaYunhan + GhoshJoydeep + PuriRuchir + MouraJosé M F + EckersleyPeter + + Explainable Machine Learning in Deployment + 2020 + 9781450369367 + https://doi.org/10.1145/3351095.3375624 + 10.1145/3351095.3375624 + + + + + + Fonoll-RubioRobert + PaetelStefan + Grau-LuqueEnric + Becerril-RomeroIgnacio + MayerRafael + Pérez-RodríguezAlejandro + GucMaxim + Izquierdo-RocaVictor + + Insights into the Effects of RbF-Post-Deposition Treatments on the Absorber Surface of High Efficiency Cu(In,Ga)Se2 Solar Cells and Development of Analytical and Machine Learning Process Monitoring Methodologies Based on Combinatorial Analysis + Advanced Energy Materials + John Wiley & Sons, Ltd + 202201 + 1614-6840 + https://onlinelibrary.wiley.com/doi/full/10.1002/aenm.202103163 https://onlinelibrary.wiley.com/doi/abs/10.1002/aenm.202103163 https://onlinelibrary.wiley.com/doi/10.1002/aenm.202103163 + 10.1002/AENM.202103163 + 2103163 + + + + + + + Grau-LuqueEnric + AnefnafIkram + BenhaddouNada + Fonoll-RubioRobert + Becerril-RomeroIgnacio + AazouSafae + SaucedoEdgardo + SekkatZouheir + Perez-RodriguezAlejandro + Izquierdo-RocaVictor + GucMaxim + + Combinatorial and machine learning approaches for the analysis of Cu2ZnGeSe4: influence of the off-stoichiometry on defect formation and solar cell performance + Journal of Materials Chemistry A + Royal Society of Chemistry + 202104 + 9 + 16 + https://pubs.rsc.org/en/content/articlehtml/2021/ta/d1ta01299a https://pubs.rsc.org/en/content/articlelanding/2021/ta/d1ta01299a + 10.1039/d1ta01299a + 10466 + 10476 + + + + + + Van RossumG + DrakeF L + + Python 3 Reference Manual; CreateSpace + Scotts Valley, CA + 2009 + 978-1-4414-1269-0 + https://www.python.org/ + 242 + + + + + + + CaswellThomas A + DroettboomMichael + LeeAntony + AndradeElliott Sales de + HunterJohn + HoffmannTim + FiringEric + KlymakJody + StansbyDavid + VaroquauxNelle + NielsenJens Hedegaard + RootBenjamin + MayRyan + ElsonPhil + SeppänenJouni K. + DaleDarren + LeeJae-Joon + McDougallDamon + StrawAndrew + HobsonPaul + GohlkeChristoph + Hannah + YuTony S + MaEric + VincentAdrien F. + SilvesterSteven + MoadCharlie + KniazevNikita + ErnestElan + IvanovPaul + + matplotlib/matplotlib: REL: v3.4.2 + 202105 + https://zenodo.org/record/4743323 + 10.5281/ZENODO.4743323 + + + + +