+ Statement of need
+ Spectroscopic techniques (e.g. Raman, photoluminescence,
+ reflectance, transmittance, X-ray fluorescence) are an important and
+ widely used resource in different fields of science, such as
+ photovoltaics
+ (Fonoll-Rubio
+ et al., 2022)
+ (Grau-Luque
+ et al., 2021), cancer
+ (Bellisola
+ & Sorio, 2012), superconductors
+ (Fischer
+ et al., 2007), polymers
+ (Easton
+ et al., 2020), corrosion
+ (Haruna
+ et al., 2023), forensics
+ (P.
+ V. Bhatt & Rawtani, 2023), and environmental sciences
+ (Estefany
+ et al., 2023), to name just a few. This is due to the
+ versatile, non-destructive and fast acquisition nature that provides a
+ wide range of material properties, such as composition, morphology,
+ molecular structure, optical and electronic properties. As such,
+ machine learning (ML) has been used to analyze spectral data for
+ several years, elucidating their vast complexity, and uncovering
+ further potential on the information contained within them
+ (Goodacre,
+ 2003)
+ (Luo et
+ al., 2022). Unfortunately, most of these ML analyses lack
+ further interpretation of the derived results due to the complex
+ nature of such algorithms. In this regard, interpreting the results of
+ ML algorithms has become an increasingly important topic, as concerns
+ about the lack of interpretability of these models have grown
+ (Burkart
+ & Huber, 2021). In natural sciences (like materials,
+ physical, chemistry, etc.), as ML becomes more common, this concern
+ has gained especial interest, since results obtained from ML analyses
+ may lack scientific value if they cannot be properly interpreted,
+ which can affect scientific consistency and strongly diminish the
+ significance and confidence in the results, particularly when tackling
+ scientific problems
+ (Roscher
+ et al., 2020).
+ Even though there are methods and libraries available for
+ explaining different types of algorithms such as SHAP
+ (Lundberg
+ et al., 2017), LIME
+ (Ribeiro
+ et al., 2016), or GradCAM
+ (Selvaraju
+ et al., 2017), they can be difficult to interpret or understand
+ even for data scientists, leading to problems such as
+ miss-interpretation, miss-use and over-trust
+ (Kaur
+ et al., n.d.). Adding this to other human-related issues
+ (Krishnå1
+ et al., 2022), researchers with expertise in natural sciences
+ with little or no data science background may face further issues when
+ using such methodologies
+ (Zhong
+ et al., 2022). Furthermore, these types of libraries normally
+ aim for problems composed of image, text, or tabular data, which
+ cannot be associated in a straightforward way with spectroscopic data.
+ On the other hand, time series (TS) data shares similarities with
+ spectroscopy, and while still having specific needs and differences,
+ TS dedicated tools can be a better approach. Unfortunately, despite
+ the existence of several libraries that aim to explain models for TS
+ with the potential to be applied to spectroscopic data, they are
+ mostly designed for a specialized audience, and many are
+ model-specific
+ (Rojat
+ et al., 2021). Moreover, spectral data normally manifests as an
+ array of peaks that are typically overlapped and can be distinguished
+ by their shape, intensity, and position. Minor shifts in these
+ patterns can indicate significant alterations in the fundamental
+ properties of the subject material. Conversely, pronounced variations
+ in the other case might only indicate negligible differences.
+ Therefore, comprehending such alterations and their implications is
+ paramount. This is still true with ML spectroscopic analysis where the
+ spectral variations are still of primary concern. In this context, a
+ tool with an easy and understandable approach that offers
+ spectroscopy-aimed functionalities that allow to aim for specific
+ patterns, areas, and variations, and that is beginner and
+ non-specialist friendly is of high interest. This can help the
+ different stakeholders to better understand the ML models that they
+ employ and considerably increase the transparency, comprehensibility,
+ and scientific impact of ML results
+ (U.
+ Bhatt et al., 2020)
+ (Belle
+ & Papantonis, 2021).
+
+
+ Overview
+ pudu is a Python library that quantifies the effect of
+ changes in spectral features over the predictions of ML models and
+ their effect to the target instances. In other words, it perturbates
+ the features in a predictable and deliberate way and evaluates the
+ features based on how the final prediction changes. For this, four
+ main methods are included and defined. Importance
+ quantifies the relevance of the features according to the changes in
+ the prediction. Thus, this is measured in probability or target value
+ difference for classification or regression problems, respectively.
+ Speed quantifies how fast a prediction changes
+ according to perturbations in the features. For this, the
+ importance is calculated at different perturbation
+ levels, and a line is fitted to the obtained values and the slope, or
+ the rate of change of importance, is extracted as the
+ speed. Synergy indicates how
+ features complement each other in terms of prediction change after
+ perturbations. Finally, re-activations account for
+ the number of unit activations in a Convolutional Neural Network (CNN)
+ that after perturbation, the value goes above the original activation
+ criteria. The latter is only applicable for CNNs, but the rest can be
+ applied to any other ML problem, including CNNs. To read in more
+ detail how these techniques work, please refer to the
+ definitions
+ in the documentation.
+ pudu is versatile as it can analyze classification and
+ regression algorithms for both 1- and 2-dimensional problems, offering
+ plenty of flexibility with parameters, and the ability to provide
+ localized explanations by selecting specific areas of interest. To
+ illustrate this,
+ [fig:figure1]
+ shows two analysis instances using the same
+ importance method but with different parameters.
+ Additionally, its other functionalities are shown in examples using
+ scikit-learn
+ (Pedregosa
+ et al., 2011), keras
+ (Chollet
+ et al., 2018), and localreg
+ (Marholm,
+ 2022) are found in the documentation, along with XAI methods
+ including LIME and GradCAM.
+ pudu is built in Python 3
+ (Van
+ Rossum & Drake, 2009) and uses third-party packages
+ including numpy
+ (Harris
+ et al., 2020), matplotlib
+ (Caswell
+ et al., 2021), and keras. It is available in both PyPI and
+ conda, and comes with complete documentation, including quick start,
+ examples, and contribution guidelines. Source code and documentation
+ are available in https://github.com/pudu-py/pudu.
+
+ Two ways of using the same method
+ importance by A) using a sequential change pattern
+ over all the spectral features and B) selecting peaks of interest.
+ These spectras are measured from thin-film photovoltaic samples and
+ are correlated to their performance using ML, as explained in
+ (Fonoll-Rubio
+ et al., 2022). In A), the vector was divided in window sizes
+ of 25 pixels were perturbed individually. The impact of the peak in
+ the range of 1200-1400 opaques the impact of the rest. In contrast,
+ in B) specific ranges are defined, so only the first four main peaks
+ are selected to be analyzed and better visualize their impact in the
+ prediction.
+
+
+
+