Adding Logistic Regression class implementation & utils #41

stompsjo · 2022-10-31T18:30:43Z

This introduces an ML class, LogReg, that can be used for supervised logistic regression. This includes typical scikit-learn-esque methods like train and predict as well as methods for hyperparameter optimization and saving the model to file.

coveralls · 2022-10-31T18:31:27Z

Pull Request Test Coverage Report for Build 3931573895

85 of 103 (82.52%) changed or added relevant lines in 2 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-5.6%) to 94.444%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
scripts/utils.py	46	64	71.88%

Totals
Change from base Build 3348444582:	-5.6%
Covered Lines:	306
Relevant Lines:	324

💛 - Coveralls

…n the LabelProp PR

stompsjo · 2022-10-31T19:12:23Z

@gonuke This PR should be reviewed and merged first because it includes utils.py which will be used for all subsequent ML PRs (the rest can be done in parallel). This will also be the "most difficult" to review, since it includes both the utils and one ML class (LogReg) that reflects the common structure for ML models.

Coveralls says that coverage dropped because the SSML functionality of utils.cross_validation are not tested (since this PR does not include any SSML models). A test is already implemented as part of #45.

gonuke

This looks correct enough, but a few suggestions for maintainability.

Some good discussion points for a S/W meeting, too.

models/LogReg.py

scripts/utils.py

tests/test_models.py

Co-authored-by: Paul Wilson <[email protected]>

stompsjo

@gonuke I have addressed some of the comments from our S/W review. Here are a few lingering comments, pending a re-review from you.

scripts/utils.py

stompsjo · 2022-12-15T20:13:19Z

scripts/utils.py

+    if stratified:
+        cv = StratifiedKFold(n_splits=n_splits, random_state=random_state,
+                             shuffle=shuffle)


There is currently no test for stratified KFold, but the only difference between this and standard KFold (which is tested) is a different scikit-learn class. I could devise a second test dataset with a few extra instances of one class (which would be balanced by StratifiedKFold) or we could ignore testing this portion. Thoughts?

In light of # pragma: no cover from Coveralls, I am adding to this IF...ELSE... since we are not testing StratifiedKFold.

stompsjo · 2022-12-15T20:14:18Z

scripts/utils.py

+    Inputs:
+    Lx: labeled feature data.
+    Ux: unlabeled feature data.
+    n: number of singular values to include in PCA analysis.
+    '''
+
+    pcadata = np.append(Lx, Ux, axis=0)
+    normalizer = StandardScaler()
+    x = normalizer.fit_transform(pcadata)
+    logging.info(np.mean(pcadata), np.std(pcadata))
+    logging.info(np.mean(x), np.std(x))
+
+    pca = PCA(n_components=n)
+    pca.fit_transform(x)
+    logging.info(pca.explained_variance_ratio_)
+    logging.info(pca.singular_values_)
+    logging.info(pca.components_)
+
+    principalComponents = pca.fit_transform(x)
+
+    return principalComponents
+
+
+def _plot_pca_data(principalComponents, Ly, Uy, ax):
+    '''
+    Helper function for plot_pca that plots data for a given axis.
+    Inputs:
+    principalComponents: ndarray of shape (n_samples, n_components).
+    Ly: class labels for labeled data.
+    Uy: labels for unlabeled data (all labels should be -1).
+    ax: matplotlib-axis to plot on.
+    '''
+
+    # only saving colors for binary classification with unlabeled instances
+    col_dict = {-1: 'tab:gray', 0: 'tab:orange', 1: 'tab:blue'}
+
+    for idx, color in col_dict.items():
+        indices = np.where(np.append(Ly, Uy, axis=0) == idx)[0]
+        ax.scatter(principalComponents[indices, 0],
+                   principalComponents[indices, 1],
+                   c=color,
+                   label='class '+str(idx))
+    return ax
+
+
+def plot_pca(principalComponents, Ly, Uy, filename, n=2):
+    '''
+    A function for computing and plotting n-dimensional PCA.
+    Inputs:
+    principalComponents: ndarray of shape (n_samples, n_components).
+    Ly: class labels for labeled data.
+    Uy: labels for unlabeled data (all labels should be -1).
+    filename: filename for saved plot.
+        The file must be saved with extension .joblib.
+        Added to filename if not included as input.
+    n: number of singular values to include in PCA analysis.
+    '''
+
+    plt.rcParams.update({'font.size': 20})
+
+    alph = ["A", "B", "C", "D", "E", "F", "G", "H",
+            "I", "J", "K", "L", "M", "N", "O", "P",
+            "Q", "R", "S", "T", "U", "V", "W", "X",
+            "Y", "Z"]
+    jobs = alph[:n]
+
+    # only one plot is needed for n=2
+    if n == 2:
+        fig, ax = plt.subplots(figsize=(10, 8))
+        ax.set_xlabel('PC '+jobs[0], fontsize=15)
+        ax.set_ylabel('PC '+jobs[1], fontsize=15)
+        ax = _plot_pca_data(principalComponents, Ly, Uy, ax)
+        ax.grid()
+        ax.legend()
+    else:
+        fig, axes = plt.subplots(n, n, figsize=(15, 15))
+        for row in range(axes.shape[0]):
+            for col in range(axes.shape[1]):
+                ax = axes[row, col]
+                # blank label plot
+                if row == col:
+                    ax.tick_params(
+                        axis='both', which='both',
+                        bottom='off', top='off',
+                        labelbottom='off',
+                        left='off', right='off',
+                        labelleft='off'
+                    )
+                    ax.text(0.5, 0.5, jobs[row], horizontalalignment='center')
+                # PCA results
+                else:
+                    ax = _plot_pca_data(principalComponents, Ly, Uy, ax)
+
+    if filename[-4:] != '.png':
+        filename += '.png'
+    fig.tight_layout()
+    fig.savefig(filename)
+
+
+def plot_cf(testy, predy, title, filename):


@gonuke, per our conversations from S/W, do you still concur that we can ignore testing these PCA/plotting functions?

Yes - but is there a way to exclude them from the denominator of the coverage testing?

Turns out, Coveralls supports adding # pragma: no cover to blocks of code, and it will ignore it. Adding to the plotting functions above.

stompsjo · 2022-12-15T20:15:09Z

tests/test_models.py

+    # since the test data used here is synthetic/toy data (i.e. uninteresting),
+    # the trained model should be at least better than a 50-50 guess
+    # if it was worse, something would be wrong with the ML class
+    assert acc > 0.5


I added some clarification to explain our "good-enough" testing of trained ML models.

scripts/utils.py

gonuke

Still some questions about what the right test for parameters passed into the LogReg initializer....

gonuke · 2022-12-31T22:52:34Z

models/LogReg.py

+                            random_state=self.random_state
+                        )
+        else:
+            if all(key in params.keys() for key in keys):


I'm not sure if this is the most correct/robust way to do this. The LogisticRegression model has defaults for these parameters, so it may be OK if some are missing. You just need to make sure they exist if you want to pass them along. Right now, you only allow 0 parameters or all 3 parameters, but maybe it's OK for just 1 or 2?

One way to manage this is with the **kwargs object that you can pass through, perhaps?

This is my first time using **kwargs but I saw a recommendation to use kwargs.pop('key', default_value) to pull option args from the input. This system should support any combination of input parameters, including ones that are not supported. I have updated the __init__ and its relevant unit test. Let me know if you have feedback!

scripts/utils.py

gonuke · 2022-12-31T22:57:59Z

scripts/utils.py

+    Inputs:
+    Lx: labeled feature data.
+    Ux: unlabeled feature data.
+    n: number of singular values to include in PCA analysis.
+    '''
+
+    pcadata = np.append(Lx, Ux, axis=0)
+    normalizer = StandardScaler()
+    x = normalizer.fit_transform(pcadata)
+    logging.info(np.mean(pcadata), np.std(pcadata))
+    logging.info(np.mean(x), np.std(x))
+
+    pca = PCA(n_components=n)
+    pca.fit_transform(x)
+    logging.info(pca.explained_variance_ratio_)
+    logging.info(pca.singular_values_)
+    logging.info(pca.components_)
+
+    principalComponents = pca.fit_transform(x)
+
+    return principalComponents
+
+
+def _plot_pca_data(principalComponents, Ly, Uy, ax):
+    '''
+    Helper function for plot_pca that plots data for a given axis.
+    Inputs:
+    principalComponents: ndarray of shape (n_samples, n_components).
+    Ly: class labels for labeled data.
+    Uy: labels for unlabeled data (all labels should be -1).
+    ax: matplotlib-axis to plot on.
+    '''
+
+    # only saving colors for binary classification with unlabeled instances
+    col_dict = {-1: 'tab:gray', 0: 'tab:orange', 1: 'tab:blue'}
+
+    for idx, color in col_dict.items():
+        indices = np.where(np.append(Ly, Uy, axis=0) == idx)[0]
+        ax.scatter(principalComponents[indices, 0],
+                   principalComponents[indices, 1],
+                   c=color,
+                   label='class '+str(idx))
+    return ax
+
+
+def plot_pca(principalComponents, Ly, Uy, filename, n=2):
+    '''
+    A function for computing and plotting n-dimensional PCA.
+    Inputs:
+    principalComponents: ndarray of shape (n_samples, n_components).
+    Ly: class labels for labeled data.
+    Uy: labels for unlabeled data (all labels should be -1).
+    filename: filename for saved plot.
+        The file must be saved with extension .joblib.
+        Added to filename if not included as input.
+    n: number of singular values to include in PCA analysis.
+    '''
+
+    plt.rcParams.update({'font.size': 20})
+
+    alph = ["A", "B", "C", "D", "E", "F", "G", "H",
+            "I", "J", "K", "L", "M", "N", "O", "P",
+            "Q", "R", "S", "T", "U", "V", "W", "X",
+            "Y", "Z"]
+    jobs = alph[:n]
+
+    # only one plot is needed for n=2
+    if n == 2:
+        fig, ax = plt.subplots(figsize=(10, 8))
+        ax.set_xlabel('PC '+jobs[0], fontsize=15)
+        ax.set_ylabel('PC '+jobs[1], fontsize=15)
+        ax = _plot_pca_data(principalComponents, Ly, Uy, ax)
+        ax.grid()
+        ax.legend()
+    else:
+        fig, axes = plt.subplots(n, n, figsize=(15, 15))
+        for row in range(axes.shape[0]):
+            for col in range(axes.shape[1]):
+                ax = axes[row, col]
+                # blank label plot
+                if row == col:
+                    ax.tick_params(
+                        axis='both', which='both',
+                        bottom='off', top='off',
+                        labelbottom='off',
+                        left='off', right='off',
+                        labelleft='off'
+                    )
+                    ax.text(0.5, 0.5, jobs[row], horizontalalignment='center')
+                # PCA results
+                else:
+                    ax = _plot_pca_data(principalComponents, Ly, Uy, ax)
+
+    if filename[-4:] != '.png':
+        filename += '.png'
+    fig.tight_layout()
+    fig.savefig(filename)
+
+
+def plot_cf(testy, predy, title, filename):


Yes - but is there a way to exclude them from the denominator of the coverage testing?

stompsjo · 2023-01-16T14:57:57Z

@gonuke I addressed your comments, thanks!

gonuke

Looks good!

Jordan Stomps added 2 commits October 31, 2022 14:19

utility and testing scripts needed for implementing ML and SSML

54f7065

adding Logistic Regression class implementation

b335056

stompsjo requested a review from gonuke October 31, 2022 18:30

stompsjo self-assigned this Oct 31, 2022

Merge branch 'utils' into logreg since tests require it

bc8cbfc

stompsjo mentioned this pull request Oct 31, 2022

Utility and testing scripts for ML implementations #40

Closed

Jordan Stomps added 2 commits October 31, 2022 14:54

removing pytests for future ML implementations

c661099

commenting cross validation test for ssml since it will be included i…

a7d4bfe

…n the LabelProp PR

stompsjo added the good first issue Good for newcomers label Oct 31, 2022

removing leftovers from migration

738720e

stompsjo mentioned this pull request Oct 31, 2022

SSML Models Implementation #38

Closed

stompsjo changed the title ~~Adding Logistic Regression class implementation~~ Adding Logistic Regression class implementation & utils Oct 31, 2022

stompsjo force-pushed the logreg branch from dd66c85 to 738720e Compare October 31, 2022 19:20

gonuke requested changes Nov 20, 2022

View reviewed changes

Jordan Stomps and others added 7 commits November 28, 2022 14:31

condensing bool return

b6a2c36

Co-authored-by: Paul Wilson <[email protected]>

addressing comments from PR review

de820fa

Merge branch 'logreg' of github.com:stompsjo/RadClass into logreg

cfeb32b

splitting utils test between cross_validation and pca

49ed491

implementing more meaningful test data for testing cross validation

785e424

changing print statements to logging

fd3cf53

clarifying the testing logic for ML classes

56da530

stompsjo commented Dec 15, 2022

View reviewed changes

scripts/utils.py Show resolved Hide resolved

gonuke requested changes Dec 31, 2022

View reviewed changes

Jordan Stomps added 2 commits January 16, 2023 09:26

moving default init params to support kwargs

533bbb2

adding pragmas for coveralls to ignore PCA functions

f731a56

pep8

c9cb249

gonuke approved these changes Jan 16, 2023

View reviewed changes

gonuke merged commit d727d1e into cnerg:main Jan 16, 2023

stompsjo pushed a commit to stompsjo/RadClass that referenced this pull request Jan 16, 2023

changes in light of PR cnerg#41 comments

ec47a63

stompsjo mentioned this pull request Jan 16, 2023

Adding CoTraining class implementation #42

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Logistic Regression class implementation & utils #41

Adding Logistic Regression class implementation & utils #41

stompsjo commented Oct 31, 2022

coveralls commented Oct 31, 2022 •

edited

Loading

stompsjo commented Oct 31, 2022

gonuke left a comment

stompsjo left a comment

stompsjo Dec 15, 2022

stompsjo Jan 16, 2023

stompsjo Dec 15, 2022

gonuke Dec 31, 2022

stompsjo Jan 16, 2023

stompsjo Dec 15, 2022

gonuke left a comment

gonuke Dec 31, 2022

stompsjo Jan 16, 2023

gonuke Dec 31, 2022

stompsjo commented Jan 16, 2023

gonuke left a comment

Adding Logistic Regression class implementation & utils #41

Adding Logistic Regression class implementation & utils #41

Conversation

stompsjo commented Oct 31, 2022

coveralls commented Oct 31, 2022 • edited Loading

Pull Request Test Coverage Report for Build 3931573895

💛 - Coveralls

stompsjo commented Oct 31, 2022

gonuke left a comment

Choose a reason for hiding this comment

stompsjo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gonuke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stompsjo commented Jan 16, 2023

gonuke left a comment

Choose a reason for hiding this comment

coveralls commented Oct 31, 2022 •

edited

Loading