Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Logistic Regression class implementation & utils #41

Merged
merged 16 commits into from
Jan 16, 2023

Conversation

stompsjo
Copy link
Collaborator

This introduces an ML class, LogReg, that can be used for supervised logistic regression. This includes typical scikit-learn-esque methods like train and predict as well as methods for hyperparameter optimization and saving the model to file.

@stompsjo stompsjo requested a review from gonuke October 31, 2022 18:30
@stompsjo stompsjo self-assigned this Oct 31, 2022
@coveralls
Copy link

coveralls commented Oct 31, 2022

Pull Request Test Coverage Report for Build 3931573895

  • 85 of 103 (82.52%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-5.6%) to 94.444%

Changes Missing Coverage Covered Lines Changed/Added Lines %
scripts/utils.py 46 64 71.88%
Totals Coverage Status
Change from base Build 3348444582: -5.6%
Covered Lines: 306
Relevant Lines: 324

💛 - Coveralls

@stompsjo stompsjo added the good first issue Good for newcomers label Oct 31, 2022
@stompsjo
Copy link
Collaborator Author

@gonuke This PR should be reviewed and merged first because it includes utils.py which will be used for all subsequent ML PRs (the rest can be done in parallel). This will also be the "most difficult" to review, since it includes both the utils and one ML class (LogReg) that reflects the common structure for ML models.

Coveralls says that coverage dropped because the SSML functionality of utils.cross_validation are not tested (since this PR does not include any SSML models). A test is already implemented as part of #45.

@stompsjo stompsjo changed the title Adding Logistic Regression class implementation Adding Logistic Regression class implementation & utils Oct 31, 2022
Copy link
Member

@gonuke gonuke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct enough, but a few suggestions for maintainability.

Some good discussion points for a S/W meeting, too.

models/LogReg.py Show resolved Hide resolved
scripts/utils.py Outdated Show resolved Hide resolved
scripts/utils.py Outdated Show resolved Hide resolved
scripts/utils.py Outdated Show resolved Hide resolved
scripts/utils.py Outdated Show resolved Hide resolved
tests/test_models.py Outdated Show resolved Hide resolved
tests/test_models.py Outdated Show resolved Hide resolved
tests/test_models.py Outdated Show resolved Hide resolved
tests/test_models.py Outdated Show resolved Hide resolved
tests/test_models.py Outdated Show resolved Hide resolved
Copy link
Collaborator Author

@stompsjo stompsjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gonuke I have addressed some of the comments from our S/W review. Here are a few lingering comments, pending a re-review from you.

scripts/utils.py Show resolved Hide resolved
scripts/utils.py Outdated
Comment on lines 150 to 152
if stratified:
cv = StratifiedKFold(n_splits=n_splits, random_state=random_state,
shuffle=shuffle)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is currently no test for stratified KFold, but the only difference between this and standard KFold (which is tested) is a different scikit-learn class. I could devise a second test dataset with a few extra instances of one class (which would be balanced by StratifiedKFold) or we could ignore testing this portion. Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In light of # pragma: no cover from Coveralls, I am adding to this IF...ELSE... since we are not testing StratifiedKFold.

scripts/utils.py Outdated
Comment on lines 194 to 296
Inputs:
Lx: labeled feature data.
Ux: unlabeled feature data.
n: number of singular values to include in PCA analysis.
'''

pcadata = np.append(Lx, Ux, axis=0)
normalizer = StandardScaler()
x = normalizer.fit_transform(pcadata)
logging.info(np.mean(pcadata), np.std(pcadata))
logging.info(np.mean(x), np.std(x))

pca = PCA(n_components=n)
pca.fit_transform(x)
logging.info(pca.explained_variance_ratio_)
logging.info(pca.singular_values_)
logging.info(pca.components_)

principalComponents = pca.fit_transform(x)

return principalComponents


def _plot_pca_data(principalComponents, Ly, Uy, ax):
'''
Helper function for plot_pca that plots data for a given axis.
Inputs:
principalComponents: ndarray of shape (n_samples, n_components).
Ly: class labels for labeled data.
Uy: labels for unlabeled data (all labels should be -1).
ax: matplotlib-axis to plot on.
'''

# only saving colors for binary classification with unlabeled instances
col_dict = {-1: 'tab:gray', 0: 'tab:orange', 1: 'tab:blue'}

for idx, color in col_dict.items():
indices = np.where(np.append(Ly, Uy, axis=0) == idx)[0]
ax.scatter(principalComponents[indices, 0],
principalComponents[indices, 1],
c=color,
label='class '+str(idx))
return ax


def plot_pca(principalComponents, Ly, Uy, filename, n=2):
'''
A function for computing and plotting n-dimensional PCA.
Inputs:
principalComponents: ndarray of shape (n_samples, n_components).
Ly: class labels for labeled data.
Uy: labels for unlabeled data (all labels should be -1).
filename: filename for saved plot.
The file must be saved with extension .joblib.
Added to filename if not included as input.
n: number of singular values to include in PCA analysis.
'''

plt.rcParams.update({'font.size': 20})

alph = ["A", "B", "C", "D", "E", "F", "G", "H",
"I", "J", "K", "L", "M", "N", "O", "P",
"Q", "R", "S", "T", "U", "V", "W", "X",
"Y", "Z"]
jobs = alph[:n]

# only one plot is needed for n=2
if n == 2:
fig, ax = plt.subplots(figsize=(10, 8))
ax.set_xlabel('PC '+jobs[0], fontsize=15)
ax.set_ylabel('PC '+jobs[1], fontsize=15)
ax = _plot_pca_data(principalComponents, Ly, Uy, ax)
ax.grid()
ax.legend()
else:
fig, axes = plt.subplots(n, n, figsize=(15, 15))
for row in range(axes.shape[0]):
for col in range(axes.shape[1]):
ax = axes[row, col]
# blank label plot
if row == col:
ax.tick_params(
axis='both', which='both',
bottom='off', top='off',
labelbottom='off',
left='off', right='off',
labelleft='off'
)
ax.text(0.5, 0.5, jobs[row], horizontalalignment='center')
# PCA results
else:
ax = _plot_pca_data(principalComponents, Ly, Uy, ax)

if filename[-4:] != '.png':
filename += '.png'
fig.tight_layout()
fig.savefig(filename)


def plot_cf(testy, predy, title, filename):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gonuke, per our conversations from S/W, do you still concur that we can ignore testing these PCA/plotting functions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - but is there a way to exclude them from the denominator of the coverage testing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out, Coveralls supports adding # pragma: no cover to blocks of code, and it will ignore it. Adding to the plotting functions above.

Comment on lines +162 to +165
# since the test data used here is synthetic/toy data (i.e. uninteresting),
# the trained model should be at least better than a 50-50 guess
# if it was worse, something would be wrong with the ML class
assert acc > 0.5
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some clarification to explain our "good-enough" testing of trained ML models.

Copy link
Member

@gonuke gonuke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still some questions about what the right test for parameters passed into the LogReg initializer....

models/LogReg.py Outdated
random_state=self.random_state
)
else:
if all(key in params.keys() for key in keys):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is the most correct/robust way to do this. The LogisticRegression model has defaults for these parameters, so it may be OK if some are missing. You just need to make sure they exist if you want to pass them along. Right now, you only allow 0 parameters or all 3 parameters, but maybe it's OK for just 1 or 2?

One way to manage this is with the **kwargs object that you can pass through, perhaps?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my first time using **kwargs but I saw a recommendation to use kwargs.pop('key', default_value) to pull option args from the input. This system should support any combination of input parameters, including ones that are not supported. I have updated the __init__ and its relevant unit test. Let me know if you have feedback!

scripts/utils.py Show resolved Hide resolved
scripts/utils.py Outdated
Comment on lines 194 to 296
Inputs:
Lx: labeled feature data.
Ux: unlabeled feature data.
n: number of singular values to include in PCA analysis.
'''

pcadata = np.append(Lx, Ux, axis=0)
normalizer = StandardScaler()
x = normalizer.fit_transform(pcadata)
logging.info(np.mean(pcadata), np.std(pcadata))
logging.info(np.mean(x), np.std(x))

pca = PCA(n_components=n)
pca.fit_transform(x)
logging.info(pca.explained_variance_ratio_)
logging.info(pca.singular_values_)
logging.info(pca.components_)

principalComponents = pca.fit_transform(x)

return principalComponents


def _plot_pca_data(principalComponents, Ly, Uy, ax):
'''
Helper function for plot_pca that plots data for a given axis.
Inputs:
principalComponents: ndarray of shape (n_samples, n_components).
Ly: class labels for labeled data.
Uy: labels for unlabeled data (all labels should be -1).
ax: matplotlib-axis to plot on.
'''

# only saving colors for binary classification with unlabeled instances
col_dict = {-1: 'tab:gray', 0: 'tab:orange', 1: 'tab:blue'}

for idx, color in col_dict.items():
indices = np.where(np.append(Ly, Uy, axis=0) == idx)[0]
ax.scatter(principalComponents[indices, 0],
principalComponents[indices, 1],
c=color,
label='class '+str(idx))
return ax


def plot_pca(principalComponents, Ly, Uy, filename, n=2):
'''
A function for computing and plotting n-dimensional PCA.
Inputs:
principalComponents: ndarray of shape (n_samples, n_components).
Ly: class labels for labeled data.
Uy: labels for unlabeled data (all labels should be -1).
filename: filename for saved plot.
The file must be saved with extension .joblib.
Added to filename if not included as input.
n: number of singular values to include in PCA analysis.
'''

plt.rcParams.update({'font.size': 20})

alph = ["A", "B", "C", "D", "E", "F", "G", "H",
"I", "J", "K", "L", "M", "N", "O", "P",
"Q", "R", "S", "T", "U", "V", "W", "X",
"Y", "Z"]
jobs = alph[:n]

# only one plot is needed for n=2
if n == 2:
fig, ax = plt.subplots(figsize=(10, 8))
ax.set_xlabel('PC '+jobs[0], fontsize=15)
ax.set_ylabel('PC '+jobs[1], fontsize=15)
ax = _plot_pca_data(principalComponents, Ly, Uy, ax)
ax.grid()
ax.legend()
else:
fig, axes = plt.subplots(n, n, figsize=(15, 15))
for row in range(axes.shape[0]):
for col in range(axes.shape[1]):
ax = axes[row, col]
# blank label plot
if row == col:
ax.tick_params(
axis='both', which='both',
bottom='off', top='off',
labelbottom='off',
left='off', right='off',
labelleft='off'
)
ax.text(0.5, 0.5, jobs[row], horizontalalignment='center')
# PCA results
else:
ax = _plot_pca_data(principalComponents, Ly, Uy, ax)

if filename[-4:] != '.png':
filename += '.png'
fig.tight_layout()
fig.savefig(filename)


def plot_cf(testy, predy, title, filename):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - but is there a way to exclude them from the denominator of the coverage testing?

@stompsjo
Copy link
Collaborator Author

@gonuke I addressed your comments, thanks!

Copy link
Member

@gonuke gonuke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@gonuke gonuke merged commit d727d1e into cnerg:main Jan 16, 2023
stompsjo pushed a commit to stompsjo/RadClass that referenced this pull request Jan 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants