Evaluation of Unsupervised Machine Learning Methods on Scientific Datasets in High Performance Computing

An Oak Ridge National Lab summer intern project.

Mentored by Dr. Junqi Yin

Introduction

Scientists who perform High Performance Computing (HPC) often run traditional Machine Learning (ML) methods on data. These methods include Principal Component Analysis (PCA) and K-Means clustering. ML has recently attracted a lot of interest in science due to the flexibility it offers scientists. With most ML cases, the ground truth is not known, which prompts the use of unsupervised ML algorithms.

Traditionally, PCA and K-Means have been widely used, but it becomes challenging to maintain accuracy with them in large feature datasets with strong non-linearity. This research looks at the differences between four methods: PCA, K-Means, a Convolutional Autoencoder (CAE) and a Convolutional Variational Autoencoder (CVAE). The two case studies used for testing are the SARS-MERS-COVID dataset and a High Entropy Alloy (HEA) dataset. We find that there is greater accuracy with the CAE and CVAE when it comes to large datasets. We also find that CAE and CVAE provide greater accuracy at greater cost in computation compared to PCA and K-Means.

Datasets

We used pathogen strains of SARS, MERS, and COVID as images to train the different models. The aim is the use the strains to classify each strain accurately. This dataset consists of 60000 samples and each sample is an image that is 24x24x1 pixels.

The second dataset we used in the High Entropy Allow dataset. This dataset consists of lattice structures of different alloys. This dataset has dimensions of 40000 samples, each 40x40x40 pixels.

Methods

We use PCA, KMeans, a Convolutional Autoencoder, and a Convolutional Variational Autoencoder as the primary methods for evaluating this dataset. Each method is elaborated below.

Principal Component Analysis (PCA):

PCA is a dimensionality reduction method that exposes the most important features from the least important ones in a list. Using this method, it is possible to expose the most impotant features from the datset and use them, especially when there are a lot of features into the models. This is especially useful when there are a lot of features, and some of them are dependent on each other, or some of them are hurting the model performance. Shown below is the flow chart representing PCA algorithm.

K-Means Clustering:

K-Means is a way to cluster different categories together based on centroids and distances to those centroids. It is an iterative procedure that starts with randomly assigned centroids, takes repetitive means and distances for every point, and assigns that to the closest centroid. This algorithm eventually converges, revealing the final clusters. Shown below is a flow chart of the algorithm.

Convolutional Autoencoder

Autoencoders shrink an image, and use the data points found within the image while expanding it back up again to learn intrinsic properties about the image. This helps the model get really good at classifying images based on how similar it is from learned features while shrinking and expanding trial images over and over again. A convolutional autoencoder passes a kernel over the pixel values of the image, and creates a smaller mappingl, effectively shrinking the image dimensions (linear algebra). The CNN model is shown below, as well as a representation of how it works.

Convolutional Variational Autoencoder

Just like the way a CAE works, CVAE introduces a probabilistic manner of describing features in latent space. It outputs probability distributions instead of a single value, modelling uncertainty to a greater degree. So we randomly sample from each probability distribution to generate decoder output. So then the decoder attempts to reconstruct the image based on the sampling done. This helps with greater learning given enough training data, making CVAE more generalizable to a variety of testing data.

Gaussian Mixture Model

Each cluster is modelled as a Gaussian. The Expectation Maximisation (EM) algorithm is used to maximize the marginal likelihood of the input variables given parameters. We then estimate the posterior distribution conditional on weights, means, and covariances. Once the posterior distribution has been estimated, we obtain the parameters of each Gaussian and evaluate log likelihood.

Generative Adversarial Networks

GANs are a great way to train neural networks to classify. It is a 2 part neural network, one of which is a generator and one is a discriminator. The generator learns off the discriminator loss and tries to fool the discriminator into thinking its outputs are real. THe discriminator, however, has to classify between the real and fake inputs. This way, we have an adversarial situation. So the common case becomes that the discriminator is doing really well but the generator is not. The worst case is when the generator is doing really well and the discriminator is not, because now the network is not going to discriminate between real and fake inputs.

Results

We found that the different algorithms we used here gave us different classification outputs. For example, K-Means was not very good at classifying on the HEA dataset, but the CAE and CVAE did really well. On the SARS-MERS-COVID dataset, a neural network trained with PCA reduction did really well, achieving 74% accuracy, and K-Means also did not do so well, achieving only 56.87% accuracy. The results are shown below.

PCA

K-Means

CAE

CVAE

GMM

Multi-GPU Training

We were also able to train across multiple GPUs for these models on the Summit supercomputer. To do this in code, it involved using the Horovod python library. Horovod allows us to separate the data or the model and train in parallel. We implemented the data separation across different GPUs, and passed through the whole model on each. We see there is a decrease in processing times from 1 to 6 GPUs used for both datasets. The SARS-MERS-COVID dataset showed greatest decrease in time, whereas the HEA dataset showed less, due to the complexity of the dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
.idea		.idea
Code		Code
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation of Unsupervised Machine Learning Methods on Scientific Datasets in High Performance Computing

An Oak Ridge National Lab summer intern project.

Mentored by Dr. Junqi Yin

Introduction

Datasets

Methods

Principal Component Analysis (PCA):

K-Means Clustering:

Convolutional Autoencoder

Convolutional Variational Autoencoder

Gaussian Mixture Model

Generative Adversarial Networks

Results

PCA

K-Means

CAE

CVAE

GMM

Multi-GPU Training

About

Releases

Packages

Contributors 2

Languages

arjuns-code-center/OakRidgeNatLab_InternProject

Folders and files

Latest commit

History

Repository files navigation

Evaluation of Unsupervised Machine Learning Methods on Scientific Datasets in High Performance Computing

An Oak Ridge National Lab summer intern project.

Mentored by Dr. Junqi Yin

Introduction

Datasets

Methods

Principal Component Analysis (PCA):

K-Means Clustering:

Convolutional Autoencoder

Convolutional Variational Autoencoder

Gaussian Mixture Model

Generative Adversarial Networks

Results

PCA

K-Means

CAE

CVAE

GMM

Multi-GPU Training

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages