DSGD

Tabular interpretable classifier based on Dempster-Shafer Theory and Gradient Descent

Description

This repository contains 3 implementations of the classifier:

DSClassifier for binary classification problems
DSClassifierMulti for multi-class classification problems
DSClassifierMultiQ for multi-class classification problems that also includes the commonality transformation improvement which makes computations faster

We always recommend using DSClassifierMultiQ since it is the most stable and fastest implementation. Multi-class implementations can handle binary problems as well.

Installation

pip install git+https://github.com/Sergio-P/DSGD.git

Usage

Import the module

from dsgd import DSClassifierMultiQ

Read the data

The data can be read using pandas, numpy or other libraries

import pandas as pd
data = pd.read_csv("my_data.csv")

After that, separate them into feature vectors and their corresponding classes

y = data["class"].values
X = data.drop("class").values

Ensure that feature vectors (X) and their classes (y) are a numpy matrix and a numpy array, respectively. (In the example we use the property DataFrame.values to convert a pandas dataframe to numpy elements). And also ensure that classes are integers from 0 to num_classes - 1. Strings are not permitted as class values.

Create the model

DSC = DSClassifierMultiQ(3, max_iter=150, debug_mode=True, 
                        lossfn="MSE", min_dloss=0.0001, lr=0.005,
                        precompute_rules=True)

In this step we create the model and set the configuration, the only required parameter is the first which indicates the number of classes in the problem (3 in our case). The rest of the parameters are optional and are the following:

lr : Initial learning rate
min_iter : Minimum number of epochs to perform in the training phase
max_iter : Maximun number of epochs to perform in the training phase
min_dloss : Minium variation of loss to consider convergence
optim : ( adam | sgd ) Optimization Method
lossfn : ( CE | MSE ) Loss function
debug_mode : Enables debug in training (prints and outputs metrics)
batch_size : For large datasets, the number of records to be processed together (batch)
precompute_rules : Whether to store the result of the rules computations for each record instead of computing every time. It speeds up the training but requires more memory.
force_precompute : Speeds up the training process but uses more memory, so use it carefully.
device : ( cpu | cuda | mps ) Device to use for computations, cuda and mps use GPU and usually is faster than cpu. To use cuda must have a compatible GPU and CUDA installed.

Rule definition

After the model is defined, we need to define the rules. There are 2 ways to define rules: manually and automatically.

Define a rule manually

from dsgd import DSRule
DSC.model.add_rule(DSRule(lambda x: x[0] > 18, "Patient is adult"))

In this case we use the method add_rule from our defined model. This method accepts a DSRule as an argument. A DSRule can be defined directly using its constructor which requires as first argument a lambda function which given a feature vector x it must return whether the rule is satisfied (a boolean True or False). The second argument is optional and provides a meaningful description of the rule. In the example, if the first column of the feature vector indicates the age of a patient, the lambda x : x[0] > 18 is satisfied when the patient is an adult, which matches the description given as the second argument.

Define rules automatically

The model provides methods to generate rules automatically based on given parameters and statistics. The main two methods to generate rules are explained below.

DSC.model.generate_statistic_single_rules(X, breaks=3, 
                             column_names=names)

Given a sample of feature vectors (usually the same using for training) and a number of breaks n, the model generates simple one-attribute rules that separate each variable into n+1 equal-number groups. Columns names are optional and they are only used to generate the descriptions.

DSC.model.generate_mult_pair_rules(X, column_names=names)

Given a sample of feature vectors (usually the same using for training). It creates a rule for each pair of attributes indicating whether they are both below their means, above their means, or one above and the other below.

Training

DSC.fit(X,y)

The method fit given a set of feature vectors X and their corresponding classes y, performs all the training of the model according to the configuration and the rules defined. When this method finishes, the model is trained so that it can predict new instances as accurate as possible.

Training process performs a lot of computations, thus this method could take several minutes to finish.

When debug_mode is True this method can also print its progress (e.g. the loss in each iteration) and it also measures and outputs the time taken in every step.

Predicting

y_pred = DSC.predict(X_new)

For predicting a set of new feature vectors X_new, the model provides the method predict which returns an array with the predicted classes for each feature vector (in the same notation as used in the fit method).

y_score = DSC.predict_proba(X_new)

The model also provides the method predict_proba which instead of returning a single value for each feature vector (the predicted class), it returns the estimated probability of belonging to each class.

Interpretability

DSC.model.print_most_important_rules()

The model can explain the decisions it makes. After training the model can show which of the defined rules are most important for the prediction of each class. The method print_most_important_rules prints a summary if these findings, and the method find_most_important_rules returns this information in a structured way.

Save and Load trained models

As explained before, training is a very costly operation. Then it is not desirable to train the model every single time we perform a new experiment if we already have trained it. To handle this, the model provides methods to save and load trained models from disk.

DSC.model.save_rules_bin("my_trained_model.dsb")
# ...
DSC.model.load_rules_bin("my_trained_model.dsb")

Currently the model only saves the rules (lambdas and adjusted values). However, the other configurations must be set every time. Note that the model is created when invoking to load_rules_bin so we have already defined its configuration.

Full example

For a full and simple example please refer to the Iris example. Uncomment and comment lines to see other features.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
data		data
dsgd		dsgd
examples		examples
models		models
other		other
result		result
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DSGD

Description

Installation

Usage

Import the module

Read the data

Create the model

Rule definition

Define a rule manually

Define rules automatically

Training

Predicting

Interpretability

Save and Load trained models

Full example

About

Releases

Packages

Contributors 3

Languages

Sergio-P/DSGD

Folders and files

Latest commit

History

Repository files navigation

DSGD

Description

Installation

Usage

Import the module

Read the data

Create the model

Rule definition

Define a rule manually

Define rules automatically

Training

Predicting

Interpretability

Save and Load trained models

Full example

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages