Datasets with high dimensionality represent a challenge to existing learning methods. The presence of irrelevant and redundant features in a dataset can degrade the performance of the models inferred from it. In large datasets, manual management of features tends to be impractical. This frameworks allows to remove redundant and irrelevant features in supervised datasets.
To install requirements and compile in a debian based platform execute the script "setup.sh"
$ ./setup.sh
To install in other linux distribution install the following packages:
- g++
- libboost-python-dev
- python-dev
- python-numpy
- python-pandas
- python-sklearn
- python-matplotlib
Then execute in the terminal:
$ make
$ make wrapper
MICTools allows to identify correlation between variables in a dataset. It can be used independently and compiled as a standalone application.
MICSelect perform the feature selection, it requires MICTools.
Experiments were executed on ubuntu 18.04 using python 3.6
- Compile mictools
- Create the folder "datasets-test" in the root folder of this proyect
- Create the folder "s10" inside the folder "datasets-test"
- Create the folder "x20" inside the folder "datasets-test"
- Download datasets https://u.pcloud.link/publink/show?code=XZT2pOkZgHaO7WBaWzVmGRmMdkdjLY39hK2V
- Run MICSelect with every dataset inside ("datasets-input") with the parameters (-y target -s 10)
- Move every output ("datasets-output") to folder "datasets-test/s10"
- Run MICSelect with every dataset inside ("datasets-input") with the parameters (-y target -x 20)
- Move every output ("datasets-output") to folder "datasets-test/x20"
- Runs result.py