-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pklm implementation #157
Pklm implementation #157
Conversation
… the projections. Creation of the associated tests
Sill have problems with random_state usage.
Also add a notebook to show how it works
nb_permutation: int = 30, | ||
nb_trees_per_proj: int = 200, | ||
exact_p_value: bool = False, | ||
encoder: Union[None, OneHotEncoder] = None, # We could define more encoders. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On pourrait ajouter davantage d'encoder. A voir lesquels on choisit. Pour le moment seul le OneHotencoder est disponible.
self.exact_p_value = exact_p_value | ||
self.encoder = encoder | ||
|
||
if self.exact_p_value: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L'implémentation du calcul "exact" de cette p_value ne semble pas être correct (cf notebook).
Je ne sais pas où ça pêche, peut être ici.
TypeNotHandled | ||
If any column has a data type that is not numeric, string, or boolean. | ||
""" | ||
allowed_types = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actuellement on conserve les types numériques et on encode les types:
- str
- bool
On pourrait ajouter aussi les categories. On pourrait aussi se poser la question pour les dates.
Ce qui reste à faire :
|
# * ``nb_projections``: Number of projections on which the test statistic is calculated. This | ||
# parameter has the greatest influence on test calculation time. Its defaut value | ||
# ``nb_projections=100``. | ||
# Est-ce qu'on donne des ordres de grandeurs utiles ? J'avais un peu fait ce travail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
J'avais fait un petit travail de probabilités pour estimer le bon nombre de projection pour une taille de dataset donné. Est-ce utile ?
def _encode_dataframe(self, df: pd.DataFrame) -> np.ndarray: | ||
""" | ||
Encodes the DataFrame by converting numeric columns to a numpy array | ||
and applying one-hot encoding to non-numeric columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retirer simplement "one-hot"! Ce ne sera peut être pas toujours du one-hot.
def _check_draw(df: np.ndarray, features_idx: np.ndarray, target_idx: int) -> np.bool_: | ||
""" | ||
Checks if the drawn features and target are valid. | ||
# TODO : Need to develop ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ça demande un peu de rentrer dans les détails du papier si l'on souhaite développer cette partie. Je ne suis pas certain qui ça vaille le coup
|
Also add the p_value_validity notebook for the meeting with Jeffrey NAF.
examples/tutorials/plot_tuto_mcar.py
Outdated
# This notebook shows how the Little's test performs on a simplistic case and its limitations. We | ||
# instanciate a test object with a random state for reproducibility. | ||
# This notebook shows how the Little and PKLM tests perform on a simplistic case and their | ||
# limitations. We instanciate a test object with a random state for reproducibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instantiate
+------------+------------+----------------------+ | ||
| 100000 | 15 | 3'06" | | ||
+------------+------------+----------------------+ | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why, for the same number of lines, if the number of columns increases, the time decreases (the difference is more noticeable the more lines we have) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's because the .fit() of the classifier takes less time when there are more features.
No description provided.