From 05a1eb835e1366feae5f0596da6f9e4490e8065b Mon Sep 17 00:00:00 2001 From: Oumaima Fisaoui <48260689+Oumaimafisaoui@users.noreply.github.com> Date: Tue, 1 Oct 2024 09:33:24 +0100 Subject: [PATCH 1/5] Chore(AI): Fix piscine structure --- subjects/ai/classification/README.md | 43 +++++++++------- subjects/ai/classification/audit/README.md | 2 +- subjects/ai/data-wrangling/README.md | 30 +++++++---- subjects/ai/data-wrangling/audit/README.md | 2 +- subjects/ai/keras-2/README.md | 29 +++++++---- subjects/ai/keras-2/audit/README.md | 2 +- subjects/ai/keras/README.md | 26 +++++++--- subjects/ai/keras/audit/README.md | 2 +- subjects/ai/linear-regression/README.md | 51 +++++++++++-------- subjects/ai/linear-regression/audit/README.md | 4 +- subjects/ai/model-selection/README.md | 30 +++++++---- subjects/ai/model-selection/audit/README.md | 2 +- subjects/ai/neural-networks/README.md | 38 ++++++++++---- subjects/ai/neural-networks/audit/README.md | 2 +- subjects/ai/nlp-spacy/README.md | 28 ++++++---- subjects/ai/nlp-spacy/audit/README.md | 2 +- subjects/ai/nlp/README.md | 29 +++++++---- subjects/ai/nlp/audit/README.md | 2 +- subjects/ai/numpy/README.md | 36 ++++++++----- subjects/ai/numpy/audit/README.md | 2 +- subjects/ai/pandas/README.md | 26 +++++++--- subjects/ai/pandas/audit/README.md | 2 +- subjects/ai/pipeline/README.md | 32 +++++++----- subjects/ai/pipeline/audit/README.md | 2 +- subjects/ai/time-series/README.md | 22 +++++--- subjects/ai/time-series/audit/README.md | 2 +- subjects/ai/training/README.md | 47 ++++++++++------- subjects/ai/training/audit/README.md | 2 +- subjects/ai/visualizations/README.md | 28 ++++++---- subjects/ai/visualizations/audit/README.md | 2 +- 30 files changed, 336 insertions(+), 191 deletions(-) diff --git a/subjects/ai/classification/README.md b/subjects/ai/classification/README.md index 3042412969..3aa01fd1c6 100644 --- a/subjects/ai/classification/README.md +++ b/subjects/ai/classification/README.md @@ -1,7 +1,15 @@ -# Classification +## Classification + +### Overview The goal of this day is to understand practical classification with Scikit Learn. +### Role play + +Imagine you're a data scientist working for a cutting-edge medical research company. Your team has been tasked with developing a machine learning model to assist doctors in diagnosing breast cancer. You'll be using logistic regression to classify tumors as benign or malignant based on various features. + +### Learning Objectives + Today we will learn a different approach in Machine Learning: the classification which is a large domain in the field of statistics and machine learning. Generally, it can be broken down in two areas: - **Binary classification**, where we wish to group an outcome into one of two groups. @@ -45,15 +53,15 @@ The **logloss** or **cross entropy** is the loss used for classification. Simila _Version of Scikit Learn I used to do the exercises: 0.22_. I suggest to use the most recent one. Scikit Learn 1.0 is finally available after ... 14 years. -### **Resources** +### Resources -### Logistic regression +#### Logistic regression - https://towardsdatascience.com/understanding-logistic-regression-9b02c2aec102 -### Logloss +#### Logloss -- https://towardsdatascience.com/cross-entropy-for-classification-d98e7f974451 +- https://www.datacamp.com/tutorial/the-cross-entropy-loss-function-in-machine-learning - https://medium.com/swlh/what-is-logistic-regression-62807de62efa @@ -61,7 +69,7 @@ _Version of Scikit Learn I used to do the exercises: 0.22_. I suggest to use the --- -# Exercise 0: Environment and libraries +### Exercise 0: Environment and libraries The goal of this exercise is to set up the Python work environment with the required libraries. @@ -73,13 +81,13 @@ I recommend to use: - the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science. - one of the most recents versions of the libraries required -1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`. +1. Create a virtual environment named `ex00`, with a version of Python >= `3.9`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`. --- --- -# Exercise 1: Logistic regression in Scikit-learn +### Exercise 1: Logistic regression in Scikit-learn The goal of this exercise is to learn to use Scikit-learn to classify data. @@ -98,7 +106,7 @@ y = [0,0,0,1,1,1,0] --- -# Exercise 2: Sigmoid +### Exercise 2: Sigmoid The goal of this exercise is to learn to compute and plot the sigmoid function. @@ -120,11 +128,11 @@ The plot should look like this: --- -# Exercise 3: Decision boundary +### Exercise 3: Decision boundary The goal of this exercise is to learn to fit a logistic regression on simple examples and to understand how the algorithm separated the data from the different classes. -## 1 dimension +#### 1 dimension First, we will start as usual with features data in 1 dimension. Use `make classification` from Scikit-learn to generate 100 data points: @@ -191,7 +199,7 @@ def predict_probability(coefs, X): [ex3q6]: ./w2_day2_ex3_q5.png "Scatter plot + Logistic regression + predictions" -## 2 dimensions +#### 2 dimensions Now, let us repeat this process on 2-dimensional data. The goal is to focus on the decision boundary and to understand how the Logistic Regression create a line that separates the data. The code to plot the decision boundary is provided, however it is important to understand the way it works. @@ -247,7 +255,7 @@ The plot should look like this: --- -# Exercise 4: Train test split +### Exercise 4: Train test split The goal of this exercise is to learn to split a classification data set. The idea is the same as splitting a regression data set but there's one important detail specific to the classification: the proportion of each class in the train set and test set. @@ -271,7 +279,7 @@ y[70:] = 1 --- -# Exercise 5: Breast Cancer prediction +### Exercise 5: Breast Cancer prediction The goal of this exercise is to use Logistic Regression to predict breast cancer. It is always important to understand the data before training any Machine Learning algorithm. The data is described in **breast-cancer-wisconsin.names**. I suggest to add manually the column names in the DataFrame. @@ -299,7 +307,7 @@ Preliminary: --- -# Exercise 6: Multi-class (Optional) +### Exercise 6: Multi-class (Optional) The goal of this exercise is to learn to train a classification algorithm on a multi-class labelled data. Some algorithms as SVM or Logistic Regression do not natively support multi-class (more than 2 classes). There are some approaches that allow to use these algorithms on multi-class data. @@ -310,7 +318,7 @@ Let's assume we work with 3 classes: A, B and C. More details: -- https://machinelearningmastery.com/one-vs-rest-and-one-vs-one-for-multi-class-classification/ +- https://medium.com/@agrawalsam1997/multiclass-classification-onevsrest-and-onevsone-classification-strategy-2c293a91571a Let's implement the One-vs-Rest approach from `LogisticRegression`. @@ -353,7 +361,8 @@ def predict_one_vs_all(X, clf0, clf1, clf2 ): #TODO return classes ``` +Resources : -- https://randerson112358.medium.com/python-logistic-regression-program-5e1b32f964db +- https://www.kaggle.com/code/rahulrajpandey31/logistic-regression-from-scratch-iris-data-set - https://towardsdatascience.com/logistic-regression-using-python-sklearn-numpy-mnist-handwriting-recognition-matplotlib-a6b31e2b166a diff --git a/subjects/ai/classification/audit/README.md b/subjects/ai/classification/audit/README.md index 9ed24a48d7..ead114de82 100644 --- a/subjects/ai/classification/audit/README.md +++ b/subjects/ai/classification/audit/README.md @@ -6,7 +6,7 @@ ##### Run `python --version` -###### Does it print `Python 3.x`? x >= 8? +###### Does it print `Python 3.x`? x >= 9? ###### Does `import jupyter`, `import numpy`, `import pandas`, `import matplotlib` and `import sklearn` run without any error? diff --git a/subjects/ai/data-wrangling/README.md b/subjects/ai/data-wrangling/README.md index b0181a7638..96a47b65dd 100644 --- a/subjects/ai/data-wrangling/README.md +++ b/subjects/ai/data-wrangling/README.md @@ -1,13 +1,21 @@ -# Data wrangling +## Data wrangling -Data wrangling is one of the crucial tasks in data science and analysis which includes operations like: +### Overview + +Data wrangling is one of the crucial tasks in data science and analysis + +### Role Play + +You are a newly hired data analyst at a major e-commerce company. Your first assignment is to clean and prepare various datasets for analysis. The company's data comes from multiple sources and in different formats. Your manager has tasked you with combining these datasets, dealing with missing or inconsistent data, and preparing summary reports. You'll need to use your data wrangling skills to transform raw data into a format suitable for analysis and visualization. + +### Learning Objectives - Data Sorting: To rearrange values in ascending or descending order. - Data Filtration: To create a subset of available data. - Data Reduction: To eliminate or replace unwanted values. - Data Access: To read or write data files. - Data Processing: To perform aggregation, statistical, and similar operations on specific values. - Ax explained before, Pandas is an open source library, specifically developed for data science and analysis. It is built upon the Numpy (to handle numeric data in tabular form) package and has inbuilt data structures to ease-up the process of data manipulation, aka data munging/wrangling. + As explained before, Pandas is an open source library, specifically developed for data science and analysis. It is built upon the Numpy (to handle numeric data in tabular form) package and has inbuilt data structures to ease-up the process of data manipulation, aka data munging/wrangling. ### Exercises of the day @@ -45,7 +53,7 @@ I suggest to use the most recent one. --- -# Exercise 0: Environment and libraries +### Exercise 0: Environment and libraries The goal of this exercise is to set up the Python work environment with the required libraries. @@ -57,13 +65,13 @@ I recommend to use: - the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science. - one of the most recents versions of the libraries required -1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy` ,`tabulate` and `jupyter`. +1. Create a virtual environment named `ex00`, with a version of Python >= `3.9`, with the following libraries: `pandas`, `numpy` ,`tabulate` and `jupyter`. --- --- -# Exercise 1: Concatenate +### Exercise 1: Concatenate The goal of this exercise is to learn to concatenate DataFrames. The logic is the same for the Series. @@ -82,7 +90,7 @@ df2 = pd.DataFrame([['c', 1], ['d', 2]], --- -# Exercise 2: Merge +### Exercise 2: Merge The goal of this exercise is to learn to merge DataFrames The logic of merging DataFrames in Pandas is quite similar as the one used in SQL. @@ -132,7 +140,7 @@ df2 = pd.DataFrame(df2_dict, columns = ['id', 'Feature1', 'Feature2']) --- -# Exercise 3: Merge MultiIndex +### Exercise 3: Merge MultiIndex The goal of this exercise is to learn to merge DataFrames with MultiIndex. Use the code below to generate the DataFrames. `market_data` contains fake market data. In finance, the market is available during the trading days (business days). `alternative_data` contains fake alternative data from social media. This data is available every day. But, for some reasons the Data Engineer lost the last 15 days of alternative data. @@ -171,7 +179,7 @@ Use the code below to generate the DataFrames. `market_data` contains fake marke --- -# Exercise 4: Groupby Apply +### Exercise 4: Groupby Apply The goal of this exercise is to learn to group the data and apply a function on the groups. The use case we will work on is computing @@ -241,7 +249,7 @@ Here is what the function should output: --- -# Exercise 5: Groupby Agg +### Exercise 5: Groupby Agg The goal of this exercise is to learn to compute different type of aggregations on the groups. This small DataFrame contains products and prices. @@ -269,7 +277,7 @@ Note: The columns don't have to be MultiIndex --- -# Exercise 6: Unstack +### Exercise 6: Unstack The goal of this exercise is to learn to unstack a MultiIndex Let's assume we trained a machine learning model that predicts a daily score on the companies (tickers) below. It may be very useful to unstack the MultiIndex: plot the time series, vectorize the backtest, ... diff --git a/subjects/ai/data-wrangling/audit/README.md b/subjects/ai/data-wrangling/audit/README.md index b7812e4162..bcdd3258b9 100644 --- a/subjects/ai/data-wrangling/audit/README.md +++ b/subjects/ai/data-wrangling/audit/README.md @@ -6,7 +6,7 @@ ##### Run `python --version`. -###### Does it print `Python 3.x`? x >= 8 +###### Does it print `Python 3.x`? x >= 9 ###### Does `import jupyter`, `import numpy` and `import pandas` run without any error? diff --git a/subjects/ai/keras-2/README.md b/subjects/ai/keras-2/README.md index 777cf1bc10..0a944d93be 100644 --- a/subjects/ai/keras-2/README.md +++ b/subjects/ai/keras-2/README.md @@ -1,4 +1,15 @@ -# Keras 2 +## Keras 2 + +### Overview + +This exercise set focuses on advanced applications of Keras for building and training neural networks. You'll work on both regression and multi-class classification problems, using real-world datasets like the Auto MPG and Iris datasets. + + +### Role Play + +You're a data scientist at a biotech company developing AI-powered systems for various applications. Your current project involves creating neural networks for both regression and multi-class classification tasks. You'll be working on predicting car fuel efficiency and classifying flower species, showcasing the versatility of neural networks in different domains. + +### Learning Objectives The goal of this day is to learn to use Keras to build Neural Networks and train them on small data sets. This helps to understand the specifics of networks for classification and regression. @@ -28,7 +39,7 @@ The audit will provide the code and output because it is not straightforward to _Version of Keras I used to do the exercises: 2.4.3_. I suggest to use the most recent one. -### **Resources** +### Resources - https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/ @@ -36,7 +47,7 @@ I suggest to use the most recent one. --- -# Exercise 0: Environment and libraries +### Exercise 0: Environment and libraries The goal of this exercise is to set up the Python work environment with the required libraries. @@ -48,13 +59,13 @@ I recommend to use: - the virtual environment you're the most comfortable with. `virtualenv` and `conda` are the most used in Data Science. - one of the most recent versions of the libraries required -1. Create a virtual environment named with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter` and `keras`. +1. Create a virtual environment named with a version of Python >= `3.9`, with the following libraries: `pandas`, `numpy`, `jupyter` and `keras`. --- --- -# Exercise 1: Regression - Optimize +### Exercise 1: Regression - Optimize The goal of this exercise is to learn to set up the optimization for a regression neural network. There's no code to run in that exercise. In W2D2E3, we implemented a neural network designed for regression. We will be using this neural network: @@ -88,7 +99,7 @@ https://keras.io/api/metrics/regression_metrics/ --- -# Exercise 2: Regression example +### Exercise 2: Regression example The goal of this exercise is to learn to train a neural network to perform a regression on a data set. The data set is [Auto MPG Dataset](auto-mpg.csv) and the go is to build a model to predict the fuel efficiency of late-1970s and early 1980s automobiles. To do this, provide the model with a description of many automobiles from that time period. This description includes attributes like: cylinders, displacement, horsepower, and weight. @@ -109,7 +120,7 @@ https://www.tensorflow.org/tutorials/keras/regression --- -# Exercise 3: Multi classification - Softmax +### Exercise 3: Multi classification - Softmax The goal of this exercise is to learn to a neural network architecture for multi-class data. This is an important type of problem on which to practice with neural networks because the three class values require specialized handling. A multi-classification neural network uses as output layer a **softmax** layer. The **softmax** activation function is an extension of the sigmoid as it is designed to output the probabilities to belong to each class in a multi-class problem. This output layer has to contain as much neurons as classes in the multi-classification problem. This article explains in detail how it works. https://developers.google.com/machine-learning/crash-course/multi-class-neural-networks/softmax @@ -126,7 +137,7 @@ Let us assume we want to classify images and we know they contain either apples, --- -# Exercise 4: Multi classification - Optimize +### Exercise 4: Multi classification - Optimize The goal of this exercise is to learn to optimize a multi-classification neural network. As learnt previously, the loss function used in binary classification is the log loss - also called in Keras `binary_crossentropy`. This function is defined for binary classification and can be extended to multi-classification. In Keras, the extended loss that supports multi-classification is `binary_crossentropy`. There's no code to run in that exercise. @@ -142,7 +153,7 @@ model.compile(loss='',#TODO1 --- -# Exercise 5 Multi classification example +### Exercise 5 Multi classification example The goal of this exercise is to learn to use a neural network to classify a multiclass data set. The data set used is the Iris data set which allows to classify flower given basic features as flower's measurement. diff --git a/subjects/ai/keras-2/audit/README.md b/subjects/ai/keras-2/audit/README.md index 31675b7cfb..0ce90efd58 100644 --- a/subjects/ai/keras-2/audit/README.md +++ b/subjects/ai/keras-2/audit/README.md @@ -6,7 +6,7 @@ ##### Run `python --version`. -###### Does it print `Python 3.x`? x >= 8 +###### Does it print `Python 3.x`? x >= 9 ###### Do `import jupyter`, `import numpy`, `import pandas` and `import keras` run without any error? diff --git a/subjects/ai/keras/README.md b/subjects/ai/keras/README.md index cde386c22e..68853690a5 100644 --- a/subjects/ai/keras/README.md +++ b/subjects/ai/keras/README.md @@ -1,4 +1,14 @@ -# Keras +## Keras + +### Overview + +This exercise focuses on using Keras to build and train neural networks. Keras is a high-level deep learning API that runs on top of TensorFlow, designed for fast experimentation with deep neural networks. You'll learn to create sequential models, add dense layers, design network architectures, and optimize your models. + +### Role Play + +You are a machine learning engineer at a cutting-edge AI startup. Your team has been tasked with developing a neural network model to predict breast cancer diagnoses. The company wants to leverage the power of deep learning to improve early detection rates. Your job is to build, train, and optimize a neural network using Keras and TensorFlow. You'll need to demonstrate your understanding of neural network architectures, the Keras API, and best practices in deep learning model development. + +### Learning Objectives The goal of this day is to learn to use Keras to build Neural Networks. As explained on Keras website, Keras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result as fast as possible is key to doing good research. And, TensorFlow was created by the Google Brain team, TensorFlow is an open source library for numerical computation and large-scale machine learning. TensorFlow bundles together a slew of machine learning and deep learning (aka neural networking) models and algorithms and makes them useful by way of a common metaphor. It uses Python to provide a convenient front-end API for building applications with the framework, while executing those applications in high-performance C++. @@ -28,7 +38,7 @@ The audit will provide the code and output because it is not straightforward to _Version of Keras I used to do the exercises: 2.4.3_. I suggest to use the most recent one. -### **Resources** +### Resources - https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/ @@ -36,7 +46,7 @@ I suggest to use the most recent one. --- -# Exercise 0: Environment and libraries +### Exercise 0: Environment and libraries The goal of this exercise is to set up the Python work environment with the required libraries. @@ -48,13 +58,13 @@ I recommend to use: - the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science. - one of the most recents versions of the libraries required -1. Create a virtual environment named with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, and `keras`. +1. Create a virtual environment named with a version of Python >= `3.9`, with the following libraries: `pandas`, `numpy`, `jupyter`, and `keras`. --- --- -# Exercise 1: Sequential +### Exercise 1: Sequential The goal of this exercise is to learn to call the object `Sequential`. @@ -64,7 +74,7 @@ The goal of this exercise is to learn to call the object `Sequential`. --- -# Exercise 2: Dense +### Exercise 2: Dense The goal of this exercise is to learn to create layers of neurons. Keras proposes options to create custom layers. The neural networks build in these exercises do not require custom layers. `Dense` layers do the job. A dense layer is simply a layer where each unit or neuron is connected to each neuron in the next layer. As seen yesterday, there are three main types of layers: input, hidden and output. The **input layer** that specifies the number of inputs (features) is not represented as a layer in Keras. However, `Dense` has a parameter `input_dim` that gives the number of inputs in the previous layer. The output layer as any hidden layer can be created using `Dense`, the only difference is that the output layer contains one single neuron. @@ -90,7 +100,7 @@ The goal of this exercise is to learn to create layers of neurons. Keras propose --- -# Exercise 3: Architecture +### Exercise 3: Architecture The goal of this exercise is to combine the layers and to create a neural network. @@ -105,7 +115,7 @@ The goal of this exercise is to combine the layers and to create a neural networ --- -# Exercise 4: Optimize +### Exercise 4: Optimize The goal of this exercise is to learn to train the neural network. Once the architecture of the neural network is set there are two steps to train the neural network: diff --git a/subjects/ai/keras/audit/README.md b/subjects/ai/keras/audit/README.md index 1d4a573c1c..9e45c665b8 100644 --- a/subjects/ai/keras/audit/README.md +++ b/subjects/ai/keras/audit/README.md @@ -6,7 +6,7 @@ ##### Run `python --version` -###### Does it print `Python 3.x`? x >= 8 +###### Does it print `Python 3.x`? x >= 9 ###### Does `import jupyter`, `import numpy`, `import pandas`, and `import keras` run without any error? diff --git a/subjects/ai/linear-regression/README.md b/subjects/ai/linear-regression/README.md index 274cb86da9..737e81c4e9 100644 --- a/subjects/ai/linear-regression/README.md +++ b/subjects/ai/linear-regression/README.md @@ -1,6 +1,8 @@ ![Alt Text](w2_day01_linear_regression_video.gif) -# Linear regression +## Linear regression + +### Overview The goal of this day is to understand practical Linear regression and supervised learning with Scikit Learn. @@ -9,6 +11,13 @@ studied the size of individuals within a progeny. He was trying to understand wh large individuals in a population appeared to have smaller children, more close to the average population size; hence the introduction of the term "regression". +### Role play + +Hey there, future data detective! Ready to crack the case of predicting outcomes? You're in for a treat! This module is all about mastering the art of Linear Regression - your trusty magnifying glass in the world of data analysis. +Imagine being able to draw the perfect line through a cloud of data points, revealing hidden patterns and making predictions that'll make your colleagues go "Wow!" That's the power of Linear Regression, and you're about to become an expert! + +### Learning Objective + Today we will learn a basic algorithm used in **supervised learning** : **The Linear Regression**. We will be using **Scikit-learn** which is a machine learning library. It is designed to interoperate with the Python libraries NumPy and Pandas. We will also learn progressively the Machine Learning methodology for supervised learning - today we will focus on evaluating a machine learning model by splitting the data set in a train set and a test set. @@ -33,31 +42,29 @@ We will also learn progressively the Machine Learning methodology for supervised _Version of Scikit Learn I used to do the exercises: 0.22_. I suggest using the most recent one. Scikit Learn 1.0 is finally available after ... 14 years. -### **Resources** +### Resources -### To start with Scikit-learn +#### To start with Scikit-learn -- https://scikit-learn.org/stable/tutorial/basic/tutorial.html +- https://scikit-learn.org/stable/getting_started.html - https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html - https://scikit-learn.org/stable/modules/linear_model.html -### Machine learning methodology and algorithms +#### Machine learning methodology and algorithms -- This course provides a broad introduction to machine learning, data mining, and statistical pattern recognition. Andrew Ng is a star in the Machine Learning community. I recommend spending some time during the projects to focus on some algorithms. However, Python is not the language used for the course. https://www.coursera.org/learn/machine-learning +- This course provides a broad introduction to machine learning, data mining, and statistical pattern recognition. Andrew Ng is a star in the Machine Learning community. I recommend spending some time during the projects to focus on some algorithms. However, Python is not the language used for the course. https://www.youtube.com/playlist?list=PLWD7QtH5pagQevEwjEOCQi1Cgqe3zKf2s - https://docs.microsoft.com/en-us/azure/machine-learning/algorithm-cheat-sheet -- https://scikit-learn.org/stable/tutorial/index.html - -### Linear Regression +#### Linear Regression -- https://towardsdatascience.com/laymans-introduction-to-linear-regression-8b334a3dab09 +- https://onlinestatbook.com/2/regression/intro.html -- https://towardsdatascience.com/linear-regression-the-actually-complete-introduction-67152323fcf2 +- https://www.analyticsvidhya.com/blog/2021/10/everything-you-need-to-know-about-linear-regression/ -### Train test split +#### Train test split - https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/ @@ -67,7 +74,7 @@ _Version of Scikit Learn I used to do the exercises: 0.22_. I suggest using the --- -# Exercise 0: Environment and libraries +### Exercise 0: Environment and libraries The goal of this exercise is to set up the Python work environment with the required libraries. @@ -79,13 +86,13 @@ I recommend to use: - the virtual environment you're the most comfortable with. `virtualenv` and `conda` are the most used in Data Science. - one of the most recent versions of the libraries required -1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`. +1. Create a virtual environment named `ex00`, with a version of Python >= `3.9`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`. --- --- -# Exercise 1: Scikit-learn estimator +### Exercise 1: Scikit-learn estimator The goal of this exercise is to learn to fit a Scikit-learn estimator and use it to predict. @@ -101,7 +108,7 @@ X, y = [[1],[2.1],[3]], [[1],[2],[3]] --- -# Exercise 2: Linear regression in 1D +### Exercise 2: Linear regression in 1D The goal of this exercise is to understand how the linear regression works in one dimension. To do so, we will generate a data in one dimension. Using `make regression` from Scikit-learn, generate a data set with 100 observations: @@ -149,7 +156,7 @@ https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_e --- -# Exercise 3: Train test split +### Exercise 3: Train test split The goal of this exercise is to learn to split a data set. It is important to understand why we split the data in two sets. To put it in a nutshell: the Machine Learning model learns on the training data and evaluates on the data the model hasn't seen before: the testing data. @@ -170,11 +177,11 @@ https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_ --- -# Exercise 4: Forecast diabetes progression +### Exercise 4: Forecast diabetes progression The goal of this exercise is to use Linear Regression to forecast the progression of diabetes. It will not always be precised, you should **ALWAYS** start doing an exploratory data analysis in order to have a good understanding of the data you model. As a reminder here an introduction to EDA: -- https://towardsdatascience.com/exploratory-data-analysis-eda-a-practical-guide-and-template-for-structured-data-abfbf3ee3bd9 +- https://medium.com/octave-john-keells-group/a-complete-guide-to-exploratory-data-analysis-on-structured-data-112c082892 The data set used is described in https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes. @@ -200,7 +207,7 @@ https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset --- -# Exercise 5: Gradient Descent (Optional) +### Exercise 5: Gradient Descent (Optional) The goal of this exercise is to understand how the Linear Regression algorithm finds the optimal coefficients. @@ -238,7 +245,7 @@ y_pred1 = a*x1 + b\ y_pred2 = a*x2 + b\ y_pred3 = a\*x3 + b -### Greedy approach +#### Greedy approach 2. Create a function `compute_mse`. Compute mse for `a = 1` and `b = 2`. **Warning**: `X.shape` is `(100, 1)` and `y.shape` is `(100, )`. Make sure that `y_preds` and `y` have the same shape before to compute `y_preds-y`. @@ -310,7 +317,7 @@ The expected output is: In this example we computed 160 000 times the MSE. It is frequent to deal with 50 features, which requires 51 parameters to fit the Linear Regression. If we try this approach with 50 features we would need to compute **5.07e+132** MSE. Even if we reduce the scope and try only 5 values per coefficients we would have to compute the MSE **4.4409e+35** times. This approach is not scalable and that is why is not used to find optimal coefficients for Linear Regression. -### Gradient Descent +#### Gradient Descent In a nutshell, Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters (a and b) of our model. Parameters refer to the coefficients used in Linear Regression. Before to start implementing the questions, take the time to read [this article](https://medium.com/@yennhi95zz/4-a-beginners-guide-to-gradient-descent-in-machine-learning-773ba7cd3dfe). It explains the gradient descent and how to implement it. The "tricky" part is the computation of the derivative of the mse. You can admit the formulas of the derivatives to implement the gradient descent (`d_theta_0` and `d_theta_1` in the article). diff --git a/subjects/ai/linear-regression/audit/README.md b/subjects/ai/linear-regression/audit/README.md index 68db6949a9..5971442f13 100644 --- a/subjects/ai/linear-regression/audit/README.md +++ b/subjects/ai/linear-regression/audit/README.md @@ -1,4 +1,4 @@ -#### Linear regression with Scikit Learn +#### Linear regression #### Exercise 0: Environment and libraries @@ -8,7 +8,7 @@ ##### Run `python --version` -###### Does it print `Python 3.x`? x >= 8 +###### Does it print `Python 3.x`? x >= 9 ###### Do `import jupyter`, `import numpy`, `import pandas`, `import matplotlib` and `import sklearn` run without any error? diff --git a/subjects/ai/model-selection/README.md b/subjects/ai/model-selection/README.md index 57bfa8caa7..85740dd158 100644 --- a/subjects/ai/model-selection/README.md +++ b/subjects/ai/model-selection/README.md @@ -1,4 +1,14 @@ -# Model selection +## Model selection + +### Overview + +This exercise set focuses on advanced model selection techniques in machine learning. You'll work with cross-validation, grid search, and performance evaluation tools. + +### Role Play + +You're a machine learning engineer at a tech company. Your team is working on improving model selection and evaluation processes for various projects. Your task is to implement and analyze different model selection techniques to ensure the most robust and reliable models are chosen for production. + +### Learning Objectives If you finished yesterday's exercises you should be able to train several Machine Learning algorithms and to choose one returned by GridSearchCV. GridSearchCV returns the model that gives the best score on the test set. Yesterday, as I told you, I changed the **cv** parameter to compute the GridSearch with a train set and a test set. @@ -27,17 +37,17 @@ We will answer these questions today ! The topics we will cover are the one of t _Version of Pandas I used to do the exercises: 1.0.1_. I suggest to use the most recent one. -### **Resources** +### Resources **Must read before to start the exercises** -### Biais-Variance trade off, aka Underfitting/Overfitting: +#### Biais-Variance trade off, aka Underfitting/Overfitting: - [Bias-Variance Trade-Off in Machine Learning](https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/) - [Hyperparameters and Model Validation](https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html) -### Cross-validation +#### Cross-validation - [Train/Test Split and Cross Validation](https://algotrading101.com/learn/train-test-split/) @@ -45,7 +55,7 @@ I suggest to use the most recent one. --- -# Exercise 0: Environment and libraries +### Exercise 0: Environment and libraries The goal of this exercise is to set up the Python work environment with the required libraries. @@ -57,13 +67,13 @@ I recommend to use: - the virtual environment you're the most comfortable with. `virtualenv` and `conda` are the most used in Data Science. - one of the most recent versions of the libraries required -1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`. +1. Create a virtual environment named `ex00`, with a version of Python >= `3.9`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`. --- --- -# Exercise 1: K-Fold +### Exercise 1: K-Fold The goal of this exercise is to learn to use `KFold` to split the data set in a k-fold cross validation. Most of the time you won't use this function to split your data because this function is used by others as `cross_val_score` or `cross_validate` or `GridSearchCV` ... . But, this allows to understand the splitting and to create a custom one if needed. @@ -95,7 +105,7 @@ y = np.array(np.arange(1,11)) --- -# Exercise 2: Cross validation (k-fold) +### Exercise 2: Cross validation (k-fold) The goal of this exercise is to learn how to use cross validation. After reading the articles you should be able to explain why we need to cross-validate the models. We will firstly focus on Linear Regression to reduce the computation time. We will be using `cross_validate` to run the cross validation. Note that `cross_val_score` is similar but the `cross_validate` calculates one or more scores and timings for each CV split. @@ -153,7 +163,7 @@ Standard deviation of scores on validation sets: --- -# Exercise 3: GridsearchCV +### Exercise 3: GridsearchCV The goal here is to utilize GridSearchCV for running a grid search, making predictions, and scoring on a test set. @@ -204,7 +214,7 @@ _Hint_: The name of the metric to put in the parameter `scoring` is `neg_mean_sq --- -# Exercise 4: Validation curve and Learning curve +### Exercise 4: Validation curve and Learning curve The goal of this exercise is to learn how to analyze the model's performance with two tools: diff --git a/subjects/ai/model-selection/audit/README.md b/subjects/ai/model-selection/audit/README.md index dea69286d0..da34faa9cb 100644 --- a/subjects/ai/model-selection/audit/README.md +++ b/subjects/ai/model-selection/audit/README.md @@ -6,7 +6,7 @@ ##### Run `python --version`. -###### Does it print `Python 3.x`? x >= 8 +###### Does it print `Python 3.x`? x >= 9 ###### Do `import jupyter`, `import numpy`, `import pandas`, `import matplotlib` and `import sklearn` run without any error? diff --git a/subjects/ai/neural-networks/README.md b/subjects/ai/neural-networks/README.md index e58021747d..b3c441a3dc 100644 --- a/subjects/ai/neural-networks/README.md +++ b/subjects/ai/neural-networks/README.md @@ -1,4 +1,6 @@ -# Neural Networks +## Neural Networks + +### Overview Last week you learnt about some Machine Learning algorithms as Random Forest or Gradient Boosting. Neural Networks are another type of Machine Learning algorithms that are intensively used because of their efficiency. Neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling or clustering raw input. The patterns they recognize are numerical, contained in vectors, into which all real-world data, be it images, sound, text or time series, must be translated. Different types of neural networks exist and are specific to some use-cases. For example CNN for images, RNN or LSTMs for time-series or text, etc ... @@ -6,6 +8,24 @@ Today we will focus on Artificial Neural Networks. The goal is to understand how However the exercises won't cover architectures as RNN, LSTM - used on sequences as time series or text, CNN - used a lot on images processing. One of the projects will require to know how to use the special architectures. To do so, I suggest that you go through this lesson: https://fr.coursera.org/specializations/deep-learning. +### Role play + +Imagine you're a newly hired AI researcher at "NeuroTech Innovations," a cutting-edge startup developing AI solutions for healthcare. Your first major project is to create a neural network that can predict patient outcomes based on various medical parameters. + +Your team lead has tasked you with building the foundational components of this AI system. You'll start by implementing a single neuron, then combine multiple neurons into a small network, and finally adapt this network for both classification and regression tasks. + +### Learning Objectives + +By the end of this quest, you will be able to: + +- Implement a single artificial neuron and understand its components (weights, bias, activation function) +- Combine multiple neurons to create a simple neural network +- Implement and understand the importance of loss functions, particularly log loss for classification tasks +- Perform forward propagation in a neural network +- Adapt a neural network for regression tasks by modifying the output layer +- Evaluate the performance of your neural network using appropriate metrics (log loss for classification, MSE for regression) +- Gain intuition about how neural networks learn from data and make predictions + ### Exercises of the day - Exercise 0: Environment and libraries @@ -25,7 +45,7 @@ However the exercises won't cover architectures as RNN, LSTM - used on sequences _Version of NumPy I used to do the exercises: 1.18.1_. I suggest to use the most recent one. -### **Resources** +### Resources - https://victorzhou.com/blog/intro-to-neural-networks/ @@ -37,7 +57,7 @@ I suggest to use the most recent one. --- -# Exercise 0: Environment and libraries +### Exercise 0: Environment and libraries The goal of this exercise is to set up the Python work environment with the required libraries. @@ -49,13 +69,13 @@ I recommend to use: - the virtual environment you're the most comfortable with. `virtualenv` and `conda` are the most used in Data Science. - one of the most recent versions of the libraries required -1. Create a virtual environment with a version of Python >= `3.8`, with the following libraries: `numpy` and `jupyter`. +1. Create a virtual environment with a version of Python >= `3.9`, with the following libraries: `numpy` and `jupyter`. --- --- -# Exercise 1: The neuron +### Exercise 1: The neuron The goal of this exercise is to understand the role of a neuron and to implement a neuron. @@ -114,7 +134,7 @@ https://victorzhou.com/blog/intro-to-neural-networks/ --- -# Exercise 2: Neural network +### Exercise 2: Neural network The goal of this exercise is to understand how to combine three neurons to form a neural network. A neural network is nothing else than neurons connected together. As shown in the figure the neural network is composed of **layers**: @@ -165,7 +185,7 @@ Now, we add two more neurons: --- -# Exercise 3: Log loss +### Exercise 3: Log loss The objective of this exercise is to implement the Log Loss function, which serves as a **loss function** in classification problems. This function quantifies the difference between predicted and actual categorical outcomes, producing lower values for accurate predictions. @@ -191,7 +211,7 @@ This equation calculates Log Loss across all predictions in a dataset, penalizin --- -# Exercise 4: Forward propagation +### Exercise 4: Forward propagation The goal of this exercise is to compute the log loss on the output of the forward propagation. The data used is the tiny data set below. @@ -218,7 +238,7 @@ The goal if the network is to predict the success at the exam given math and che --- -# Exercise 5: Regression +### Exercise 5: Regression The goal of this exercise is to learn to adapt the output layer to regression. As a reminder, one of reasons for which the sigmoid is used in classification is because it contracts the output between 0 and 1 which is the expected output range for a probability (W2D2: Logistic regression). However, the output of the regression is not a probability. diff --git a/subjects/ai/neural-networks/audit/README.md b/subjects/ai/neural-networks/audit/README.md index dabc96a642..63c8986458 100644 --- a/subjects/ai/neural-networks/audit/README.md +++ b/subjects/ai/neural-networks/audit/README.md @@ -6,7 +6,7 @@ ##### Run `python --version`. -###### Does it print `Python 3.x`? x >= 8 +###### Does it print `Python 3.x`? x >= 9 ###### Do `import jupyter` and `import numpy` run without any error? diff --git a/subjects/ai/nlp-spacy/README.md b/subjects/ai/nlp-spacy/README.md index 7a5f193c21..693caec0aa 100644 --- a/subjects/ai/nlp-spacy/README.md +++ b/subjects/ai/nlp-spacy/README.md @@ -1,7 +1,15 @@ -# Natural Language processing with Spacy +## NLP with Spacy + +### Overview `spaCy` is a natural language processing (NLP) library for Python designed to have fast performance, and with word embedding models built in, it’s perfect for a quick and easy start. I don't need to detail what spaCy does, it is perfectly summarized by spaCy in this article: **spaCy 101: Everything you need to know**. +### Role Play + +You are a senior NLP engineer at a leading e-commerce company. Your team has been tasked with developing an advanced language understanding system to improve various aspects of the company's operations, including product recommendations, customer service automation, and market analysis. + +### Learning Objectives + Today, we will learn to use a pre-trained embedding to convert a text into a vector to compute similarity between words or sentences. Remember, embeddings translate large sparse vectors into a lower-dimensional space that preserves semantic relationships. Word embeddings is a technique where individual words of a domain or language are represented as real-valued vectors in a lower dimensional space. The BoW representation's dimension depends on the size of the vocabulary. But it can easily reach 10k words. We will also learn to use NER and Part-of-speech. NER allows to identify and segment the named entities and classify or categorize them under various predefined classes. Part-of-speech is a special label assigned to each token (word) in a text corpus to indicate the part of speech and often also other grammatical categories such as tense, number (plural/singular), case etc. @@ -26,7 +34,7 @@ Word embeddings is a technique where individual words of a domain or language ar I suggest using the most recent libraries. -### **Resources** +### Resources - https://spacy.io/usage/spacy-101 - https://spacy.io/api/doc @@ -37,7 +45,7 @@ I suggest using the most recent libraries. --- -# Exercise 0: Environment and libraries +### Exercise 0: Environment and libraries The goal of this exercise is to set up the Python work environment with the required libraries. @@ -49,13 +57,13 @@ I recommend to use: - the virtual environment you're the most comfortable with. `virtualenv` and `conda` are the most used in Data Science. - one of the most recent versions of the libraries required -1. Create a virtual environment named with a version of Python >= `3.8`, with the following libraries: `pandas`, `jupyter`, `spaCy 3.4.0`, `sklearn`, `matplotlib`. +1. Create a virtual environment named with a version of Python >= `3.9`, with the following libraries: `pandas`, `jupyter`, `spaCy 3.4.0`, `sklearn`, `matplotlib`. --- --- -# Exercise 1: Embedding 1 +### Exercise 1: Embedding 1 The goal of this exercise is to learn to load an embedding on `spaCy`. @@ -70,7 +78,7 @@ The goal of this exercise is to learn to load an embedding on `spaCy`. --- -# Exercise 2: Tokenization +### Exercise 2: Tokenization The goal of this exercise is to learn to tokenize a document using `spaCy`. We did this using NLTK yesterday. @@ -85,7 +93,7 @@ The goal of this exercise is to learn to tokenize a document using `spaCy`. We d --- -# Exercise 3: Embeddings 2 +### Exercise 3: Embeddings 2 The goal of this exercise is to learn to use `spaCy` embedding on a document. @@ -107,7 +115,7 @@ https://medium.com/datadriveninvestor/cosine-similarity-cosine-distance-6571387f --- -# Exercise 4: Sentences' similarity +### Exercise 4: Sentences' similarity The goal of this exerice is to learn to compute the similarity between two sentences. As explained in the documentation: **The word embedding of a full sentence is simply the average over all different words**. This is how `similarity` works in SpaCy. This small use case is very interesting because if we build a corpus of sentences that express an intention as **buy shoes**, then we can detect this intention and use it to propose shoes advertisement for customers. The language model used in this exercise is `en_core_web_sm`. @@ -124,7 +132,7 @@ The goal of this exerice is to learn to compute the similarity between two sente --- -# Exercise 5: NER +### Exercise 5: NER The goal of this exercise is to learn to use a Named entity recognition algorithm to detect entities. @@ -147,7 +155,7 @@ https://en.wikipedia.org/wiki/Named-entity_recognition --- -# Exercise 6: Part-of-speech tags +### Exercise 6: Part-of-speech tags The goal of this exercise is to learn to use the Part-of-speech tags (**POS TAG**) using `spaCy`. As explained on Wikipedia, the POS TAG is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. diff --git a/subjects/ai/nlp-spacy/audit/README.md b/subjects/ai/nlp-spacy/audit/README.md index 709da8969b..90ee5b0686 100644 --- a/subjects/ai/nlp-spacy/audit/README.md +++ b/subjects/ai/nlp-spacy/audit/README.md @@ -6,7 +6,7 @@ ##### Run `python --version`. -###### Does it print `Python 3.x`? x >= 8 +###### Does it print `Python 3.x`? x >= 9 ##### Do `import jupyter`, `import pandas` and `import spacy` run without any error? diff --git a/subjects/ai/nlp/README.md b/subjects/ai/nlp/README.md index 49ce33bb1b..b096d94f36 100644 --- a/subjects/ai/nlp/README.md +++ b/subjects/ai/nlp/README.md @@ -1,7 +1,14 @@ -# NLP +## NLP + +### Overview “NLP makes it possible for humans to talk to machines:” This branch of AI enables computers to understand, interpret, and manipulate human language. This technology is one of the most broadly applied areas of machine learning and is critical in effectively analyzing massive quantities of unstructured, text-heavy data. +### Role Play +You're a Natural Language Processing (NLP) specialist at a tech startup developing a sentiment analysis tool for social media posts. Your task is to build the preprocessing pipeline and create a bag-of-words representation for tweet analysis. + +### Learning Objectives + Machine learning algorithms cannot work with raw text directly. Rather, the text must be converted into vectors of numbers. In natural language processing, a common technique for extracting features from text is to place all of the words that occur in the text in an unordered bucket. This approach is called a bag of words model or BoW for short. It’s referred to as a “bag” of words because any information about the structure of the sentence is lost. This is useful to train usual machine learning models on text data. Other types of models as RNNs or LSTMs take as input a complete and ordered sequence. Almost every Natural Language Processing (NLP) task requires text to be preprocessed before training a model. The article **Your Guide to Natural Language Processing (NLP)** gives a very good introduction to NLP. @@ -30,7 +37,7 @@ Today, we we will learn to preprocess text data and to create a bag of word repr I suggest to use the most recent libraries. -### **Resources** +### Resources - https://towardsdatascience.com/your-guide-to-natural-language-processing-nlp-48ea2511f6e1 @@ -40,7 +47,7 @@ I suggest to use the most recent libraries. --- -# Exercise 0: Environment and libraries +### Exercise 0: Environment and libraries The goal of this exercise is to set up the Python work environment with the required libraries. @@ -52,13 +59,13 @@ I recommend to use: - the virtual environment you're the most comfortable with. `virtualenv` and `conda` are the most used in Data Science. - one of the most recent versions of the libraries required -1. Create a virtual environment named with a version of Python >= `3.8`, with the following libraries: `pandas`, `jupyter`, `nltk` and `scikit-learn`. +1. Create a virtual environment named with a version of Python >= `3.9`, with the following libraries: `pandas`, `jupyter`, `nltk` and `scikit-learn`. --- --- -# Exercise 1: Lowercase +### Exercise 1: Lowercase The goal of this exercise is to learn to lowercase text data in Python. Note that if the volume of data is low the text data can be stored in a Pandas DataFrame or Series. But, when dealing with high volumes (high but not huge), using a Pandas DataFrame or Series is not efficient. Data structures as dictionaries or list are more adapted. @@ -77,7 +84,7 @@ series_data = pd.Series(list_, name='text') --- -# Exercise 2: Punctuation +### Exercise 2: Punctuation The goal of this exercise is to learn to deal with punctuation. In Natural Language Processing, some basic approaches as Bag of Words model the text as an unordered combination of words. In that case the punctuation is not always useful as it doesn't add information to the model. That is why is removed. @@ -93,7 +100,7 @@ The goal of this exercise is to learn to deal with punctuation. In Natural Langu --- -# Exercise 3: Tokenization +### Exercise 3: Tokenization The goal of this exercise is to learn [to tokenize](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) as text. This step is important because it splits the text into token. A token could be a sentence or a word. @@ -110,7 +117,7 @@ text = """Bitcoin is a cryptocurrency invented in 2008 by an unknown person or g --- -# Exercise 4: Stop words +### Exercise 4: Stop words The goal of this exercise is to learn to remove stop words with NLTK. Stop words usually refers to the most common words in a language. For example: "and", "is", "a" are stop words and do not add information to a sentence. @@ -126,7 +133,7 @@ The goal of this exercise is to learn to remove stop words with NLTK. Stop word --- -# Exercise 5: Stemming +### Exercise 5: Stemming The goal of this exercise is to learn to use stemming using NLTK. As explained in details in the article, stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language. @@ -144,7 +151,7 @@ The interviewer interviews the president in an interview --- -# Exercise 6: Text preprocessing +### Exercise 6: Text preprocessing The goal of this exercise is to learn to create a function to prepocess and clean a text using NLTK. @@ -171,7 +178,7 @@ _Ressources: https://towardsdatascience.com/nlp-preprocessing-with-nltk-3c04ee00 --- -# Exercise 7: Bag of Word representation +### Exercise 7: Bag of Word representation The goal of this exercise is to understand the creation of a Bag of Word (BoW) model for a corpus of texts and create a labeled dataset from textual data using a word count matrix. diff --git a/subjects/ai/nlp/audit/README.md b/subjects/ai/nlp/audit/README.md index 4c8efe8094..52844a9cac 100644 --- a/subjects/ai/nlp/audit/README.md +++ b/subjects/ai/nlp/audit/README.md @@ -6,7 +6,7 @@ ##### Run `python --version`. -###### Does it print `Python 3.x`? x >= 8 +###### Does it print `Python 3.x`? x >= 9 ###### Do `import jupyter`, `import pandas`, `import nltk` and `import sklearn` run without any error? diff --git a/subjects/ai/numpy/README.md b/subjects/ai/numpy/README.md index 3fd2a1e690..8713639ec8 100644 --- a/subjects/ai/numpy/README.md +++ b/subjects/ai/numpy/README.md @@ -1,7 +1,17 @@ ## NumPy +### Overview + The goal of this day is to understand practical usage of **NumPy**. **NumPy** is a commonly used Python data analysis package. By using **NumPy**, you can speed up your workflow, and interface with other packages in the Python ecosystem, like scikit-learn, that use **NumPy** under the hood. **NumPy** was originally developed in the mid 2000s, and arose from an even older package called Numeric. This longevity means that almost every data analysis or machine learning package for Python leverages **NumPy** in some way. +### Role Play + +Hey there, future data wizard! Ready to dive into the magical world of NumPy? You're in for a treat! This module is all about mastering the mystical arts of numerical computing in Python. With NumPy as your trusty wand, you'll be slicing through arrays, conjuring up random numbers, and reshaping data faster than you can say "abracadabra"! + +### Learning Objectives + +Master fundamental NumPy operations and techniques to efficiently manipulate, analyze, and extract insights from numerical data in Python. + ### Virtual Environment - Python 3.x @@ -13,7 +23,7 @@ I suggest to use the most recent one. ### Resources -- [Why Should We Use NumPy](https://medium.com/fintechexplained/why-should-we-use-NumPy-c14a4fb03ee9) +- [What Is It and Why Does It Matter?](https://www.nvidia.com/en-us/glossary/numpy/) - [NumPy Documentation](https://numpy.org/doc/) - [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) @@ -21,7 +31,7 @@ I suggest to use the most recent one. --- -## Exercise 0: Environment and libraries +### Exercise 0: Environment and libraries The goal of this exercise is to set up the Python work environment with the required libraries and to learn to launch a `jupyter notebook`. Jupyter notebooks are very convenient as they allow to write and test code within seconds. However, it really easy to implement instable and not reproducible code using notebooks. Keep the notebook and the underlying code clean. Notebook can be used for most of the exercises of the piscine as the goal is to experiment a lot. But no worries, you'll be asked to build a more robust structure for all the projects. @@ -41,7 +51,7 @@ I suggest utilizing: 4. Execute `print("Buy the dip ?")` in the second cell to display the message. -### Resources: +Resources : - [python](https://www.python.org/) - [Conda Documentation](https://docs.conda.io/) @@ -54,7 +64,7 @@ I suggest utilizing: --- -## Exercise 1: Your first NumPy array +### Exercise 1: Your first NumPy array The objective of this exercise is to familiarize yourself with incorporating various Python data types into **NumPy** arrays. **NumPy** arrays play a vital role in both **NumPy** and **Pandas**, offering flexibility and optimized functionalities. @@ -69,7 +79,7 @@ for i in your_np_array: --- -## Exercise 2: Zeros +### Exercise 2: Zeros The goal of this exercise is to learn to create a NumPy array with 0s. @@ -80,7 +90,7 @@ The goal of this exercise is to learn to create a NumPy array with 0s. --- -## Exercise 3: Slicing +### Exercise 3: Slicing The goal of this exercise is to learn NumPy indexing/slicing. It allows to access values of the NumPy array efficiently and without a for loop. @@ -117,7 +127,7 @@ The goal of this exercise is to learn NumPy indexing/slicing. It allows to acces --- -## Exercise 4: Random +### Exercise 4: Random The goal of this exercise is to learn to generate random data. In Data Science it is extremely useful to generate random data for many reasons: @@ -139,7 +149,7 @@ NumPy proposes a lot of options to generate random data. In statistics, assumpti --- -## Exercise 5: Split, concatenate, reshape arrays +### Exercise 5: Split, concatenate, reshape arrays The goal of this exercise is to learn to concatenate and reshape arrays. @@ -165,7 +175,7 @@ Print what you've created in the previous steps. --- -## Exercise 6: Broadcasting and Slicing +### Exercise 6: Broadcasting and Slicing The goal of this exercise is to learn to access values of n-dimensional arrays efficiently. @@ -223,7 +233,7 @@ Expected output: [ 5 10 15]] ``` -### Resources +Resources : [Computation on Arrays: Broadcasting](https://jakevdp.github.io/PythonDataScienceHandbook/) @@ -231,7 +241,7 @@ Expected output: --- -## Exercise 7: NaN +### Exercise 7: NaN The goal of this exercise is to handle missing data in NumPy and manipulate arrays effectively. @@ -286,7 +296,7 @@ Expected output: --- -## Exercise 8: Wine +### Exercise 8: Wine The goal of this exercise is to perform fundamental data analysis on real data using NumPy. @@ -314,7 +324,7 @@ The dataset chosen for this task was the [red wine dataset](./data/winequality-r --- -## Exercise 9: Football tournament +### Exercise 9: Football tournament This exercise focuses on utilizing permutations and complex computations. diff --git a/subjects/ai/numpy/audit/README.md b/subjects/ai/numpy/audit/README.md index ea12e50f07..052dc2f455 100644 --- a/subjects/ai/numpy/audit/README.md +++ b/subjects/ai/numpy/audit/README.md @@ -8,7 +8,7 @@ ##### Run `python --version` -###### Does it print `Python 3.8.x`? x could be any number from 0 to 9 +###### Does it print `Python 3.x`? x >= 9 ###### Does `import jupyter` and `import numpy` run without any error? diff --git a/subjects/ai/pandas/README.md b/subjects/ai/pandas/README.md index 6911c090cd..1fb1c1a3a4 100644 --- a/subjects/ai/pandas/README.md +++ b/subjects/ai/pandas/README.md @@ -1,4 +1,16 @@ -# Pandas +## Pandas + +### Overview + +This set of exercises focuses on using Pandas, a powerful library for data manipulation and analysis in Python. You'll learn to create and manipulate DataFrames, work with real-world datasets, handle missing values, and perform various data operations. The exercises cover key Pandas functionalities including data loading, cleaning, transformation, and basic analysis. + +### Role Play + +You are a data analyst at a multinational energy company. Your team has been tasked with analyzing various datasets to improve operational efficiency and customer service. + +Your manager emphasizes the importance of clean, efficient code and clear explanations of your methods and findings. You'll need to present your results to both technical team members and non-technical executives, so focus on creating clear visualizations and concise summaries of your insights. + +### Learning Objectives The goal of this day is to understand practical usage of **Pandas**. As **Pandas** in intensively used in Data Science, other days of the piscine will be dedicated to it. @@ -51,7 +63,7 @@ It contains ALL you need to know about Pandas. --- -# Exercise 0: Environment and libraries +### Exercise 0: Environment and libraries The goal of this exercise is to set up the Python work environment with the required libraries. @@ -63,13 +75,13 @@ I recommend to use: - the virtual environment you're the most comfortable with. `virtualenv` and `conda` are the most used in Data Science. - one of the most recent versions of the libraries required -1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy` and `jupyter`. +1. Create a virtual environment named `ex00`, with a version of Python >= `3.9`, with the following libraries: `pandas`, `numpy` and `jupyter`. --- --- -# Exercise 1: Your first DataFrame +### Exercise 1: Your first DataFrame The goal of this exercise is to learn to create basic Pandas objects. @@ -92,7 +104,7 @@ The goal of this exercise is to learn to create basic Pandas objects. --- -# Exercise 2: Electric power consumption +### Exercise 2: Electric power consumption The goal of this exercise is to learn to manipulate real data with Pandas. @@ -121,7 +133,7 @@ The data set used is [**Individual household electric power consumption**](https --- -# Exercise 3: E-commerce purchases +### Exercise 3: E-commerce purchases The goal of this exercise is to learn to manipulate real data with Pandas. This exercise is less guided since the exercise 2 should have given you a nice introduction. @@ -146,7 +158,7 @@ Questions: --- -# Exercise 4: Handling missing values +### Exercise 4: Handling missing values The goal of this exercise is to learn to handle missing values. In the previous exercise we used the first techniques: filter out the missing values. We were lucky because the proportion of missing values was low. But in some cases, dropping the missing values is not possible because the filtered data set would be too small. diff --git a/subjects/ai/pandas/audit/README.md b/subjects/ai/pandas/audit/README.md index 8c43921c29..2f3c33f773 100644 --- a/subjects/ai/pandas/audit/README.md +++ b/subjects/ai/pandas/audit/README.md @@ -6,7 +6,7 @@ ##### Run `python --version`. -###### Does it print `Python 3.x`? x >= 8 +###### Does it print `Python 3.x`? x >= 9 ###### Do `import jupyter`, `import numpy` and `import pandas` run without any error? diff --git a/subjects/ai/pipeline/README.md b/subjects/ai/pipeline/README.md index c0cb8c5ddb..67013a04de 100644 --- a/subjects/ai/pipeline/README.md +++ b/subjects/ai/pipeline/README.md @@ -1,7 +1,15 @@ -# Pipeline +## Pipeline + +### Overview Today we will focus on the data preprocessing and discover the Pipeline object from scikit learn. +### Role play + +You are a data scientist working for a large e-commerce company. The marketing team has provided you with a dataset containing customer information and purchase history. However, the data is messy - it contains categorical variables, missing values, and features on different scales. Your task is to preprocess this data and prepare it for a machine learning model that will predict customer lifetime value. + +### Learning Objective + 1. Manage categorical variables with Integer encoding and One Hot Encoding 2. Impute the missing values 3. Reduce the dimension of the data @@ -42,15 +50,15 @@ _Version of Scikit Learn I used to do the exercises: 0.22_. I suggest using the ### **Resources** -### Step 3 +#### Step 3 - https://towardsdatascience.com/dimensionality-reduction-for-machine-learning-80a46c2ebb7e -### Step 4 +#### Step 4 - https://medium.com/@societyofai/simplest-way-for-feature-scaling-in-gradient-descent-ae0aaa383039#:~:text=Feature%20scaling%20is%20an%20idea,of%20convergence%20of%20gradient%20descent. -### Pipeline +#### Pipeline - https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html @@ -58,7 +66,7 @@ _Version of Scikit Learn I used to do the exercises: 0.22_. I suggest using the --- -# Exercise 0: Environment and libraries +### Exercise 0: Environment and libraries The goal of this exercise is to set up the Python work environment with the required libraries. @@ -70,13 +78,13 @@ I recommend to use: - the virtual environment you're the most comfortable with. `virtualenv` and `conda` are the most used in Data Science. - one of the most recent versions of the libraries required -1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`. +1. Create a virtual environment named `ex00`, with a version of Python >= `3.9`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`. --- --- -# Exercise 1: Imputer 1 +### Exercise 1: Imputer 1 The goal of this exercise is to learn how to use an `Imputer` to fill missing values on basic example. @@ -102,7 +110,7 @@ test_data = [[np.nan, 1, 2], --- -# Exercise 2: Scaler +### Exercise 2: Scaler The goal of this exercise is to learn to scale a data set. There are various scaling techniques, we will focus on `StandardScaler` from scikit learn. @@ -137,7 +145,7 @@ Resources: --- -# Exercise 3: One hot Encoder +### Exercise 3: One hot Encoder The goal of this exercise is to learn how to deal with Categorical variables using the `OneHot` Encoder. @@ -177,7 +185,7 @@ The expected output is: --- -# Exercise 4: Ordinal Encoder +### Exercise 4: Ordinal Encoder The goal of this exercise is to learn how to deal with Categorical variables using the Ordinal Encoder. @@ -201,7 +209,7 @@ _Note: In the version 0.22 of Scikit-learn, the Ordinal Encoder doesn't handle n --- -# Exercise 5: Categorical variables +### Exercise 5: Categorical variables The goal of this exercise is to learn how to deal with Categorical variables with Ordinal Encoder, Label Encoder and One Hot Encoder. For this exercise I strongly suggest using a recent version of `sklearn >= 0.24.1` to avoid issues with the Ordinal Encoder. @@ -334,7 +342,7 @@ Resources: --- -# Exercise 6: Pipeline +### Exercise 6: Pipeline The goal of this exercise is to learn to use the Scikit-learn object: Pipeline. The data set: used for this exercise is the `iris` data set. diff --git a/subjects/ai/pipeline/audit/README.md b/subjects/ai/pipeline/audit/README.md index fcdb12c016..fccbdba240 100644 --- a/subjects/ai/pipeline/audit/README.md +++ b/subjects/ai/pipeline/audit/README.md @@ -6,7 +6,7 @@ ##### Run `python --version`. -###### Does it print `Python 3.x`? x >= 8 +###### Does it print `Python 3.x`? x >= 9 ###### Do `import jupyter`, `import numpy`, `import pandas`, `import matplotlib` and `import sklearn` run without any error? diff --git a/subjects/ai/time-series/README.md b/subjects/ai/time-series/README.md index 99f56aa34d..e2068dc223 100644 --- a/subjects/ai/time-series/README.md +++ b/subjects/ai/time-series/README.md @@ -1,4 +1,6 @@ -# Time Series +## Time Series + +### Overview Time series data are data that are indexed by a sequence of dates or times. Today, you'll learn how to use methods built into Pandas to work with this index. You'll also learn for instance: @@ -6,6 +8,12 @@ Time series data are data that are indexed by a sequence of dates or times. Toda - to calculate rolling and cumulative values for times series - to build a backtest +### Role Play + +You are a quantitative analyst at a prominent hedge fund. Your team is responsible for developing and testing trading strategies using historical financial data. Your manager has assigned you a project to analyze time series data, particularly focusing on Apple stock, and to backtest a simple trading strategy. + +### Learning Objectives + Time series a used A LOT in finance. You'll learn to evaluate financial strategies using Pandas. It is important to keep in mind that Python is vectorized. That's why some questions constraint you to not use a for loop ;-). ### Exercises of the day @@ -43,7 +51,7 @@ I suggest to use the most recent one. --- -# Exercise 0: Environment and libraries +### Exercise 0: Environment and libraries The goal of this exercise is to set up the Python work environment with the required libraries. @@ -55,13 +63,13 @@ I recommend to use: - the virtual environment you're the most comfortable with. `virtualenv` and `conda` are the most used in Data Science. - one of the most recent versions of the libraries required -1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy` and `jupyter`. +1. Create a virtual environment named `ex00`, with a version of Python >= `3.9`, with the following libraries: `pandas`, `numpy` and `jupyter`. --- --- -# Exercise 1: Series +### Exercise 1: Series The goal of this exercise is to learn to manipulate time series in Pandas. @@ -73,7 +81,7 @@ The goal of this exercise is to learn to manipulate time series in Pandas. --- -# Exercise 2: Financial data +### Exercise 2: Financial data This exercise aims to familiarize you with handling financial data using Pandas, particularly focusing on time series analysis and computations related to stock prices. @@ -97,7 +105,7 @@ There are two recommended methods: utilizing the `pct_change` function and imple --- -# Exercise 3: Multi asset returns +### Exercise 3: Multi asset returns The goal of this exercise is to learn to compute daily returns on a DataFrame that contains many assets (multi-assets). @@ -136,7 +144,7 @@ Note: The data is generated randomly, the values you may have lead to a differen --- -# Exercise 4: Backtest +### Exercise 4: Backtest The goal of this exercise is to learn to perform a backtest in Pandas. A backtest is a tool that allows you to know how a strategy would have performed retrospectively using historical data. In this exercise we will focus on the backtesting tool and not on how to build the best strategy. diff --git a/subjects/ai/time-series/audit/README.md b/subjects/ai/time-series/audit/README.md index d68c355ac9..4dd7c0ff22 100644 --- a/subjects/ai/time-series/audit/README.md +++ b/subjects/ai/time-series/audit/README.md @@ -6,7 +6,7 @@ ##### Run `python --version`. -###### Does it print `Python 3.x`? x >= 8 +###### Does it print `Python 3.x`? x >= 9 ###### Do `import jupyter`, `import numpy` and `import pandas` run without any error? diff --git a/subjects/ai/training/README.md b/subjects/ai/training/README.md index 3323116293..c75dc3cf61 100644 --- a/subjects/ai/training/README.md +++ b/subjects/ai/training/README.md @@ -1,8 +1,17 @@ -# Training +## Training + +### Overview Today we will learn how to train and evaluate a machine learning model. You'll learn how to choose the right Machine Learning metric depending on the problem you are solving and to compute it. A metric gives an idea of how good the model performs. Depending on working on a classification problem or a regression problem the metrics considered are different. It is important to understand that all metrics are just metrics, not the truth. -We will focus on the most important metrics: +### Role Play + +You are a machine learning engineer at a tech startup that specializes in predictive analytics for various industries. Your team has been tasked with developing and evaluating machine learning models for two key projects: + +1. A real estate price prediction tool for the California housing market. +2. A medical diagnostics system for breast cancer detection. + +### Learning Objectives - Regression: - **R2**, **Mean Square Error**, **Mean Absolute Error** @@ -40,23 +49,23 @@ _Version of Scikit Learn I used to do the exercises: 0.22_. I suggest to use the ### Resources -### Metrics +#### Metrics -- https://www.kdnuggets.com/2018/06/right-metric-evaluating-machine-learning-models-2.html +- https://medium.com/analytics-vidhya/different-metrics-to-evaluate-the-performance-of-a-machine-learning-model-90acec9e8726 -### Imbalance datasets +#### Imbalance datasets - https://stats.stackexchange.com/questions/260164/auc-and-class-imbalance-in-training-test-dataset -### Gridsearch +#### Gridsearch -- https://medium.com/fintechexplained/what-is-grid-search-c01fe886ef0a +- https://www.dremio.com/wiki/grid-search/ --- --- -# Exercise 0: Environment and libraries +### Exercise 0: Environment and libraries The goal of this exercise is to set up the Python work environment with the required libraries. @@ -68,13 +77,13 @@ I recommend to use: - the virtual environment you're the most comfortable with. `virtualenv` and `conda` are the most used in Data Science. - one of the most recent versions of the libraries required -1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`. +1. Create a virtual environment named `ex00`, with a version of Python >= `3.9`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`. --- --- -# Exercise 1: MSE Scikit-learn +### Exercise 1: MSE Scikit-learn The goal of this exercise is to learn to use `sklearn.metrics` to compute the mean squared error (MSE). @@ -89,7 +98,7 @@ y_pred = [90, 48, 2, 2, -4] --- -# Exercise 2: Accuracy Scikit-learn +### Exercise 2: Accuracy Scikit-learn The goal of this exercise is to learn to use `sklearn.metrics` to compute the accuracy. @@ -104,7 +113,7 @@ y_true = [0, 0, 1, 1, 1, 1, 0] --- -# Exercise 3: Regression +### Exercise 3: Regression The goal of this exercise is to learn to evaluate a machine learning model using many regression metrics. @@ -146,7 +155,7 @@ pipe.fit(X_train, y_train) --- -# Exercise 4: Classification +### Exercise 4: Classification The goal of this exercise is to learn to evaluate a machine learning model using many classification metrics. @@ -181,13 +190,13 @@ classifier.fit(X_train_scaled, y_train) [logo_ex4]: ./w2_day4_ex4_q3.png "ROC AUC " -- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_roc_curve.html +- https://scikit-learn.org/1.1/modules/generated/sklearn.metrics.plot_roc_curve.html --- --- -# Exercise 5: Machine Learning models +### Exercise 5: Machine Learning models The goal of this exercise is to have an overview of the existing Machine Learning models and to learn to call them from scikit learn. We will focus on: @@ -199,7 +208,7 @@ We will focus on: All these algorithms exist in two versions: regression and classification. Even if the logic is similar in both classification and regression, the loss function is specific to each case. -It is really easy to get lost among all the existing algorithms. This article is very useful to have a clear overview of the models and to understand which algorithm use and when. https://towardsdatascience.com/how-to-choose-the-right-machine-learning-algorithm-for-your-application-1e36c32400b9 +It is really easy to get lost among all the existing algorithms. This article is very useful to have a clear overview of the models and to understand which algorithm use and when. https://www.geeksforgeeks.org/choosing-a-suitable-machine-learning-algorithm/ Preliminary: @@ -247,7 +256,7 @@ Take time to have basic understanding of the role of the basic hyperparameter an --- -# Exercise 6: Grid Search +### Exercise 6: Grid Search The goal of this exercise is to learn how to make an exhaustive search over specified parameter values for an estimator. This is very useful because the hyperparameter which are the parameters of the model impact the performance of the model. @@ -293,8 +302,8 @@ Ressources: - https://stackoverflow.com/questions/38555650/try-multiple-estimator-in-one-grid-search -- https://medium.com/fintechexplained/what-is-grid-search-c01fe886ef0a +- https://www.dremio.com/wiki/grid-search/ - https://elutins.medium.com/grid-searching-in-machine-learning-quick-explanation-and-python-implementation-550552200596 -- https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html +- https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html \ No newline at end of file diff --git a/subjects/ai/training/audit/README.md b/subjects/ai/training/audit/README.md index e654feac1f..0a242941b1 100644 --- a/subjects/ai/training/audit/README.md +++ b/subjects/ai/training/audit/README.md @@ -6,7 +6,7 @@ ##### Run `python --version`. -###### Does it print `Python 3.x`? x >= 8 +###### Does it print `Python 3.x`? x >= 9 ##### Do `import jupyter`, `import numpy`, `import pandas`, `import matplotlib` and `import sklearn` run without any error? diff --git a/subjects/ai/visualizations/README.md b/subjects/ai/visualizations/README.md index cc3885c6cb..45cfe5db19 100644 --- a/subjects/ai/visualizations/README.md +++ b/subjects/ai/visualizations/README.md @@ -1,4 +1,6 @@ -# Visualizations +## Visualizations + +### Overview While working on a dataset it is important to check the distribution of the data. Obviously, for most of humans it is difficult to visualize the data in more than 3 dimensions @@ -8,6 +10,12 @@ While working on a dataset it is important to check the distribution of the data - Matplotlib - Plotly +### Role play + +You are a data visualization specialist at a leading tech company. Your team has been tasked with creating an interactive dashboard to present key insights from various company datasets. Your manager has emphasized the importance of using a variety of visualization techniques to effectively communicate complex data to both technical and non-technical stakeholders. + +### Learning Objectives + The goal is to understand the basics of those libraries. You'll have time during the project to master one (or the three) of them. You may wonder why using one library is not enough. The reason is simple: it depends on the usage. For example if you want to check the data quickly you may want to use Pandas viz module or Matplotlib. @@ -48,7 +56,7 @@ I suggest to use the most recent version of the packages. --- -# Exercise 0: Environment and libraries +### Exercise 0: Environment and libraries The goal of this exercise is to set up the Python work environment with the required libraries. @@ -60,13 +68,13 @@ I recommend to use: - the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science. - one of the most recents versions of the libraries required -1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `plotly`. +1. Create a virtual environment named `ex00`, with a version of Python >= `3.9`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `plotly`. --- --- -# Exercise 1: Pandas plot 1 +### Exercise 1: Pandas plot 1 The goal of this exercise is to learn to create plots with use Pandas. Panda's `.plot()` is a wrapper for `matplotlib.pyplot.plot()`. @@ -99,7 +107,7 @@ The plot has to contain: --- -# Exercise 2: Pandas plot 2 +### Exercise 2: Pandas plot 2 The goal of this exercise is to learn to create plots with use Pandas. Panda's `.plot()` is a wrapper for `matplotlib.pyplot.plot()`. @@ -130,7 +138,7 @@ The plot has to contain: --- -# Exercise 3: Matplotlib 1 +### Exercise 3: Matplotlib 1 The goal of this plot is to learn to use Matplotlib to plot data. As you know, Matplotlib is the underlying library used by Pandas. It provides more options to plot custom visualizations. Howerver, most of the plots we will create with Matplotlib can be reproduced with Pandas' `.plot()`. @@ -153,7 +161,7 @@ The plot has to contain: --- -# Exercise 4: Matplotlib 2 +### Exercise 4: Matplotlib 2 The goal of this plot is to learn to use Matplotlib to plot different lines in the same plot on different axis using `twinx`. This very useful to compare variables in different ranges. @@ -183,7 +191,7 @@ The plot has to contain: --- -# Exercise 5: Matplotlib subplots +### Exercise 5: Matplotlib subplots The goal of this exercise is to learn to use Matplotlib to create subplots. @@ -206,7 +214,7 @@ The plot has to contain: --- -# Exercise 6: Plotly 1 +### Exercise 6: Plotly 1 Plotly has evolved a lot in the previous years. It is important to **always check the documentation**. @@ -245,7 +253,7 @@ https://plotly.com/python/time-series/ --- -# Exercise 7: Plotly Box plots +### Exercise 7: Plotly Box plots The goal of this exercise is to learn to use Plotly to plot Box Plots. It is a method for graphically depicting groups of numerical data through their quartiles and values as min, max. It allows comparing quickly some variables. diff --git a/subjects/ai/visualizations/audit/README.md b/subjects/ai/visualizations/audit/README.md index 9bf1717700..38b7c24189 100644 --- a/subjects/ai/visualizations/audit/README.md +++ b/subjects/ai/visualizations/audit/README.md @@ -6,7 +6,7 @@ ##### Run `python --version` -###### Does it print `Python 3.x`? x >= 8 +###### Does it print `Python 3.x`? x >= 9 ###### Do `import jupyter`, `import numpy`, `import pandas`, `matplotlib` and `plotly` run without any errors? From 5272a6e0d04ba6d01ca3d59b9ed95eda98a221b7 Mon Sep 17 00:00:00 2001 From: Oumaima Fisaoui <48260689+Oumaimafisaoui@users.noreply.github.com> Date: Tue, 1 Oct 2024 09:35:48 +0100 Subject: [PATCH 2/5] Chore(AI): Fix piscine structure --- subjects/ai/classification/README.md | 3 ++- subjects/ai/classification/audit/README.md | 6 +++--- subjects/ai/data-wrangling/audit/README.md | 2 +- subjects/ai/keras-2/README.md | 1 - subjects/ai/keras-2/audit/README.md | 1 - subjects/ai/keras/README.md | 2 +- subjects/ai/linear-regression/README.md | 10 +++++----- subjects/ai/model-selection/README.md | 6 +++--- subjects/ai/neural-networks/README.md | 4 ++-- subjects/ai/nlp-spacy/README.md | 2 +- subjects/ai/nlp-spacy/audit/README.md | 2 +- subjects/ai/nlp/README.md | 1 + subjects/ai/nlp/audit/README.md | 2 +- subjects/ai/pandas/README.md | 2 +- subjects/ai/training/README.md | 2 +- subjects/ai/visualizations/audit/README.md | 5 +++-- 16 files changed, 26 insertions(+), 25 deletions(-) diff --git a/subjects/ai/classification/README.md b/subjects/ai/classification/README.md index 3aa01fd1c6..4b54afcbae 100644 --- a/subjects/ai/classification/README.md +++ b/subjects/ai/classification/README.md @@ -304,8 +304,8 @@ Preliminary: - [Database](data/breast-cancer-wisconsin.data) and [database information](data/breast-cancer-wisconsin.names) --- ---- +--- ### Exercise 6: Multi-class (Optional) @@ -361,6 +361,7 @@ def predict_one_vs_all(X, clf0, clf1, clf2 ): #TODO return classes ``` + Resources : - https://www.kaggle.com/code/rahulrajpandey31/logistic-regression-from-scratch-iris-data-set diff --git a/subjects/ai/classification/audit/README.md b/subjects/ai/classification/audit/README.md index ead114de82..00c38b8069 100644 --- a/subjects/ai/classification/audit/README.md +++ b/subjects/ai/classification/audit/README.md @@ -31,7 +31,6 @@ Score: 0.7142857142857143 ``` - --- --- @@ -73,9 +72,9 @@ Coefficient: [[1.18866075]] ###### For question 4, does `predict_probability` output the same probabilities as `predict_proba`? Note that the values have to match one of the class probabilities, not both. To do so, compare the output with: `clf.predict_proba(X)[:,1]`. The shape of the arrays is not important. -###### Does `predict_class` output the same classes as `cfl.predict(X)` for question 5? The shape of the arrays is not important. +###### Does `predict_class` output the same classes as `cfl.predict(X)` for question 5? The shape of the arrays is not important. -###### Does the plot for question 6 look like the plot below? As mentioned, it is not required to shift the class prediction to make the plot easier to understand. +###### Does the plot for question 6 look like the plot below? As mentioned, it is not required to shift the class prediction to make the plot easier to understand. ![alt text][ex3q6] @@ -193,6 +192,7 @@ As said, for some reasons, the results may be slightly different from mine becau --- #### Bonus + #### Exercise 6: Multi-class (Optional) ##### The exercise is validated if all questions of the exercise are validated diff --git a/subjects/ai/data-wrangling/audit/README.md b/subjects/ai/data-wrangling/audit/README.md index bcdd3258b9..f861fdb5ba 100644 --- a/subjects/ai/data-wrangling/audit/README.md +++ b/subjects/ai/data-wrangling/audit/README.md @@ -52,7 +52,7 @@ | 5 | 6 | nan | nan | O | P | | 6 | 7 | nan | nan | Q | R | | 7 | 8 | nan | nan | S | T | - + Note: Check that the suffixes are set using the suffix parameters rather than manually changing the columns' name. --- diff --git a/subjects/ai/keras-2/README.md b/subjects/ai/keras-2/README.md index 0a944d93be..92ce31874a 100644 --- a/subjects/ai/keras-2/README.md +++ b/subjects/ai/keras-2/README.md @@ -4,7 +4,6 @@ This exercise set focuses on advanced applications of Keras for building and training neural networks. You'll work on both regression and multi-class classification problems, using real-world datasets like the Auto MPG and Iris datasets. - ### Role Play You're a data scientist at a biotech company developing AI-powered systems for various applications. Your current project involves creating neural networks for both regression and multi-class classification tasks. You'll be working on predicting car fuel efficiency and classifying flower species, showcasing the versatility of neural networks in different domains. diff --git a/subjects/ai/keras-2/audit/README.md b/subjects/ai/keras-2/audit/README.md index 0ce90efd58..622bab7712 100644 --- a/subjects/ai/keras-2/audit/README.md +++ b/subjects/ai/keras-2/audit/README.md @@ -131,7 +131,6 @@ model.compile(loss='categorical_crossentropy', --- - #### Exercise 5: Multi classification example ##### The exercise is validated if all questions of the exercise are validated diff --git a/subjects/ai/keras/README.md b/subjects/ai/keras/README.md index 68853690a5..c6712a1bf4 100644 --- a/subjects/ai/keras/README.md +++ b/subjects/ai/keras/README.md @@ -2,7 +2,7 @@ ### Overview -This exercise focuses on using Keras to build and train neural networks. Keras is a high-level deep learning API that runs on top of TensorFlow, designed for fast experimentation with deep neural networks. You'll learn to create sequential models, add dense layers, design network architectures, and optimize your models. +This exercise focuses on using Keras to build and train neural networks. Keras is a high-level deep learning API that runs on top of TensorFlow, designed for fast experimentation with deep neural networks. You'll learn to create sequential models, add dense layers, design network architectures, and optimize your models. ### Role Play diff --git a/subjects/ai/linear-regression/README.md b/subjects/ai/linear-regression/README.md index 737e81c4e9..d9df005661 100644 --- a/subjects/ai/linear-regression/README.md +++ b/subjects/ai/linear-regression/README.md @@ -126,7 +126,7 @@ X, y, coef = make_regression(n_samples=100, ![alt text][q1] -[q1]: ./w2_day1_ex2_q1.png 'Scatter plot' +[q1]: ./w2_day1_ex2_q1.png "Scatter plot" 2. Fit a LinearRegression from Scikit-learn on the generated data and give the equation of the fitted line. The expected output is: `y = coef * x + intercept` @@ -134,7 +134,7 @@ X, y, coef = make_regression(n_samples=100, ![alt text][q3] -[q3]: ./w2_day1_ex2_q3.png 'Scatter plot + fitted line' +[q3]: ./w2_day1_ex2_q3.png "Scatter plot + fitted line" 4. Predict on X. @@ -229,7 +229,7 @@ _Warning: The shape of X is not the same as the shape of y. You may need (for so ![alt text][ex5q1] -[ex5q1]: ./w2_day1_ex5_q1.png 'Scatter plot ' +[ex5q1]: ./w2_day1_ex5_q1.png "Scatter plot " As a reminder, fitting a Linear Regression on this data means finding (a, b) that fits well the data points. @@ -311,7 +311,7 @@ The expected output is: ![alt text][ex5q5] -[ex5q5]: ./w2_day1_ex5_q5.png 'MSE ' +[ex5q5]: ./w2_day1_ex5_q5.png "MSE " 6. From the `losses` list, find the optimal value of a and b and plot the line in the scatter point of question 1. @@ -327,6 +327,6 @@ In a nutshell, Gradient descent is an optimization algorithm used to minimize so ![alt text][ex5q8] -[ex5q8]: ./w2_day1_ex5_q8.png 'MSE + Gradient descent' +[ex5q8]: ./w2_day1_ex5_q8.png "MSE + Gradient descent" 9. Use Linear Regression from Scikit-learn. Compare the results. diff --git a/subjects/ai/model-selection/README.md b/subjects/ai/model-selection/README.md index 85740dd158..c6e2ac93bb 100644 --- a/subjects/ai/model-selection/README.md +++ b/subjects/ai/model-selection/README.md @@ -2,7 +2,7 @@ ### Overview -This exercise set focuses on advanced model selection techniques in machine learning. You'll work with cross-validation, grid search, and performance evaluation tools. +This exercise set focuses on advanced model selection techniques in machine learning. You'll work with cross-validation, grid search, and performance evaluation tools. ### Role Play @@ -245,7 +245,7 @@ The plot should look like this: ![alt text][logo_ex5q1] -[logo_ex5q1]: ./w2_day5_ex5_q1.png 'Validation curve ' +[logo_ex5q1]: ./w2_day5_ex5_q1.png "Validation curve " The interpretation is that from max_depth=10, the train score keeps increasing but the test score (or validation score) reaches a plateau. It means that choosing max_depth = 20 may lead to have an over fitted model. @@ -261,7 +261,7 @@ The interpretation is that from max_depth=10, the train score keeps increasing b ![alt text][logo_ex5q2] -[logo_ex5q2]: ./w2_day5_ex5_q2.png 'Learning curve ' +[logo_ex5q2]: ./w2_day5_ex5_q2.png "Learning curve " - **Note Plot Learning Curves**: The learning curves is detailed in the first resource. diff --git a/subjects/ai/neural-networks/README.md b/subjects/ai/neural-networks/README.md index b3c441a3dc..a2fcbf7836 100644 --- a/subjects/ai/neural-networks/README.md +++ b/subjects/ai/neural-networks/README.md @@ -147,7 +147,7 @@ Notice that the neuron **o1** in the output layer takes as input the output of t In exercise 1, you implemented this neuron. ![alt text][neuron] -[neuron]: ./w3_day1_neuron.png 'Plot' +[neuron]: ./w3_day1_neuron.png "Plot" Now, we add two more neurons: @@ -156,7 +156,7 @@ Now, we add two more neurons: ![alt text][nn] -[nn]: ./w3_day1_neural_network.png 'Plot' +[nn]: ./w3_day1_neural_network.png "Plot" 1. Implement the function `feedforward` of the class `OurNeuralNetwork` that takes as input the input data and returns the output y. Return the output for these neurons: diff --git a/subjects/ai/nlp-spacy/README.md b/subjects/ai/nlp-spacy/README.md index 693caec0aa..a68e666061 100644 --- a/subjects/ai/nlp-spacy/README.md +++ b/subjects/ai/nlp-spacy/README.md @@ -107,7 +107,7 @@ The goal of this exercise is to learn to use `spaCy` embedding on a document. ![alt text][logo] -[logo]: ./w3day05ex1_plot.png 'Plot' +[logo]: ./w3day05ex1_plot.png "Plot" https://medium.com/datadriveninvestor/cosine-similarity-cosine-distance-6571387f9bf8 diff --git a/subjects/ai/nlp-spacy/audit/README.md b/subjects/ai/nlp-spacy/audit/README.md index 90ee5b0686..505743feb9 100644 --- a/subjects/ai/nlp-spacy/audit/README.md +++ b/subjects/ai/nlp-spacy/audit/README.md @@ -58,7 +58,7 @@ ![alt text][logo] -[logo]: ../w3day05ex1_plot.png 'Plot' +[logo]: ../w3day05ex1_plot.png "Plot" --- diff --git a/subjects/ai/nlp/README.md b/subjects/ai/nlp/README.md index b096d94f36..b88d218239 100644 --- a/subjects/ai/nlp/README.md +++ b/subjects/ai/nlp/README.md @@ -5,6 +5,7 @@ “NLP makes it possible for humans to talk to machines:” This branch of AI enables computers to understand, interpret, and manipulate human language. This technology is one of the most broadly applied areas of machine learning and is critical in effectively analyzing massive quantities of unstructured, text-heavy data. ### Role Play + You're a Natural Language Processing (NLP) specialist at a tech startup developing a sentiment analysis tool for social media posts. Your task is to build the preprocessing pipeline and create a bag-of-words representation for tweet analysis. ### Learning Objectives diff --git a/subjects/ai/nlp/audit/README.md b/subjects/ai/nlp/audit/README.md index 52844a9cac..dd45b53a93 100644 --- a/subjects/ai/nlp/audit/README.md +++ b/subjects/ai/nlp/audit/README.md @@ -40,7 +40,7 @@ Name: text, dtype: object #### Exercise 2: Punctuation -###### For question 1, is validated if the ouptut doesn't contain punctuation `` !"#$%&'()*+,-./:;<=>?@[]^_`{|}~ ``. Is the previous statement true? Do not take into account the spaces in the output. The output should be as: +###### For question 1, is validated if the ouptut doesn't contain punctuation ``!"#$%&'()*+,-./:;<=>?@[]^_`{|}~``. Is the previous statement true? Do not take into account the spaces in the output. The output should be as: ``` Remove this from the sentence diff --git a/subjects/ai/pandas/README.md b/subjects/ai/pandas/README.md index 1fb1c1a3a4..ac4447a1a8 100644 --- a/subjects/ai/pandas/README.md +++ b/subjects/ai/pandas/README.md @@ -6,7 +6,7 @@ This set of exercises focuses on using Pandas, a powerful library for data manip ### Role Play -You are a data analyst at a multinational energy company. Your team has been tasked with analyzing various datasets to improve operational efficiency and customer service. +You are a data analyst at a multinational energy company. Your team has been tasked with analyzing various datasets to improve operational efficiency and customer service. Your manager emphasizes the importance of clean, efficient code and clear explanations of your methods and findings. You'll need to present your results to both technical team members and non-technical executives, so focus on creating clear visualizations and concise summaries of your insights. diff --git a/subjects/ai/training/README.md b/subjects/ai/training/README.md index c75dc3cf61..32dcc819d9 100644 --- a/subjects/ai/training/README.md +++ b/subjects/ai/training/README.md @@ -306,4 +306,4 @@ Ressources: - https://elutins.medium.com/grid-searching-in-machine-learning-quick-explanation-and-python-implementation-550552200596 -- https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html \ No newline at end of file +- https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html diff --git a/subjects/ai/visualizations/audit/README.md b/subjects/ai/visualizations/audit/README.md index 38b7c24189..a93b731348 100644 --- a/subjects/ai/visualizations/audit/README.md +++ b/subjects/ai/visualizations/audit/README.md @@ -20,11 +20,12 @@ ##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria. -###### Does it have a title? +###### Does it have a title? ###### Does it have a name on x-axis? -###### Does it have a legend? +###### Does it have a legend? + ![alt text][logo] [logo]: ../w1day03_ex1_plot1.png "Bar plot ex1" From 0630611b6aa69e18f15bcf6ba094766e32e1371c Mon Sep 17 00:00:00 2001 From: Oumaima Fisaoui <48260689+Oumaimafisaoui@users.noreply.github.com> Date: Wed, 2 Oct 2024 10:13:47 +0100 Subject: [PATCH 3/5] draft --- subjects/ai/backtesting-sp500/README.md | 2 +- subjects/ai/backtesting-sp500/audit/README.md | 2 +- subjects/ai/credit-scoring/readme_data.md | 9 ++++----- subjects/ai/emotions-detector/README.md | 10 ++++------ 4 files changed, 10 insertions(+), 13 deletions(-) diff --git a/subjects/ai/backtesting-sp500/README.md b/subjects/ai/backtesting-sp500/README.md index 54fca3fb91..7360a0618e 100644 --- a/subjects/ai/backtesting-sp500/README.md +++ b/subjects/ai/backtesting-sp500/README.md @@ -135,7 +135,7 @@ A data point (x-axis: date, y-axis: cumulated_return) is: the **cumulated return ![alt text][performance] -[performance]: images/w1_weekend_plot_pnl.png 'Cumulative Performance' +[performance]: images/w1_weekend_plot_pnl.png "Cumulative Performance" ## 5. Main diff --git a/subjects/ai/backtesting-sp500/audit/README.md b/subjects/ai/backtesting-sp500/audit/README.md index fdf200d933..6f88582bba 100644 --- a/subjects/ai/backtesting-sp500/audit/README.md +++ b/subjects/ai/backtesting-sp500/audit/README.md @@ -107,7 +107,7 @@ Best practice: ![alt text][performance] -[performance]: ../images/w1_weekend_plot_pnl.png 'Cumulative Performance' +[performance]: ../images/w1_weekend_plot_pnl.png "Cumulative Performance" ##### 5. main.py diff --git a/subjects/ai/credit-scoring/readme_data.md b/subjects/ai/credit-scoring/readme_data.md index 12aa9509ca..8682b10e39 100644 --- a/subjects/ai/credit-scoring/readme_data.md +++ b/subjects/ai/credit-scoring/readme_data.md @@ -4,7 +4,7 @@ This file describes the available data for the project. ![alt data description](data_description.png "Credit scoring data description") -## application_{train|test}.csv +## application\_{train|test}.csv This is the main table, broken into two files for Train (with TARGET) and Test (without TARGET). Static data for all applications. One row represents one loan in our data sample. @@ -17,24 +17,23 @@ For every loan in our sample, there are as many rows as number of credits the cl ## bureau_balance.csv Monthly balances of previous credits in Credit Bureau. -This table has one row for each month of history of every previous credit reported to Credit Bureau – i.e the table has (#loans in sample * # of relative previous credits * # of months where we have some history observable for the previous credits) rows. +This table has one row for each month of history of every previous credit reported to Credit Bureau – i.e the table has (#loans in sample _ # of relative previous credits _ # of months where we have some history observable for the previous credits) rows. ## POS_CASH_balance.csv Monthly balance snapshots of previous POS (point of sales) and cash loans that the applicant had with Home Credit. -This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample * # of relative previous credits * # of months in which we have some history observable for the previous credits) rows. +This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample _ # of relative previous credits _ # of months in which we have some history observable for the previous credits) rows. ## credit_card_balance.csv Monthly balance snapshots of previous credit cards that the applicant has with Home Credit. -This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample * # of relative previous credit cards * # of months where we have some history observable for the previous credit card) rows. +This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample _ # of relative previous credit cards _ # of months where we have some history observable for the previous credit card) rows. ## previous_application.csv All previous applications for Home Credit loans of clients who have loans in our sample. There is one row for each previous application related to loans in our data sample. - ## installments_payments.csv Repayment history for the previously disbursed credits in Home Credit related to the loans in our sample. diff --git a/subjects/ai/emotions-detector/README.md b/subjects/ai/emotions-detector/README.md index a0bd998395..552c5fd883 100644 --- a/subjects/ai/emotions-detector/README.md +++ b/subjects/ai/emotions-detector/README.md @@ -150,12 +150,10 @@ Preprocessing ... ### Useful resources: -- https://machinelearningmastery.com/what-is-computer-vision/ +- [Computer vision](https://machinelearningmastery.com/what-is-computer-vision/) -- Use a pre-trained CNN: https://arxiv.org/pdf/1812.06387.pdf +- [Use a pre-trained CNN](https://arxiv.org/pdf/1812.06387.pdf) -- Hack the CNN https://medium.com/@ageitgey/machine-learning-is-fun-part-8-how-to-intentionally-trick-neural-networks-b55da32b7196 +-[Hack the CNN](https://medium.com/@ageitgey/machine-learning-is-fun-part-8-how-to-intentionally-trick-neural-networks-b55da32b7196) -- http://ice.dlut.edu.cn/valse2018/ppt/WeihongDeng_VALSE2018.pdf - -- https://arxiv.org/pdf/1812.06387.pdf +- [Pre-Trained Convolutional Neural Network Features for Facial Expression Recognition](https://arxiv.org/pdf/1812.06387.pdf) From a38bc0828431dae526b059adb0a75b15e4dcbcb2 Mon Sep 17 00:00:00 2001 From: Oumaima Fisaoui <48260689+Oumaimafisaoui@users.noreply.github.com> Date: Sun, 6 Oct 2024 20:43:31 +0100 Subject: [PATCH 4/5] Chore(AI): change the resources format --- subjects/ai/classification/README.md | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/subjects/ai/classification/README.md b/subjects/ai/classification/README.md index 4b54afcbae..195aa9fe7a 100644 --- a/subjects/ai/classification/README.md +++ b/subjects/ai/classification/README.md @@ -55,15 +55,11 @@ _Version of Scikit Learn I used to do the exercises: 0.22_. I suggest to use the ### Resources -#### Logistic regression +- [Logistic regression](https://towardsdatascience.com/understanding-logistic-regression-9b02c2aec102) -- https://towardsdatascience.com/understanding-logistic-regression-9b02c2aec102 +- [Logloss](https://www.datacamp.com/tutorial/the-cross-entropy-loss-function-in-machine-learning) -#### Logloss - -- https://www.datacamp.com/tutorial/the-cross-entropy-loss-function-in-machine-learning - -- https://medium.com/swlh/what-is-logistic-regression-62807de62efa +- [More on logistic regression](https://medium.com/swlh/what-is-logistic-regression-62807de62efa) --- From 2e58bcc9bf60ac6b5aa771d559d98c6fbcc0ffc2 Mon Sep 17 00:00:00 2001 From: Oumaima Fisaoui <48260689+Oumaimafisaoui@users.noreply.github.com> Date: Sun, 27 Oct 2024 22:44:03 +0100 Subject: [PATCH 5/5] Fix ressources --- subjects/ai/emotions-detector/README.md | 1 - subjects/ai/keras-2/README.md | 2 +- subjects/ai/keras/README.md | 2 +- subjects/ai/linear-regression/README.md | 6 +++--- subjects/ai/neural-networks/README.md | 6 +++--- subjects/ai/nlp-spacy/README.md | 8 ++++---- subjects/ai/pipeline/README.md | 4 ++-- 7 files changed, 14 insertions(+), 15 deletions(-) diff --git a/subjects/ai/emotions-detector/README.md b/subjects/ai/emotions-detector/README.md index cbf9c0926c..f95b70deee 100644 --- a/subjects/ai/emotions-detector/README.md +++ b/subjects/ai/emotions-detector/README.md @@ -171,4 +171,3 @@ Balance technical prowess with psychological insight: as you fine-tune your CNN - [Hack the CNN](https://medium.com/@ageitgey/machine-learning-is-fun-part-8-how-to-intentionally-trick-neural-networks-b55da32b7196) - [Convolutional Neural Network](https://arxiv.org/pdf/1812.06387.pdf) - diff --git a/subjects/ai/keras-2/README.md b/subjects/ai/keras-2/README.md index 92ce31874a..4bf60627f2 100644 --- a/subjects/ai/keras-2/README.md +++ b/subjects/ai/keras-2/README.md @@ -40,7 +40,7 @@ I suggest to use the most recent one. ### Resources -- https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/ +- [Neural network](https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/) --- diff --git a/subjects/ai/keras/README.md b/subjects/ai/keras/README.md index c6712a1bf4..2ae37447bf 100644 --- a/subjects/ai/keras/README.md +++ b/subjects/ai/keras/README.md @@ -40,7 +40,7 @@ I suggest to use the most recent one. ### Resources -- https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/ +- [Neural network](https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/) --- diff --git a/subjects/ai/linear-regression/README.md b/subjects/ai/linear-regression/README.md index d9df005661..2fa3850180 100644 --- a/subjects/ai/linear-regression/README.md +++ b/subjects/ai/linear-regression/README.md @@ -46,11 +46,11 @@ _Version of Scikit Learn I used to do the exercises: 0.22_. I suggest using the #### To start with Scikit-learn -- https://scikit-learn.org/stable/getting_started.html +- [Scikit](https://scikit-learn.org/stable/getting_started.html) -- https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html +- [Introducing Scikit](https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html) -- https://scikit-learn.org/stable/modules/linear_model.html +- [Linear Model](https://scikit-learn.org/stable/modules/linear_model.html) #### Machine learning methodology and algorithms diff --git a/subjects/ai/neural-networks/README.md b/subjects/ai/neural-networks/README.md index a2fcbf7836..a87e182429 100644 --- a/subjects/ai/neural-networks/README.md +++ b/subjects/ai/neural-networks/README.md @@ -47,11 +47,11 @@ I suggest to use the most recent one. ### Resources -- https://victorzhou.com/blog/intro-to-neural-networks/ +- [Neural networks](https://victorzhou.com/blog/intro-to-neural-networks/) -- https://srnghn.medium.com/deep-learning-overview-of-neurons-and-activation-functions-1d98286cf1e4 +- [Deep Learning](https://srnghn.medium.com/deep-learning-overview-of-neurons-and-activation-functions-1d98286cf1e4) -- https://towardsdatascience.com/machine-learning-for-beginners-an-introduction-to-neural-networks-d49f22d238f9 +- [Machine Learning](https://towardsdatascience.com/machine-learning-for-beginners-an-introduction-to-neural-networks-d49f22d238f9) --- diff --git a/subjects/ai/nlp-spacy/README.md b/subjects/ai/nlp-spacy/README.md index a68e666061..22557a720d 100644 --- a/subjects/ai/nlp-spacy/README.md +++ b/subjects/ai/nlp-spacy/README.md @@ -36,10 +36,10 @@ I suggest using the most recent libraries. ### Resources -- https://spacy.io/usage/spacy-101 -- https://spacy.io/api/doc -- https://www.analyticsvidhya.com/blog/2021/06/nlp-application-named-entity-recognition-ner-in-python-with-spacy/ -- https://medium.com/mlearning-ai/nlp-04-part-of-speech-tagging-in-spacy-dc3e239c2726 +- [Spacy](https://spacy.io/usage/spacy-101) +- [NLP](https://spacy.io/api/doc +- https://www.analyticsvidhya.com/blog/2021/06/nlp-application-named-entity-recognition-ner-in-python-with-spacy/) +- [NLP 2](https://medium.com/mlearning-ai/nlp-04-part-of-speech-tagging-in-spacy-dc3e239c2726) --- diff --git a/subjects/ai/pipeline/README.md b/subjects/ai/pipeline/README.md index 67013a04de..a0a22440cb 100644 --- a/subjects/ai/pipeline/README.md +++ b/subjects/ai/pipeline/README.md @@ -137,9 +137,9 @@ If the data is split in train and test set, it is extremely important to apply t Resources: -- https://medium.com/technofunnel/what-when-why-feature-scaling-for-machine-learning-standard-minmax-scaler-49e64c510422 +- [Machine Learning](https://medium.com/technofunnel/what-when-why-feature-scaling-for-machine-learning-standard-minmax-scaler-49e64c510422) -- https://scikit-learn.org/stable/modules/preprocessing.html +- [Preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html) ---