diff --git a/book/000_title.md b/book/000_title.md index e8c85a3..27fdac5 100755 --- a/book/000_title.md +++ b/book/000_title.md @@ -11,6 +11,8 @@ header-includes: \definecolor{secundaryowlgreen}{rgb}{0.63,0.83,0.29} \definecolor{secundaryowlgray}{rgb}{0.57,0.56,0.56} \definecolor{secundaryowlmagenta}{rgb}{0.57,0.06,0.33} + \definecolor{yellowcover}{rgb}{1.00,0.80,0.09} + \definecolor{browncover}{rgb}{0.25,0.22,0.14} \usepackage{tcolorbox} \usepackage{tabularx} \usepackage{float} diff --git a/book/020_fundamentals_of_data_science.md b/book/020_fundamentals_of_data_science.md index 1451f7b..e0738ea 100755 --- a/book/020_fundamentals_of_data_science.md +++ b/book/020_fundamentals_of_data_science.md @@ -94,21 +94,24 @@ In data science, SQL is often used in combination with other tools and languages In this section, we will explore the usage of SQL commands with two tables: `iris` and `species`. The `iris` table contains information about flower measurements, while the `species` table provides details about different species of flowers. SQL (Structured Query Language) is a powerful tool for managing and manipulating relational databases. +\clearpage +\vfill + **iris table** ``` -| sepal_length | sepal_width | petal_length | petal_width | species | -|--------------|-------------|--------------|-------------|-----------| -| 5.1 | 3.5 | 1.4 | 0.2 | Setosa | -| 4.9 | 3.0 | 1.4 | 0.2 | Setosa | -| 4.7 | 3.2 | 1.3 | 0.2 | Setosa | -| 4.6 | 3.1 | 1.5 | 0.2 | Setosa | -| 5.0 | 3.6 | 1.4 | 0.2 | Setosa | -| 5.4 | 3.9 | 1.7 | 0.4 | Setosa | -| 4.6 | 3.4 | 1.4 | 0.3 | Setosa | -| 5.0 | 3.4 | 1.5 | 0.2 | Setosa | -| 4.4 | 2.9 | 1.4 | 0.2 | Setosa | -| 4.9 | 3.1 | 1.5 | 0.1 | Setosa | +| slength | swidth | plength | pwidth | species | +|---------|--------|---------|--------|-----------| +| 5.1 | 3.5 | 1.4 | 0.2 | Setosa | +| 4.9 | 3.0 | 1.4 | 0.2 | Setosa | +| 4.7 | 3.2 | 1.3 | 0.2 | Setosa | +| 4.6 | 3.1 | 1.5 | 0.2 | Setosa | +| 5.0 | 3.6 | 1.4 | 0.2 | Setosa | +| 5.4 | 3.9 | 1.7 | 0.4 | Setosa | +| 4.6 | 3.4 | 1.4 | 0.3 | Setosa | +| 5.0 | 3.4 | 1.5 | 0.2 | Setosa | +| 4.4 | 2.9 | 1.4 | 0.2 | Setosa | +| 4.9 | 3.1 | 1.5 | 0.1 | Setosa | ``` **species table** @@ -130,6 +133,8 @@ In this section, we will explore the usage of SQL commands with two tables: `iri Using the `iris` and `species` tables as examples, we can perform various SQL operations to extract meaningful insights from the data. Some of the commonly used SQL commands with these tables include: +\clearpage +\vfill **Data Retrieval:** @@ -139,8 +144,8 @@ SQL (Structured Query Language) is essential for accessing and retrieving data s | SQL Command | Purpose | Example | |-----------------|-----------------------------------------|-----------------------------------------------------------------| | SELECT | Retrieve data from a table | SELECT * FROM iris | -| WHERE | Filter rows based on a condition | SELECT * FROM iris WHERE sepal_length > 5.0 | -| ORDER BY | Sort the result set | SELECT * FROM iris ORDER BY sepal_width DESC | +| WHERE | Filter rows based on a condition | SELECT * FROM iris WHERE slength > 5.0 | +| ORDER BY | Sort the result set | SELECT * FROM iris ORDER BY swidth DESC | | LIMIT | Limit the number of rows returned | SELECT * FROM iris LIMIT 10 | | JOIN | Combine rows from multiple tables | SELECT * FROM iris JOIN species ON iris.species = species.name | --> @@ -152,14 +157,17 @@ SQL (Structured Query Language) is essential for accessing and retrieving data s \hline\hline \textbf{SQL Command} & \textbf{Purpose} & \textbf{Example} \\ \hline SELECT & Retrieve data from a table & SELECT * FROM iris \\ -WHERE & Filter rows based on a condition & SELECT * FROM iris WHERE sepal\_length > 5.0 \\ -ORDER BY & Sort the result set & SELECT * FROM iris ORDER BY sepal\_width DESC \\ -LIMIT & Limit the number of rows returned & SELECT * FROM iris LIMIT 10 \\ +WHERE & Filter rows based on a condition & SELECT * FROM iris WHERE slength > 5.0 \\ +ORDER BY & Sort the result set & SELECT * FROM iris ORDER BY swidth DESC \\ +LIMIT & Limit the number of rows returned & SELECT * FROM iris LIMIT 10 \\ JOIN & Combine rows from \mbox{multiple} tables & SELECT * FROM iris JOIN species ON iris.species = species.name \\ \hline\hline \end{tabularx} \caption{Common SQL commands for data retrieval.} \end{table} +\clearpage +\vfill + **Data Manipulation:** Data manipulation is a critical aspect of database management, allowing users to modify existing data, add new data, or delete unwanted data. The key SQL commands for data manipulation are `INSERT INTO` for adding new records, `UPDATE` for modifying existing records, and `DELETE FROM` for removing records. These commands are powerful tools for maintaining and updating the content within a database, ensuring that the data remains current and accurate. @@ -177,8 +185,8 @@ Data manipulation is a critical aspect of database management, allowing users to \begin{tabularx}{\textwidth}{|>{\hsize=0.5\hsize}X|>{\hsize=0.8\hsize}X|>{\hsize=1.7\hsize}X|} \hline\hline \textbf{SQL Command} & \textbf{Purpose} & \textbf{Example} \\ \hline -INSERT INTO & Insert new records into a table & INSERT INTO iris (sepal\_length, sepal\_width) VALUES (6.3, 2.8) \\ -UPDATE & Update existing records in a table & UPDATE iris SET petal\_length = 1.5 WHERE species = 'Setosa' \\ +INSERT INTO & Insert new records into a table & INSERT INTO iris (slength, swidth) VALUES (6.3, 2.8) \\ +UPDATE & Update existing records in a table & UPDATE iris SET plength = 1.5 WHERE species = 'Setosa' \\ DELETE FROM & Delete records from a \mbox{table} & DELETE FROM iris WHERE species = 'Versicolor' \\ \hline\hline \end{tabularx} \caption{Common SQL commands for modifying and managing data.} @@ -188,6 +196,9 @@ DELETE FROM & Delete records from a \mbox{table} & DELETE FROM i SQL provides robust functionality for aggregating data, which is essential for statistical analysis and generating meaningful insights from large datasets. Commands like `GROUP BY` enable grouping of data based on one or more columns, while `SUM`, `AVG`, `COUNT`, and other aggregation functions allow for the calculation of sums, averages, and counts. The `HAVING` clause can be used in conjunction with `GROUP BY` to filter groups based on specific conditions. These aggregation capabilities are crucial for summarizing data, facilitating complex analyses, and supporting decision-making processes. +\clearpage +\vfill + - -\begin{table}[H] -\centering - -\begin{tabularx}{\textwidth}{|>{\hsize=0.7\hsize}X|>{\hsize=0.7\hsize}X|>{\hsize=1.9\hsize}X|>{\hsize=0.7\hsize}X|} -\hline\hline -\textbf{Purpose} & \textbf{Library Name} & \textbf{Description} & \textbf{Website} \\ -\hline -Data \mbox{Cleaning} & Pandas (Python) & A powerful data manipulation library for \mbox{cleaning} and preprocessing data. & \href{https://pandas.pydata.org/}{Pandas} \\ -& dplyr (R) & Provides a set of functions for data wrangling and data manipulation tasks. & \href{https://dplyr.tidyverse.org/}{dplyr} \\ -\hline -Normalization & scikit-learn (Python) & Offers various normalization techniques such as Min-Max scaling and Z-score normalization. & \href{https://scikit-learn.org/}{scikit-learn} \\ -& caret (R) & Provides pre-processing functions, including normalization, for building machine learning models. & \href{https://topepo.github.io/caret/}{caret} \\ -\hline -Feature \mbox{Engineering} & Featuretools (Python) & A library for automated feature engineering that can generate new features from existing ones. & \href{https://www.featuretools.com/}{Featuretools} \\ -& recipes (R) & Offers a framework for feature engineering, \mbox{allowing} users to create custom feature \mbox{transformation} pipelines. & \href{https://recipes.tidymodels.org/}{recipes} \\ -\hline -Non-Linearity Handling & TensorFlow (Python) & A deep learning library that supports building and training non-linear models using neural \mbox{networks}. & \href{https://www.tensorflow.org/}{TensorFlow} \\ -& keras (R) & Provides high-level interfaces for building and training neural networks with non-linear \mbox{activation} functions. & \href{https://keras.io/}{keras} \\ -\hline -Outlier Treatment & PyOD (Python) & A comprehensive library for outlier detection and removal using various algorithms and \mbox{models}. & \href{https://pyod.readthedocs.io/}{PyOD} \\ -& outliers (R) & Implements various methods for detecting and handling outliers in datasets. & \href{https://cran.r-project.org/web/packages/outliers/index.html}{outliers} \\ -\hline\hline -\end{tabularx} -\caption{Data preprocessing and machine learning libraries.} -\end{table} + * **Outlier Treatment:** Outliers can significantly impact the analysis and model performance. Transformations such as winsorization or logarithmic transformation can help reduce the influence of outliers without losing valuable information. **PyOD** in Python offers a comprehensive suite of tools for detecting and treating outliers using various algorithms and models (details at [PyOD](https://pyod.readthedocs.io/)). \clearpage \vfill diff --git a/book/070_modeling_and_data_validation.md b/book/070_modeling_and_data_validation.md index 6cffcae..3f6c20c 100755 --- a/book/070_modeling_and_data_validation.md +++ b/book/070_modeling_and_data_validation.md @@ -164,7 +164,7 @@ Proper model evaluation helps to identify potential issues such as overfitting o | Recall (Sensitivity) | Measures the proportion of true positive predictions among all actual positive instances in classification tasks. | scikit-learn: `recall_score` | | F1 Score | Combines precision and recall into a single metric, providing a balanced measure of model performance. | scikit-learn: `f1_score` | | ROC AUC | Quantifies the model's ability to distinguish between classes by plotting the true positive rate against the false positive rate. | scikit-learn: `roc_auc_score` | ---> + \begin{table}[H] \centering @@ -186,6 +186,29 @@ ROC AUC & Quantifies the model's ability to distinguish between classes by plott \caption{Common machine learning evaluation metrics and their corresponding libraries.} \end{table} +--> + +In machine learning, evaluation metrics are crucial for assessing model performance. The **Mean Squared Error (MSE)** measures the average squared difference between the predicted and actual values in regression tasks. This metric is computed using the `mean_squared_error` function in the `scikit-learn` library. + +Another related metric is the **Root Mean Squared Error (RMSE)**, which represents the square root of the MSE to provide a measure of the average magnitude of the error. It is typically calculated by taking the square root of the MSE value obtained from `scikit-learn`. + +The **Mean Absolute Error (MAE)** computes the average absolute difference between predicted and actual values, also in regression tasks. This metric can be calculated using the `mean_absolute_error` function from `scikit-learn`. + +**R-squared** is used to measure the proportion of the variance in the dependent variable that is predictable from the independent variables. It is a key performance metric for regression models and can be found in the `statsmodels` library. + +For classification tasks, **Accuracy** calculates the ratio of correctly classified instances to the total number of instances. This metric is obtained using the `accuracy_score` function in `scikit-learn`. + +**Precision** represents the proportion of true positive predictions among all positive predictions. It helps determine the accuracy of the positive class predictions and is computed using `precision_score` from `scikit-learn`. + +**Recall**, or Sensitivity, measures the proportion of true positive predictions among all actual positives in classification tasks, using the `recall_score` function from `scikit-learn`. + +The **F1 Score** combines precision and recall into a single metric, providing a balanced measure of a model's accuracy and recall. It is calculated using the `f1_score` function in `scikit-learn`. + +Lastly, the **ROC AUC** quantifies a model's ability to distinguish between classes. It plots the true positive rate against the false positive rate and can be calculated using the `roc_auc_score` function from `scikit-learn`. + +These metrics are essential for evaluating the effectiveness of machine learning models, helping developers understand model performance in various tasks. Each metric offers a different perspective on model accuracy and error, allowing for comprehensive performance assessments. + + \clearpage \vfill