Skip to content

Commit

Permalink
⚡📝 update, new content.
Browse files Browse the repository at this point in the history
  • Loading branch information
imarranz committed May 30, 2024
1 parent f7807c9 commit 4c3d6ec
Show file tree
Hide file tree
Showing 5 changed files with 75 additions and 73 deletions.
2 changes: 2 additions & 0 deletions book/000_title.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ header-includes:
\definecolor{secundaryowlgreen}{rgb}{0.63,0.83,0.29}
\definecolor{secundaryowlgray}{rgb}{0.57,0.56,0.56}
\definecolor{secundaryowlmagenta}{rgb}{0.57,0.06,0.33}
\definecolor{yellowcover}{rgb}{1.00,0.80,0.09}
\definecolor{browncover}{rgb}{0.25,0.22,0.14}
\usepackage{tcolorbox}
\usepackage{tabularx}
\usepackage{float}
Expand Down
53 changes: 32 additions & 21 deletions book/020_fundamentals_of_data_science.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,21 +94,24 @@ In data science, SQL is often used in combination with other tools and languages

In this section, we will explore the usage of SQL commands with two tables: `iris` and `species`. The `iris` table contains information about flower measurements, while the `species` table provides details about different species of flowers. SQL (Structured Query Language) is a powerful tool for managing and manipulating relational databases.

\clearpage
\vfill

**iris table**

```
| sepal_length | sepal_width | petal_length | petal_width | species |
|--------------|-------------|--------------|-------------|-----------|
| 5.1 | 3.5 | 1.4 | 0.2 | Setosa |
| 4.9 | 3.0 | 1.4 | 0.2 | Setosa |
| 4.7 | 3.2 | 1.3 | 0.2 | Setosa |
| 4.6 | 3.1 | 1.5 | 0.2 | Setosa |
| 5.0 | 3.6 | 1.4 | 0.2 | Setosa |
| 5.4 | 3.9 | 1.7 | 0.4 | Setosa |
| 4.6 | 3.4 | 1.4 | 0.3 | Setosa |
| 5.0 | 3.4 | 1.5 | 0.2 | Setosa |
| 4.4 | 2.9 | 1.4 | 0.2 | Setosa |
| 4.9 | 3.1 | 1.5 | 0.1 | Setosa |
| slength | swidth | plength | pwidth | species |
|---------|--------|---------|--------|-----------|
| 5.1 | 3.5 | 1.4 | 0.2 | Setosa |
| 4.9 | 3.0 | 1.4 | 0.2 | Setosa |
| 4.7 | 3.2 | 1.3 | 0.2 | Setosa |
| 4.6 | 3.1 | 1.5 | 0.2 | Setosa |
| 5.0 | 3.6 | 1.4 | 0.2 | Setosa |
| 5.4 | 3.9 | 1.7 | 0.4 | Setosa |
| 4.6 | 3.4 | 1.4 | 0.3 | Setosa |
| 5.0 | 3.4 | 1.5 | 0.2 | Setosa |
| 4.4 | 2.9 | 1.4 | 0.2 | Setosa |
| 4.9 | 3.1 | 1.5 | 0.1 | Setosa |
```

**species table**
Expand All @@ -130,6 +133,8 @@ In this section, we will explore the usage of SQL commands with two tables: `iri

Using the `iris` and `species` tables as examples, we can perform various SQL operations to extract meaningful insights from the data. Some of the commonly used SQL commands with these tables include:

\clearpage
\vfill

**Data Retrieval:**

Expand All @@ -139,8 +144,8 @@ SQL (Structured Query Language) is essential for accessing and retrieving data s
| SQL Command | Purpose | Example |
|-----------------|-----------------------------------------|-----------------------------------------------------------------|
| SELECT | Retrieve data from a table | SELECT * FROM iris |
| WHERE | Filter rows based on a condition | SELECT * FROM iris WHERE sepal_length > 5.0 |
| ORDER BY | Sort the result set | SELECT * FROM iris ORDER BY sepal_width DESC |
| WHERE | Filter rows based on a condition | SELECT * FROM iris WHERE slength > 5.0 |
| ORDER BY | Sort the result set | SELECT * FROM iris ORDER BY swidth DESC |
| LIMIT | Limit the number of rows returned | SELECT * FROM iris LIMIT 10 |
| JOIN | Combine rows from multiple tables | SELECT * FROM iris JOIN species ON iris.species = species.name |
-->
Expand All @@ -152,14 +157,17 @@ SQL (Structured Query Language) is essential for accessing and retrieving data s
\hline\hline
\textbf{SQL Command} & \textbf{Purpose} & \textbf{Example} \\ \hline
SELECT & Retrieve data from a table & SELECT * FROM iris \\
WHERE & Filter rows based on a condition & SELECT * FROM iris WHERE sepal\_length > 5.0 \\
ORDER BY & Sort the result set & SELECT * FROM iris ORDER BY sepal\_width DESC \\
LIMIT & Limit the number of rows returned & SELECT * FROM iris LIMIT 10 \\
WHERE & Filter rows based on a condition & SELECT * FROM iris WHERE slength > 5.0 \\
ORDER BY & Sort the result set & SELECT * FROM iris ORDER BY swidth DESC \\
LIMIT & Limit the number of rows returned & SELECT * FROM iris LIMIT 10 \\
JOIN & Combine rows from \mbox{multiple} tables & SELECT * FROM iris JOIN species ON iris.species = species.name \\ \hline\hline
\end{tabularx}
\caption{Common SQL commands for data retrieval.}
\end{table}

\clearpage
\vfill

**Data Manipulation:**

Data manipulation is a critical aspect of database management, allowing users to modify existing data, add new data, or delete unwanted data. The key SQL commands for data manipulation are `INSERT INTO` for adding new records, `UPDATE` for modifying existing records, and `DELETE FROM` for removing records. These commands are powerful tools for maintaining and updating the content within a database, ensuring that the data remains current and accurate.
Expand All @@ -177,8 +185,8 @@ Data manipulation is a critical aspect of database management, allowing users to
\begin{tabularx}{\textwidth}{|>{\hsize=0.5\hsize}X|>{\hsize=0.8\hsize}X|>{\hsize=1.7\hsize}X|}
\hline\hline
\textbf{SQL Command} & \textbf{Purpose} & \textbf{Example} \\ \hline
INSERT INTO & Insert new records into a table & INSERT INTO iris (sepal\_length, sepal\_width) VALUES (6.3, 2.8) \\
UPDATE & Update existing records in a table & UPDATE iris SET petal\_length = 1.5 WHERE species = 'Setosa' \\
INSERT INTO & Insert new records into a table & INSERT INTO iris (slength, swidth) VALUES (6.3, 2.8) \\
UPDATE & Update existing records in a table & UPDATE iris SET plength = 1.5 WHERE species = 'Setosa' \\
DELETE FROM & Delete records from a \mbox{table} & DELETE FROM iris WHERE species = 'Versicolor' \\ \hline\hline
\end{tabularx}
\caption{Common SQL commands for modifying and managing data.}
Expand All @@ -188,6 +196,9 @@ DELETE FROM & Delete records from a \mbox{table} & DELETE FROM i

SQL provides robust functionality for aggregating data, which is essential for statistical analysis and generating meaningful insights from large datasets. Commands like `GROUP BY` enable grouping of data based on one or more columns, while `SUM`, `AVG`, `COUNT`, and other aggregation functions allow for the calculation of sums, averages, and counts. The `HAVING` clause can be used in conjunction with `GROUP BY` to filter groups based on specific conditions. These aggregation capabilities are crucial for summarizing data, facilitating complex analyses, and supporting decision-making processes.

\clearpage
\vfill

<!--
| SQL Command | Purpose | Example |
|-----------------|-----------------------------------------|-------------------------------------------------------------------------|
Expand All @@ -204,8 +215,8 @@ SQL provides robust functionality for aggregating data, which is essential for s
\textbf{SQL Command} & \textbf{Purpose} & \textbf{Example} \\ \hline
GROUP BY & Group rows by a \mbox{column(s)} & SELECT species, COUNT(*) FROM iris GROUP BY species \\
HAVING & Filter groups based on a condition & SELECT species, COUNT(*) FROM iris GROUP BY species HAVING COUNT(*) > 5 \\
SUM & Calculate the sum of a column & SELECT species, SUM(petal\_length) FROM iris GROUP BY species \\
AVG & Calculate the average of a column & SELECT species, AVG(sepal\_width) FROM iris GROUP BY species \\ \hline\hline
SUM & Calculate the sum of a column & SELECT species, SUM(plength) FROM iris GROUP BY species \\
AVG & Calculate the average of a column & SELECT species, AVG(swidth) FROM iris GROUP BY species \\ \hline\hline
\end{tabularx}
\caption{Common SQL commands for data aggregation and analysis.}
\end{table}
Expand Down
12 changes: 12 additions & 0 deletions book/050_data_adquisition_and_preparation.md
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,18 @@ Data \mbox{Validation} & pandas-schema & A Python library that enables the \mbox
\caption{Essential data preparation steps: From handling missing data to data transformation.}
\end{figure}

**Handling Missing Data**:Dealing with missing values by imputation, deletion, or interpolation methods to avoid biased or erroneous analyses.

**Outlier Detection**: Identifying and addressing outliers, which can significantly impact statistical measures and model predictions.

**Data Deduplication**: Identifying and removing duplicate entries to avoid duplication bias and ensure data integrity.

**Standardization and Formatting**: Converting data into a consistent format, ensuring uniformity and compatibility across variables.

**Data Validation and Verification**: Verifying the accuracy, completeness, and consistency of the data through various validation techniques.

**Data Transformation**: Converting data into a suitable format, such as scaling numerical variables or transforming categorical variables.

\hfill
\clearpage

Expand Down
56 changes: 5 additions & 51 deletions book/060_exploratory_data_analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -329,61 +329,15 @@ Data transformation is a crucial step in the exploratory data analysis process.

Data transformation plays a vital role in preparing the data for analysis. It helps in achieving the following objectives:

* **Data Cleaning:** Transformation techniques help in handling missing values, outliers, and inconsistent data entries. By addressing these issues, we ensure the accuracy and reliability of our analysis.
* **Data Cleaning:** Transformation techniques help in handling missing values, outliers, and inconsistent data entries. By addressing these issues, we ensure the accuracy and reliability of our analysis. For data cleaning, libraries like **Pandas** in Python provide powerful data manipulation capabilities (more details on [Pandas website](https://pandas.pydata.org/)). In R, the **dplyr** library offers a set of functions tailored for data wrangling and manipulation tasks (learn more at [dplyr](https://dplyr.tidyverse.org/)).

* **Normalization:** Different variables in a dataset may have different scales, units, or ranges. Normalization techniques such as min-max scaling or z-score normalization bring all variables to a common scale, enabling fair comparisons and avoiding bias in subsequent analyses.
* **Normalization:** Different variables in a dataset may have different scales, units, or ranges. Normalization techniques such as min-max scaling or z-score normalization bring all variables to a common scale, enabling fair comparisons and avoiding bias in subsequent analyses. The **scikit-learn** library in Python includes various normalization techniques (see [scikit-learn](https://scikit-learn.org/)), while in R, **caret** provides pre-processing functions including normalization for building machine learning models (details at [caret](https://topepo.github.io/caret/)).

* **Feature Engineering:** Transformation allows us to create new features or derive meaningful information from existing variables. This process involves extracting relevant information, creating interaction terms, or encoding categorical variables for better representation and predictive power.
* **Feature Engineering:** Transformation allows us to create new features or derive meaningful information from existing variables. This process involves extracting relevant information, creating interaction terms, or encoding categorical variables for better representation and predictive power. In Python, **Featuretools** is a library dedicated to automated feature engineering, enabling the generation of new features from existing data (visit [Featuretools](https://www.featuretools.com/)). For R users, **recipes** offers a framework to design custom feature transformation pipelines (more on [recipes](https://recipes.tidymodels.org/)).

* **Non-linearity Handling:** In some cases, relationships between variables may not be linear. Transforming variables using functions like logarithm, exponential, or power transformations can help capture non-linear patterns and improve model performance.
* **Non-linearity Handling:** In some cases, relationships between variables may not be linear. Transforming variables using functions like logarithm, exponential, or power transformations can help capture non-linear patterns and improve model performance. Python's **TensorFlow** library supports building and training complex non-linear models using neural networks (explore [TensorFlow](https://www.tensorflow.org/)), while **keras** in R provides high-level interfaces for neural networks with non-linear activation functions (find out more at [keras](https://keras.io/)).

* **Outlier Treatment:** Outliers can significantly impact the analysis and model performance. Transformations such as winsorization or logarithmic transformation can help reduce the influence of outliers without losing valuable information.

<!--
| **Purpose** | **Library Name** | **Description** | **Website** |
|-------------------|-----------------|-----------------|--------------|
| **Data Cleaning** | | | |
| | Pandas (Python) | A powerful data manipulation library for cleaning and preprocessing data. | [Pandas](https://pandas.pydata.org/) |
| | dplyr (R) | Provides a set of functions for data wrangling and data manipulation tasks. | [dplyr](https://dplyr.tidyverse.org/) |
| **Normalization** | | | |
| | scikit-learn (Python) | Offers various normalization techniques such as Min-Max scaling and Z-score normalization. | [scikit-learn](https://scikit-learn.org/) |
| | caret (R) | Provides pre-processing functions, including normalization, for building machine learning models. | [caret](https://topepo.github.io/caret/) |
| **Feature Engineering** | | | |
| | Featuretools (Python) | A library for automated feature engineering that can generate new features from existing ones. | [Featuretools](https://www.featuretools.com/) |
| | recipes (R) | Offers a framework for feature engineering, allowing users to create custom feature transformation pipelines. | [recipes](https://recipes.tidymodels.org/) |
| **Non-Linearity Handling** | | | |
| | TensorFlow (Python) | A deep learning library that supports building and training non-linear models using neural networks. | [TensorFlow](https://www.tensorflow.org/) |
| | keras (R) | Provides high-level interfaces for building and training neural networks with non-linear activation functions. | [keras](https://keras.io/) |
| **Outlier Treatment** | | | |
| | PyOD (Python) | A comprehensive library for outlier detection and removal using various algorithms and models. | [PyOD](https://pyod.readthedocs.io/) |
| | outliers (R) | Implements various methods for detecting and handling outliers in datasets. | [outliers](https://cran.r-project.org/web/packages/outliers/index.html) |
-->

\begin{table}[H]
\centering

\begin{tabularx}{\textwidth}{|>{\hsize=0.7\hsize}X|>{\hsize=0.7\hsize}X|>{\hsize=1.9\hsize}X|>{\hsize=0.7\hsize}X|}
\hline\hline
\textbf{Purpose} & \textbf{Library Name} & \textbf{Description} & \textbf{Website} \\
\hline
Data \mbox{Cleaning} & Pandas (Python) & A powerful data manipulation library for \mbox{cleaning} and preprocessing data. & \href{https://pandas.pydata.org/}{Pandas} \\
& dplyr (R) & Provides a set of functions for data wrangling and data manipulation tasks. & \href{https://dplyr.tidyverse.org/}{dplyr} \\
\hline
Normalization & scikit-learn (Python) & Offers various normalization techniques such as Min-Max scaling and Z-score normalization. & \href{https://scikit-learn.org/}{scikit-learn} \\
& caret (R) & Provides pre-processing functions, including normalization, for building machine learning models. & \href{https://topepo.github.io/caret/}{caret} \\
\hline
Feature \mbox{Engineering} & Featuretools (Python) & A library for automated feature engineering that can generate new features from existing ones. & \href{https://www.featuretools.com/}{Featuretools} \\
& recipes (R) & Offers a framework for feature engineering, \mbox{allowing} users to create custom feature \mbox{transformation} pipelines. & \href{https://recipes.tidymodels.org/}{recipes} \\
\hline
Non-Linearity Handling & TensorFlow (Python) & A deep learning library that supports building and training non-linear models using neural \mbox{networks}. & \href{https://www.tensorflow.org/}{TensorFlow} \\
& keras (R) & Provides high-level interfaces for building and training neural networks with non-linear \mbox{activation} functions. & \href{https://keras.io/}{keras} \\
\hline
Outlier Treatment & PyOD (Python) & A comprehensive library for outlier detection and removal using various algorithms and \mbox{models}. & \href{https://pyod.readthedocs.io/}{PyOD} \\
& outliers (R) & Implements various methods for detecting and handling outliers in datasets. & \href{https://cran.r-project.org/web/packages/outliers/index.html}{outliers} \\
\hline\hline
\end{tabularx}
\caption{Data preprocessing and machine learning libraries.}
\end{table}
* **Outlier Treatment:** Outliers can significantly impact the analysis and model performance. Transformations such as winsorization or logarithmic transformation can help reduce the influence of outliers without losing valuable information. **PyOD** in Python offers a comprehensive suite of tools for detecting and treating outliers using various algorithms and models (details at [PyOD](https://pyod.readthedocs.io/)).

\clearpage
\vfill
Expand Down
Loading

0 comments on commit 4c3d6ec

Please sign in to comment.