From 607a9e0fd186f0a8b41892a2559e4b82cf920a8e Mon Sep 17 00:00:00 2001 From: WHQWHQWHQ <117834283+WHQWHQWHQ@users.noreply.github.com> Date: Tue, 16 Jan 2024 05:48:37 +0800 Subject: [PATCH 1/7] Delete open-machine-learning-jupyter-book/assignments/machine-learning-productionization/data-engineering.ipynb --- .../data-engineering.ipynb | 506 ------------------ 1 file changed, 506 deletions(-) delete mode 100644 open-machine-learning-jupyter-book/assignments/machine-learning-productionization/data-engineering.ipynb diff --git a/open-machine-learning-jupyter-book/assignments/machine-learning-productionization/data-engineering.ipynb b/open-machine-learning-jupyter-book/assignments/machine-learning-productionization/data-engineering.ipynb deleted file mode 100644 index 07628dbab..000000000 --- a/open-machine-learning-jupyter-book/assignments/machine-learning-productionization/data-engineering.ipynb +++ /dev/null @@ -1,506 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Data engineering\n", - "\n", - "This assignment focuses on techniques for cleaning and transforming the data to handle challenges of missing, inaccurate, or incomplete data. Please refer to [Machine Learning productionization - Data engineering](#data-engineering) to learn more.\n", - "\n", - "Fill `____` pieces of the below implementation in order to pass the assertions.\n", - "\n", - "\n", - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Exploring dataset\n", - "\n", - "> **Learning goal**: By the end of this subsection, you should be comfortable finding general information about the data stored in pandas DataFrames.\n", - "\n", - "In order to explore this functionality, we will import the modefined version of Python scikit-learn library's iconic dataset **Iris**." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "from sklearn.datasets import load_iris\n", - "import math\n", - "\n", - "iris_df = pd.read_csv('../../assets/data/modefined_sklearn_iris_dataset.csv', index_col=0)\n", - "iris_df" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To start off, print the summary of a DataFrame." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "iris_df.____" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "```{quizdown}\n", - "\n", - "## How many entries the Iris dataset has?\n", - "\n", - "> Please refer to the output of above cell. \n", - "\n", - "- [ ] 50\n", - "- [ ] 100\n", - "- [x] 150\n", - "- [ ] 200\n", - "\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next, let's check the actual content of the `DataFrame`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# displying first 5 rows of our iris_df\n", - "iris_df____\n", - "\n", - "# in the first five rows, which one's spepal length is 5.0cm?\n", - "assert iris_df.iloc[____, 0] == 5.0" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Conversely, we can check the last few rows of the DataFrame." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# displying last 5 rows of our `iris_df`.\n", - "iris_df.____\n", - "\n", - "# in the last five rows, which one's spepal width is 2.5cm?\n", - "assert iris_df.iloc[____, 1] == 2.5" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Takeaway**: Even just by looking at the metadata about the information in a DataFrame or the first and last few values in one, you can get an immediate idea about the size, shape, and content of the data you are dealing with." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Dealing with missing data\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Missing data can cause inaccuracies as well as weak or biased results. Sometimes these can be resolved by a \"reload\" of the data, filling in the missing values with computation and code like Python, or simply just removing the value and corresponding data. There are numerous reasons for why data may be missing and the actions that are taken to resolve these missing values can be dependent on how and why they went missing in the first place.\n", - "\n", - "> **Learning goal**: By the end of this subsection, you should know how to replace or remove null values from DataFrames.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In pandas, the `isnull()` and `notnull()` methods are your primary methods for detecting null data. Both return Boolean masks over your data. We will be using numpy for NaN values:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "iris_isnull_df = iris_df.isnull()\n", - "\n", - "print(iris_isnull_df)\n", - "\n", - "# find one row with missing value\n", - "assert iris_isnull_df.iloc[____, ____] == True\n", - "assert math.isnan(iris_df.iloc[____, ____]) == True" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# get all the rows with missing data\n", - "iris_with_missing_value_df = iris_df____\n", - "\n", - "assert iris_with_missing_value_df.shape[0] == 16" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Dropping null values**: Beyond identifying missing values, pandas provides a convenient means `dropna` to remove null values from Series and DataFrames. (Particularly on large data sets, it is often more advisable to simply remove missing [NA] values from your analysis than deal with them in other ways.) " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# remove all the rows with missing values\n", - "iris_with_dropna_on_row_df = iris_df.____\n", - "\n", - "assert iris_with_dropna_on_row_df.shape[0] == 134" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# remove all the columns with missing values\n", - "iris_with_dropna_on_column_df = iris_df.____\n", - "\n", - "assert iris_with_dropna_on_column_df.columns.shape[0] == 0" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# remove all the rows with 2 missing values\n", - "iris_with_dropna_2_values_on_rows_df = iris_df.____\n", - "\n", - "assert iris_with_dropna_2_values_on_rows_df.shape[0] == 144\n", - "\n", - "# remove all the rows with 1 missing values\n", - "iris_with_dropna_1_values_on_rows_df = iris_df.____\n", - "\n", - "assert iris_with_dropna_1_values_on_rows_df.shape[0] == 147" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Filling null values**: Depending on your dataset, it can sometimes make more sense to fill null values with valid ones rather than drop them. You could use `isnull` to do this in place, but that can be laborious, particularly if you have a lot of values to fill. Because this is such a common task in data science, pandas provides `fillna`, which returns a copy of the Series or DataFrame with the missing values replaced with one of your choosing. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# fll all the missing values with 0\n", - "iris_with_fillna_df = iris_df.____\n", - "\n", - "# get all the rows with missing data\n", - "iris_with_missing_value_after_fillna_df = iris_with_fillna_df____\n", - "\n", - "assert iris_with_missing_value_after_fillna_df.shape[0] == 0\n", - "assert iris_with_fillna_df.iloc[____, 3] == -1" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# forward-fill null values, which is to use the last valid value to fill a null:\n", - "iris_with_fillna_forward_df = iris_df.____\n", - "\n", - "# get all the rows with missing data\n", - "iris_with_missing_value_after_fillna_forward_df = iris_with_fillna_forward_df____\n", - "\n", - "assert iris_with_missing_value_after_fillna_forward_df.shape[0] == 0\n", - "assert float(iris_with_fillna_forward_df.iloc[3, 3]) == 0.2" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# back-fill null values, which is to use the next valid value to fill a null:\n", - "iris_with_fillna_back_df = iris_df.____\n", - "\n", - "# get all the rows with missing data\n", - "iris_with_missing_value_after_fillna_back_df = iris_with_fillna_back_df____\n", - "\n", - "assert iris_with_missing_value_after_fillna_back_df.shape[0] == 0\n", - "assert float(iris_with_fillna_back_df.iloc[3, 3]) == 0.1" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Removing duplicate data\n", - "\n", - "Data that has more than one occurrence can produce inaccurate results and usually should be removed. This can be a common occurrence when joining two or more datasets together. However, there are instances where duplication in joined datasets contain pieces that can provide additional information and may need to be preserved.\n", - "\n", - "> **Learning goal**: By the end of this subsection, you should be comfortable identifying and removing duplicate values from DataFrames.\n", - "\n", - "In addition to missing data, you will often encounter duplicated data in real-world datasets. Fortunately, pandas provides an easy means of detecting and removing duplicate entries." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Identifying duplicates**: You can easily spot duplicate values using the `duplicated` method in pandas, which returns a Boolean mask indicating whether an entry in a DataFrame is a duplicate of an earlier one. Let's create another example DataFrame to see this in action." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "iris_isduplicated_df = iris_df.____\n", - "\n", - "print(iris_isduplicated_df)\n", - "\n", - "# find one row with duplicated value\n", - "assert iris_isduplicated_df.iloc[____, ____] == True" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Dropping duplicates**: `drop_duplicates` simply returns a copy of the data for which all of the duplicated values are False:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# remove all the rows with duplicated values\n", - "iris_with_drop_duplicates_on_df = iris_df.drop_duplicates()\n", - "\n", - "assert iris_with_drop_duplicates_on_df.shape[0] == 143" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Both `duplicated` and `drop_duplicates` default to consider all columns but you can specify that they examine only a subset of columns in your DataFrame:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# remove all the rows with duplicated values on column 'petal width (cm)'\n", - "iris_with_drop_duplicates_on_column_df = iris_df.____\n", - "\n", - "assert iris_with_drop_duplicates_on_column_df.shape[0] == 27" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Handle inconsistent data\n", - "\n", - "Depending on the source, data can have inconsistencies in how it’s presented. This can cause problems in searching for and representing the value, where it’s seen within the dataset but is not properly represented in visualizations or query results. Common formatting problems involve resolving whitespace, dates, and data types. Resolving formatting issues is typically up to the people who are using the data. For example, standards on how dates and numbers are presented can differ by country.\n", - "\n", - "> **Learning goal**: By the end of this subsection, you should know how to handle the inconsistent data format in the DataFrame." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's cleaning up the **4th** column `petal width (cm)` to make sure there's no data entry inconsistencies in it. Firstly, we will use a convenient method `unique` from pandas to check the unique values of this column\n", - "\n", - "In pandas, the `unique` method is a convenient way to unique values based on a hash table:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "column_to_format = ____\n", - "column_to_format_unique = column_to_format.____\n", - "\n", - "print(column_to_format_unique)\n", - "\n", - "# find one row with duplicated value\n", - "assert column_to_format_unique.shape[0] == 27" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Regardless the `nan` value, you may find the numeric valus are in different precision. More specifically, `1.` or `1.5012` are not in the same precision as other numbers. We want to append tailing `0` to numbers like `1.`, and round numbers like `1.5012` to `1.5`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# firstly, let's apply `round`` to the values to make the precision all as .1f\n", - "formatted_column = column_to_format.____\n", - "\n", - "print(formatted_column.unique())\n", - "\n", - "assert formatted_column.unique().shape[0] == 23" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "# now, let's add tailing 0 if needed to make numbers like 1. to be 1.0. \n", - "# You may need to filter the nan value while processing.\n", - "formatted_column = formatted_column.____\n", - "\n", - "print(formatted_column.unique())\n", - "\n", - "assert formatted_column.unique().shape[0] == 23" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## At last\n", - "\n", - "Let's apply all the methods above to make the data to be clean." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# remove all rows with missing values\n", - "no_missing_data_df = iris_df.____\n", - "\n", - "# remove all rows with duplicated values\n", - "no_missing_dup_data_df = no_missing_data_df.____\n", - "\n", - "# apply the precision .1f to all the numbers\n", - "cleand_df = no_missing_dup_data_df.____\n", - "\n", - "assert no_missing_data_df.shape[0] == 134\n", - "assert no_missing_dup_data_df.shape[0] == 129\n", - "assert cleand_df[cleand_df.columns[3]].unique().shape[0] == 22" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Also, you could refer to below for more about how to handle data quality.\n", - "\n", - "- missing data - [pandas - Working with missing data](https://pandas.pydata.org/docs/user_guide/missing_data.html)\n", - "- duplicate data - [pandas - Duplicate Labels](https://pandas.pydata.org/docs/user_guide/duplicates.html)\n", - "- outlier\n", - " - [Ways to Detect and Remove the Outliers](https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba)\n", - " - [Outlier!!! The Silent Killer](https://www.kaggle.com/code/nareshbhat/outlier-the-silent-killer)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Acknowledgments\n", - "\n", - "Thanks to Microsoft for creating the open source course [Data Science for Beginners](https://github.com/microsoft/Data-Science-For-Beginners). It contributes some of the content in this chapter." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3.9.13 64-bit", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.5" - }, - "vscode": { - "interpreter": { - "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49" - } - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} From 60550282e04a033b398f36db30f9b7eca7c723f5 Mon Sep 17 00:00:00 2001 From: WHQWHQWHQ <117834283+WHQWHQWHQ@users.noreply.github.com> Date: Tue, 16 Jan 2024 05:50:51 +0800 Subject: [PATCH 2/7] Delete open-machine-learning-jupyter-book/environment.yml --- .../environment.yml | 36 ------------------- 1 file changed, 36 deletions(-) delete mode 100644 open-machine-learning-jupyter-book/environment.yml diff --git a/open-machine-learning-jupyter-book/environment.yml b/open-machine-learning-jupyter-book/environment.yml deleted file mode 100644 index 7030b0e52..000000000 --- a/open-machine-learning-jupyter-book/environment.yml +++ /dev/null @@ -1,36 +0,0 @@ -name: open-machine-learning-jupyter-book -channels: - - conda-forge -dependencies: - - python=3.9 - - nbmake - - pip - - pip: - - sphinx==5.0 - - pytest==7.2.0 - - pandas==1.5.2 - - numpy==1.24.1 - - jsonschema==2.6.0 - - matplotlib==3.6.2 - - pywaffle==1.1.0 - - scikit-learn==1.2.0 - - scipy==1.10.0 - - seaborn==0.12.2 - - tensorflow==2.11.0 - - jupyter-book==0.13.1 - - notebook==6.5.2 - - xgboost==1.6.2 - - imblearn - - jupyter_contrib_nbextensions==0.7.0 - - sphinxcontrib-mermaid==0.7.1 - - sphinxcontrib-wavedrom==3.0.4 - - sphinxcontrib-plantuml==0.24.1 - - sphinxcontrib-tikz==0.4.16 - - sphinxcontrib-blockdiag==3.0.0 - - sphinxcontrib-drawio==0.0.16 - - git+https://github.com/innovationOUtside/ipython_magic_tikz.git - - git+https://github.com/bonartm/sphinxcontrib-quizdown.git - - tqdm - - fastai - - skl2onnx - From 97c04ff26bdb74981c04ab4f5cc60a60c9d25129 Mon Sep 17 00:00:00 2001 From: 296406598 <296406598@qq.com> Date: Mon, 4 Mar 2024 04:04:17 +0800 Subject: [PATCH 3/7] improve rnn.ipynb --- .../deep-learning/rnn.ipynb | 68 +++++++++++-------- 1 file changed, 40 insertions(+), 28 deletions(-) diff --git a/open-machine-learning-jupyter-book/deep-learning/rnn.ipynb b/open-machine-learning-jupyter-book/deep-learning/rnn.ipynb index 120aaebc4..7b77ca566 100644 --- a/open-machine-learning-jupyter-book/deep-learning/rnn.ipynb +++ b/open-machine-learning-jupyter-book/deep-learning/rnn.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "id": "4f92eda8", "metadata": { "tags": [ @@ -298,54 +298,66 @@ "Vocabulary Size: 8630\n", "80-20 Train Test split: 4459 -- 1115\n", "Epoch 1/20\n", - "15/15 [==============================] - 2s 41ms/step - loss: 0.6142 - accuracy: 0.7062 - val_loss: 0.4936 - val_accuracy: 0.8879\n", + "15/15 [==============================] - 4s 107ms/step - loss: 0.5348 - accuracy: 0.7951 - val_loss: 0.4411 - val_accuracy: 0.9025\n", "Epoch 2/20\n", - "15/15 [==============================] - 0s 16ms/step - loss: 0.4652 - accuracy: 0.8623 - val_loss: 0.3963 - val_accuracy: 0.9215\n", + "15/15 [==============================] - 0s 20ms/step - loss: 0.4310 - accuracy: 0.8814 - val_loss: 0.3498 - val_accuracy: 0.9540\n", "Epoch 3/20\n", - "15/15 [==============================] - 0s 16ms/step - loss: 0.3839 - accuracy: 0.8980 - val_loss: 0.3211 - val_accuracy: 0.9361\n", + "15/15 [==============================] - 0s 24ms/step - loss: 0.3542 - accuracy: 0.9277 - val_loss: 0.2880 - val_accuracy: 0.9664\n", "Epoch 4/20\n", - "15/15 [==============================] - 0s 16ms/step - loss: 0.3174 - accuracy: 0.9322 - val_loss: 0.2614 - val_accuracy: 0.9585\n", + "15/15 [==============================] - 1s 45ms/step - loss: 0.3058 - accuracy: 0.9445 - val_loss: 0.2443 - val_accuracy: 0.9675\n", "Epoch 5/20\n", - "15/15 [==============================] - 0s 16ms/step - loss: 0.2671 - accuracy: 0.9448 - val_loss: 0.2177 - val_accuracy: 0.9630\n", + "15/15 [==============================] - 1s 45ms/step - loss: 0.2615 - accuracy: 0.9588 - val_loss: 0.2178 - val_accuracy: 0.9630\n", "Epoch 6/20\n", - "15/15 [==============================] - 0s 15ms/step - loss: 0.2230 - accuracy: 0.9619 - val_loss: 0.1832 - val_accuracy: 0.9686\n", + "15/15 [==============================] - 1s 46ms/step - loss: 0.2295 - accuracy: 0.9610 - val_loss: 0.1918 - val_accuracy: 0.9686\n", "Epoch 7/20\n", - "15/15 [==============================] - 0s 16ms/step - loss: 0.1839 - accuracy: 0.9770 - val_loss: 0.1638 - val_accuracy: 0.9641\n", + "15/15 [==============================] - 1s 36ms/step - loss: 0.2096 - accuracy: 0.9703 - val_loss: 0.1826 - val_accuracy: 0.9608\n", "Epoch 8/20\n", - "15/15 [==============================] - 0s 16ms/step - loss: 0.1537 - accuracy: 0.9849 - val_loss: 0.1778 - val_accuracy: 0.9462\n", + "15/15 [==============================] - 0s 17ms/step - loss: 0.1889 - accuracy: 0.9692 - val_loss: 0.1677 - val_accuracy: 0.9652\n", "Epoch 9/20\n", - "15/15 [==============================] - 0s 16ms/step - loss: 0.1351 - accuracy: 0.9885 - val_loss: 0.1349 - val_accuracy: 0.9630\n", + "15/15 [==============================] - 0s 17ms/step - loss: 0.1708 - accuracy: 0.9725 - val_loss: 0.1676 - val_accuracy: 0.9574\n", "Epoch 10/20\n", - "15/15 [==============================] - 0s 16ms/step - loss: 0.1164 - accuracy: 0.9885 - val_loss: 0.1393 - val_accuracy: 0.9540\n", + "15/15 [==============================] - 0s 20ms/step - loss: 0.1553 - accuracy: 0.9779 - val_loss: 0.1753 - val_accuracy: 0.9507\n", "Epoch 11/20\n", - "15/15 [==============================] - 0s 16ms/step - loss: 0.1019 - accuracy: 0.9936 - val_loss: 0.1193 - val_accuracy: 0.9574\n", + "15/15 [==============================] - 0s 33ms/step - loss: 0.1445 - accuracy: 0.9809 - val_loss: 0.1642 - val_accuracy: 0.9563\n", "Epoch 12/20\n", - "15/15 [==============================] - 0s 16ms/step - loss: 0.0907 - accuracy: 0.9924 - val_loss: 0.1223 - val_accuracy: 0.9596\n", + "15/15 [==============================] - 1s 46ms/step - loss: 0.1305 - accuracy: 0.9840 - val_loss: 0.1633 - val_accuracy: 0.9574\n", "Epoch 13/20\n", - "15/15 [==============================] - 0s 16ms/step - loss: 0.0807 - accuracy: 0.9947 - val_loss: 0.1254 - val_accuracy: 0.9574\n", + "15/15 [==============================] - 0s 32ms/step - loss: 0.1298 - accuracy: 0.9857 - val_loss: 0.1663 - val_accuracy: 0.9585\n", "Epoch 14/20\n", - "15/15 [==============================] - 0s 16ms/step - loss: 0.0712 - accuracy: 0.9952 - val_loss: 0.1198 - val_accuracy: 0.9563\n", + "15/15 [==============================] - 0s 20ms/step - loss: 0.1245 - accuracy: 0.9843 - val_loss: 0.1730 - val_accuracy: 0.9518\n", "Epoch 15/20\n", - "15/15 [==============================] - 0s 16ms/step - loss: 0.0657 - accuracy: 0.9952 - val_loss: 0.1182 - val_accuracy: 0.9608\n", + "15/15 [==============================] - 0s 25ms/step - loss: 0.1124 - accuracy: 0.9874 - val_loss: 0.1755 - val_accuracy: 0.9518\n", "Epoch 16/20\n", - "15/15 [==============================] - 0s 16ms/step - loss: 0.0618 - accuracy: 0.9961 - val_loss: 0.1213 - val_accuracy: 0.9596\n", + "15/15 [==============================] - 1s 42ms/step - loss: 0.1072 - accuracy: 0.9868 - val_loss: 0.1609 - val_accuracy: 0.9585\n", "Epoch 17/20\n", - "15/15 [==============================] - 0s 16ms/step - loss: 0.0541 - accuracy: 0.9972 - val_loss: 0.1225 - val_accuracy: 0.9596\n", + "15/15 [==============================] - 1s 40ms/step - loss: 0.1012 - accuracy: 0.9891 - val_loss: 0.1768 - val_accuracy: 0.9529\n", "Epoch 18/20\n", - "15/15 [==============================] - 0s 15ms/step - loss: 0.0467 - accuracy: 0.9978 - val_loss: 0.1207 - val_accuracy: 0.9552\n", + "15/15 [==============================] - 1s 42ms/step - loss: 0.0920 - accuracy: 0.9882 - val_loss: 0.1894 - val_accuracy: 0.9496\n", "Epoch 19/20\n", - "15/15 [==============================] - 0s 16ms/step - loss: 0.0455 - accuracy: 0.9972 - val_loss: 0.1106 - val_accuracy: 0.9563\n", + "15/15 [==============================] - 0s 23ms/step - loss: 0.1018 - accuracy: 0.9863 - val_loss: 0.1943 - val_accuracy: 0.9484\n", "Epoch 20/20\n", - "15/15 [==============================] - 0s 17ms/step - loss: 0.0407 - accuracy: 0.9975 - val_loss: 0.1170 - val_accuracy: 0.9552\n" + "15/15 [==============================] - 0s 30ms/step - loss: 0.0914 - accuracy: 0.9893 - val_loss: 0.1985 - val_accuracy: 0.9496\n" ] }, { - "ename": "", - "evalue": "", - "output_type": "error", - "traceback": [ - "\u001b[1;31mThe Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click here for more info. View Jupyter log for further details." - ] + "data": { + "image/png": "", + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" } ], "source": [ @@ -459,7 +471,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.15" + "version": "3.9.13" } }, "nbformat": 4, From d935049662a8ccca657645b1e02bfa9e0bb7f2ed Mon Sep 17 00:00:00 2001 From: Lola-jo <120191238+Lola-jo@users.noreply.github.com> Date: Thu, 7 Mar 2024 14:21:13 +0800 Subject: [PATCH 4/7] Update _toc.yml From c942ec6cbbdcaedf2de360666faf84e1df1c27f2 Mon Sep 17 00:00:00 2001 From: Lola-jo <120191238+Lola-jo@users.noreply.github.com> Date: Thu, 7 Mar 2024 14:22:07 +0800 Subject: [PATCH 5/7] Update environment.yml --- open-machine-learning-jupyter-book/environment.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/open-machine-learning-jupyter-book/environment.yml b/open-machine-learning-jupyter-book/environment.yml index b86bda3fd..d3be367b8 100644 --- a/open-machine-learning-jupyter-book/environment.yml +++ b/open-machine-learning-jupyter-book/environment.yml @@ -49,4 +49,4 @@ dependencies: - git+https://github.com/innovationOUtside/ipython_magic_tikz.git - git+https://github.com/bonartm/sphinxcontrib-quizdown.git - tqdm - - fastai \ No newline at end of file + - fastai From 0fa86e34c504943c2552ad341fea072f7d2d85f5 Mon Sep 17 00:00:00 2001 From: Lola-jo <120191238+Lola-jo@users.noreply.github.com> Date: Thu, 7 Mar 2024 14:23:58 +0800 Subject: [PATCH 6/7] Update data-engineering.ipynb --- .../data-engineering.ipynb | 4403 ++--------------- 1 file changed, 506 insertions(+), 3897 deletions(-) diff --git a/open-machine-learning-jupyter-book/assignments/machine-learning-productionization/data-engineering.ipynb b/open-machine-learning-jupyter-book/assignments/machine-learning-productionization/data-engineering.ipynb index e910af9ec..ba5a478a8 100644 --- a/open-machine-learning-jupyter-book/assignments/machine-learning-productionization/data-engineering.ipynb +++ b/open-machine-learning-jupyter-book/assignments/machine-learning-productionization/data-engineering.ipynb @@ -1,3897 +1,506 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - machine-learning/open-machine-learning-jupyter-book/assignments/machine-learning-productionization/data-engineering.ipynb at main · ocademy-ai/machine-learning - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Skip to content - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Global navigation - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Home - - - - - - - - - - - - - - - - - - Issues - - - - - - - - - - - - - - - - - - Pull requests - - - - - - - - - - - - - - - - - - Projects - - - - - - - - - - - - - - - - - - Discussions - - - - - - - - - - - - - - - - - - Codespaces - - - - - - - - - - - - - - - - - - - - Explore - - - - - - - - - - - - - - - - - - Marketplace - - - - - - - - - - - - - - - - - - - © 2024 GitHub, Inc. - - - About - Blog - Terms - Privacy - Security - Status - - - - - - - - - - - - - - - - - - - - - ocademy-ai - / - - - - - machine-learning - - - - - - - - - - - - Navigate back to - - - - - - - - - - - - - - - - - - - ocademy-ai - - - - - - - - - - - - machine-learning - - - - - - - - - - - - - - - - - - - - ocademy-ai - - - - / - - - - - - machine-learning - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Type / to search - - - - - - - - - - - - - - Command palette - - - - - - - - - Search code, repositories, users, issues, pull requests... - - - - - - - - Search - - - - - - - - - - - - - - - Clear - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Search syntax tips - - - Give feedback - - - - - - - - - - - - - - - - - - - Provide feedback - - - - - - - - - - - - We read every piece of feedback, and take your input very seriously. - - - Include my email address so I can be contacted - - - - - - - - - - - - - - Saved searches - - Use saved searches to filter your results more quickly - - - - - - - - - - - - - - - - - - - Name - - - - - - - - Query - - - - - To see all available qualifiers, see our documentation. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Create new... - - - - - - - - - - - - - - - - - - New repository - - - - - - - - - - - - - - - - Import repository - - - - - - - - - - - - - - - - - New codespace - - - - - - - - - - - - - - - - New gist - - - - - - - - - - - - - - - - - New organization - - - - - - - - - - - - - - - - - -Issues - - - - -Pull requests - - - - - - - - - - - Notifications - - - - - - - - - - - - - - - - - - - - Account menu - - - - - - - - WHQWHQWHQ - - - - - - - - - - - - - - -Create new... - - - - - - - - - - - - - - - - - - New repository - - - - - - - - - - - - - - - - Import repository - - - - - - - - - - - - - - - - - New codespace - - - - - - - - - - - - - - - - New gist - - - - - - - - - - - - - - - - - New organization - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Loading... - - - - - - - - - - - - - - - - - - - - - - Your profile - - - - - - - - - - - - - - - - - - - - - - - Loading... - - - - - - - - - - - - - - - - - - - - - - Your repositories - - - - - - - - - - - - - - - - - - Your projects - - - - - - - - - - - - - - - - - - - - - - - Loading... - - - - - - - - - - - - - - - - - - - - Your stars - - - - - - - - - - - - - - - - - - Your sponsors - - - - - - - - - - - - - - - - - - Your gists - - - - - - - - - - - - - - - - - - - - - - - - - Loading... - - - - - - - - - - - - - - - - - - - - - - - - - Loading... - - - - - - - - - - - - - - - - - - - - - - - - - Loading... - - - - - - - - - - - - - - - - - - - - Settings - - - - - - - - - - - - - - - - - - - - GitHub Docs - - - - - - - - - - - - - - - - - - GitHub Support - - - - - - - - - - - - - - - Sign out - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Code - - - - - - - - - - - - Issues - 110 - - - - - - - - - - - Pull requests - 5 - - - - - - - - - - - Discussions - - - - - - - - - - - - Actions - - - - - - - - - - - - Projects - 1 - - - - - - - - - - - Wiki - - - - - - - - - - - - Security - - - - - - - - - - - Insights - - - - - - - - - - - -Additional navigation options - - - - - - - - - - - - - - - - - - Code - - - - - - - - - - - - - - - Issues - - - - - - - - - - - - - - - Pull requests - - - - - - - - - - - - - - - Discussions - - - - - - - - - - - - - - - Actions - - - - - - - - - - - - - - - Projects - - - - - - - - - - - - - - - Wiki - - - - - - - - - - - - - - - Security - - - - - - - - - - - - - - - Insights - - - - - - - - - - - - - - - - - - - - - - You signed in with another tab or window. Reload to refresh your session. - You signed out in another tab or window. Reload to refresh your session. - You switched accounts on another tab or window. Reload to refresh your session. - - - - -Dismiss alert - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - {{ message }} - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - / ... / - ocademy-ai / - machine-learning / - - - - - - - - - - - - - - - Clear Command Palette - - - - - - - - Tip: - Type # to search pull requests - - - Type ? for help and tips - - - - - - - Tip: - Type # to search issues - - - Type ? for help and tips - - - - - - - Tip: - Type # to search discussions - - - Type ? for help and tips - - - - - - - Tip: - Type ! to search projects - - - Type ? for help and tips - - - - - - - Tip: - Type @ to search teams - - - Type ? for help and tips - - - - - - - Tip: - Type @ to search people and organizations - - - Type ? for help and tips - - - - - - - Tip: - Type > to activate command mode - - - Type ? for help and tips - - - - - - - Tip: - Go to your accessibility settings to change your keyboard shortcuts - - - Type ? for help and tips - - - - - - - Tip: - Type author:@me to search your content - - - Type ? for help and tips - - - - - - - Tip: - Type is:pr to filter to pull requests - - - Type ? for help and tips - - - - - - - Tip: - Type is:issue to filter to issues - - - Type ? for help and tips - - - - - - - Tip: - Type is:project to filter to projects - - - Type ? for help and tips - - - - - - - Tip: - Type is:open to filter to open content - - - Type ? for help and tips - - - - - - - - - - - We’ve encountered an error and some results aren't available at this time. Type a new search or try again later. - - - - No results matched your search - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Search for issues and pull requests - - # - - - - Search for issues, pull requests, discussions, and projects - - # - - - - Search for organizations, repositories, and users - - @ - - - - Search for projects - - ! - - - - Search for files - - / - - - - Activate command mode - - > - - - - Search your issues, pull requests, and discussions - - # author:@me - - - - Search your issues, pull requests, and discussions - - # author:@me - - - - Filter to pull requests - - # is:pr - - - - Filter to issues - - # is:issue - - - - Filter to discussions - - # is:discussion - - - - Filter to projects - - # is:project - - - - Filter to open issues, pull requests, and discussions - - # is:open - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Open in github.dev - Open in a new github.dev tab - Open in codespace - - - - - - - - - - - - - - - FilesFiles maint.githubassignmentsawesomedrawiogenerative-aiimagesopen-machine-learning-jupyter-book_staticassetsassignmentsdata-sciencedeep-learningmachine-learning-productionizationcounterintuitive-challenges-in-ml-debugging.ipynbdata-engineering.ipynbdebugging-in-classification.ipynbdebugging-in-regression.ipynbml-advancedml-fundamentalsset-up-envproject-plan-template.ipynbdata-sciencedeep-learningmachine-learning-productionizationml-advancedml-fundamentalsprerequisitesslidesCNAMECONTRIBUTING.mdREADME.mdSTYLE_GUIDE.md_config.yml_toc.ymlbuild-force-all.shbuild-local.shbuild.shenvironment.ymlfavicon.icointro.mdlogo-long.pnglogo.pngpost-build.shreferences.bibtutorials.gitignoreCODE_OF_CONDUCT.mdCONTRIBUTING.mdISSUE_TEMPLATE.mdLICENSE-CODELICENSE-TEXTNOTATION.mdREADME.mdSECURITY.mdSTYLE_GUIDE.mdSUPPORT.mdTRANSLATIONS.mdpostBuildpull_request_template.mdrequirements.txtFiles main Blame Breadcrumbsmachine-learning/open-machine-learning-jupyter-book/assignments/machine-learning-productionization/data-engineering.ipynbBreadcrumbsmachine-learning/open-machine-learning-jupyter-book/assignments/machine-learning-productionization/data-engineering.ipynb Blame Latest commit HistoryHistory506 lines (506 loc) · 17.3 KBBreadcrumbsmachine-learning/open-machine-learning-jupyter-book/assignments/machine-learning-productionization/data-engineering.ipynbTopFile metadata and controlsPreviewCodeBlame506 lines (506 loc) · 17.3 KBRaw - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - You can’t perform that action at this time. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Data engineering\n", + "\n", + "This assignment focuses on techniques for cleaning and transforming the data to handle challenges of missing, inaccurate, or incomplete data. Please refer to [Machine Learning productionization - Data engineering](#data-engineering) to learn more.\n", + "\n", + "Fill `____` pieces of the below implementation in order to pass the assertions.\n", + "\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exploring dataset\n", + "\n", + "> **Learning goal**: By the end of this subsection, you should be comfortable finding general information about the data stored in pandas DataFrames.\n", + "\n", + "In order to explore this functionality, we will import the modefined version of Python scikit-learn library's iconic dataset **Iris**." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "from sklearn.datasets import load_iris\n", + "import math\n", + "\n", + "iris_df = pd.read_csv('../../assets/data/modefined_sklearn_iris_dataset.csv', index_col=0)\n", + "iris_df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To start off, print the summary of a DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "iris_df.____" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```{quizdown}\n", + "\n", + "## How many entries the Iris dataset has?\n", + "\n", + "> Please refer to the output of above cell. \n", + "\n", + "- [ ] 50\n", + "- [ ] 100\n", + "- [x] 150\n", + "- [ ] 200\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, let's check the actual content of the `DataFrame`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# displying first 5 rows of our iris_df\n", + "iris_df____\n", + "\n", + "# in the first five rows, which one's spepal length is 5.0cm?\n", + "assert iris_df.iloc[____, 0] == 5.0" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Conversely, we can check the last few rows of the DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# displying last 5 rows of our `iris_df`.\n", + "iris_df.____\n", + "\n", + "# in the last five rows, which one's spepal width is 2.5cm?\n", + "assert iris_df.iloc[____, 1] == 2.5" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Takeaway**: Even just by looking at the metadata about the information in a DataFrame or the first and last few values in one, you can get an immediate idea about the size, shape, and content of the data you are dealing with." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dealing with missing data\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Missing data can cause inaccuracies as well as weak or biased results. Sometimes these can be resolved by a \"reload\" of the data, filling in the missing values with computation and code like Python, or simply just removing the value and corresponding data. There are numerous reasons for why data may be missing and the actions that are taken to resolve these missing values can be dependent on how and why they went missing in the first place.\n", + "\n", + "> **Learning goal**: By the end of this subsection, you should know how to replace or remove null values from DataFrames.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In pandas, the `isnull()` and `notnull()` methods are your primary methods for detecting null data. Both return Boolean masks over your data. We will be using numpy for NaN values:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "iris_isnull_df = iris_df.isnull()\n", + "\n", + "print(iris_isnull_df)\n", + "\n", + "# find one row with missing value\n", + "assert iris_isnull_df.iloc[____, ____] == True\n", + "assert math.isnan(iris_df.iloc[____, ____]) == True" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# get all the rows with missing data\n", + "iris_with_missing_value_df = iris_df____\n", + "\n", + "assert iris_with_missing_value_df.shape[0] == 16" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Dropping null values**: Beyond identifying missing values, pandas provides a convenient means `dropna` to remove null values from Series and DataFrames. (Particularly on large data sets, it is often more advisable to simply remove missing [NA] values from your analysis than deal with them in other ways.) " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# remove all the rows with missing values\n", + "iris_with_dropna_on_row_df = iris_df.____\n", + "\n", + "assert iris_with_dropna_on_row_df.shape[0] == 134" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# remove all the columns with missing values\n", + "iris_with_dropna_on_column_df = iris_df.____\n", + "\n", + "assert iris_with_dropna_on_column_df.columns.shape[0] == 0" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# remove all the rows with 2 missing values\n", + "iris_with_dropna_2_values_on_rows_df = iris_df.____\n", + "\n", + "assert iris_with_dropna_2_values_on_rows_df.shape[0] == 144\n", + "\n", + "# remove all the rows with 1 missing values\n", + "iris_with_dropna_1_values_on_rows_df = iris_df.____\n", + "\n", + "assert iris_with_dropna_1_values_on_rows_df.shape[0] == 147" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Filling null values**: Depending on your dataset, it can sometimes make more sense to fill null values with valid ones rather than drop them. You could use `isnull` to do this in place, but that can be laborious, particularly if you have a lot of values to fill. Because this is such a common task in data science, pandas provides `fillna`, which returns a copy of the Series or DataFrame with the missing values replaced with one of your choosing. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# fll all the missing values with 0\n", + "iris_with_fillna_df = iris_df.____\n", + "\n", + "# get all the rows with missing data\n", + "iris_with_missing_value_after_fillna_df = iris_with_fillna_df____\n", + "\n", + "assert iris_with_missing_value_after_fillna_df.shape[0] == 0\n", + "assert iris_with_fillna_df.iloc[____, 3] == -1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# forward-fill null values, which is to use the last valid value to fill a null:\n", + "iris_with_fillna_forward_df = iris_df.____\n", + "\n", + "# get all the rows with missing data\n", + "iris_with_missing_value_after_fillna_forward_df = iris_with_fillna_forward_df____\n", + "\n", + "assert iris_with_missing_value_after_fillna_forward_df.shape[0] == 0\n", + "assert float(iris_with_fillna_forward_df.iloc[3, 3]) == 0.2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# back-fill null values, which is to use the next valid value to fill a null:\n", + "iris_with_fillna_back_df = iris_df.____\n", + "\n", + "# get all the rows with missing data\n", + "iris_with_missing_value_after_fillna_back_df = iris_with_fillna_back_df____\n", + "\n", + "assert iris_with_missing_value_after_fillna_back_df.shape[0] == 0\n", + "assert float(iris_with_fillna_back_df.iloc[3, 3]) == 0.1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Removing duplicate data\n", + "\n", + "Data that has more than one occurrence can produce inaccurate results and usually should be removed. This can be a common occurrence when joining two or more datasets together. However, there are instances where duplication in joined datasets contain pieces that can provide additional information and may need to be preserved.\n", + "\n", + "> **Learning goal**: By the end of this subsection, you should be comfortable identifying and removing duplicate values from DataFrames.\n", + "\n", + "In addition to missing data, you will often encounter duplicated data in real-world datasets. Fortunately, pandas provides an easy means of detecting and removing duplicate entries." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Identifying duplicates**: You can easily spot duplicate values using the `duplicated` method in pandas, which returns a Boolean mask indicating whether an entry in a DataFrame is a duplicate of an earlier one. Let's create another example DataFrame to see this in action." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "iris_isduplicated_df = iris_df.____\n", + "\n", + "print(iris_isduplicated_df)\n", + "\n", + "# find one row with duplicated value\n", + "assert iris_isduplicated_df.iloc[____, ____] == True" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Dropping duplicates**: `drop_duplicates` simply returns a copy of the data for which all of the duplicated values are False:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# remove all the rows with duplicated values\n", + "iris_with_drop_duplicates_on_df = iris_df.drop_duplicates()\n", + "\n", + "assert iris_with_drop_duplicates_on_df.shape[0] == 143" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Both `duplicated` and `drop_duplicates` default to consider all columns but you can specify that they examine only a subset of columns in your DataFrame:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# remove all the rows with duplicated values on column 'petal width (cm)'\n", + "iris_with_drop_duplicates_on_column_df = iris_df.____\n", + "\n", + "assert iris_with_drop_duplicates_on_column_df.shape[0] == 27" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Handle inconsistent data\n", + "\n", + "Depending on the source, data can have inconsistencies in how it’s presented. This can cause problems in searching for and representing the value, where it’s seen within the dataset but is not properly represented in visualizations or query results. Common formatting problems involve resolving whitespace, dates, and data types. Resolving formatting issues is typically up to the people who are using the data. For example, standards on how dates and numbers are presented can differ by country.\n", + "\n", + "> **Learning goal**: By the end of this subsection, you should know how to handle the inconsistent data format in the DataFrame." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's cleaning up the **4th** column `petal width (cm)` to make sure there's no data entry inconsistencies in it. Firstly, we will use a convenient method `unique` from pandas to check the unique values of this column\n", + "\n", + "In pandas, the `unique` method is a convenient way to unique values based on a hash table:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "column_to_format = ____\n", + "column_to_format_unique = column_to_format.____\n", + "\n", + "print(column_to_format_unique)\n", + "\n", + "# find one row with duplicated value\n", + "assert column_to_format_unique.shape[0] == 27" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Regardless the `nan` value, you may find the numeric valus are in different precision. More specifically, `1.` or `1.5012` are not in the same precision as other numbers. We want to append tailing `0` to numbers like `1.`, and round numbers like `1.5012` to `1.5`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# firstly, let's apply `round`` to the values to make the precision all as .1f\n", + "formatted_column = column_to_format.____\n", + "\n", + "print(formatted_column.unique())\n", + "\n", + "assert formatted_column.unique().shape[0] == 23" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# now, let's add tailing 0 if needed to make numbers like 1. to be 1.0. \n", + "# You may need to filter the nan value while processing.\n", + "formatted_column = formatted_column.____\n", + "\n", + "print(formatted_column.unique())\n", + "\n", + "assert formatted_column.unique().shape[0] == 23" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## At last\n", + "\n", + "Let's apply all the methods above to make the data to be clean." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# remove all rows with missing values\n", + "no_missing_data_df = iris_df.____\n", + "\n", + "# remove all rows with duplicated values\n", + "no_missing_dup_data_df = no_missing_data_df.____\n", + "\n", + "# apply the precision .1f to all the numbers\n", + "cleand_df = no_missing_dup_data_df.____\n", + "\n", + "assert no_missing_data_df.shape[0] == 134\n", + "assert no_missing_dup_data_df.shape[0] == 129\n", + "assert cleand_df[cleand_df.columns[3]].unique().shape[0] == 22" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Also, you could refer to below for more about how to handle data quality.\n", + "\n", + "- missing data - [pandas - Working with missing data](https://pandas.pydata.org/docs/user_guide/missing_data.html)\n", + "- duplicate data - [pandas - Duplicate Labels](https://pandas.pydata.org/docs/user_guide/duplicates.html)\n", + "- outlier\n", + " - [Ways to Detect and Remove the Outliers](https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba)\n", + " - [Outlier!!! The Silent Killer](https://www.kaggle.com/code/nareshbhat/outlier-the-silent-killer)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Acknowledgments\n", + "\n", + "Thanks to Microsoft for creating the open source course [Data Science for Beginners](https://github.com/microsoft/Data-Science-For-Beginners). It contributes some of the content in this chapter." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.9.13 64-bit", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.13 (main, May 24 2022, 21:28:31) \n[Clang 13.1.6 (clang-1316.0.21.2)]" + }, + "vscode": { + "interpreter": { + "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From 3060f9c930ffc68dea454530e1fbda14d87f4f3e Mon Sep 17 00:00:00 2001 From: Lola-jo <120191238+Lola-jo@users.noreply.github.com> Date: Thu, 7 Mar 2024 14:24:57 +0800 Subject: [PATCH 7/7] Update data-engineering.ipynb
© 2024 GitHub, Inc.
We read every piece of feedback, and take your input very seriously.
- To see all available qualifiers, see our documentation. -