diff --git a/open-machine-learning-jupyter-book/data-science/working-with-data/pandas/advanced-pandas-techniques.ipynb b/open-machine-learning-jupyter-book/data-science/working-with-data/pandas/advanced-pandas-techniques.ipynb new file mode 100644 index 0000000000..9540ddf331 --- /dev/null +++ b/open-machine-learning-jupyter-book/data-science/working-with-data/pandas/advanced-pandas-techniques.ipynb @@ -0,0 +1,5078 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "73bc6d8a-5f93-4207-a34f-68f68f587837", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "source": [ + "---\n", + "jupytext:\n", + " cell_metadata_filter: -all\n", + " formats: md:myst\n", + " text_representation:\n", + " extension: .md\n", + " format_name: myst\n", + " format_version: 0.13\n", + " jupytext_version: 1.11.5\n", + "kernelspec:\n", + " display_name: Python 3\n", + " language: python\n", + " name: python3" + ] + }, + { + "cell_type": "markdown", + "id": "aa35406e-c73d-49f1-aa84-5cc5ced6c294", + "metadata": {}, + "source": [ + "# Advanced Pandas Techniques" + ] + }, + { + "cell_type": "markdown", + "id": "64ccbdd4-f4c6-4893-80e9-7d87595020d2", + "metadata": {}, + "source": [ + "## Overview\n", + "\n", + "In this section, we'll continue to introduce combining datasets: concat, merge and join along with data aggregation and grouping." + ] + }, + { + "cell_type": "markdown", + "id": "87f36a89-87fd-400d-894e-9332245bc9e5", + "metadata": {}, + "source": [ + "Import NumPy and load Pandas into your namespace:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "b221a566-8a04-4689-8eb1-c266ede5a264", + "metadata": {}, + "outputs": [], + "source": [ + "# Install the necessary dependencies\n", + "import os\n", + "import sys\n", + "!{sys.executable} -m pip install --quiet jupyterlab_myst ipython\n", + "import numpy as np\n", + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "id": "9edbb0f3", + "metadata": {}, + "source": [ + "## Combining datasets: concat, merge and join\n", + "\n", + "### concat\n", + "\n", + "- Concatenate Pandas objects along a particular axis.\n", + "\n", + "- Allows optional set logic along the other axes.\n", + "\n", + "- Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.\n", + "\n", + "For example:\n", + "\n", + "Combine two `Series`." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "b08dcc94", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "0 a\n", + "1 b\n", + "0 c\n", + "1 d\n", + "dtype: object" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s1 = pd.Series(['a', 'b'])\n", + "s2 = pd.Series(['c', 'd'])\n", + "pd.concat([s1, s2])" + ] + }, + { + "cell_type": "markdown", + "id": "b1c47e7c", + "metadata": {}, + "source": [ + "Clear the existing index and reset it in the result by setting the `ignore_index` option to `True`." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "32049abb", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "0 a\n", + "1 b\n", + "2 c\n", + "3 d\n", + "dtype: object" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.concat([s1, s2], ignore_index=True)" + ] + }, + { + "cell_type": "markdown", + "id": "31f73f90", + "metadata": {}, + "source": [ + "Add a hierarchical index at the outermost level of the data with the `keys` option." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "d5b95507", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "s1 0 a\n", + " 1 b\n", + "s2 0 c\n", + " 1 d\n", + "dtype: object" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.concat([s1, s2], keys=['s1', 's2'])" + ] + }, + { + "cell_type": "markdown", + "id": "9c618012", + "metadata": {}, + "source": [ + "Label the index keys you create with the `names` option." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "6d54830d", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Series name Row ID\n", + "s1 0 a\n", + " 1 b\n", + "s2 0 c\n", + " 1 d\n", + "dtype: object" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.concat([s1, s2], keys=['s1', 's2'],\n", + " names=['Series name', 'Row ID'])" + ] + }, + { + "cell_type": "markdown", + "id": "31fac69f", + "metadata": {}, + "source": [ + "Combine two `DataFrame` objects with identical columns." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "fec72294", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
letternumber
0a1
1b2
\n", + "
" + ], + "text/plain": [ + " letter number\n", + "0 a 1\n", + "1 b 2" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1 = pd.DataFrame([['a', 1], ['b', 2]],\n", + " columns=['letter', 'number'])\n", + "df1" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "80a1f5b0", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
letternumber
0c3
1d4
\n", + "
" + ], + "text/plain": [ + " letter number\n", + "0 c 3\n", + "1 d 4" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df2 = pd.DataFrame([['c', 3], ['d', 4]],\n", + " columns=['letter', 'number'])\n", + "df2" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "4e9e65f6", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
letternumber
0a1
1b2
0c3
1d4
\n", + "
" + ], + "text/plain": [ + " letter number\n", + "0 a 1\n", + "1 b 2\n", + "0 c 3\n", + "1 d 4" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.concat([df1, df2])" + ] + }, + { + "cell_type": "markdown", + "id": "49d878b5", + "metadata": {}, + "source": [ + "Combine `DataFrame` objects with overlapping columns and return everything. Columns outside the intersection will be filled with `NaN` values." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "f50e8ede", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
letternumberanimal
0c3cat
1d4dog
\n", + "
" + ], + "text/plain": [ + " letter number animal\n", + "0 c 3 cat\n", + "1 d 4 dog" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],\n", + " columns=['letter', 'number', 'animal'])\n", + "df3" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "9def1cdd", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
letternumberanimal
0a1NaN
1b2NaN
0c3cat
1d4dog
\n", + "
" + ], + "text/plain": [ + " letter number animal\n", + "0 a 1 NaN\n", + "1 b 2 NaN\n", + "0 c 3 cat\n", + "1 d 4 dog" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.concat([df1, df3], sort=False)" + ] + }, + { + "cell_type": "markdown", + "id": "6f2fcb0c", + "metadata": {}, + "source": [ + "Combine DataFrame objects with overlapping columns and return only those that are shared by passing inner to the join keyword argument." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "ef69d51c", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
letternumber
0a1
1b2
0c3
1d4
\n", + "
" + ], + "text/plain": [ + " letter number\n", + "0 a 1\n", + "1 b 2\n", + "0 c 3\n", + "1 d 4" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.concat([df1, df3], join=\"inner\")" + ] + }, + { + "cell_type": "markdown", + "id": "0fda5cf5", + "metadata": {}, + "source": [ + "Combine `DataFrame` objects horizontally along the x-axis by passing in `axis=1`." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "2159161d", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
letternumberanimalname
0a1birdpolly
1b2monkeygeorge
\n", + "
" + ], + "text/plain": [ + " letter number animal name\n", + "0 a 1 bird polly\n", + "1 b 2 monkey george" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],\n", + " columns=['animal', 'name'])\n", + "pd.concat([df1, df4], axis=1)" + ] + }, + { + "cell_type": "markdown", + "id": "adb11ea6", + "metadata": {}, + "source": [ + "Prevent the result from including duplicate index values with the `verify_integrity` option." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "45bea28a", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0
a1
\n", + "
" + ], + "text/plain": [ + " 0\n", + "a 1" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df5 = pd.DataFrame([1], index=['a'])\n", + "df5" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "db871526", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0
a2
\n", + "
" + ], + "text/plain": [ + " 0\n", + "a 2" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df6 = pd.DataFrame([2], index=['a'])\n", + "df6" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "1ab6b3b0", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + }, + "tags": [ + "raises-exception" + ] + }, + "outputs": [ + { + "ename": "ValueError", + "evalue": "Indexes have overlapping values: Index(['a'], dtype='object')", + "output_type": "error", + "traceback": [ + "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[1;32mIn[15], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[43mpd\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mconcat\u001b[49m\u001b[43m(\u001b[49m\u001b[43m[\u001b[49m\u001b[43mdf5\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdf6\u001b[49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mverify_integrity\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m)\u001b[49m\n", + "File \u001b[1;32mF:\\anaconda\\envs\\py39\\lib\\site-packages\\pandas\\core\\reshape\\concat.py:393\u001b[0m, in \u001b[0;36mconcat\u001b[1;34m(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)\u001b[0m\n\u001b[0;32m 378\u001b[0m copy \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mFalse\u001b[39;00m\n\u001b[0;32m 380\u001b[0m op \u001b[38;5;241m=\u001b[39m _Concatenator(\n\u001b[0;32m 381\u001b[0m objs,\n\u001b[0;32m 382\u001b[0m axis\u001b[38;5;241m=\u001b[39maxis,\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 390\u001b[0m sort\u001b[38;5;241m=\u001b[39msort,\n\u001b[0;32m 391\u001b[0m )\n\u001b[1;32m--> 393\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mop\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_result\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n", + "File \u001b[1;32mF:\\anaconda\\envs\\py39\\lib\\site-packages\\pandas\\core\\reshape\\concat.py:667\u001b[0m, in \u001b[0;36m_Concatenator.get_result\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 665\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m obj \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mobjs:\n\u001b[0;32m 666\u001b[0m indexers \u001b[38;5;241m=\u001b[39m {}\n\u001b[1;32m--> 667\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m ax, new_labels \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28menumerate\u001b[39m(\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mnew_axes\u001b[49m):\n\u001b[0;32m 668\u001b[0m \u001b[38;5;66;03m# ::-1 to convert BlockManager ax to DataFrame ax\u001b[39;00m\n\u001b[0;32m 669\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m ax \u001b[38;5;241m==\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mbm_axis:\n\u001b[0;32m 670\u001b[0m \u001b[38;5;66;03m# Suppress reindexing on concat axis\u001b[39;00m\n\u001b[0;32m 671\u001b[0m \u001b[38;5;28;01mcontinue\u001b[39;00m\n", + "File \u001b[1;32mproperties.pyx:36\u001b[0m, in \u001b[0;36mpandas._libs.properties.CachedProperty.__get__\u001b[1;34m()\u001b[0m\n", + "File \u001b[1;32mF:\\anaconda\\envs\\py39\\lib\\site-packages\\pandas\\core\\reshape\\concat.py:698\u001b[0m, in \u001b[0;36m_Concatenator.new_axes\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 695\u001b[0m \u001b[38;5;129m@cache_readonly\u001b[39m\n\u001b[0;32m 696\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mnew_axes\u001b[39m(\u001b[38;5;28mself\u001b[39m) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m \u001b[38;5;28mlist\u001b[39m[Index]:\n\u001b[0;32m 697\u001b[0m ndim \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_get_result_dim()\n\u001b[1;32m--> 698\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m [\n\u001b[0;32m 699\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_get_concat_axis \u001b[38;5;28;01mif\u001b[39;00m i \u001b[38;5;241m==\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mbm_axis \u001b[38;5;28;01melse\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_get_comb_axis(i)\n\u001b[0;32m 700\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m i \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mrange\u001b[39m(ndim)\n\u001b[0;32m 701\u001b[0m ]\n", + "File \u001b[1;32mF:\\anaconda\\envs\\py39\\lib\\site-packages\\pandas\\core\\reshape\\concat.py:699\u001b[0m, in \u001b[0;36m\u001b[1;34m(.0)\u001b[0m\n\u001b[0;32m 695\u001b[0m \u001b[38;5;129m@cache_readonly\u001b[39m\n\u001b[0;32m 696\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mnew_axes\u001b[39m(\u001b[38;5;28mself\u001b[39m) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m \u001b[38;5;28mlist\u001b[39m[Index]:\n\u001b[0;32m 697\u001b[0m ndim \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_get_result_dim()\n\u001b[0;32m 698\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m [\n\u001b[1;32m--> 699\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_get_concat_axis\u001b[49m \u001b[38;5;28;01mif\u001b[39;00m i \u001b[38;5;241m==\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mbm_axis \u001b[38;5;28;01melse\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_get_comb_axis(i)\n\u001b[0;32m 700\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m i \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mrange\u001b[39m(ndim)\n\u001b[0;32m 701\u001b[0m ]\n", + "File \u001b[1;32mproperties.pyx:36\u001b[0m, in \u001b[0;36mpandas._libs.properties.CachedProperty.__get__\u001b[1;34m()\u001b[0m\n", + "File \u001b[1;32mF:\\anaconda\\envs\\py39\\lib\\site-packages\\pandas\\core\\reshape\\concat.py:762\u001b[0m, in \u001b[0;36m_Concatenator._get_concat_axis\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 757\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 758\u001b[0m concat_axis \u001b[38;5;241m=\u001b[39m _make_concat_multiindex(\n\u001b[0;32m 759\u001b[0m indexes, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mkeys, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mlevels, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mnames\n\u001b[0;32m 760\u001b[0m )\n\u001b[1;32m--> 762\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_maybe_check_integrity\u001b[49m\u001b[43m(\u001b[49m\u001b[43mconcat_axis\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 764\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m concat_axis\n", + "File \u001b[1;32mF:\\anaconda\\envs\\py39\\lib\\site-packages\\pandas\\core\\reshape\\concat.py:770\u001b[0m, in \u001b[0;36m_Concatenator._maybe_check_integrity\u001b[1;34m(self, concat_index)\u001b[0m\n\u001b[0;32m 768\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m concat_index\u001b[38;5;241m.\u001b[39mis_unique:\n\u001b[0;32m 769\u001b[0m overlap \u001b[38;5;241m=\u001b[39m concat_index[concat_index\u001b[38;5;241m.\u001b[39mduplicated()]\u001b[38;5;241m.\u001b[39munique()\n\u001b[1;32m--> 770\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mIndexes have overlapping values: \u001b[39m\u001b[38;5;132;01m{\u001b[39;00moverlap\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m)\n", + "\u001b[1;31mValueError\u001b[0m: Indexes have overlapping values: Index(['a'], dtype='object')" + ] + } + ], + "source": [ + "pd.concat([df5, df6], verify_integrity=True)" + ] + }, + { + "cell_type": "markdown", + "id": "90fc36d4", + "metadata": {}, + "source": [ + "Append a single row to the end of a `DataFrame` object." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "007c1ed6", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ab
012
\n", + "
" + ], + "text/plain": [ + " a b\n", + "0 1 2" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df7 = pd.DataFrame({'a': 1, 'b': 2}, index=[0])\n", + "df7" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "9dbaddff", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "a 3\n", + "b 4\n", + "dtype: int64" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "new_row = pd.Series({'a': 3, 'b': 4})\n", + "new_row" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "ad2d1313", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ab
012
134
\n", + "
" + ], + "text/plain": [ + " a b\n", + "0 1 2\n", + "1 3 4" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.concat([df7, new_row.to_frame().T], ignore_index=True)" + ] + }, + { + "cell_type": "markdown", + "id": "39223d1c", + "metadata": {}, + "source": [ + ":::{note}\n", + "`append()` has been deprecated since version 1.4.0: Use `concat()` instead. \n", + ":::\n", + "\n", + "### merge\n", + "\n", + "- Merge DataFrame or named Series objects with a database-style join.\n", + "\n", + "- A named Series object is treated as a DataFrame with a single named column.\n", + "\n", + "- The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross-merge, no column specifications to merge on are allowed." + ] + }, + { + "cell_type": "markdown", + "id": "c1afc536-2209-4fa1-8d63-0b19c18c66c6", + "metadata": { + "attributes": { + "classes": [ + "warning" + ], + "id": "" + } + }, + "source": [ + "If both key columns contain rows where the key is a null value, those rows will be matched against each other. This is different from usual SQL join behaviour and can lead to unexpected results." + ] + }, + { + "cell_type": "markdown", + "id": "0f2ffec1", + "metadata": {}, + "source": [ + "For example:" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "e223179b", + "metadata": {}, + "outputs": [], + "source": [ + "df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],\n", + " 'value': [1, 2, 3, 5]})\n", + "df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],\n", + " 'value': [5, 6, 7, 8]})" + ] + }, + { + "cell_type": "markdown", + "id": "ee9441ec", + "metadata": {}, + "source": [ + "Merge DataFrames `df1` and `df2` with specified left and right suffixes appended to any overlapping columns." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "e22da8fc", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
lkeyvalue_leftrkeyvalue_right
0foo1foo5
1foo1foo8
2foo5foo5
3foo5foo8
4bar2bar6
5baz3baz7
\n", + "
" + ], + "text/plain": [ + " lkey value_left rkey value_right\n", + "0 foo 1 foo 5\n", + "1 foo 1 foo 8\n", + "2 foo 5 foo 5\n", + "3 foo 5 foo 8\n", + "4 bar 2 bar 6\n", + "5 baz 3 baz 7" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=('_left', '_right'))" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "6147bab8-4644-4a23-ba71-205573a1c3f9", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "5112fc3a", + "metadata": {}, + "source": [ + "\n", + "Merge DataFrames `df1` and `df2`, but raise an exception if the DataFrames have any overlapping columns." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "3dea68f6", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + }, + "tags": [ + "raises-exception" + ] + }, + "outputs": [ + { + "ename": "ValueError", + "evalue": "columns overlap but no suffix specified: Index(['value'], dtype='object')", + "output_type": "error", + "traceback": [ + "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[1;32mIn[22], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[43mdf1\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mmerge\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdf2\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mleft_on\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mlkey\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mright_on\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mrkey\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43msuffixes\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m(\u001b[49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m\n", + "File \u001b[1;32mF:\\anaconda\\envs\\py39\\lib\\site-packages\\pandas\\core\\frame.py:10490\u001b[0m, in \u001b[0;36mDataFrame.merge\u001b[1;34m(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)\u001b[0m\n\u001b[0;32m 10471\u001b[0m \u001b[38;5;129m@Substitution\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m 10472\u001b[0m \u001b[38;5;129m@Appender\u001b[39m(_merge_doc, indents\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m2\u001b[39m)\n\u001b[0;32m 10473\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mmerge\u001b[39m(\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 10486\u001b[0m validate: MergeValidate \u001b[38;5;241m|\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m,\n\u001b[0;32m 10487\u001b[0m ) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m DataFrame:\n\u001b[0;32m 10488\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mpandas\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mcore\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mreshape\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mmerge\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m merge\n\u001b[1;32m> 10490\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mmerge\u001b[49m\u001b[43m(\u001b[49m\n\u001b[0;32m 10491\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[0;32m 10492\u001b[0m \u001b[43m \u001b[49m\u001b[43mright\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 10493\u001b[0m \u001b[43m \u001b[49m\u001b[43mhow\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mhow\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 10494\u001b[0m \u001b[43m \u001b[49m\u001b[43mon\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mon\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 10495\u001b[0m \u001b[43m \u001b[49m\u001b[43mleft_on\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mleft_on\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 10496\u001b[0m \u001b[43m \u001b[49m\u001b[43mright_on\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mright_on\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 10497\u001b[0m \u001b[43m \u001b[49m\u001b[43mleft_index\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mleft_index\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 10498\u001b[0m \u001b[43m \u001b[49m\u001b[43mright_index\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mright_index\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 10499\u001b[0m \u001b[43m \u001b[49m\u001b[43msort\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43msort\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 10500\u001b[0m \u001b[43m \u001b[49m\u001b[43msuffixes\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43msuffixes\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 10501\u001b[0m \u001b[43m \u001b[49m\u001b[43mcopy\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcopy\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 10502\u001b[0m \u001b[43m \u001b[49m\u001b[43mindicator\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mindicator\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 10503\u001b[0m \u001b[43m \u001b[49m\u001b[43mvalidate\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mvalidate\u001b[49m\u001b[43m,\u001b[49m\n\u001b[0;32m 10504\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n", + "File \u001b[1;32mF:\\anaconda\\envs\\py39\\lib\\site-packages\\pandas\\core\\reshape\\merge.py:183\u001b[0m, in \u001b[0;36mmerge\u001b[1;34m(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)\u001b[0m\n\u001b[0;32m 168\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 169\u001b[0m op \u001b[38;5;241m=\u001b[39m _MergeOperation(\n\u001b[0;32m 170\u001b[0m left_df,\n\u001b[0;32m 171\u001b[0m right_df,\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 181\u001b[0m validate\u001b[38;5;241m=\u001b[39mvalidate,\n\u001b[0;32m 182\u001b[0m )\n\u001b[1;32m--> 183\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mop\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_result\u001b[49m\u001b[43m(\u001b[49m\u001b[43mcopy\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcopy\u001b[49m\u001b[43m)\u001b[49m\n", + "File \u001b[1;32mF:\\anaconda\\envs\\py39\\lib\\site-packages\\pandas\\core\\reshape\\merge.py:885\u001b[0m, in \u001b[0;36m_MergeOperation.get_result\u001b[1;34m(self, copy)\u001b[0m\n\u001b[0;32m 881\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mleft, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mright \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_indicator_pre_merge(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mleft, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mright)\n\u001b[0;32m 883\u001b[0m join_index, left_indexer, right_indexer \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_get_join_info()\n\u001b[1;32m--> 885\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_reindex_and_concat\u001b[49m\u001b[43m(\u001b[49m\n\u001b[0;32m 886\u001b[0m \u001b[43m \u001b[49m\u001b[43mjoin_index\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mleft_indexer\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mright_indexer\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcopy\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcopy\u001b[49m\n\u001b[0;32m 887\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 888\u001b[0m result \u001b[38;5;241m=\u001b[39m result\u001b[38;5;241m.\u001b[39m__finalize__(\u001b[38;5;28mself\u001b[39m, method\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_merge_type)\n\u001b[0;32m 890\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mindicator:\n", + "File \u001b[1;32mF:\\anaconda\\envs\\py39\\lib\\site-packages\\pandas\\core\\reshape\\merge.py:837\u001b[0m, in \u001b[0;36m_MergeOperation._reindex_and_concat\u001b[1;34m(self, join_index, left_indexer, right_indexer, copy)\u001b[0m\n\u001b[0;32m 834\u001b[0m left \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mleft[:]\n\u001b[0;32m 835\u001b[0m right \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mright[:]\n\u001b[1;32m--> 837\u001b[0m llabels, rlabels \u001b[38;5;241m=\u001b[39m \u001b[43m_items_overlap_with_suffix\u001b[49m\u001b[43m(\u001b[49m\n\u001b[0;32m 838\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mleft\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_info_axis\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mright\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_info_axis\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43msuffixes\u001b[49m\n\u001b[0;32m 839\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 841\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m left_indexer \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m is_range_indexer(left_indexer, \u001b[38;5;28mlen\u001b[39m(left)):\n\u001b[0;32m 842\u001b[0m \u001b[38;5;66;03m# Pinning the index here (and in the right code just below) is not\u001b[39;00m\n\u001b[0;32m 843\u001b[0m \u001b[38;5;66;03m# necessary, but makes the `.take` more performant if we have e.g.\u001b[39;00m\n\u001b[0;32m 844\u001b[0m \u001b[38;5;66;03m# a MultiIndex for left.index.\u001b[39;00m\n\u001b[0;32m 845\u001b[0m lmgr \u001b[38;5;241m=\u001b[39m left\u001b[38;5;241m.\u001b[39m_mgr\u001b[38;5;241m.\u001b[39mreindex_indexer(\n\u001b[0;32m 846\u001b[0m join_index,\n\u001b[0;32m 847\u001b[0m left_indexer,\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 852\u001b[0m use_na_proxy\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m,\n\u001b[0;32m 853\u001b[0m )\n", + "File \u001b[1;32mF:\\anaconda\\envs\\py39\\lib\\site-packages\\pandas\\core\\reshape\\merge.py:2655\u001b[0m, in \u001b[0;36m_items_overlap_with_suffix\u001b[1;34m(left, right, suffixes)\u001b[0m\n\u001b[0;32m 2652\u001b[0m lsuffix, rsuffix \u001b[38;5;241m=\u001b[39m suffixes\n\u001b[0;32m 2654\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m lsuffix \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m rsuffix:\n\u001b[1;32m-> 2655\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcolumns overlap but no suffix specified: \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mto_rename\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m 2657\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mrenamer\u001b[39m(x, suffix: \u001b[38;5;28mstr\u001b[39m \u001b[38;5;241m|\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m):\n\u001b[0;32m 2658\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[0;32m 2659\u001b[0m \u001b[38;5;124;03m Rename the left and right indices.\u001b[39;00m\n\u001b[0;32m 2660\u001b[0m \n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 2671\u001b[0m \u001b[38;5;124;03m x : renamed column name\u001b[39;00m\n\u001b[0;32m 2672\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n", + "\u001b[1;31mValueError\u001b[0m: columns overlap but no suffix specified: Index(['value'], dtype='object')" + ] + } + ], + "source": [ + "df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=(False, False))" + ] + }, + { + "cell_type": "markdown", + "id": "86efca65", + "metadata": {}, + "source": [ + "Using `how` parameter decide the type of merge to be performed." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "1026fc27", + "metadata": {}, + "outputs": [], + "source": [ + "df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})\n", + "df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "b4379cb1", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
abc
0foo13
\n", + "
" + ], + "text/plain": [ + " a b c\n", + "0 foo 1 3" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1.merge(df2, how='inner', on='a')" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "90916930-6a8e-40e3-871e-d0043aae93d8", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "2a8bb3d7", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
abc
0foo13.0
1bar2NaN
\n", + "
" + ], + "text/plain": [ + " a b c\n", + "0 foo 1 3.0\n", + "1 bar 2 NaN" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1.merge(df2, how='left', on='a')" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "467da7f9-a710-442e-9fcf-afb4990ea3b0", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "8951b7b9", + "metadata": {}, + "outputs": [], + "source": [ + "df1 = pd.DataFrame({'left': ['foo', 'bar']})\n", + "df2 = pd.DataFrame({'right': [7, 8]})" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "93051401", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
leftright
0foo7
1foo8
2bar7
3bar8
\n", + "
" + ], + "text/plain": [ + " left right\n", + "0 foo 7\n", + "1 foo 8\n", + "2 bar 7\n", + "3 bar 8" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1.merge(df2, how='cross')" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "bc243059-83f7-485c-bcd0-453d611c3d1f", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "b58237c9", + "metadata": {}, + "source": [ + "\n", + "### join\n", + "\n", + "- Join columns of another DataFrame.\n", + "\n", + "- Join columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.\n", + "\n", + "For example:" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "5ad178d6", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],\n", + " 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "ff1aa936", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],\n", + " 'B': ['B0', 'B1', 'B2']}) " + ] + }, + { + "cell_type": "markdown", + "id": "3278bb56", + "metadata": {}, + "source": [ + "Join DataFrames using their indexes." + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "a2517b83", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
key_callerAkey_otherB
0K0A0K0B0
1K1A1K1B1
2K2A2K2B2
3K3A3NaNNaN
4K4A4NaNNaN
5K5A5NaNNaN
\n", + "
" + ], + "text/plain": [ + " key_caller A key_other B\n", + "0 K0 A0 K0 B0\n", + "1 K1 A1 K1 B1\n", + "2 K2 A2 K2 B2\n", + "3 K3 A3 NaN NaN\n", + "4 K4 A4 NaN NaN\n", + "5 K5 A5 NaN NaN" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.join(other, lsuffix='_caller', rsuffix='_other')" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "81738ab5-bc94-4264-bb43-8c64c041c332", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "59935609", + "metadata": {}, + "source": [ + "\n", + "If we want to join using the `key` columns, we need to set `key` to be the index in both `df` and `other`. The joined DataFrame will have `key` as its index." + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "id": "91c6f0f0", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
AB
key
K0A0B0
K1A1B1
K2A2B2
K3A3NaN
K4A4NaN
K5A5NaN
\n", + "
" + ], + "text/plain": [ + " A B\n", + "key \n", + "K0 A0 B0\n", + "K1 A1 B1\n", + "K2 A2 B2\n", + "K3 A3 NaN\n", + "K4 A4 NaN\n", + "K5 A5 NaN" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.set_index('key').join(other.set_index('key'))" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "id": "f942120e-c151-473d-aa0a-3ed6b0679204", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "1483f153", + "metadata": {}, + "source": [ + "\n", + "Another option to join using the key columns is to use the `on` parameter. `DataFrame.join` always uses `other`'s index but we can use any column in `df`. This method preserves the original DataFrame's index in the result." + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "id": "d8fbb1f7", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
keyAB
0K0A0B0
1K1A1B1
2K2A2B2
3K3A3NaN
4K4A4NaN
5K5A5NaN
\n", + "
" + ], + "text/plain": [ + " key A B\n", + "0 K0 A0 B0\n", + "1 K1 A1 B1\n", + "2 K2 A2 B2\n", + "3 K3 A3 NaN\n", + "4 K4 A4 NaN\n", + "5 K5 A5 NaN" + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.join(other.set_index('key'), on='key')" + ] + }, + { + "cell_type": "markdown", + "id": "0ed06755", + "metadata": {}, + "source": [ + "Using non-unique key values shows how they are matched." + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "id": "b4d1eb0d", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
keyA
0K0A0
1K1A1
2K1A2
3K3A3
4K0A4
5K1A5
\n", + "
" + ], + "text/plain": [ + " key A\n", + "0 K0 A0\n", + "1 K1 A1\n", + "2 K1 A2\n", + "3 K3 A3\n", + "4 K0 A4\n", + "5 K1 A5" + ] + }, + "execution_count": 38, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = pd.DataFrame({'key': ['K0', 'K1', 'K1', 'K3', 'K0', 'K1'],\n", + " 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})\n", + "df " + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "7f6bc83d", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
keyAB
0K0A0B0
1K1A1B1
2K1A2B1
3K3A3NaN
4K0A4B0
5K1A5B1
\n", + "
" + ], + "text/plain": [ + " key A B\n", + "0 K0 A0 B0\n", + "1 K1 A1 B1\n", + "2 K1 A2 B1\n", + "3 K3 A3 NaN\n", + "4 K0 A4 B0\n", + "5 K1 A5 B1" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.join(other.set_index('key'), on='key', validate='m:1')" + ] + }, + { + "cell_type": "markdown", + "id": "61fb9627", + "metadata": {}, + "source": [ + "## Aggregation and grouping\n", + "\n", + "Group `DataFrame` using a mapper or by a `Series` of columns.\n", + "\n", + "A `groupby` operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.\n", + "\n", + "For example:" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "38adb2b7", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Max Speed
Animal
Falcon375.0
Parrot25.0
\n", + "
" + ], + "text/plain": [ + " Max Speed\n", + "Animal \n", + "Falcon 375.0\n", + "Parrot 25.0" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',\n", + " 'Parrot', 'Parrot'],\n", + " 'Max Speed': [380., 370., 24., 26.]})\n", + "df\n", + "df.groupby(['Animal']).mean()" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "917ba231-1ee4-4f2c-bcb9-4262d7eba119", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "84fe11db", + "metadata": {}, + "source": [ + "\n", + "### Hierarchical Indexes\n", + "\n", + "We can `groupby` different levels of a hierarchical index using the `level` parameter:" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "id": "5e84fd8b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Max Speed
Animal
Falcon370.0
Parrot25.0
\n", + "
" + ], + "text/plain": [ + " Max Speed\n", + "Animal \n", + "Falcon 370.0\n", + "Parrot 25.0" + ] + }, + "execution_count": 42, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],\n", + " ['Captive', 'Wild', 'Captive', 'Wild']]\n", + "index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))\n", + "df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]},\n", + " index=index)\n", + "df.groupby(level=0).mean()" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "id": "8d6ff678-1c1e-4629-9e06-1874511ecdf0", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "id": "5a7a2d6a", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Max Speed
Type
Captive210.0
Wild185.0
\n", + "
" + ], + "text/plain": [ + " Max Speed\n", + "Type \n", + "Captive 210.0\n", + "Wild 185.0" + ] + }, + "execution_count": 44, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.groupby(level=\"Type\").mean()" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "id": "31f4c668-6a8b-4dba-a6db-29673e7fbdba", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "fe08b062", + "metadata": {}, + "source": [ + "\n", + "We can also choose to include NA in group keys or not by setting `dropna` parameter, the default setting is `True`." + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "id": "f27b6536", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ac
b
1.023
2.025
\n", + "
" + ], + "text/plain": [ + " a c\n", + "b \n", + "1.0 2 3\n", + "2.0 2 5" + ] + }, + "execution_count": 46, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]\n", + "df = pd.DataFrame(l, columns=[\"a\", \"b\", \"c\"])\n", + "df.groupby(by=[\"b\"]).sum()" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "id": "47261c15-1d74-4a39-a7bb-073f6835cbf8", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "id": "815ba4c3", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ac
b
1.023
2.025
NaN14
\n", + "
" + ], + "text/plain": [ + " a c\n", + "b \n", + "1.0 2 3\n", + "2.0 2 5\n", + "NaN 1 4" + ] + }, + "execution_count": 48, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.groupby(by=[\"b\"], dropna=False).sum()" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "id": "17c93213-8bcf-4ac8-a30d-09df48b9ca71", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "id": "719dc004", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
bc
a
a13.013.0
b12.3123.0
\n", + "
" + ], + "text/plain": [ + " b c\n", + "a \n", + "a 13.0 13.0\n", + "b 12.3 123.0" + ] + }, + "execution_count": 50, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "l = [[\"a\", 12, 12], [None, 12.3, 33.], [\"b\", 12.3, 123], [\"a\", 1, 1]]\n", + "df = pd.DataFrame(l, columns=[\"a\", \"b\", \"c\"])\n", + "df.groupby(by=\"a\").sum()" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "id": "ba2d22de-ed75-4d52-a6d8-badf4791429f", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "id": "cce87c6a", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
bc
a
a13.013.0
b12.3123.0
NaN12.333.0
\n", + "
" + ], + "text/plain": [ + " b c\n", + "a \n", + "a 13.0 13.0\n", + "b 12.3 123.0\n", + "NaN 12.3 33.0" + ] + }, + "execution_count": 52, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.groupby(by=\"a\", dropna=False).sum()" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "id": "70cc2217-577e-4b8c-8fc2-ce02f036622b", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "6988f12c", + "metadata": {}, + "source": [ + "\n", + "When using `.apply()`, use `group_keys` to include or exclude the group keys. The `group_keys` argument defaults to `True` (include)." + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "id": "1fa5930a", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
AnimalMax Speed
Animal
Falcon0Falcon380.0
1Falcon370.0
Parrot2Parrot24.0
3Parrot26.0
\n", + "
" + ], + "text/plain": [ + " Animal Max Speed\n", + "Animal \n", + "Falcon 0 Falcon 380.0\n", + " 1 Falcon 370.0\n", + "Parrot 2 Parrot 24.0\n", + " 3 Parrot 26.0" + ] + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',\n", + " 'Parrot', 'Parrot'],\n", + " 'Max Speed': [380., 370., 24., 26.]})\n", + "df.groupby(\"Animal\", group_keys=True).apply(lambda x: x)" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "id": "67e4668e", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
AnimalMax Speed
0Falcon380.0
1Falcon370.0
2Parrot24.0
3Parrot26.0
\n", + "
" + ], + "text/plain": [ + " Animal Max Speed\n", + "0 Falcon 380.0\n", + "1 Falcon 370.0\n", + "2 Parrot 24.0\n", + "3 Parrot 26.0" + ] + }, + "execution_count": 55, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.groupby(\"Animal\", group_keys=False).apply(lambda x: x)" + ] + }, + { + "cell_type": "markdown", + "id": "c8777695", + "metadata": {}, + "source": [ + "## Pivot table\n", + "\n", + "Create a spreadsheet-style pivot table as a DataFrame.\n", + "\n", + "The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "id": "c8e1b317", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABCDE
0fooonesmall12
1fooonelarge24
2fooonelarge25
3footwosmall35
4footwosmall36
5baronelarge46
6baronesmall58
7bartwosmall69
8bartwolarge79
\n", + "
" + ], + "text/plain": [ + " A B C D E\n", + "0 foo one small 1 2\n", + "1 foo one large 2 4\n", + "2 foo one large 2 5\n", + "3 foo two small 3 5\n", + "4 foo two small 3 6\n", + "5 bar one large 4 6\n", + "6 bar one small 5 8\n", + "7 bar two small 6 9\n", + "8 bar two large 7 9" + ] + }, + "execution_count": 56, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = pd.DataFrame({\"A\": [\"foo\", \"foo\", \"foo\", \"foo\", \"foo\",\n", + " \"bar\", \"bar\", \"bar\", \"bar\"],\n", + " \"B\": [\"one\", \"one\", \"one\", \"two\", \"two\",\n", + " \"one\", \"one\", \"two\", \"two\"],\n", + " \"C\": [\"small\", \"large\", \"large\", \"small\",\n", + " \"small\", \"large\", \"small\", \"small\",\n", + " \"large\"],\n", + " \"D\": [1, 2, 2, 3, 3, 4, 5, 6, 7],\n", + " \"E\": [2, 4, 5, 5, 6, 6, 8, 9, 9]})\n", + "df" + ] + }, + { + "cell_type": "markdown", + "id": "ef96918e", + "metadata": {}, + "source": [ + "This first example aggregates values by taking the sum." + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "id": "7206f156", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "C:\\Users\\87554\\AppData\\Local\\Temp\\ipykernel_26368\\2135498425.py:1: FutureWarning: The provided callable is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string \"sum\" instead.\n", + " table = pd.pivot_table(df, values='D', index=['A', 'B'],\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Clargesmall
AB
barone4.05.0
two7.06.0
fooone4.01.0
twoNaN6.0
\n", + "
" + ], + "text/plain": [ + "C large small\n", + "A B \n", + "bar one 4.0 5.0\n", + " two 7.0 6.0\n", + "foo one 4.0 1.0\n", + " two NaN 6.0" + ] + }, + "execution_count": 57, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "table = pd.pivot_table(df, values='D', index=['A', 'B'],\n", + " columns=['C'], aggfunc=np.sum)\n", + "table" + ] + }, + { + "cell_type": "markdown", + "id": "e0df6460", + "metadata": {}, + "source": [ + "We can also fill in missing values using the `fill_value` parameter." + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "id": "6cfd03f9", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "C:\\Users\\87554\\AppData\\Local\\Temp\\ipykernel_26368\\3213005518.py:1: FutureWarning: The provided callable is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string \"sum\" instead.\n", + " table = pd.pivot_table(df, values='D', index=['A', 'B'],\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Clargesmall
AB
barone45
two76
fooone41
two06
\n", + "
" + ], + "text/plain": [ + "C large small\n", + "A B \n", + "bar one 4 5\n", + " two 7 6\n", + "foo one 4 1\n", + " two 0 6" + ] + }, + "execution_count": 58, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "table = pd.pivot_table(df, values='D', index=['A', 'B'],\n", + " columns=['C'], aggfunc=np.sum, fill_value=0)\n", + "table" + ] + }, + { + "cell_type": "markdown", + "id": "bf713c57", + "metadata": {}, + "source": [ + "The next example aggregates by taking the mean across multiple columns." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "900dc876", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],\n", + " aggfunc={'D': np.mean,\n", + " 'E': np.mean})\n", + "table" + ] + }, + { + "cell_type": "markdown", + "id": "6a428fdc", + "metadata": {}, + "source": [ + "We can also calculate multiple types of aggregations for any given value column." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "36ccdfaf", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],\n", + " aggfunc={'D': np.mean,\n", + " 'E': [min, max, np.mean]})\n", + "table" + ] + }, + { + "cell_type": "markdown", + "id": "19eeb851", + "metadata": {}, + "source": [ + "## High-performance Pandas: eval() and query()\n", + "\n", + "### eval()\n", + "\n", + "Evaluate a string describing operations on DataFrame columns.\n", + "\n", + "Operates on columns only, not specific rows or elements. This allows `eval` to run arbitrary code, which can make you vulnerable to code injection if you pass user input to this function.\n", + "\n", + "For example:" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "id": "db6fdd36", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
AB
0110
128
236
344
452
\n", + "
" + ], + "text/plain": [ + " A B\n", + "0 1 10\n", + "1 2 8\n", + "2 3 6\n", + "3 4 4\n", + "4 5 2" + ] + }, + "execution_count": 61, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)})\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "id": "92e71f86", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "0 11\n", + "1 10\n", + "2 9\n", + "3 8\n", + "4 7\n", + "dtype: int64" + ] + }, + "execution_count": 62, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.eval('A + B')" + ] + }, + { + "cell_type": "markdown", + "id": "e5f51480", + "metadata": {}, + "source": [ + "The assignment is allowed though by default the original `DataFrame` is not modified." + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "id": "b6387047", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABC
011011
12810
2369
3448
4527
\n", + "
" + ], + "text/plain": [ + " A B C\n", + "0 1 10 11\n", + "1 2 8 10\n", + "2 3 6 9\n", + "3 4 4 8\n", + "4 5 2 7" + ] + }, + "execution_count": 63, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.eval('C = A + B')" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "id": "a5322c51", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
AB
0110
128
236
344
452
\n", + "
" + ], + "text/plain": [ + " A B\n", + "0 1 10\n", + "1 2 8\n", + "2 3 6\n", + "3 4 4\n", + "4 5 2" + ] + }, + "execution_count": 64, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df" + ] + }, + { + "cell_type": "markdown", + "id": "9a0a5d4d", + "metadata": {}, + "source": [ + "Use `inplace=True` to modify the original DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "id": "13d2dffa", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABC
011011
12810
2369
3448
4527
\n", + "
" + ], + "text/plain": [ + " A B C\n", + "0 1 10 11\n", + "1 2 8 10\n", + "2 3 6 9\n", + "3 4 4 8\n", + "4 5 2 7" + ] + }, + "execution_count": 65, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.eval('C = A + B', inplace=True)\n", + "df" + ] + }, + { + "cell_type": "markdown", + "id": "e9c14654", + "metadata": {}, + "source": [ + "Multiple columns can be assigned using multi-line expressions:" + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "id": "8ee5ceea", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABCD
011011-9
12810-6
2369-3
34480
45273
\n", + "
" + ], + "text/plain": [ + " A B C D\n", + "0 1 10 11 -9\n", + "1 2 8 10 -6\n", + "2 3 6 9 -3\n", + "3 4 4 8 0\n", + "4 5 2 7 3" + ] + }, + "execution_count": 66, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.eval(\n", + " '''\n", + " C = A + B\n", + " D = A - B\n", + " '''\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "9c052b27", + "metadata": {}, + "source": [ + "### query()\n", + "\n", + "Query the columns of a DataFrame with a boolean expression.\n", + "\n", + "For example:" + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "id": "d99bb798", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABC C
011010
1289
2368
3447
4526
\n", + "
" + ], + "text/plain": [ + " A B C C\n", + "0 1 10 10\n", + "1 2 8 9\n", + "2 3 6 8\n", + "3 4 4 7\n", + "4 5 2 6" + ] + }, + "execution_count": 67, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = pd.DataFrame({\n", + " 'A': range(1, 6),\n", + " 'B': range(10, 0, -2),\n", + " 'C C': range(10, 5, -1)\n", + "})\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "id": "c228b08b", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABC C
4526
\n", + "
" + ], + "text/plain": [ + " A B C C\n", + "4 5 2 6" + ] + }, + "execution_count": 68, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.query('A > B')" + ] + }, + { + "cell_type": "markdown", + "id": "e90ed305", + "metadata": {}, + "source": [ + "The previous expression is equivalent to" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "id": "28a30c04", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABC C
4526
\n", + "
" + ], + "text/plain": [ + " A B C C\n", + "4 5 2 6" + ] + }, + "execution_count": 69, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[df.A > df.B]" + ] + }, + { + "cell_type": "markdown", + "id": "454bb2b9", + "metadata": {}, + "source": [ + "For columns with spaces in their name, you can use backtick quoting." + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "id": "4d06bb30", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABC C
011010
\n", + "
" + ], + "text/plain": [ + " A B C C\n", + "0 1 10 10" + ] + }, + "execution_count": 70, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.query('B == `C C`')" + ] + }, + { + "cell_type": "markdown", + "id": "2ac03c29", + "metadata": {}, + "source": [ + "The previous expression is equivalent to" + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "id": "f8dacc1f", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABC C
011010
\n", + "
" + ], + "text/plain": [ + " A B C C\n", + "0 1 10 10" + ] + }, + "execution_count": 71, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[df.B == df['C C']]" + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "id": "6ec1ded1-6f8a-46ca-b304-25621fe08677", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "bc6c4cd4", + "metadata": {}, + "source": [ + "\n", + "## Your turn! 🚀\n", + "\n", + "### Processing image data\n", + "\n", + "Recently, very powerful AI models have been developed that allow us to understand images. There are many tasks that can be solved using pre-trained neural networks, or cloud services. Some examples include:\n", + "\n", + "- **Image Classification**, can help you categorize the image into one of the pre-defined classes. You can easily train your own image classifiers using services such as [Custom Vision](https://azure.microsoft.com/services/cognitive-services/custom-vision-service/?WT.mc_id=academic-77958-bethanycheum)\n", + "- **Object Detection** to detect different objects in the image. Services such as [computer vision](https://azure.microsoft.com/services/cognitive-services/computer-vision/?WT.mc_id=academic-77958-bethanycheum) can detect a number of common objects, and you can train [Custom Vision](https://azure.microsoft.com/services/cognitive-services/custom-vision-service/?WT.mc_id=academic-77958-bethanycheum) model to detect some specific objects of interest.\n", + "- **Face Detection**, including Age, Gender and Emotion detection. This can be done via [Face API](https://azure.microsoft.com/services/cognitive-services/face/?WT.mc_id=academic-77958-bethanycheum).\n", + "\n", + "All those cloud services can be called using [Python SDKs](https://docs.microsoft.com/samples/azure-samples/cognitive-services-python-sdk-samples/cognitive-services-python-sdk-samples/?WT.mc_id=academic-77958-bethanycheum), and thus can be easily incorporated into your data exploration workflow.\n", + "\n", + "Here are some examples of exploring data from Image data sources:\n", + "\n", + "- In the blog post [How to Learn Data Science without Coding](https://soshnikov.com/azure/how-to-learn-data-science-without-coding/) we explore Instagram photos, trying to understand what makes people give more likes to a photo. We first extract as much information from pictures as possible using [computer vision](https://azure.microsoft.com/services/cognitive-services/computer-vision/?WT.mc_id=academic-77958-bethanycheum), and then use [Azure Machine Learning AutoML](https://docs.microsoft.com/azure/machine-learning/concept-automated-ml/?WT.mc_id=academic-77958-bethanycheum) to build the interpretable model.\n", + "- In [Facial Studies Workshop](https://github.com/CloudAdvocacy/FaceStudies) we use [Face API](https://azure.microsoft.com/services/cognitive-services/face/?WT.mc_id=academic-77958-bethanycheum) to extract emotions from people on photographs from events, in order to try to understand what makes people happy.\n", + "\n", + "### Assignment\n", + "\n", + "[Perform more detailed data study for the challenges above](../../assignments/data-science/data-processing-in-python.md)\n", + "\n", + "## Self study\n", + "\n", + "In this chapter, we've covered many of the basics of using Pandas effectively for data analysis. Still, much has been omitted from our discussion. To learn more about Pandas, we recommend the following resources:\n", + "\n", + "- [Pandas online documentation](http://pandas.pydata.org/): This is the go-to source for complete documentation of the package. While the examples in the documentation tend to be small generated datasets, the description of the options is complete and generally very useful for understanding the use of various functions.\n", + "\n", + "- [*Python for Data Analysis*](http://shop.oreilly.com/product/0636920023784.do) Written by Wes McKinney (the original creator of Pandas), this book contains much more detail on the Pandas package than we had room for in this chapter. In particular, he takes a deep dive into tools for time series, which were his bread and butter as a financial consultant. The book also has many entertaining examples of applying Pandas to gain insight from real-world datasets. Keep in mind, though, that the book is now several years old, and the Pandas package has quite a few new features that this book does not cover (but be on the lookout for a new edition in 2017).\n", + "\n", + "- [Stack Overflow](http://stackoverflow.com/questions/tagged/pandas): Pandas has so many users that any question you have has likely been asked and answered on Stack Overflow. Using Pandas is a case where some Google-Fu is your best friend. Simply go to your favorite search engine and type in the question, problem, or error you're coming across-more than likely you'll find your answer on a Stack Overflow page.\n", + "\n", + "- [Pandas on PyVideo](http://pyvideo.org/search?q=pandas): From PyCon to SciPy to PyData, many conferences have featured tutorials from Pandas developers and power users. The PyCon tutorials in particular tend to be given by very well-vetted presenters.\n", + "\n", + "Using these resources, combined with the walk-through given in this chapter, my hope is that you'll be poised to use Pandas to tackle any data analysis problem you come across!\n", + "\n", + "## Acknowledgments\n", + "\n", + "Thanks for [Pandas user guide](https://pandas.pydata.org/docs/user_guide/index.html). It contributes the majority of the content in this chapter." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "EnvName", + "language": "python", + "name": "envname" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.18" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/open-machine-learning-jupyter-book/data-science/working-with-data/pandas/data-selection.ipynb b/open-machine-learning-jupyter-book/data-science/working-with-data/pandas/data-selection.ipynb new file mode 100644 index 0000000000..b12c60d273 --- /dev/null +++ b/open-machine-learning-jupyter-book/data-science/working-with-data/pandas/data-selection.ipynb @@ -0,0 +1,4514 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "4d6f93c8-aa6b-458d-9d0e-81244eee5808", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "source": [ + "---\n", + "jupytext:\n", + " cell_metadata_filter: -all\n", + " formats: md:myst\n", + " text_representation:\n", + " extension: .md\n", + " format_name: myst\n", + " format_version: 0.13\n", + " jupytext_version: 1.11.5\n", + "kernelspec:\n", + " display_name: Python 3\n", + " language: python\n", + " name: python3" + ] + }, + { + "cell_type": "markdown", + "id": "70c2694f-98d3-4846-a4d2-a88ac4da4a56", + "metadata": {}, + "source": [ + "# Data Selection " + ] + }, + { + "cell_type": "markdown", + "id": "3ca294fb-5274-4c7b-b293-a374877b524b", + "metadata": {}, + "source": [ + "## Overview\n", + "\n", + "In this section, we'll focus on how to slice, dice, and generally get and set subsets of Pandas objects." + ] + }, + { + "cell_type": "markdown", + "id": "446ccf8d-1e3a-4cec-8bac-400c5c97028a", + "metadata": {}, + "source": [ + "Import NumPy and load Pandas into your namespace:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "f1931205-8c05-40ca-b266-c0f14e26cff3", + "metadata": {}, + "outputs": [], + "source": [ + "# Install the necessary dependencies\n", + "import os\n", + "import sys\n", + "!{sys.executable} -m pip install --quiet jupyterlab_myst ipython\n", + "import numpy as np\n", + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "id": "281fa7e2", + "metadata": {}, + "source": [ + "## Selection by label" + ] + }, + { + "cell_type": "markdown", + "id": "8cfbc1d9-62b9-4f12-a249-0fb7af77d6f3", + "metadata": { + "attributes": { + "classes": [ + "warning" + ], + "id": "" + } + }, + "source": [ + "Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called `chained assignment` and should be avoided." + ] + }, + { + "cell_type": "markdown", + "id": "9dd00162-d4ac-4b84-9da4-9fe7e36cbcb5", + "metadata": { + "attributes": { + "classes": [ + "warning" + ], + "id": "" + } + }, + "source": [ + "`.loc` is strict when you present slicers that are not compatible (or convertible) with the index type. For example using integers in a `DatetimeIndex`. These will raise a `TypeError`." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "19faf0a0", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "dfl = pd.DataFrame(np.random.randn(5, 4),\n", + " columns=list('ABCD'),\n", + " index=pd.date_range('20130101', periods=5))" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "5cd6165e", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + }, + "tags": [ + "raises-exception" + ] + }, + "outputs": [ + { + "ename": "TypeError", + "evalue": "cannot do slice indexing on DatetimeIndex with these indexers [2] of type int", + "output_type": "error", + "traceback": [ + "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[1;32mIn[3], line 2\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[38;5;66;03m#:tags: [\"raises-exception\"]\u001b[39;00m\n\u001b[1;32m----> 2\u001b[0m dfl\u001b[38;5;241m.\u001b[39mloc[\u001b[38;5;241m2\u001b[39m:\u001b[38;5;241m3\u001b[39m]\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\indexing.py:1073\u001b[0m, in \u001b[0;36m_LocationIndexer.__getitem__\u001b[1;34m(self, key)\u001b[0m\n\u001b[0;32m 1070\u001b[0m axis \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39maxis \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;241m0\u001b[39m\n\u001b[0;32m 1072\u001b[0m maybe_callable \u001b[38;5;241m=\u001b[39m com\u001b[38;5;241m.\u001b[39mapply_if_callable(key, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mobj)\n\u001b[1;32m-> 1073\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_getitem_axis(maybe_callable, axis\u001b[38;5;241m=\u001b[39maxis)\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\indexing.py:1290\u001b[0m, in \u001b[0;36m_LocIndexer._getitem_axis\u001b[1;34m(self, key, axis)\u001b[0m\n\u001b[0;32m 1288\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(key, \u001b[38;5;28mslice\u001b[39m):\n\u001b[0;32m 1289\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_validate_key(key, axis)\n\u001b[1;32m-> 1290\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_get_slice_axis(key, axis\u001b[38;5;241m=\u001b[39maxis)\n\u001b[0;32m 1291\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m com\u001b[38;5;241m.\u001b[39mis_bool_indexer(key):\n\u001b[0;32m 1292\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_getbool_axis(key, axis\u001b[38;5;241m=\u001b[39maxis)\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\indexing.py:1324\u001b[0m, in \u001b[0;36m_LocIndexer._get_slice_axis\u001b[1;34m(self, slice_obj, axis)\u001b[0m\n\u001b[0;32m 1321\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m obj\u001b[38;5;241m.\u001b[39mcopy(deep\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mFalse\u001b[39;00m)\n\u001b[0;32m 1323\u001b[0m labels \u001b[38;5;241m=\u001b[39m obj\u001b[38;5;241m.\u001b[39m_get_axis(axis)\n\u001b[1;32m-> 1324\u001b[0m indexer \u001b[38;5;241m=\u001b[39m labels\u001b[38;5;241m.\u001b[39mslice_indexer(slice_obj\u001b[38;5;241m.\u001b[39mstart, slice_obj\u001b[38;5;241m.\u001b[39mstop, slice_obj\u001b[38;5;241m.\u001b[39mstep)\n\u001b[0;32m 1326\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(indexer, \u001b[38;5;28mslice\u001b[39m):\n\u001b[0;32m 1327\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mobj\u001b[38;5;241m.\u001b[39m_slice(indexer, axis\u001b[38;5;241m=\u001b[39maxis)\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\indexes\\datetimes.py:809\u001b[0m, in \u001b[0;36mDatetimeIndex.slice_indexer\u001b[1;34m(self, start, end, step, kind)\u001b[0m\n\u001b[0;32m 801\u001b[0m \u001b[38;5;66;03m# GH#33146 if start and end are combinations of str and None and Index is not\u001b[39;00m\n\u001b[0;32m 802\u001b[0m \u001b[38;5;66;03m# monotonic, we can not use Index.slice_indexer because it does not honor the\u001b[39;00m\n\u001b[0;32m 803\u001b[0m \u001b[38;5;66;03m# actual elements, is only searching for start and end\u001b[39;00m\n\u001b[0;32m 804\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m (\n\u001b[0;32m 805\u001b[0m check_str_or_none(start)\n\u001b[0;32m 806\u001b[0m \u001b[38;5;129;01mor\u001b[39;00m check_str_or_none(end)\n\u001b[0;32m 807\u001b[0m \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mis_monotonic_increasing\n\u001b[0;32m 808\u001b[0m ):\n\u001b[1;32m--> 809\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m Index\u001b[38;5;241m.\u001b[39mslice_indexer(\u001b[38;5;28mself\u001b[39m, start, end, step, kind\u001b[38;5;241m=\u001b[39mkind)\n\u001b[0;32m 811\u001b[0m mask \u001b[38;5;241m=\u001b[39m np\u001b[38;5;241m.\u001b[39marray(\u001b[38;5;28;01mTrue\u001b[39;00m)\n\u001b[0;32m 812\u001b[0m deprecation_mask \u001b[38;5;241m=\u001b[39m np\u001b[38;5;241m.\u001b[39marray(\u001b[38;5;28;01mTrue\u001b[39;00m)\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\indexes\\base.py:6559\u001b[0m, in \u001b[0;36mIndex.slice_indexer\u001b[1;34m(self, start, end, step, kind)\u001b[0m\n\u001b[0;32m 6516\u001b[0m \u001b[38;5;250m\u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[0;32m 6517\u001b[0m \u001b[38;5;124;03mCompute the slice indexer for input labels and step.\u001b[39;00m\n\u001b[0;32m 6518\u001b[0m \n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 6555\u001b[0m \u001b[38;5;124;03mslice(1, 3, None)\u001b[39;00m\n\u001b[0;32m 6556\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[0;32m 6557\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_deprecated_arg(kind, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mkind\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mslice_indexer\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m-> 6559\u001b[0m start_slice, end_slice \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mslice_locs(start, end, step\u001b[38;5;241m=\u001b[39mstep)\n\u001b[0;32m 6561\u001b[0m \u001b[38;5;66;03m# return a slice\u001b[39;00m\n\u001b[0;32m 6562\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m is_scalar(start_slice):\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\indexes\\base.py:6767\u001b[0m, in \u001b[0;36mIndex.slice_locs\u001b[1;34m(self, start, end, step, kind)\u001b[0m\n\u001b[0;32m 6765\u001b[0m start_slice \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[0;32m 6766\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m start \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m-> 6767\u001b[0m start_slice \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mget_slice_bound(start, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mleft\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m 6768\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m start_slice \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[0;32m 6769\u001b[0m start_slice \u001b[38;5;241m=\u001b[39m \u001b[38;5;241m0\u001b[39m\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\indexes\\base.py:6676\u001b[0m, in \u001b[0;36mIndex.get_slice_bound\u001b[1;34m(self, label, side, kind)\u001b[0m\n\u001b[0;32m 6672\u001b[0m original_label \u001b[38;5;241m=\u001b[39m label\n\u001b[0;32m 6674\u001b[0m \u001b[38;5;66;03m# For datetime indices label may be a string that has to be converted\u001b[39;00m\n\u001b[0;32m 6675\u001b[0m \u001b[38;5;66;03m# to datetime boundary according to its resolution.\u001b[39;00m\n\u001b[1;32m-> 6676\u001b[0m label \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_maybe_cast_slice_bound(label, side)\n\u001b[0;32m 6678\u001b[0m \u001b[38;5;66;03m# we need to look up the label\u001b[39;00m\n\u001b[0;32m 6679\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\indexes\\datetimes.py:767\u001b[0m, in \u001b[0;36mDatetimeIndex._maybe_cast_slice_bound\u001b[1;34m(self, label, side, kind)\u001b[0m\n\u001b[0;32m 762\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(label, date) \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(label, datetime):\n\u001b[0;32m 763\u001b[0m \u001b[38;5;66;03m# Pandas supports slicing with dates, treated as datetimes at midnight.\u001b[39;00m\n\u001b[0;32m 764\u001b[0m \u001b[38;5;66;03m# https://github.com/pandas-dev/pandas/issues/31501\u001b[39;00m\n\u001b[0;32m 765\u001b[0m label \u001b[38;5;241m=\u001b[39m Timestamp(label)\u001b[38;5;241m.\u001b[39mto_pydatetime()\n\u001b[1;32m--> 767\u001b[0m label \u001b[38;5;241m=\u001b[39m \u001b[38;5;28msuper\u001b[39m()\u001b[38;5;241m.\u001b[39m_maybe_cast_slice_bound(label, side, kind\u001b[38;5;241m=\u001b[39mkind)\n\u001b[0;32m 768\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_deprecate_mismatched_indexing(label)\n\u001b[0;32m 769\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_maybe_cast_for_get_loc(label)\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\indexes\\datetimelike.py:320\u001b[0m, in \u001b[0;36mDatetimeIndexOpsMixin._maybe_cast_slice_bound\u001b[1;34m(self, label, side, kind)\u001b[0m\n\u001b[0;32m 318\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m lower \u001b[38;5;28;01mif\u001b[39;00m side \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mleft\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m upper\n\u001b[0;32m 319\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(label, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_data\u001b[38;5;241m.\u001b[39m_recognized_scalars):\n\u001b[1;32m--> 320\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_invalid_indexer(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mslice\u001b[39m\u001b[38;5;124m\"\u001b[39m, label)\n\u001b[0;32m 322\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m label\n", + "\u001b[1;31mTypeError\u001b[0m: cannot do slice indexing on DatetimeIndex with these indexers [2] of type int" + ] + } + ], + "source": [ + "dfl.loc[2:3]" + ] + }, + { + "cell_type": "markdown", + "id": "f2b699ce-d01f-4afa-8323-c43b9df24b38", + "metadata": { + "attributes": { + "classes": [ + "warning" + ], + "id": "" + } + }, + "source": [ + "String likes in slicing can be convertible to the type of the index and lead to natural slicing." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "3f5fb2f0", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABCD
2013-01-02-0.5292161.223634-0.783708-1.209286
2013-01-03-1.570743-0.316004-1.132640-0.464328
2013-01-041.390855-0.319271-1.093100-1.090622
\n", + "
" + ], + "text/plain": [ + " A B C D\n", + "2013-01-02 -0.529216 1.223634 -0.783708 -1.209286\n", + "2013-01-03 -1.570743 -0.316004 -1.132640 -0.464328\n", + "2013-01-04 1.390855 -0.319271 -1.093100 -1.090622" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dfl.loc['20130102':'20130104']" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "abe5968b-ffe5-4302-9918-81a1d97ed568", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "f3c046d5-19dc-47cb-828f-880f008d02d4", + "metadata": { + "attributes": { + "classes": [ + "warning" + ], + "id": "" + } + }, + "source": [ + "Pandas will raise a `KeyError` if indexing with a list with missing labels." + ] + }, + { + "cell_type": "markdown", + "id": "29221dda", + "metadata": {}, + "source": [ + "Pandas provides a suite of methods in order to have **purely label-based indexing**. This is a strict inclusion-based protocol. Every label asked for must be in the index, or a `KeyError` will be raised. When slicing, both the start bound **AND** the stop bound are included, if present in the index. Integers are valid labels, but they refer to the label **and not the position**.\n", + "\n", + "- The `.loc` attribute is the primary access method. The following are valid inputs:\n", + "\n", + "- A single label, e.g. `5` or `'a'` (Note that `5` is interpreted as a label of the index. This use is not an integer position along the index.).\n", + "\n", + "- A list or array of labels `['a', 'b', 'c']`.\n", + "\n", + "- A slice object with labels `'a':'f'` (Note that contrary to usual Python slices, both the start and the stop are included, when present in the index!\n", + "\n", + "- A boolean array.\n", + "\n", + "- A `callable`." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "8a174f11", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "c 0.697303\n", + "d -1.412259\n", + "e 0.104600\n", + "f 1.718896\n", + "dtype: float64" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s1 = pd.Series(np.random.randn(6), index=list('abcdef'))\n", + "s1\n", + "s1.loc['c':]" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "b276bd82-797f-4eb6-8886-51153d771bb0", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "11e56acc", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "-0.3374040853531507" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s1.loc['b']" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "74a7ae51-b334-4d5f-b9a2-e2080958663f", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "eb2dbf2d-cdd9-42e4-b374-fc7944f1996f", + "metadata": {}, + "source": [ + "Note that the setting works as well:" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "8fe78c41", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "a -0.985634\n", + "b -0.337404\n", + "c 0.000000\n", + "d 0.000000\n", + "e 0.000000\n", + "f 0.000000\n", + "dtype: float64" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s1.loc['c':] = 0\n", + "s1" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "e32f82e4-6b3e-48a7-ab56-c6ea820274e5", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "cfb25d9f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABCD
a1.401532-1.744216-0.212177-1.295240
b0.965335-1.586035-2.2753840.615352
d-0.131692-0.910665-1.2866410.340830
\n", + "
" + ], + "text/plain": [ + " A B C D\n", + "a 1.401532 -1.744216 -0.212177 -1.295240\n", + "b 0.965335 -1.586035 -2.275384 0.615352\n", + "d -0.131692 -0.910665 -1.286641 0.340830" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1 = pd.DataFrame(np.random.randn(6, 4),\n", + " index=list('abcdef'),\n", + " columns=list('ABCD'))\n", + "df1\n", + "df1.loc[['a', 'b', 'd'], :]" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "de1a7123-2c8e-4910-b435-cdd489baff5b", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "0493a65b-5915-4119-a2a8-00b0b8a728b0", + "metadata": {}, + "source": [ + "Accessing via label slices:" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "2934e9e8", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABC
d-0.131692-0.910665-1.286641
e0.238683-0.1697711.322003
f1.511896-0.2472483.169958
\n", + "
" + ], + "text/plain": [ + " A B C\n", + "d -0.131692 -0.910665 -1.286641\n", + "e 0.238683 -0.169771 1.322003\n", + "f 1.511896 -0.247248 3.169958" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1.loc['d':, 'A':'C']" + ] + }, + { + "cell_type": "markdown", + "id": "29f4ac9b-3d4b-4199-a04a-b9ec886b26f6", + "metadata": {}, + "source": [ + "For getting a cross-section using a label (equivalent to `df.xs('a')`):" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "ccbffe12", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "A 1.401532\n", + "B -1.744216\n", + "C -0.212177\n", + "D -1.295240\n", + "Name: a, dtype: float64" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1.loc['a']" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "c9570d12-8020-4328-94e8-91266619e666", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "589a2a99", + "metadata": {}, + "source": [ + "For getting values with a boolean array:" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "e60fdddf", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "A True\n", + "B False\n", + "C False\n", + "D False\n", + "Name: a, dtype: bool" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1.loc['a'] > 0" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "4a9f2648-9f92-4077-a7ec-00836c2f28fd", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "d6226934", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
A
a1.401532
b0.965335
c-0.097299
d-0.131692
e0.238683
f1.511896
\n", + "
" + ], + "text/plain": [ + " A\n", + "a 1.401532\n", + "b 0.965335\n", + "c -0.097299\n", + "d -0.131692\n", + "e 0.238683\n", + "f 1.511896" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1.loc[:, df1.loc['a'] > 0]" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "f8ae65cd-dbea-4f40-a464-7b07554b9b11", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "0e52a617", + "metadata": {}, + "source": [ + "NA values in a boolean array propagate as `False`:" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "0ca93c29", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "\n", + "[True, False, True, False, , False]\n", + "Length: 6, dtype: boolean" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "mask = pd.array([True, False, True, False, pd.NA, False], dtype=\"boolean\")\n", + "mask" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "fd577bd5", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABCD
a1.401532-1.744216-0.212177-1.295240
c-0.097299-0.8344960.188575-0.271869
\n", + "
" + ], + "text/plain": [ + " A B C D\n", + "a 1.401532 -1.744216 -0.212177 -1.295240\n", + "c -0.097299 -0.834496 0.188575 -0.271869" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1[mask]" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "4f1b5f67-5c56-4e47-8953-4d6383f283e1", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "2ff30b9c", + "metadata": {}, + "source": [ + "For getting a value explicitly:" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "7e425a66", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1.4015323563287203" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1.loc['a', 'A'] # this is also equivalent to ``df1.at['a','A']``" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "50e88f3d-07f0-443d-994c-d7fb36c4dc7a", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "b29c0cd3", + "metadata": {}, + "source": [ + "## Slicing with labels\n", + "\n", + "When using `.loc` with slices, if both the start and the stop labels are present in the index, then elements located between the two (including them) are returned:" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "2bd13eab", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "3 b\n", + "2 c\n", + "5 d\n", + "dtype: object" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s = pd.Series(list('abcde'), index=[0, 3, 2, 5, 4])\n", + "s.loc[3:5]" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "63081450-8216-403c-8b53-04b2cc18e442", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "0a1f8d46", + "metadata": {}, + "source": [ + "If at least one of the two is absent, but the index is sorted, and can be compared against start and stop labels, then slicing will still work as expected, by selecting labels which rank between the two:" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "a08caf62", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 a\n", + "2 c\n", + "3 b\n", + "4 e\n", + "5 d\n", + "dtype: object" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s.sort_index()" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "7d665bb1-9bd1-4826-9a0f-f13496d64549", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "a5f5d2ba", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2 c\n", + "3 b\n", + "4 e\n", + "5 d\n", + "dtype: object" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s.sort_index().loc[1:6]" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "81114a6f-4511-4f2e-990b-c7edd5e4cf86", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "5115e1d2", + "metadata": {}, + "source": [ + "However, if at least one of the two is absent and the index is not sorted, an error will be raised (since doing otherwise would be computationally expensive, as well as potentially ambiguous for mixed-type indexes). For instance, in the above example, `s.loc[1:6]` would raise `KeyError`." + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "318b8e37", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "3 b\n", + "2 c\n", + "5 d\n", + "dtype: object" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s = pd.Series(list('abcdef'), index=[0, 3, 2, 5, 4, 2])\n", + "s.loc[3:5]" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "537dd0b6-b4fc-468b-88a4-5d828eba5ed8", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "ce05682d", + "metadata": {}, + "source": [ + "\n", + "Also, if the index has duplicate labels and either the start or the stop label is duplicated, an error will be raised. For instance, in the above example, `s.loc[2:5]` would raise a `KeyError`.\n", + "\n", + "## Selection by position" + ] + }, + { + "cell_type": "markdown", + "id": "099c8fa7-d8df-4304-a513-1b142c1021d5", + "metadata": { + "attributes": { + "classes": [ + "warning" + ], + "id": "" + } + }, + "source": [ + "Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called `chained assignment` and should be avoided." + ] + }, + { + "cell_type": "markdown", + "id": "9c2e1dab", + "metadata": {}, + "source": [ + "Pandas provides a suite of methods in order to get purely integer-based indexing. The semantics follow closely Python and NumPy slicing. These are 0-based indexing. When slicing, the start bound is included, while the upper bound is excluded. Trying to use a non-integer, even a valid label will raise an `IndexError`.\n", + "\n", + "The `.iloc` attribute is the primary access method. The following are valid inputs:\n", + "\n", + "- An integer e.g. `5`.\n", + "\n", + "- A list or array of integers `[4, 3, 0]`.\n", + "\n", + "- A slice object with ints `1:7`.\n", + "\n", + "- A boolean array.\n", + "\n", + "- A `callable`." + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "e7b93cb1", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 0.531403\n", + "2 1.164702\n", + "4 -0.384782\n", + "dtype: float64" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s1 = pd.Series(np.random.randn(5), index=list(range(0, 10, 2)))\n", + "s1\n", + "s1.iloc[:3]" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "id": "24d4de8c-5c42-484b-89d7-e21ebb0ba7c3", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "id": "fe63cdf3", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.21764232439885461" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s1.iloc[3]" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "id": "ed15834b-fd14-4000-bbdb-0eb86a214984", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "4ac478c2", + "metadata": {}, + "source": [ + "Note that setting works as well:" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "id": "9c4e8129", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 0.000000\n", + "2 0.000000\n", + "4 0.000000\n", + "6 0.217642\n", + "8 1.458410\n", + "dtype: float64" + ] + }, + "execution_count": 38, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s1.iloc[:3] = 0\n", + "s1" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "5b793d9f-5ddb-4121-8218-8a5eda713eab", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "56ced074", + "metadata": {}, + "source": [ + "With a DataFrame,Select via integer slicing:" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "3d55d682", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0246
0-1.194412-0.1265400.496297-0.194096
2-1.1169511.041856-0.6626330.493678
41.497028-0.2604970.697729-1.092215
\n", + "
" + ], + "text/plain": [ + " 0 2 4 6\n", + "0 -1.194412 -0.126540 0.496297 -0.194096\n", + "2 -1.116951 1.041856 -0.662633 0.493678\n", + "4 1.497028 -0.260497 0.697729 -1.092215" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1 = pd.DataFrame(np.random.randn(6, 4),\n", + " index=list(range(0, 12, 2)),\n", + " columns=list(range(0, 8, 2)))\n", + "df1\n", + "df1.iloc[:3]" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "172e44bf-8faf-42a1-b9a7-3adab79b97d1", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "id": "b5427ec6", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
46
2-0.6626330.493678
40.697729-1.092215
60.7153701.528302
8-0.5482320.081242
\n", + "
" + ], + "text/plain": [ + " 4 6\n", + "2 -0.662633 0.493678\n", + "4 0.697729 -1.092215\n", + "6 0.715370 1.528302\n", + "8 -0.548232 0.081242" + ] + }, + "execution_count": 42, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1.iloc[1:5, 2:4]" + ] + }, + { + "cell_type": "markdown", + "id": "550715ab", + "metadata": {}, + "source": [ + "Select via integer list:" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "id": "d86dd6d1", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
26
21.0418560.493678
6-0.1775171.528302
100.666088-0.595855
\n", + "
" + ], + "text/plain": [ + " 2 6\n", + "2 1.041856 0.493678\n", + "6 -0.177517 1.528302\n", + "10 0.666088 -0.595855" + ] + }, + "execution_count": 43, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1.iloc[[1, 3, 5], [1, 3]]" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "id": "a5e2a6ba-671b-4aab-b63d-5ab4ee92501f", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "id": "8528cc39", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0246
2-1.1169511.041856-0.6626330.493678
41.497028-0.2604970.697729-1.092215
\n", + "
" + ], + "text/plain": [ + " 0 2 4 6\n", + "2 -1.116951 1.041856 -0.662633 0.493678\n", + "4 1.497028 -0.260497 0.697729 -1.092215" + ] + }, + "execution_count": 45, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1.iloc[1:3, :]" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "id": "178d6f69-464f-464e-ad45-fac857b9a370", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "id": "f9288433", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
24
0-0.1265400.496297
21.041856-0.662633
4-0.2604970.697729
6-0.1775170.715370
8-0.980550-0.548232
100.6660880.114509
\n", + "
" + ], + "text/plain": [ + " 2 4\n", + "0 -0.126540 0.496297\n", + "2 1.041856 -0.662633\n", + "4 -0.260497 0.697729\n", + "6 -0.177517 0.715370\n", + "8 -0.980550 -0.548232\n", + "10 0.666088 0.114509" + ] + }, + "execution_count": 47, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1.iloc[:, 1:3]" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "id": "71859ce4-7ad5-4bea-9df2-f5929c0c2470", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "id": "eb3f25f3", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1.0418559735628448" + ] + }, + "execution_count": 49, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1.iloc[1, 1] # this is also equivalent to ``df1.iat[1,1]``" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "id": "5dad7d1a-0bf5-40d8-a4ef-2c3e573ae6fc", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "6cb0234e", + "metadata": {}, + "source": [ + "\n", + "For getting a cross-section using an integer position (equiv to `df.xs(1)`):" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "id": "cc95030f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 -1.116951\n", + "2 1.041856\n", + "4 -0.662633\n", + "6 0.493678\n", + "Name: 2, dtype: float64" + ] + }, + "execution_count": 51, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1.iloc[1]" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "id": "bfa6df43-353d-4ba4-94a0-e65c9a659468", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "bc5305b8", + "metadata": {}, + "source": [ + "Out-of-range slice indexes are handled gracefully just as in Python/NumPy." + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "id": "0c635e2f", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "['a', 'b', 'c', 'd', 'e', 'f']" + ] + }, + "execution_count": 53, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x = list('abcdef') # these are allowed in Python/NumPy.\n", + "x" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "id": "bae9b708", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "['e', 'f']" + ] + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x[4:10]" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "id": "ccb95b2c", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "[]" + ] + }, + "execution_count": 55, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x[8:10]" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "id": "fcaaeb73", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "0 a\n", + "1 b\n", + "2 c\n", + "3 d\n", + "4 e\n", + "5 f\n", + "dtype: object" + ] + }, + "execution_count": 56, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s = pd.Series(x)\n", + "s" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "id": "19e7f165", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "4 e\n", + "5 f\n", + "dtype: object" + ] + }, + "execution_count": 57, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s.iloc[4:10]" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "id": "3b612356-7774-472e-849e-0f3dc267b578", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "id": "2a25cc5c", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Series([], dtype: object)" + ] + }, + "execution_count": 59, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s.iloc[8:10]" + ] + }, + { + "cell_type": "markdown", + "id": "23aa8371", + "metadata": {}, + "source": [ + "Note that using slices that go out of bounds can result in an empty axis (e.g. an empty DataFrame being returned)." + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "id": "f9024d15", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "dfl = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "id": "5837f585", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0
1
2
3
4
\n", + "
" + ], + "text/plain": [ + "Empty DataFrame\n", + "Columns: []\n", + "Index: [0, 1, 2, 3, 4]" + ] + }, + "execution_count": 61, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dfl.iloc[:, 2:3]" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "id": "4b81ac82-5d47-4410-90b9-040f0dac662b", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "id": "d0e19553", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
B
00.700611
1-0.358047
20.620409
30.953488
4-2.263445
\n", + "
" + ], + "text/plain": [ + " B\n", + "0 0.700611\n", + "1 -0.358047\n", + "2 0.620409\n", + "3 0.953488\n", + "4 -2.263445" + ] + }, + "execution_count": 63, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dfl.iloc[:, 1:3]" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "id": "39dab713-a3f6-4189-bad9-cba564f56951", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "id": "f91ab868", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
AB
40.374726-2.263445
\n", + "
" + ], + "text/plain": [ + " A B\n", + "4 0.374726 -2.263445" + ] + }, + "execution_count": 65, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dfl.iloc[4:6]" + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "id": "220aa5af-5003-45e9-87cf-c4f5d0ac6d93", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "59c65e15", + "metadata": {}, + "source": [ + "\n", + "A single indexer that is out of bounds will raise an `IndexError`. A list of indexers where any element is out of bounds will raise an `IndexError`." + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "id": "f3496be2", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + }, + "tags": [ + "raises-exception" + ] + }, + "outputs": [ + { + "ename": "IndexError", + "evalue": "positional indexers are out-of-bounds", + "output_type": "error", + "traceback": [ + "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[1;31mIndexError\u001b[0m Traceback (most recent call last)", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\indexing.py:1587\u001b[0m, in \u001b[0;36m_iLocIndexer._get_list_axis\u001b[1;34m(self, key, axis)\u001b[0m\n\u001b[0;32m 1586\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m-> 1587\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mobj\u001b[38;5;241m.\u001b[39m_take_with_is_copy(key, axis\u001b[38;5;241m=\u001b[39maxis)\n\u001b[0;32m 1588\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mIndexError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n\u001b[0;32m 1589\u001b[0m \u001b[38;5;66;03m# re-raise with different error message\u001b[39;00m\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\generic.py:3902\u001b[0m, in \u001b[0;36mNDFrame._take_with_is_copy\u001b[1;34m(self, indices, axis)\u001b[0m\n\u001b[0;32m 3895\u001b[0m \u001b[38;5;250m\u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[0;32m 3896\u001b[0m \u001b[38;5;124;03mInternal version of the `take` method that sets the `_is_copy`\u001b[39;00m\n\u001b[0;32m 3897\u001b[0m \u001b[38;5;124;03mattribute to keep track of the parent dataframe (using in indexing\u001b[39;00m\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 3900\u001b[0m \u001b[38;5;124;03mSee the docstring of `take` for full explanation of the parameters.\u001b[39;00m\n\u001b[0;32m 3901\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m-> 3902\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_take(indices\u001b[38;5;241m=\u001b[39mindices, axis\u001b[38;5;241m=\u001b[39maxis)\n\u001b[0;32m 3903\u001b[0m \u001b[38;5;66;03m# Maybe set copy if we didn't actually change the index.\u001b[39;00m\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\generic.py:3886\u001b[0m, in \u001b[0;36mNDFrame._take\u001b[1;34m(self, indices, axis, convert_indices)\u001b[0m\n\u001b[0;32m 3884\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_consolidate_inplace()\n\u001b[1;32m-> 3886\u001b[0m new_data \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_mgr\u001b[38;5;241m.\u001b[39mtake(\n\u001b[0;32m 3887\u001b[0m indices,\n\u001b[0;32m 3888\u001b[0m axis\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_get_block_manager_axis(axis),\n\u001b[0;32m 3889\u001b[0m verify\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m,\n\u001b[0;32m 3890\u001b[0m convert_indices\u001b[38;5;241m=\u001b[39mconvert_indices,\n\u001b[0;32m 3891\u001b[0m )\n\u001b[0;32m 3892\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_constructor(new_data)\u001b[38;5;241m.\u001b[39m__finalize__(\u001b[38;5;28mself\u001b[39m, method\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtake\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\internals\\managers.py:975\u001b[0m, in \u001b[0;36mBaseBlockManager.take\u001b[1;34m(self, indexer, axis, verify, convert_indices)\u001b[0m\n\u001b[0;32m 974\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m convert_indices:\n\u001b[1;32m--> 975\u001b[0m indexer \u001b[38;5;241m=\u001b[39m maybe_convert_indices(indexer, n, verify\u001b[38;5;241m=\u001b[39mverify)\n\u001b[0;32m 977\u001b[0m new_labels \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39maxes[axis]\u001b[38;5;241m.\u001b[39mtake(indexer)\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\indexers\\utils.py:286\u001b[0m, in \u001b[0;36mmaybe_convert_indices\u001b[1;34m(indices, n, verify)\u001b[0m\n\u001b[0;32m 285\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m mask\u001b[38;5;241m.\u001b[39many():\n\u001b[1;32m--> 286\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mIndexError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mindices are out-of-bounds\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m 287\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m indices\n", + "\u001b[1;31mIndexError\u001b[0m: indices are out-of-bounds", + "\nThe above exception was the direct cause of the following exception:\n", + "\u001b[1;31mIndexError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[1;32mIn[67], line 2\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[38;5;66;03m#:tags: [\"raises-exception\"]\u001b[39;00m\n\u001b[1;32m----> 2\u001b[0m dfl\u001b[38;5;241m.\u001b[39miloc[[\u001b[38;5;241m4\u001b[39m, \u001b[38;5;241m5\u001b[39m, \u001b[38;5;241m6\u001b[39m]]\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\indexing.py:1073\u001b[0m, in \u001b[0;36m_LocationIndexer.__getitem__\u001b[1;34m(self, key)\u001b[0m\n\u001b[0;32m 1070\u001b[0m axis \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39maxis \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;241m0\u001b[39m\n\u001b[0;32m 1072\u001b[0m maybe_callable \u001b[38;5;241m=\u001b[39m com\u001b[38;5;241m.\u001b[39mapply_if_callable(key, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mobj)\n\u001b[1;32m-> 1073\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_getitem_axis(maybe_callable, axis\u001b[38;5;241m=\u001b[39maxis)\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\indexing.py:1616\u001b[0m, in \u001b[0;36m_iLocIndexer._getitem_axis\u001b[1;34m(self, key, axis)\u001b[0m\n\u001b[0;32m 1614\u001b[0m \u001b[38;5;66;03m# a list of integers\u001b[39;00m\n\u001b[0;32m 1615\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m is_list_like_indexer(key):\n\u001b[1;32m-> 1616\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_get_list_axis(key, axis\u001b[38;5;241m=\u001b[39maxis)\n\u001b[0;32m 1618\u001b[0m \u001b[38;5;66;03m# a single integer\u001b[39;00m\n\u001b[0;32m 1619\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 1620\u001b[0m key \u001b[38;5;241m=\u001b[39m item_from_zerodim(key)\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\indexing.py:1590\u001b[0m, in \u001b[0;36m_iLocIndexer._get_list_axis\u001b[1;34m(self, key, axis)\u001b[0m\n\u001b[0;32m 1587\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mobj\u001b[38;5;241m.\u001b[39m_take_with_is_copy(key, axis\u001b[38;5;241m=\u001b[39maxis)\n\u001b[0;32m 1588\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mIndexError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n\u001b[0;32m 1589\u001b[0m \u001b[38;5;66;03m# re-raise with different error message\u001b[39;00m\n\u001b[1;32m-> 1590\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mIndexError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mpositional indexers are out-of-bounds\u001b[39m\u001b[38;5;124m\"\u001b[39m) \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01merr\u001b[39;00m\n", + "\u001b[1;31mIndexError\u001b[0m: positional indexers are out-of-bounds" + ] + } + ], + "source": [ + "dfl.iloc[[4, 5, 6]]" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "id": "7b081f89", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + }, + "tags": [ + "raises-exception" + ] + }, + "outputs": [ + { + "ename": "IndexError", + "evalue": "single positional indexer is out-of-bounds", + "output_type": "error", + "traceback": [ + "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[1;31mIndexError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[1;32mIn[68], line 2\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[38;5;66;03m#:tags: [\"raises-exception\"]\u001b[39;00m\n\u001b[1;32m----> 2\u001b[0m dfl\u001b[38;5;241m.\u001b[39miloc[:, \u001b[38;5;241m4\u001b[39m]\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\indexing.py:1067\u001b[0m, in \u001b[0;36m_LocationIndexer.__getitem__\u001b[1;34m(self, key)\u001b[0m\n\u001b[0;32m 1065\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_is_scalar_access(key):\n\u001b[0;32m 1066\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mobj\u001b[38;5;241m.\u001b[39m_get_value(\u001b[38;5;241m*\u001b[39mkey, takeable\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_takeable)\n\u001b[1;32m-> 1067\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_getitem_tuple(key)\n\u001b[0;32m 1068\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 1069\u001b[0m \u001b[38;5;66;03m# we by definition only have the 0th axis\u001b[39;00m\n\u001b[0;32m 1070\u001b[0m axis \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39maxis \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;241m0\u001b[39m\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\indexing.py:1563\u001b[0m, in \u001b[0;36m_iLocIndexer._getitem_tuple\u001b[1;34m(self, tup)\u001b[0m\n\u001b[0;32m 1561\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m_getitem_tuple\u001b[39m(\u001b[38;5;28mself\u001b[39m, tup: \u001b[38;5;28mtuple\u001b[39m):\n\u001b[1;32m-> 1563\u001b[0m tup \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_validate_tuple_indexer(tup)\n\u001b[0;32m 1564\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m suppress(IndexingError):\n\u001b[0;32m 1565\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_getitem_lowerdim(tup)\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\indexing.py:873\u001b[0m, in \u001b[0;36m_LocationIndexer._validate_tuple_indexer\u001b[1;34m(self, key)\u001b[0m\n\u001b[0;32m 871\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m i, k \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28menumerate\u001b[39m(key):\n\u001b[0;32m 872\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m--> 873\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_validate_key(k, i)\n\u001b[0;32m 874\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n\u001b[0;32m 875\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[0;32m 876\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mLocation based indexing can only have \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m 877\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m[\u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_valid_types\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m] types\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m 878\u001b[0m ) \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01merr\u001b[39;00m\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\indexing.py:1466\u001b[0m, in \u001b[0;36m_iLocIndexer._validate_key\u001b[1;34m(self, key, axis)\u001b[0m\n\u001b[0;32m 1464\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m\n\u001b[0;32m 1465\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m is_integer(key):\n\u001b[1;32m-> 1466\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_validate_integer(key, axis)\n\u001b[0;32m 1467\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(key, \u001b[38;5;28mtuple\u001b[39m):\n\u001b[0;32m 1468\u001b[0m \u001b[38;5;66;03m# a tuple should already have been caught by this point\u001b[39;00m\n\u001b[0;32m 1469\u001b[0m \u001b[38;5;66;03m# so don't treat a tuple as a valid indexer\u001b[39;00m\n\u001b[0;32m 1470\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m IndexingError(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mToo many indexers\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\indexing.py:1557\u001b[0m, in \u001b[0;36m_iLocIndexer._validate_integer\u001b[1;34m(self, key, axis)\u001b[0m\n\u001b[0;32m 1555\u001b[0m len_axis \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mlen\u001b[39m(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mobj\u001b[38;5;241m.\u001b[39m_get_axis(axis))\n\u001b[0;32m 1556\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m key \u001b[38;5;241m>\u001b[39m\u001b[38;5;241m=\u001b[39m len_axis \u001b[38;5;129;01mor\u001b[39;00m key \u001b[38;5;241m<\u001b[39m \u001b[38;5;241m-\u001b[39mlen_axis:\n\u001b[1;32m-> 1557\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mIndexError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124msingle positional indexer is out-of-bounds\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n", + "\u001b[1;31mIndexError\u001b[0m: single positional indexer is out-of-bounds" + ] + } + ], + "source": [ + "dfl.iloc[:, 4]" + ] + }, + { + "cell_type": "markdown", + "id": "b3fe22e7", + "metadata": {}, + "source": [ + "## Selection by callable\n", + "\n", + "`.loc`, `.iloc`, and also `[]` indexing can accept a `callable` as indexer. The `callable` must be a function with one argument (the calling Series or DataFrame) that returns valid output for indexing." + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "id": "72420538", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABCD
d0.852422-0.452728-0.3842720.370443
e0.263762-0.0533980.1358930.618338
f0.8578980.4567150.556336-1.022002
\n", + "
" + ], + "text/plain": [ + " A B C D\n", + "d 0.852422 -0.452728 -0.384272 0.370443\n", + "e 0.263762 -0.053398 0.135893 0.618338\n", + "f 0.857898 0.456715 0.556336 -1.022002" + ] + }, + "execution_count": 69, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1 = pd.DataFrame(np.random.randn(6, 4),\n", + " index=list('abcdef'),\n", + " columns=list('ABCD'))\n", + "df1\n", + "df1.loc[lambda df: df['A'] > 0, :]" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "id": "7206088f-3aa5-4392-9982-cadec553e616", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "id": "ab18a18f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
AB
a-1.8630491.979038
b-0.336175-1.192626
c-0.3173142.267986
d0.852422-0.452728
e0.263762-0.053398
f0.8578980.456715
\n", + "
" + ], + "text/plain": [ + " A B\n", + "a -1.863049 1.979038\n", + "b -0.336175 -1.192626\n", + "c -0.317314 2.267986\n", + "d 0.852422 -0.452728\n", + "e 0.263762 -0.053398\n", + "f 0.857898 0.456715" + ] + }, + "execution_count": 71, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1.loc[:, lambda df: ['A', 'B']]" + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "id": "2166496e-975d-4539-a3b6-54cedd012e73", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 73, + "id": "aeb4a77e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
AB
a-1.8630491.979038
b-0.336175-1.192626
c-0.3173142.267986
d0.852422-0.452728
e0.263762-0.053398
f0.8578980.456715
\n", + "
" + ], + "text/plain": [ + " A B\n", + "a -1.863049 1.979038\n", + "b -0.336175 -1.192626\n", + "c -0.317314 2.267986\n", + "d 0.852422 -0.452728\n", + "e 0.263762 -0.053398\n", + "f 0.857898 0.456715" + ] + }, + "execution_count": 73, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1.iloc[:, lambda df: [0, 1]]" + ] + }, + { + "cell_type": "code", + "execution_count": 74, + "id": "e8fe3be5-15de-4036-ab8a-d6483abf265f", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "id": "ec331b54", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "a -1.863049\n", + "b -0.336175\n", + "c -0.317314\n", + "d 0.852422\n", + "e 0.263762\n", + "f 0.857898\n", + "Name: A, dtype: float64" + ] + }, + "execution_count": 75, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1[lambda df: df.columns[0]]" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "id": "31840764-a775-4e5f-8023-6c4762005ff6", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "861f0e5e", + "metadata": {}, + "source": [ + "\n", + "You can use callable indexing in `Series`." + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "id": "d4e60491", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "d 0.852422\n", + "e 0.263762\n", + "f 0.857898\n", + "Name: A, dtype: float64" + ] + }, + "execution_count": 77, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1['A'].loc[lambda s: s > 0]" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "id": "1d7a46f1-98ce-4d87-924a-288812c6b4ed", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "12d2d96d", + "metadata": {}, + "source": [ + "\n", + "### Combining positional and label-based indexing\n", + "\n", + "If you wish to get the 0th and the 2nd elements from the index in the `'A'` column, you can do:" + ] + }, + { + "cell_type": "code", + "execution_count": 79, + "id": "978312bb", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "a 1\n", + "c 3\n", + "Name: A, dtype: int64" + ] + }, + "execution_count": 79, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dfd = pd.DataFrame({'A': [1, 2, 3],\n", + " 'B': [4, 5, 6]},\n", + " index=list('abc'))\n", + "dfd\n", + "dfd.loc[dfd.index[[0, 2]], 'A']" + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "id": "a8844d1c-fdc5-4c85-923c-092ac6367692", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "11210c0d", + "metadata": {}, + "source": [ + "\n", + "This can also be expressed using `.iloc`, by explicitly getting locations on the indexers, and using positional indexing to select things." + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "id": "2e7e25d2", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "a 1\n", + "c 3\n", + "Name: A, dtype: int64" + ] + }, + "execution_count": 81, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dfd.iloc[[0, 2], dfd.columns.get_loc('A')]" + ] + }, + { + "cell_type": "code", + "execution_count": 82, + "id": "48f7feb0-9334-441f-893a-42815523e739", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "d6c36e79", + "metadata": {}, + "source": [ + "\n", + "For getting multiple indexers, using `.get_indexer`:" + ] + }, + { + "cell_type": "code", + "execution_count": 83, + "id": "7c0b22e6", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
AB
a14
c36
\n", + "
" + ], + "text/plain": [ + " A B\n", + "a 1 4\n", + "c 3 6" + ] + }, + "execution_count": 83, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dfd.iloc[[0, 2], dfd.columns.get_indexer(['A', 'B'])]" + ] + }, + { + "cell_type": "code", + "execution_count": 84, + "id": "c0924629-67d8-43b6-a435-d91bb8bf6408", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "d97622c3-93dd-44af-94cb-9b4e4401b11b", + "metadata": {}, + "source": [ + "## Acknowledgments\n", + "\n", + "Thanks for [Pandas user guide](https://pandas.pydata.org/docs/user_guide/index.html). It contributes the majority of the content in this chapter." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/open-machine-learning-jupyter-book/data-science/working-with-data/pandas/introduction-and-data-structures.ipynb b/open-machine-learning-jupyter-book/data-science/working-with-data/pandas/introduction-and-data-structures.ipynb new file mode 100644 index 0000000000..3ae0658934 --- /dev/null +++ b/open-machine-learning-jupyter-book/data-science/working-with-data/pandas/introduction-and-data-structures.ipynb @@ -0,0 +1,7292 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "c90e65da-5d8a-4295-8fd2-601a50911cd0", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "source": [ + "---\n", + "jupytext:\n", + " cell_metadata_filter: -all\n", + " formats: md:myst\n", + " text_representation:\n", + " extension: .md\n", + " format_name: myst\n", + " format_version: 0.13\n", + " jupytext_version: 1.11.5\n", + "kernelspec:\n", + " display_name: Python 3\n", + " language: python\n", + " name: python3" + ] + }, + { + "cell_type": "markdown", + "id": "105bf8eb", + "metadata": {}, + "source": [ + "\n", + "# Introduction and Data Structures\n", + " \n", + "Pandas is a fast, powerful, flexible and easy-to-use open-source data analysis and manipulation tool, built on top of the Python programming language.\n", + "\n", + "## Introducing Pandas objects\n", + "\n", + "In 3 sections, we’ll start with a quick, non-comprehensive overview of the fundamental data structures in Pandas to get you started. The fundamental behavior about data types, indexing, axis labeling, and alignment apply across all of the objects. " + ] + }, + { + "cell_type": "markdown", + "id": "2818782f-4106-4c67-9491-5569ffdaaf19", + "metadata": {}, + "source": [ + "## Overview\n", + "\n", + "In this section, we'll introduce two data structure of pandas and the basic concept of data indexing and selection." + ] + }, + { + "cell_type": "markdown", + "id": "8e7aa789-fac6-4c4c-833b-e05c8c527491", + "metadata": {}, + "source": [ + "To get started, import NumPy and load Pandas into your namespace:" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "c8e7b835", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "# Install the necessary dependencies\n", + "import os\n", + "import sys\n", + "!{sys.executable} -m pip install --quiet jupyterlab_myst ipython\n", + "import numpy as np\n", + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "id": "bb9af208", + "metadata": {}, + "source": [ + "### Series\n", + "\n", + "`Series` is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the **index**. The basic method to create a `Series` is to call:" + ] + }, + { + "cell_type": "markdown", + "id": "2d6b11bf-93b8-439d-83ed-9ffae399bb1f", + "metadata": { + "attributes": { + "classes": [ + "py" + ], + "id": "" + } + }, + "source": [ + "`s = pd.Series(data, index=index)`" + ] + }, + { + "cell_type": "markdown", + "id": "475acfee", + "metadata": {}, + "source": [ + "Here, `data` can be many different things:\n", + "\n", + "- a Python dict\n", + "- an ndarray\n", + "- a scalar value (like 5)\n", + "\n", + "\n", + "The passed **index** is a list of axis labels. Thus, this separates into a few cases depending on what the **data is**:\n", + "\n", + "#### Create a Series\n", + "\n", + "##### From ndarray\n", + "\n", + "If `data` is an ndarray, **index** must be the same length as the **data**. If no index is passed, one will be created having values `[0, ..., len(data) - 1]`." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "646c8580", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "s = pd.Series(np.random.randn(5), index=[\"a\", \"b\", \"c\", \"d\", \"e\"])" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "2d2455c1", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "a 0.142361\n", + "b 0.407910\n", + "c -0.894226\n", + "d 1.311313\n", + "e 0.710528\n", + "dtype: float64" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "20f33329", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['a', 'b', 'c', 'd', 'e'], dtype='object')" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s.index" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "5376f720", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "0 0.131905\n", + "1 -0.703499\n", + "2 1.530060\n", + "3 -0.073598\n", + "4 -0.892724\n", + "dtype: float64" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.Series(np.random.randn(5))" + ] + }, + { + "cell_type": "markdown", + "id": "2f4e73c7", + "metadata": {}, + "source": [ + ":::{note}\n", + "Pandas supports non-unique index values. If an operation that does not support duplicate index values is attempted, an exception will be raised at that time.\n", + ":::\n", + "\n", + "##### From dict\n", + "`Series` can be instantiated from dicts:" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "e8095575", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "d = {\"b\": 1, \"a\": 0, \"c\": 2}" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "ba462934", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "b 1\n", + "a 0\n", + "c 2\n", + "dtype: int64" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.Series(d)" + ] + }, + { + "cell_type": "markdown", + "id": "c4329868", + "metadata": {}, + "source": [ + "If an index is passed, the values in data corresponding to the labels in the index will be pulled out." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "03488418", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "d = {\"a\": 0.0, \"b\": 1.0, \"c\": 2.0}" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "c35e968c", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "a 0.0\n", + "b 1.0\n", + "c 2.0\n", + "dtype: float64" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.Series(d)" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "95eafc4d", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "b 1.0\n", + "c 2.0\n", + "d NaN\n", + "a 0.0\n", + "dtype: float64" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.Series(d, index=[\"b\", \"c\", \"d\", \"a\"])" + ] + }, + { + "cell_type": "markdown", + "id": "1be5c72d", + "metadata": {}, + "source": [ + ":::{note}\n", + "NaN (not a number) is the standard missing data marker used in Pandas.\n", + ":::\n", + "\n", + "##### From scalar value\n", + "\n", + "If `data` is a scalar value, an index must be provided. The value will be repeated to match the length of **index**." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "6f744115", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "a 5.0\n", + "b 5.0\n", + "c 5.0\n", + "d 5.0\n", + "e 5.0\n", + "dtype: float64" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.Series(5.0, index=[\"a\", \"b\", \"c\", \"d\", \"e\"])" + ] + }, + { + "cell_type": "markdown", + "id": "8060fb92", + "metadata": {}, + "source": [ + "#### Series is ndarray-like\n", + "\n", + "`Series` acts very similarly to a `ndarray` and is a valid argument to most NumPy functions. However, operations such as slicing will also slice the index." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "2ca453e9", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "0.14236085563166023" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "4cf8e176", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "a 0.142361\n", + "b 0.407910\n", + "c -0.894226\n", + "dtype: float64" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s[:3]" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "1bab7730", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "d 1.311313\n", + "e 0.710528\n", + "dtype: float64" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s[s > s.median()]" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "b5e98d89", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "e 0.710528\n", + "d 1.311313\n", + "b 0.407910\n", + "dtype: float64" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s[[4, 3, 1]]" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "c98a7190", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "a 1.152993\n", + "b 1.503672\n", + "c 0.408924\n", + "d 3.711042\n", + "e 2.035066\n", + "dtype: float64" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.exp(s)" + ] + }, + { + "cell_type": "markdown", + "id": "a49ee902", + "metadata": {}, + "source": [ + "Like a NumPy array, a Pandas Series has a single `dtype`." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "b0298996", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "dtype('float64')" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s.dtype" + ] + }, + { + "cell_type": "markdown", + "id": "69857db8", + "metadata": {}, + "source": [ + "If you need the actual array backing a `Series`, use `Series.array`." + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "1989c3a9", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "\n", + "[0.14236085563166023, 0.4079103331875502, -0.8942262958393035,\n", + " 1.3113126514940179, 0.7105280060827549]\n", + "Length: 5, dtype: float64" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s.array" + ] + }, + { + "cell_type": "markdown", + "id": "7ed219b0", + "metadata": {}, + "source": [ + "While `Series` is ndarray-like, if you need an actual ndarray, then use `Series.to_numpy()`." + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "1cc04172", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ 0.14236086, 0.40791033, -0.8942263 , 1.31131265, 0.71052801])" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s.to_numpy()" + ] + }, + { + "cell_type": "markdown", + "id": "12f01f86", + "metadata": {}, + "source": [ + "Even if the `Series` is backed by an `ExtensionArray`, `Series.to_numpy()` will return a NumPy ndarray.\n", + "\n", + "#### Series is dict-like\n", + "\n", + "A `Series` is also like a fixed-size dict in that you can get and set values by index label:" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "bcfe90c9", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "0.14236085563166023" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s[\"a\"]" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "00c68766", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "s[\"e\"] = 12.0" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "74f58473", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "a 0.142361\n", + "b 0.407910\n", + "c -0.894226\n", + "d 1.311313\n", + "e 12.000000\n", + "dtype: float64" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "2f822110", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "\"e\" in s" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "164dcf61", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "False" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "\"f\" in s" + ] + }, + { + "cell_type": "markdown", + "id": "ca979c84", + "metadata": {}, + "source": [ + "If a label is not contained in the index, an exception is raised:" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "id": "40a23c62-9c88-4a6e-9316-60317abe7859", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + }, + "tags": [ + "raises-exception" + ] + }, + "outputs": [ + { + "ename": "KeyError", + "evalue": "'f'", + "output_type": "error", + "traceback": [ + "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[1;31mKeyError\u001b[0m Traceback (most recent call last)", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\indexes\\base.py:3802\u001b[0m, in \u001b[0;36mIndex.get_loc\u001b[1;34m(self, key, method, tolerance)\u001b[0m\n\u001b[0;32m 3801\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m-> 3802\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_engine\u001b[38;5;241m.\u001b[39mget_loc(casted_key)\n\u001b[0;32m 3803\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\_libs\\index.pyx:138\u001b[0m, in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[1;34m()\u001b[0m\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\_libs\\index.pyx:165\u001b[0m, in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[1;34m()\u001b[0m\n", + "File \u001b[1;32mpandas\\_libs\\hashtable_class_helper.pxi:5745\u001b[0m, in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[1;34m()\u001b[0m\n", + "File \u001b[1;32mpandas\\_libs\\hashtable_class_helper.pxi:5753\u001b[0m, in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[1;34m()\u001b[0m\n", + "\u001b[1;31mKeyError\u001b[0m: 'f'", + "\nThe above exception was the direct cause of the following exception:\n", + "\u001b[1;31mKeyError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[1;32mIn[35], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m s[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\series.py:981\u001b[0m, in \u001b[0;36mSeries.__getitem__\u001b[1;34m(self, key)\u001b[0m\n\u001b[0;32m 978\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_values[key]\n\u001b[0;32m 980\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m key_is_scalar:\n\u001b[1;32m--> 981\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_get_value(key)\n\u001b[0;32m 983\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m is_hashable(key):\n\u001b[0;32m 984\u001b[0m \u001b[38;5;66;03m# Otherwise index.get_value will raise InvalidIndexError\u001b[39;00m\n\u001b[0;32m 985\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m 986\u001b[0m \u001b[38;5;66;03m# For labels that don't resolve as scalars like tuples and frozensets\u001b[39;00m\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\series.py:1089\u001b[0m, in \u001b[0;36mSeries._get_value\u001b[1;34m(self, label, takeable)\u001b[0m\n\u001b[0;32m 1086\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_values[label]\n\u001b[0;32m 1088\u001b[0m \u001b[38;5;66;03m# Similar to Index.get_value, but we do not fall back to positional\u001b[39;00m\n\u001b[1;32m-> 1089\u001b[0m loc \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mindex\u001b[38;5;241m.\u001b[39mget_loc(label)\n\u001b[0;32m 1090\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mindex\u001b[38;5;241m.\u001b[39m_get_values_for_loc(\u001b[38;5;28mself\u001b[39m, loc, label)\n", + "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\indexes\\base.py:3804\u001b[0m, in \u001b[0;36mIndex.get_loc\u001b[1;34m(self, key, method, tolerance)\u001b[0m\n\u001b[0;32m 3802\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_engine\u001b[38;5;241m.\u001b[39mget_loc(casted_key)\n\u001b[0;32m 3803\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n\u001b[1;32m-> 3804\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m(key) \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01merr\u001b[39;00m\n\u001b[0;32m 3805\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m:\n\u001b[0;32m 3806\u001b[0m \u001b[38;5;66;03m# If we have a listlike key, _check_indexing_error will raise\u001b[39;00m\n\u001b[0;32m 3807\u001b[0m \u001b[38;5;66;03m# InvalidIndexError. Otherwise we fall through and re-raise\u001b[39;00m\n\u001b[0;32m 3808\u001b[0m \u001b[38;5;66;03m# the TypeError.\u001b[39;00m\n\u001b[0;32m 3809\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_check_indexing_error(key)\n", + "\u001b[1;31mKeyError\u001b[0m: 'f'" + ] + } + ], + "source": [ + "s[\"f\"]" + ] + }, + { + "cell_type": "markdown", + "id": "396df6e2", + "metadata": {}, + "source": [ + "Using the `Series.get()` method, a missing label will return None or specified default:" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "id": "ad2a67c6", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "s.get(\"f\")" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "id": "13c1c13b", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "nan" + ] + }, + "execution_count": 38, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s.get(\"f\", np.nan)" + ] + }, + { + "cell_type": "markdown", + "id": "1b19c44c", + "metadata": {}, + "source": [ + "These labels can also be accessed by `attribute`.\n", + "\n", + "#### Vectorized operations and label alignment with Series\n", + "\n", + "When working with raw NumPy arrays, looping through value-by-value is usually not necessary. The same is true when working with `Series` in Pandas. `Series` can also be passed into most NumPy methods expecting an ndarray." + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "35540134", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "a 0.284722\n", + "b 0.815821\n", + "c -1.788453\n", + "d 2.622625\n", + "e 24.000000\n", + "dtype: float64" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s + s" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "aea7c1dc", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "a 0.284722\n", + "b 0.815821\n", + "c -1.788453\n", + "d 2.622625\n", + "e 24.000000\n", + "dtype: float64" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s * 2" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "4dcdc8c4", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "a 1.152993\n", + "b 1.503672\n", + "c 0.408924\n", + "d 3.711042\n", + "e 162754.791419\n", + "dtype: float64" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.exp(s)" + ] + }, + { + "cell_type": "markdown", + "id": "f8ed10f3", + "metadata": {}, + "source": [ + "A key difference between `Series` and ndarray is that operations between `Series` automatically align the data based on the label. Thus, you can write computations without giving consideration to whether the `Series` involved have the same labels." + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "id": "563555a0", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "a NaN\n", + "b 0.815821\n", + "c -1.788453\n", + "d 2.622625\n", + "e NaN\n", + "dtype: float64" + ] + }, + "execution_count": 42, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s[1:] + s[:-1]" + ] + }, + { + "cell_type": "markdown", + "id": "7e19643a", + "metadata": {}, + "source": [ + "The result of an operation between unaligned `Series` will have the **union** of the indexes involved. If a label is not found in one `Series` or the other, the result will be marked as missing `NaN`. Being able to write code without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis and research. The integrated data alignment features of the Pandas data structures set Pandas apart from the majority of related tools for working with labeled data.\n", + "\n", + ":::{note}\n", + "In general, we chose to make the default result of operations between differently indexed objects yield the **union** of the indexes in order to avoid loss of information. Having an index label, though the data is missing, is typically important information as part of a computation. You of course have the option of dropping labels with missing data via the `dropna` function.\n", + ":::\n", + "\n", + "#### Name attribute\n", + "\n", + "`Series` also has a `name` attribute:" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "id": "3b39834b", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "s = pd.Series(np.random.randn(5), name=\"something\")" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "id": "18210d7f", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "0 0.836867\n", + "1 -0.187063\n", + "2 0.180988\n", + "3 0.434802\n", + "4 1.175946\n", + "Name: something, dtype: float64" + ] + }, + "execution_count": 44, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "id": "06f09ce2", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'something'" + ] + }, + "execution_count": 45, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s.name" + ] + }, + { + "cell_type": "markdown", + "id": "b35b499b", + "metadata": {}, + "source": [ + "The `Series` `name` can be assigned automatically in many cases, in particular, when selecting a single column from a `DataFrame`, the `name` will be assigned the column label.\n", + "\n", + "You can rename a `Series` with the `pandas.Series.rename()` method." + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "id": "bd079c61", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "s2 = s.rename(\"different\")" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "id": "a1767258", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'different'" + ] + }, + "execution_count": 47, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s2.name" + ] + }, + { + "cell_type": "markdown", + "id": "398a679d", + "metadata": {}, + "source": [ + "Note that `s` and `s2` refer to different objects.\n", + "\n", + "### DataFrame\n", + "\n", + "`DataFrame` is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a `dict` of `Series` objects. It is generally the most commonly used Pandas object. Like `Series`, `DataFrame` accepts many different kinds of input:\n", + "\n", + "- Dict of 1D ndarrays, lists, dicts, or `Series`\n", + "- 2-D `numpy.ndarray`\n", + "- Structured or record ndarray\n", + "- A `Series`\n", + "- Another `DataFrame`\n", + "\n", + "Along with the data, you can optionally pass **index** (row labels) and **columns** (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting `DataFrame`. Thus, a `dict` of Series plus a specific index will discard all data not matching up to the passed index.\n", + "\n", + "If axis labels are not passed, they will be constructed from the input data based on common sense rules.\n", + "\n", + "#### Create a Dataframe\n", + "\n", + "##### From dict of `Series` or dicts\n", + "\n", + "The resulting **index** will be the **union** of the indexes of the various Series. If there are any nested dicts, these will first be converted to Series. If no columns are passed, the columns will be the ordered list of `dict` keys." + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "id": "aa7ddc8a", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "d = {\n", + " \"one\": pd.Series([1.0, 2.0, 3.0], index=[\"a\", \"b\", \"c\"]),\n", + " \"two\": pd.Series([1.0, 2.0, 3.0, 4.0], index=[\"a\", \"b\", \"c\", \"d\"]),\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "id": "f526badc", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "df = pd.DataFrame(d)" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "id": "69ddc66c", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
onetwo
a1.01.0
b2.02.0
c3.03.0
dNaN4.0
\n", + "
" + ], + "text/plain": [ + " one two\n", + "a 1.0 1.0\n", + "b 2.0 2.0\n", + "c 3.0 3.0\n", + "d NaN 4.0" + ] + }, + "execution_count": 50, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "id": "1f5e8ccb", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
onetwo
dNaN4.0
b2.02.0
a1.01.0
\n", + "
" + ], + "text/plain": [ + " one two\n", + "d NaN 4.0\n", + "b 2.0 2.0\n", + "a 1.0 1.0" + ] + }, + "execution_count": 51, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.DataFrame(d, index=[\"d\", \"b\", \"a\"])" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "id": "9940fb65", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
twothree
d4.0NaN
b2.0NaN
a1.0NaN
\n", + "
" + ], + "text/plain": [ + " two three\n", + "d 4.0 NaN\n", + "b 2.0 NaN\n", + "a 1.0 NaN" + ] + }, + "execution_count": 52, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.DataFrame(d, index=[\"d\", \"b\", \"a\"], columns=[\"two\", \"three\"])" + ] + }, + { + "cell_type": "markdown", + "id": "93b5a50c", + "metadata": {}, + "source": [ + "The row and column labels can be accessed respectively by accessing the **index** and **columns** attributes:\n", + "\n", + ":::{note}\n", + "When a particular set of columns is passed along with a dict of data, the passed columns override the keys in the dict.\n", + ":::" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "id": "8a3ba6ae", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['a', 'b', 'c', 'd'], dtype='object')" + ] + }, + "execution_count": 53, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.index" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "id": "13684125", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['one', 'two'], dtype='object')" + ] + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.columns" + ] + }, + { + "cell_type": "markdown", + "id": "49c8bc9a", + "metadata": {}, + "source": [ + "##### From dict of ndarrays / lists\n", + "\n", + "The ndarrays must all be the same length. If an index is passed, it must also be the same length as the arrays. If no index is passed, the result will be `range(n)`, where `n` is the array length." + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "id": "c4789555", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "d = {\"one\": [1.0, 2.0, 3.0, 4.0], \"two\": [4.0, 3.0, 2.0, 1.0]}" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "id": "29098be0", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
onetwo
01.04.0
12.03.0
23.02.0
34.01.0
\n", + "
" + ], + "text/plain": [ + " one two\n", + "0 1.0 4.0\n", + "1 2.0 3.0\n", + "2 3.0 2.0\n", + "3 4.0 1.0" + ] + }, + "execution_count": 56, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.DataFrame(d)" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "id": "5600834a", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
onetwo
a1.04.0
b2.03.0
c3.02.0
d4.01.0
\n", + "
" + ], + "text/plain": [ + " one two\n", + "a 1.0 4.0\n", + "b 2.0 3.0\n", + "c 3.0 2.0\n", + "d 4.0 1.0" + ] + }, + "execution_count": 57, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.DataFrame(d, index=[\"a\", \"b\", \"c\", \"d\"])" + ] + }, + { + "cell_type": "markdown", + "id": "506868de", + "metadata": {}, + "source": [ + "##### From structured or record array\n", + "\n", + "This case is handled identically to a dict of arrays." + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "id": "0b3b5090", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "data = np.zeros((2,), dtype=[(\"A\", \"i4\"), (\"B\", \"f4\"), (\"C\", \"a10\")])" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "id": "543153a7", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "data[:] = [(1, 2.0, \"Hello\"), (2, 3.0, \"World\")]" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "id": "c5278e68", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABC
012.0b'Hello'
123.0b'World'
\n", + "
" + ], + "text/plain": [ + " A B C\n", + "0 1 2.0 b'Hello'\n", + "1 2 3.0 b'World'" + ] + }, + "execution_count": 60, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.DataFrame(data)" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "id": "fefbfc51", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABC
first12.0b'Hello'
second23.0b'World'
\n", + "
" + ], + "text/plain": [ + " A B C\n", + "first 1 2.0 b'Hello'\n", + "second 2 3.0 b'World'" + ] + }, + "execution_count": 61, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.DataFrame(data, index=[\"first\", \"second\"])" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "id": "f76d517a", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CAB
0b'Hello'12.0
1b'World'23.0
\n", + "
" + ], + "text/plain": [ + " C A B\n", + "0 b'Hello' 1 2.0\n", + "1 b'World' 2 3.0" + ] + }, + "execution_count": 62, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.DataFrame(data, columns=[\"C\", \"A\", \"B\"])" + ] + }, + { + "cell_type": "markdown", + "id": "75f7c017", + "metadata": {}, + "source": [ + ":::{note}\n", + "DataFrame is not intended to work exactly like a 2-dimensional NumPy ndarray.\n", + ":::\n", + "\n", + "\n", + "##### From a list of dicts" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "id": "a2aa6cb3", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "data2 = [{\"a\": 1, \"b\": 2}, {\"a\": 5, \"b\": 10, \"c\": 20}]" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "id": "1e45ffbc", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
abc
012NaN
151020.0
\n", + "
" + ], + "text/plain": [ + " a b c\n", + "0 1 2 NaN\n", + "1 5 10 20.0" + ] + }, + "execution_count": 64, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.DataFrame(data2)" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "id": "8d6db924", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
abc
first12NaN
second51020.0
\n", + "
" + ], + "text/plain": [ + " a b c\n", + "first 1 2 NaN\n", + "second 5 10 20.0" + ] + }, + "execution_count": 65, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.DataFrame(data2, index=[\"first\", \"second\"])" + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "id": "258fa418", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ab
012
1510
\n", + "
" + ], + "text/plain": [ + " a b\n", + "0 1 2\n", + "1 5 10" + ] + }, + "execution_count": 66, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.DataFrame(data2, columns=[\"a\", \"b\"])" + ] + }, + { + "cell_type": "markdown", + "id": "dfb77761", + "metadata": {}, + "source": [ + "##### From a dict of tuples\n", + "\n", + "You can automatically create a MultiIndexed frame by passing a tuples dictionary." + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "id": "89af5166", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ab
bacab
AB1.04.05.08.010.0
C2.03.06.07.0NaN
DNaNNaNNaNNaN9.0
\n", + "
" + ], + "text/plain": [ + " a b \n", + " b a c a b\n", + "A B 1.0 4.0 5.0 8.0 10.0\n", + " C 2.0 3.0 6.0 7.0 NaN\n", + " D NaN NaN NaN NaN 9.0" + ] + }, + "execution_count": 67, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.DataFrame(\n", + " {\n", + " (\"a\", \"b\"): {(\"A\", \"B\"): 1, (\"A\", \"C\"): 2},\n", + " (\"a\", \"a\"): {(\"A\", \"C\"): 3, (\"A\", \"B\"): 4},\n", + " (\"a\", \"c\"): {(\"A\", \"B\"): 5, (\"A\", \"C\"): 6},\n", + " (\"b\", \"a\"): {(\"A\", \"C\"): 7, (\"A\", \"B\"): 8},\n", + " (\"b\", \"b\"): {(\"A\", \"D\"): 9, (\"A\", \"B\"): 10},\n", + " }\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "e02d86d6", + "metadata": {}, + "source": [ + "##### From a Series\n", + "\n", + "The result will be a DataFrame with the same index as the input Series, and with one column whose name is the original name of the Series (only if no other column name provided)." + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "id": "77ff8552", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "ser = pd.Series(range(3), index=list(\"abc\"), name=\"ser\")" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "id": "a86d1926", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ser
a0
b1
c2
\n", + "
" + ], + "text/plain": [ + " ser\n", + "a 0\n", + "b 1\n", + "c 2" + ] + }, + "execution_count": 69, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.DataFrame(ser)" + ] + }, + { + "cell_type": "markdown", + "id": "2f824850", + "metadata": {}, + "source": [ + "##### From a list of namedtuples\n", + "\n", + "The field names of the first `namedtuple` in the list determine the columns of the `DataFrame`. The remaining namedtuples (or tuples) are simply unpacked and their values are fed into the rows of the `DataFrame`. If any of those tuples is shorter than the first `namedtuple` then the later columns in the corresponding row are marked as missing values. If any are longer than the first `namedtuple` , a `ValueError` is raised." + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "id": "67fd765e", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "from collections import namedtuple" + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "id": "d4524af3", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "Point = namedtuple(\"Point\", \"x y\")" + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "id": "02f0937c", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
xy
000
103
223
\n", + "
" + ], + "text/plain": [ + " x y\n", + "0 0 0\n", + "1 0 3\n", + "2 2 3" + ] + }, + "execution_count": 72, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.DataFrame([Point(0, 0), Point(0, 3), (2, 3)])" + ] + }, + { + "cell_type": "code", + "execution_count": 73, + "id": "4c81da05", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "Point3D = namedtuple(\"Point3D\", \"x y z\")" + ] + }, + { + "cell_type": "code", + "execution_count": 74, + "id": "6731aad6", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
xyz
0000.0
1035.0
223NaN
\n", + "
" + ], + "text/plain": [ + " x y z\n", + "0 0 0 0.0\n", + "1 0 3 5.0\n", + "2 2 3 NaN" + ] + }, + "execution_count": 74, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.DataFrame([Point3D(0, 0, 0), Point3D(0, 3, 5), Point(2, 3)])" + ] + }, + { + "cell_type": "markdown", + "id": "8ff0bca2", + "metadata": {}, + "source": [ + "##### From a list of dataclasses\n", + "\n", + "Data Classes as introduced in PEP557, can be passed into the DataFrame constructor. Passing a list of dataclasses is equivalent to passing a list of dictionaries.\n", + "\n", + "Please be aware, that all values in the list should be dataclasses, mixing types in the list would result in a `TypeError`." + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "id": "5fe92237", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "from dataclasses import make_dataclass" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "id": "e13b27cf", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "Point = make_dataclass(\"Point\", [(\"x\", int), (\"y\", int)])" + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "id": "df6b2816", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
xy
000
103
223
\n", + "
" + ], + "text/plain": [ + " x y\n", + "0 0 0\n", + "1 0 3\n", + "2 2 3" + ] + }, + "execution_count": 77, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)])" + ] + }, + { + "cell_type": "markdown", + "id": "8e826768", + "metadata": {}, + "source": [ + "#### Column selection, addition, deletion\n", + "\n", + "You can treat a `DataFrame` semantically like a dict of like-indexed `Series` objects. Getting, setting, and deleting columns works with the same syntax as the analogous dict operations:" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "id": "a52d0734", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
onetwo
a1.01.0
b2.02.0
c3.03.0
dNaN4.0
\n", + "
" + ], + "text/plain": [ + " one two\n", + "a 1.0 1.0\n", + "b 2.0 2.0\n", + "c 3.0 3.0\n", + "d NaN 4.0" + ] + }, + "execution_count": 78, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 79, + "id": "804405d6", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "a 1.0\n", + "b 2.0\n", + "c 3.0\n", + "d NaN\n", + "Name: one, dtype: float64" + ] + }, + "execution_count": 79, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[\"one\"]" + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "id": "dfa00c9b", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "df[\"three\"] = df[\"one\"] * df[\"two\"]" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "id": "0f98ffa9", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "df[\"flag\"] = df[\"one\"] > 2" + ] + }, + { + "cell_type": "code", + "execution_count": 82, + "id": "1ef5e1a3", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
onetwothreeflag
a1.01.01.0False
b2.02.04.0False
c3.03.09.0True
dNaN4.0NaNFalse
\n", + "
" + ], + "text/plain": [ + " one two three flag\n", + "a 1.0 1.0 1.0 False\n", + "b 2.0 2.0 4.0 False\n", + "c 3.0 3.0 9.0 True\n", + "d NaN 4.0 NaN False" + ] + }, + "execution_count": 82, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df" + ] + }, + { + "cell_type": "markdown", + "id": "f518cd88", + "metadata": {}, + "source": [ + "Columns can be deleted or popped like with a dict:" + ] + }, + { + "cell_type": "code", + "execution_count": 83, + "id": "b418f585", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "del df[\"two\"]" + ] + }, + { + "cell_type": "code", + "execution_count": 84, + "id": "209ebb78", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "three = df.pop(\"three\")" + ] + }, + { + "cell_type": "code", + "execution_count": 85, + "id": "9aee9b49", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
oneflag
a1.0False
b2.0False
c3.0True
dNaNFalse
\n", + "
" + ], + "text/plain": [ + " one flag\n", + "a 1.0 False\n", + "b 2.0 False\n", + "c 3.0 True\n", + "d NaN False" + ] + }, + "execution_count": 85, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df" + ] + }, + { + "cell_type": "markdown", + "id": "40b5a135", + "metadata": {}, + "source": [ + "When inserting a scalar value, it will naturally be propagated to fill the column:" + ] + }, + { + "cell_type": "code", + "execution_count": 86, + "id": "1bddfbc5", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "df[\"foo\"] = \"bar\"" + ] + }, + { + "cell_type": "code", + "execution_count": 87, + "id": "e2613bd3", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
oneflagfoo
a1.0Falsebar
b2.0Falsebar
c3.0Truebar
dNaNFalsebar
\n", + "
" + ], + "text/plain": [ + " one flag foo\n", + "a 1.0 False bar\n", + "b 2.0 False bar\n", + "c 3.0 True bar\n", + "d NaN False bar" + ] + }, + "execution_count": 87, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df" + ] + }, + { + "cell_type": "markdown", + "id": "d93a6895", + "metadata": {}, + "source": [ + "When inserting a `Series` that does not have the same index as the `DataFrame`, it will be conformed to the DataFrame's index:" + ] + }, + { + "cell_type": "code", + "execution_count": 88, + "id": "c20564a5", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "df[\"one_trunc\"] = df[\"one\"][:2]" + ] + }, + { + "cell_type": "code", + "execution_count": 89, + "id": "877b972d-49b8-4225-855e-ec77bd876d8b", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 90, + "id": "76026aba", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
oneflagfooone_trunc
a1.0Falsebar1.0
b2.0Falsebar2.0
c3.0TruebarNaN
dNaNFalsebarNaN
\n", + "
" + ], + "text/plain": [ + " one flag foo one_trunc\n", + "a 1.0 False bar 1.0\n", + "b 2.0 False bar 2.0\n", + "c 3.0 True bar NaN\n", + "d NaN False bar NaN" + ] + }, + "execution_count": 90, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df" + ] + }, + { + "cell_type": "markdown", + "id": "b7c3f5d9", + "metadata": {}, + "source": [ + "You can insert raw ndarrays but their length must match the length of the DataFrame's index.\n", + "\n", + "By default, columns get inserted at the end. `DataFrame.insert()` inserts at a particular location in the columns:" + ] + }, + { + "cell_type": "code", + "execution_count": 91, + "id": "8dbfb773", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "df.insert(1, \"bar\", df[\"one\"])" + ] + }, + { + "cell_type": "code", + "execution_count": 92, + "id": "27dea852", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
onebarflagfooone_trunc
a1.01.0Falsebar1.0
b2.02.0Falsebar2.0
c3.03.0TruebarNaN
dNaNNaNFalsebarNaN
\n", + "
" + ], + "text/plain": [ + " one bar flag foo one_trunc\n", + "a 1.0 1.0 False bar 1.0\n", + "b 2.0 2.0 False bar 2.0\n", + "c 3.0 3.0 True bar NaN\n", + "d NaN NaN False bar NaN" + ] + }, + "execution_count": 92, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df" + ] + }, + { + "cell_type": "markdown", + "id": "4786e42f", + "metadata": {}, + "source": [ + "#### Assigning new columns in method chains\n", + "\n", + "DataFrame has an `assign()` method that allows you to easily create new columns that are potentially derived from existing columns." + ] + }, + { + "cell_type": "code", + "execution_count": 95, + "id": "e9e4dead", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "iris = pd.read_csv(\"https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/data-science/working-with-data/pandas/iris.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": 96, + "id": "38eef1a4", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SepalLengthSepalWidthPetalLengthPetalWidthName
05.13.51.40.2Iris-setosa
14.93.01.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
45.03.61.40.2Iris-setosa
\n", + "
" + ], + "text/plain": [ + " SepalLength SepalWidth PetalLength PetalWidth Name\n", + "0 5.1 3.5 1.4 0.2 Iris-setosa\n", + "1 4.9 3.0 1.4 0.2 Iris-setosa\n", + "2 4.7 3.2 1.3 0.2 Iris-setosa\n", + "3 4.6 3.1 1.5 0.2 Iris-setosa\n", + "4 5.0 3.6 1.4 0.2 Iris-setosa" + ] + }, + "execution_count": 96, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "iris.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 97, + "id": "ed27d63b", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SepalLengthSepalWidthPetalLengthPetalWidthNamesepal_ratio
05.13.51.40.2Iris-setosa0.686275
14.93.01.40.2Iris-setosa0.612245
24.73.21.30.2Iris-setosa0.680851
34.63.11.50.2Iris-setosa0.673913
45.03.61.40.2Iris-setosa0.720000
\n", + "
" + ], + "text/plain": [ + " SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio\n", + "0 5.1 3.5 1.4 0.2 Iris-setosa 0.686275\n", + "1 4.9 3.0 1.4 0.2 Iris-setosa 0.612245\n", + "2 4.7 3.2 1.3 0.2 Iris-setosa 0.680851\n", + "3 4.6 3.1 1.5 0.2 Iris-setosa 0.673913\n", + "4 5.0 3.6 1.4 0.2 Iris-setosa 0.720000" + ] + }, + "execution_count": 97, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "iris.assign(sepal_ratio=iris[\"SepalWidth\"] / iris[\"SepalLength\"]).head()" + ] + }, + { + "cell_type": "markdown", + "id": "c989dbf7", + "metadata": {}, + "source": [ + "In the example above, we inserted a precomputed value. We can also pass in a function of one argument to be evaluated on the DataFrame being assigned to." + ] + }, + { + "cell_type": "code", + "execution_count": 98, + "id": "4f39885a", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SepalLengthSepalWidthPetalLengthPetalWidthNamesepal_ratio
05.13.51.40.2Iris-setosa0.686275
14.93.01.40.2Iris-setosa0.612245
24.73.21.30.2Iris-setosa0.680851
34.63.11.50.2Iris-setosa0.673913
45.03.61.40.2Iris-setosa0.720000
\n", + "
" + ], + "text/plain": [ + " SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio\n", + "0 5.1 3.5 1.4 0.2 Iris-setosa 0.686275\n", + "1 4.9 3.0 1.4 0.2 Iris-setosa 0.612245\n", + "2 4.7 3.2 1.3 0.2 Iris-setosa 0.680851\n", + "3 4.6 3.1 1.5 0.2 Iris-setosa 0.673913\n", + "4 5.0 3.6 1.4 0.2 Iris-setosa 0.720000" + ] + }, + "execution_count": 98, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "iris.assign(sepal_ratio=lambda x: (x[\"SepalWidth\"] / x[\"SepalLength\"])).head()" + ] + }, + { + "cell_type": "markdown", + "id": "abcd0aee", + "metadata": {}, + "source": [ + "`assign()` **always** returns a copy of the data, leaving the original DataFrame untouched.\n", + "\n", + "Passing a callable, as opposed to an actual value to be inserted, is useful when you don't have a reference to the DataFrame at hand. This is common when using `assign()` in a chain of operations. For example, we can limit the DataFrame to just those observations with a Sepal Length greater than 5, calculate the ratio, and plot:" + ] + }, + { + "cell_type": "code", + "execution_count": 99, + "id": "0508916b", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 99, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAkAAAAGwCAYAAABB4NqyAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAABNgElEQVR4nO3dfVyUdb4//tcAMgg24w2ImoigeBemCIrAwklLzGpTa4+sbdgNrnEqlcz2q6vlTbVsHcsbVtzcY5JtEZVZdg6V1KqAsFnEWCc6LoaKqxA3CgOSoHD9/vDHrMPcMDPMzHVdc72ej8c8aj5zXZ/5XHOB8+Zz8/6oBEEQQERERKQgXmI3gIiIiMjdGAARERGR4jAAIiIiIsVhAERERESKwwCIiIiIFIcBEBERESkOAyAiIiJSHB+xGyBFXV1duHDhAm666SaoVCqxm0NEREQ2EAQBLS0tGDFiBLy8rPfxMAAy48KFCwgJCRG7GUREROSAc+fOYeTIkVaPYQBkxk033QTg+geo0WhEbg0RERHZQq/XIyQkxPA9bg0DIDO6h700Gg0DICIiIpmxZfoKJ0ETERGR4jAAIiIiIsVhAERERESKwwCIiIiIFIcBEBERESkOAyAiIiJSHAZAREREpDgMgIiIiEhxGAARERGR4jAAIiIiIsXhVhhEMlNV34qzF9swekgAwgIDxG4OEZEsMQAikommtg6syNWhsLLeUJYUEYSsxVHQ+vcTsWVERPLDITAimViRq8OxUw1GZcdONWB5brlILSIiki8GQEQyUFXfisLKenQKglF5pyCgsLIepxsui9QyIiJ5YgBEJANnL7ZZff1MIwMgIiJ7MAAikoHQwf5WXx89hJOhiYjswQCISAbCgwYgKSII3iqVUbm3SoWkiCCuBiMishMDICKZyFochYSxgUZlCWMDkbU4SqQWERHJF5fBE8mE1r8f9qXNwOmGyzjTeJl5gIiI+oABEJHMhAUy8CEi6isOgREREZHiMAAiIiIixWEARERERIrDAIiIiIgUhwEQERERKQ4DICIiIlIcLoMnslFVfSvOXmxj/h0iIg8geg9QdnY2wsLC4Ofnh+joaBQVFdl03rFjx+Dj44OpU6calefk5EClUpk8rly54oLWkxI0tXVgyZ7jmP3KUTyy9yvM2nIES/YcR3PbVbGbRkREDhI1AMrLy0NGRgbWrVuH8vJyJCYmYt68eaiurrZ6XnNzM5YsWYLbb7/d7OsajQY1NTVGDz8/P1dcAinAilwdjp1qMCo7dqoBy3PLRWoRERH1lagB0Kuvvoq0tDQsXboUEydOxLZt2xASEoJdu3ZZPe+xxx7DAw88gLi4OLOvq1QqDBs2zOhB5Iiq+lYUVtajUxCMyjsFAYWV9TjdcFmklhERUV+IFgB1dHSgrKwMycnJRuXJyckoKSmxeN7evXvx448/YsOGDRaPaW1tRWhoKEaOHIl77rkH5eXW/1Jvb2+HXq83ehABwNmLbVZfP9PIAIiISI5EC4AaGhrQ2dmJ4OBgo/Lg4GDU1taaPaeyshJr1qzBW2+9BR8f8/O3J0yYgJycHBw8eBC5ubnw8/NDQkICKisrLbYlMzMTWq3W8AgJCXH8wsijhA72t/r66CGcDE1EJEeiT4JWqVRGzwVBMCkDgM7OTjzwwAPYtGkTxo0bZ7G+mTNn4sEHH8SUKVOQmJiId999F+PGjUNWVpbFc9auXYvm5mbD49y5c45fEHmU8KABSIoIgnePn0lvlQpJEUFcDUZEJFOiLYMPDAyEt7e3SW9PXV2dSa8QALS0tODrr79GeXk5nnzySQBAV1cXBEGAj48PDh06hNmzZ5uc5+XlhenTp1vtAVKr1VCr1X28IvJUWYujkPbGV/j67CVDWcLYQGQtjhKxVURE1Bei9QD5+voiOjoaBQUFRuUFBQWIj483OV6j0eC7776DTqczPNLT0zF+/HjodDrExsaafR9BEKDT6TB8+HCXXAd5tqa2DizPLTcKfqaPHoSsxVHQ+vcTsWVERNQXoiZCXLVqFVJTUxETE4O4uDjs3r0b1dXVSE9PB3B9aOr8+fPYt28fvLy8EBkZaXT+0KFD4efnZ1S+adMmzJw5ExEREdDr9dixYwd0Oh127tzp1msjz2BuCfw3Z5uwPLcc+9JmiNQqIiLqK1EDoJSUFDQ2NmLz5s2oqalBZGQk8vPzERoaCgCoqanpNSdQT01NTVi2bBlqa2uh1WoRFRWFwsJCzJjBLyuyT/cS+J5uXALPOUBERPKkEoQeCU4Ier0eWq0Wzc3N0Gg0YjeHRHL4ZB0e2fuVxdf3PjIds8YPdWOL7MOtO4hIaez5/uZeYEQWyHUJfFNbB1bk6ox6r5IigjhviYjoBqIvgyeSKrkugefWHUREvWMARGRF1uIoJIwNNCqT8hJ4W7fuqKpvxeGTddzKg4gUi0NgRFZo/fthX9oMnG64jDONlyU/n6a3rTu+P9+MDR99z+ExIlI89gAR2SAsMACzxg+VdPAD9D5v6Y2SMxweIyICAyAij2Jt3tL00YPw1dlL3NmeiAgMgIg8jqV5Sw/Fj7Z6Hne2JyIl4RwgIhdzdz4eS/OWqupbrZ4npWX9zGFERK7GAIjIRcTOxxMWaBw8dA+PHTvVYDQM5q1SIWFsoCQCDbE/MyJSDg6BEbmIFPPxSH1Zvz2fGZfyE1FfsAeIyAWkuo+YlJf12/qZsZeIiJyBPUBELtBbPh6xJxxLcVm/rZ+ZFHvWiEh+GAARuYBc9xETky2fma2ZromIesMAiMgF5LqPmJhs+cyk3rNGRPLBAIjIRaQ+4ViKevvM2LNGRM7CSdBELiLlCcdS1dtnJoel/EQkDypB6DGYTtDr9dBqtWhuboZGoxG7OUR0g+a2q1ieW85VYERkwp7vb/YAEZHTuTKTM3vWiMgZGAARkdO4M0dPz0zXRET24CRoInIa5ughIrlgAERETsEcPUQkJwyAiMgpmKOHiOSEc4CIFMgVk5SZo4eI5IQBEJGCuHKSMnP0EJGccAiMSEFcPUmZ2a+JSC7YA0SkEN2TlHu6cZJyX3tpmKOHiOSCARCRQtgySdlZwQpz9BCR1HEIjEghevtl9/FS9XIEEZHnYA8QkRu4cmsIW3X18vq1Lm4LSETKwQCIyIXcuTVEb7hMnYjoXzgERuRCUtoaonuZurfKeKjLW6VCUkQQ5+wQkaKIHgBlZ2cjLCwMfn5+iI6ORlFRkU3nHTt2DD4+Ppg6darJa/v378ekSZOgVqsxadIkHDhwwMmtJk9SVd+KwyfrnL5VgxS3huAydSKi60QdAsvLy0NGRgays7ORkJCA1157DfPmzUNFRQVGjRpl8bzm5mYsWbIEt99+O3766Sej10pLS5GSkoLnn38eCxcuxIEDB7Bo0SIUFxcjNjbW1ZdEMuLq4Sl3rrqyFZepExFdpxIEQbSZj7GxsZg2bRp27dplKJs4cSIWLFiAzMxMi+f9+te/RkREBLy9vfHhhx9Cp9MZXktJSYFer8cnn3xiKLvzzjsxaNAg5Obm2tQuvV4PrVaL5uZmaDQa+y+MZGHJnuMWsxbvS5vR5/qr6lsx+5WjFl8/vPo2Bh9ERE5kz/e3aENgHR0dKCsrQ3JyslF5cnIySkpKLJ63d+9e/Pjjj9iwYYPZ10tLS03qnDt3rtU629vbodfrjR7k2dwxPMU5N0RE0iVaANTQ0IDOzk4EBwcblQcHB6O2ttbsOZWVlVizZg3eeust+PiYH72rra21q04AyMzMhFarNTxCQkLsvBqSG3ftXM45N0RE0iT6MnhVj7+OBUEwKQOAzs5OPPDAA9i0aRPGjRvnlDq7rV27FqtWrTI81+v1DII8nLuWhHPODRGRNIkWAAUGBsLb29ukZ6aurs6kBwcAWlpa8PXXX6O8vBxPPvkkAKCrqwuCIMDHxweHDh3C7NmzMWzYMJvr7KZWq6FWq51wVSQX3cNTxZX1RgkCXbVzuRhbQ0gh+SIRkVSJFgD5+voiOjoaBQUFWLhwoaG8oKAA8+fPNzleo9Hgu+++MyrLzs7G3/72N7z//vsICwsDAMTFxaGgoABPPfWU4bhDhw4hPj7eRVdCctTU1oFrXV0m2ZFnhA2W/fCUlJIvEhFJlahDYKtWrUJqaipiYmIQFxeH3bt3o7q6Gunp6QCuD02dP38e+/btg5eXFyIjI43OHzp0KPz8/IzKV65ciaSkJLz00kuYP38+PvroI3z++ecoLi5267WRtK3I1eHLqotGZV4qoJ+3l9uDBGf31FhLvuiM1W1ERJ5A1AAoJSUFjY2N2Lx5M2pqahAZGYn8/HyEhoYCAGpqalBdXW1XnfHx8XjnnXewfv16PPvssxgzZgzy8vKYA4gMuleA9dQlwLACzFwg4uxAxRU9NZau7cbVbRwOIyISOQ+QVDEPkGc7fLIOj+z9yuLrex+Zjlnjhxqeu2pIyRV5iOy9NiIiTyKLPEBEYrF3BZgr9vNyVR4ibnhKRGQbBkCkOPYkKOxroGJpnzFX5SFi8kUiItswACJFsjVBoaOBSlNbB5bsOY7ZrxzFI3u/wqwtR7Bkz3E0t10F4NqeGiZfJCLqneiJEInEYGuCQkcDld5WYnX31FiaA9SXnhomXyQi6h17gEjRwgIDMGv8UIQFBpgdrnJkSMnWYTNX99TceG1ERGSMPUDkMnLJRNzbKq+sxVFYnltu9Lq1QMWWYbOwwAD21BARiYgBEDmd3DIR9zZcZW+gYu+wmRjbZBARKR2HwMjpXLFs3FXsWeVl65BSeNAAxI8ZYva1+DFDGOwQEUkAAyDqk57zZlyV38ZVXLUc3VJ6UaYdJSKSBg6BkUMsDXOlxIy0el73/Bep6O0vAEurvKzNb6qqb0VpVaPZ80qrGiWxHYVc5mcB8morEckHAyByiKVhrraOa1bPk0omYnMB3I0sLUe3ZX6TrZOgxSCn+VlyaisRyQ+HwMhu1oa5vj57CdNHD5J8JmJzAdyNLK3ysmV+U2+ToI+erEeRhcDL1eQ0P0tObSUi+WEARHbrrYfjofjRks5EbCmA6/Zm2gzD6i9bzus5v8lS7qBuOSVnkLrnOKI2H8K5RuufpTPJaX6WnNpKRPLEITCyW289HLeM0GJf2gjJ5rfpLYC71mU+MLJnaMtc7qCeLrVdxb07i1H+XHIvLXYOKQ/N9SSnthKRPLEHiOxma3ZkqWYidnR7C3vO684ddHj1bVg1J8LiOZfarrptOExOO8XLqa1EJE8MgMghct5w09Ed0x05LywwAL2tfP+m+pJd7bdVzxQFctopXk5tJSJ5UgkCM5P0pNfrodVq0dzcDI1GI3ZzJE2qw1yA9eXTzW1XTYaobFlh5Mh5R0/W4aG9X1ms8820GUiMCLL1sqyqqm/F9zV67D76I747rzdpIwCHrlsMjt4jIlIue76/GQCZwQBI3uxZPu1oAGfveVGbD+FS21WT8kH+/ZwyB6i3Zf1eAH4REYR9aTMASDtw7UlObSUicTEA6iMGQPK2ZM9xHDvVYLSCqDuvT3cA4G7nGttw785ioyBokH8/HHziFwgZYn2+iy3MXbM5h1ffxiCCiDyWPd/fXAVGHqV7+XRPNy6fFiMACBnij/LnklFUWY9vqi9h2qhBTh32srba7EZcPUVEdB0DIPIoUl8+nRgR5LTAp1tv13wjrp4iIrqOARB5FCUun+7tmrtNDx3k9OCP+3QRkVwxACKP0r182tIcIE/8krZ0zTca5N8P//XQdKe9J/fpIiK5Yx4gkrWeuW4AeecocpS5a+42ffQgHFk9y6mBibl9uopP1XOfLiKSDa4CM4OrwKTPlh4IJS6f7r5mHy8VrnUJfbp2S8NbVfWtmP3KUYvnHXwiAbeGDHToPYmI+oKrwMjjWeuB6F7qHhaonMCnmzOuubfgsrdJ178/8B3+e0Vin9pARORqHAIj2bG0U3iXABRW1uPbc03iNMxDmAsuj51qMAxv9Tbp+n8v6GW1W7u5YVQi8nzsASLZYQ+E69iSRyk8aAAib9bgf2/YaqMnsdMN2IITuYmUjT1AJDue1gMhJbbkUQKAFxdEWj1ODukGeuvpIiLPxgCIZKe7B8Ka7i9qso+teZSmhAxCUkSQyT8gctmt3dIw6o09XUTk2RgAkSx5Qg+EFIUHDcAgC8M/g/z7GQU2WYuj8IseWa3lkm7A1p4uIvJcogdA2dnZCAsLg5+fH6Kjo1FUVGTx2OLiYiQkJGDIkCHo378/JkyYgK1btxodk5OTA5VKZfK4cuWKqy+F3EjuPRBSVVXfanbXegC41HbVqGdE698P+9Jm4PDq27D3kek4vPo27EubIYv5M0rMGE5ExkSdBJ2Xl4eMjAxkZ2cjISEBr732GubNm4eKigqMGjXK5PiAgAA8+eSTuPXWWxEQEIDi4mI89thjCAgIwLJlywzHaTQanDx50uhcPz8/l18PuVfW4igszy03msQqlx4IqXJkLzU5phtQYsZwIjImaiLE2NhYTJs2Dbt27TKUTZw4EQsWLEBmZqZNddx3330ICAjAm2++CeB6D1BGRgaamppsbkd7ezva29sNz/V6PUJCQpgIUSaUmPDQVXpLcnh49W0e8xk3t101CaC5CoxI3mSRCLGjowNlZWVYs2aNUXlycjJKSkpsqqO8vBwlJSV44YUXjMpbW1sRGhqKzs5OTJ06Fc8//zyioiz3CmRmZmLTpk32XwRJghx7IKRKST0j3UN4DKCJlEm0OUANDQ3o7OxEcHCwUXlwcDBqa2utnjty5Eio1WrExMTgiSeewNKlSw2vTZgwATk5OTh48CByc3Ph5+eHhIQEVFZWWqxv7dq1aG5uNjzOnTvXt4sjkjGl7aUWFhiAWeOHMvghUhjREyGqVCqj54IgmJT1VFRUhNbWVvz973/HmjVrMHbsWCxevBgAMHPmTMycOdNwbEJCAqZNm4asrCzs2LHDbH1qtRpqtbqPV0LkGdgzQkRKIFoAFBgYCG9vb5Penrq6OpNeoZ7CwsIAAJMnT8ZPP/2EjRs3GgKgnry8vDB9+nSrPUBEZIpDi0TkyUQbAvP19UV0dDQKCgqMygsKChAfH29zPYIgGE1gNve6TqfD8OHDHW4rEdmH+2sRkdSJOgS2atUqpKamIiYmBnFxcdi9ezeqq6uRnp4O4PrcnPPnz2Pfvn0AgJ07d2LUqFGYMGECgOt5gbZs2YLly5cb6ty0aRNmzpyJiIgI6PV67NixAzqdDjt37nT/BZLHqKpvxdmLbRwO6gX31yIiuRA1AEpJSUFjYyM2b96MmpoaREZGIj8/H6GhoQCAmpoaVFdXG47v6urC2rVrcfr0afj4+GDMmDH44x//iMcee8xwTFNTE5YtW4ba2lpotVpERUWhsLAQM2bMcPv1kfzxC90+1vbX2pfG30Eikg5R8wBJlT15BMizLdlz3OKScH6hG1NSDiEikiZ7vr9F3wqDSKq4YaZ9uL8WEckJAyAiC/iFbh/ur0VEcsIAyM2csTqGK2zcQ8pf6H39GXDFz1B3FmnvHnm8bNmg1tnt4e8IEfVG9ESISuGMybSckOteUtwWoq8/A67+GbJ3g1pnt4e/I0RkK06CNsMVk6CdMZmWE3LdT2obZvb1Z8BdP0O2ZpF2dnv4O0KkbLLYDFVJuifT9nTjZNreehOcUQfZT0rbQvT1Z8CdP0O2ZJF2dnv4O0JE9uAcIDdwxmRaTsgVlxQ2zHTkZ+DGuTBS+xlydnukdn1EJG3sAXIDZ0ymlfKEXLLMmRmk7fkZMDcXZvroQTaf7w7O/pnm7wgR2YM9QG7Ql9UxzqxDqcRYEdTU1oEle45j9itH8cjerzBryxEs2XMczW1XHa6z+2fA3C/tIP9+GOzva3huLiPzN2ebMMi/n2R+hpz9M83fESKyBwMgN8laHIWEsYFGZdZWx7iqDrmzJ5hxRRBiS5uq6lvx4J4vUXzKeD5K95YQfWFp8rX+56uGuq0lcLzUdhXTRg00KhfzZ8jZP9P8HSEiW3EVmBmu3ArDGZNppTAh190cWd7s6hVB5to0yL8fLvUSYPVlSwhbtps403gZj+z9yuIxex+ZjtFDAiT1M+Tsn2kl/o4QEVeBSZotq2PcUYfc2LvJpjtWBJlrU2/BD3B9Mq6j723LRF9b5sJI7WfI2e2R2vURkfRwCIwkz5E9uVy9IshSm2zRl8m4tgQ3nAtDRNQ7BkAkeY4EM65eEdRbm8zxVqHPAYitwQ3nwhARWcchMJI8R4IZV29j0VubzAlQ+zglALFluwkpJXAkIpIiToI2w5WToJXAmblvujkyodnV21iYa1NvpocOwn89NN0p78/ghojImD3f3wyAzGAA5BhXbkTZl2DGVYFCdeNlzN95zGjis5cK6OrlNyp+zBC8/duZTmtHT64IQF1Njm0mIulhANRHDIAc446NKKXU62Huer0AdNlwbl+Wwlsix53Q5dhmIpIue76/OQmanMKRlVqOEHtPru6kh4X/qDN7vd3Bj8r0VCNfVjU6vW3WUgVIlRzbTESegZOgySlsWakldo9NX5jrqbBm+EA/XGi6YvF1R7pdrQ0TyXEndDm2mYg8BwMgMuHIfAxP34jSXE+FNX+8bzKWvG45G/PM8CE212XLMJEcA1A5tpmIPAcDIAXrGej0ZT6Gq5edi8lST4U53debNG4o4sKHoNTMUFdc+BC7Pg9bsmDLMQCVY5uJyHNwDpACWdok9PG3vunTfAxPTb5nT9LDG6/3zw9GIykiyOj1pIgg/PnBaJvrs3VulSuyP9uz8awjmLGaiMTEHiAFMtejUFxZb3b1kj3zMTw1+V5vPRVvps3AtS7B5HrNfR6CIOCbc5ds/mx6C74qzjcbZX/uLUGiLdy5MstZbSaSA6Z7kBYGQApjaTint6Xb9szH8LSNKHsb3kvs0cvTU1hgAAb593MoqOgt+MopOYO7p4wA4LwA1N6NZ/vCU4Nmohsx3YM0cQhMYRzZwwpwbD6Gq4dQ3ClrcRSmhQ40KrOnp8LR5d7hQQMwffQgi69/dfaSyefbl1QB7kpn0JPY6Q2IXInpHqSJPUAK01uPQs9Mxo5MYva0v3a6r+erM5cMZdNDB9l8PX1d7v1Q/Gij9+7JXO+co13tXJlF5FxM9yBd7AFSGGsTT+PCh+AXY42HcxyZj+Fpf+2Yu55vqptsvh5HdrO/0aTh1rOZ3tg7Z2mCe/MN23VYw5VZRM7V199/ch32ACmQtYmnWv9+fZqP4Wl/7fR2Pe8cr0ZsL8va+xpU2JNioK/zdzw5nQGRGPhHhXQ5HAA1NTVhz549+OGHH6BSqTBx4kSkpaVBq9U6s33kAr1NPO3LJGZPG0Lp7XrWfPAdAOtDfJaCim4bPvq+1+E0W1ZLOSv45MosIufhHxXS5dAQ2Ndff40xY8Zg69atuHjxIhoaGrB161aMGTMG33zzjV11ZWdnIywsDH5+foiOjkZRUZHFY4uLi5GQkIAhQ4agf//+mDBhArZu3Wpy3P79+zFp0iSo1WpMmjQJBw4csPsaPYmlyciumHjqSX/tVNW3orb5Z5uO7W2Iz1yOJFvPBf4VtB5efRv2PjIdh1ffhn1pM4yCJmd1tdvyXkRkO0/NkSZ3DvUAPfXUU7j33nvxl7/8BT4+16u4du0ali5dioyMDBQWFtpUT15eHjIyMpCdnY2EhAS89tprmDdvHioqKjBq1CiT4wMCAvDkk0/i1ltvRUBAAIqLi/HYY48hICAAy5YtAwCUlpYiJSUFzz//PBYuXIgDBw5g0aJFKC4uRmxsrCOXK0tV9a2ouKDHGyVn8NXZf02gdfVkZE/4a8fefb+A3ntZtP79sPHeSZj9ylG7z72Rtd45ZwefnpbOgEgsTPcgTSpBMNMn34v+/fujvLwcEyZMMCqvqKhATEwM2tpsW2odGxuLadOmYdeuXYayiRMnYsGCBcjMzLSpjvvuuw8BAQF48803AQApKSnQ6/X45JNPDMfceeedGDRoEHJzc22qU6/XQ6vVorm5GRqN9QmoUtPbl3d3IOLsfC43am67ajKEIqdVYEv2HLc4XNWbp+ZE4N4pN5v9x+3wyTo8stfy/mB7H5mOWeOHAnB8FZe5trvjnhMRSYE9398O9QBpNBpUV1ebBEDnzp3DTTfdZFMdHR0dKCsrw5o1a4zKk5OTUVJSYlMd5eXlKCkpwQsvvGAoKy0txVNPPWV03Ny5c7Ft2zaL9bS3t6O9vd3wXK/X2/T+UtTbpp3umIws57927Nn3y5ytBZXYWlBpNuCzpYemrykEOH+HiMg2DgVAKSkpSEtLw5YtWxAfHw+VSoXi4mI888wzWLx4sU11NDQ0oLOzE8HBwUblwcHBqK2ttXruyJEjUV9fj2vXrmHjxo1YunSp4bXa2lq768zMzMSmTZtsareU2fPl7Y7JyHIcQnE0UWRP5lZe2TI82N2D01tdlsg5+CQicieHAqAtW7ZApVJhyZIluHbtGgCgX79++I//+A/88Y9/tKsuVY98NIIgmJT1VFRUhNbWVvz973/HmjVrMHbsWKPAy946165di1WrVhme6/V6hISE2HMZkmDPl7ecJiO7U2+9NLay1NNmrYfGmSkE5Bh8kjxxfyuSK4cCIF9fX2zfvh2ZmZn48ccfIQgCxo4dC39/2788AgMD4e3tbdIzU1dXZ9KD01NYWBgAYPLkyfjpp5+wceNGQwA0bNgwu+tUq9VQq9U2t12qbPnyltNkZDGEBw1AXPgQlFY1Wjym+zPcNP8WHDxxHlsLKi0e27OnzVoPzTfnLGd7NlcXkZg8LeM7KU+fMkH7+/tj8uTJuPXWW+0KfoDrQVR0dDQKCgqMygsKChAfH29zPYIgGM3fiYuLM6nz0KFDdtUpV73tGwUA00IHymI+iJj7iPXSAWnosQkLDMAvbx1h9VhLPW3mUhB4UgoB8nyelvGdlMfmHqD77rsPOTk50Gg0uO+++6we+8EHH9hU56pVq5CamoqYmBjExcVh9+7dqK6uRnp6OoDrQ1Pnz5/Hvn37AAA7d+7EqFGjDJOvi4uLsWXLFixfvtxQ58qVK5GUlISXXnoJ8+fPx0cffYTPP/8cxcXFtl6qrPW2b9Tjs8ZK+q8zsf+qrKpvRcmPlnt/3kybYbT7uzOX/XtCCgFSBk/L+E7KZHMApNVqDfNoNBpNr/N0bJGSkoLGxkZs3rwZNTU1iIyMRH5+PkJDQwEANTU1qK6uNhzf1dWFtWvX4vTp0/Dx8cGYMWPwxz/+EY899pjhmPj4eLzzzjtYv349nn32WYwZMwZ5eXmKyQFkz75RUtTXrRzsYW7uQm/zqK51mS6Nd+bKK2fUxTkZ5GqelvGdlMmhPECeTs55gAD55oKpqm81myiw2+HVtznlH1VrvUyNl9sdboMzV145UpfYvWekHO76XSWylz3f3w7NAZo9ezaamprMvvHs2bMdqZKcSK5p1921a7K1XqbuYSjvHj2c3ioVkiKCrP6j7sytRRypi3MyyF368ntCJBUOrQI7cuQIOjo6TMqvXLlidS8vcg+p5IKxdyjGHZOAbZm74MgwlNjDTlKekyH2Z0OuwaSbJHd2BUDffvut4f8rKiqMlpt3dnbi008/xc033+y81lGfiJULxtGhGHdMArZ17oKtAaRUhp2kOCdDKp8NuYZU/tAicpRdAdDUqVOhUqmgUqnMDnX1798fWVlZTmscyVNfJjK7+q/K3nqZfLz+1aVvSwDpzknb1khxCb1UPhtyLSbdJLmyKwA6ffo0BEFAeHg4jh8/jqCgfy0H9vX1xdChQ+Ht7e30RpJ89HUoxtV/VVrqZeqWuue4SS+FpSEcKQ07SW0JvZQ+GyIic+wKgLqXp3d1dbmkMSR/zhqKceVfleZ6mW7U3UuxY/FUq0M4Uht2ktKcDKl9NkREPTk0CbpbRUUFqqurTSZE33vvvX1qFMmXFIdieuruZSr8Rx2WvP6VyevdvRS/3fc1vjnbZPTajUM4UrtWKc3JkNpnQ0TUk0MBUFVVFRYuXIjvvvsOKpUK3amEupMjdnZ2Oq+FJCtSG4qxprOXDFjmMmrfOIQj1WuVwpwMqX42RETdHMoDtHLlSoSFheGnn36Cv78/vv/+exQWFiImJgZHjhxxchNJbuSSh6gvO7935ySSy7WKgZ8NEUmZQ5mgAwMD8be//Q233nortFotjh8/jvHjx+Nvf/sbnn76aZSXyzvxmtwzQUuFs4diXJFPxlLW7KhRA/H1Wct7qvXMdCuFYSep4mdDRO5iz/e3Q0NgnZ2dGDBgAIDrwdCFCxcwfvx4hIaG4uTJk45USR7IWUMxrswnY23i8PLccpuHcKQw7CRV/GyISIocCoAiIyPx7bffIjw8HLGxsXj55Zfh6+uL3bt3Izw83NltJIVzZT4ZaxOHpbSqioiInMuhAGj9+vW4fPn6HIgXXngB99xzDxITEzFkyBC88847Tm0gKZu78smY66Vw96oqbhlBROQ+DgVAc+fONfx/eHg4KioqcPHiRQwaNMiwEozIGaSQT8bVQzjcMoKIyP0cWgVmzuDBg1FbW4snn3zSWVUSuTSfTFV9Kw6frMPpBufsMu8o7uJOROR+dvcAVVRU4PDhw+jXrx8WLVqEgQMHoqGhAS+++CL+/Oc/IywszBXtJIVyRT4ZKfW4cMsIIiJx2NUD9N///d+IiorC8uXLkZ6ejpiYGBw+fBgTJ06ETqfDe++9h4qKCle1lRTK2flkpNTjYssQHxEROZ9dPUAvvvgi0tPT8eKLL2L37t1YvXo10tPTsX//fiQlJbmqjaRwzpyMLLUeF24ZQWLixHtxyfXzl2u7e7IrAPrhhx/wxhtvYMCAAVixYgV+97vfYdu2bQx+yC2cMRlZCpOqb8QtI0gMUhoGViK5fv5ybbcldg2B6fV6DBw4EADg4+OD/v37Y9y4ca5oF5FLSLHHhVtGkLtJaRhYieT6+cu13ZY4NAm6trYWACAIAk6ePGnICdTt1ltvdU7riJxMij0uUtrFnTyf1IaBlUaun79c222N3QHQ7bffjhu3D7vnnnsAwLArvEql4m7wJGlSzfDMLSPIHaQ2DKw0cv385dpua+wKgE6fPu2qdhC5DXtcSMmkOAysJHL9/OXabmvsCoBCQ0Nd1Q4it2OPCymRFIeBlUSun79c222NSrhxPMuKb7/91uZK5T4HSK/XQ6vVorm5GRqNRuzmEBE5VXPbVZNhYDmv5pEbuX7+cmi3Pd/fNgdAXl5ehnk+Viv0gDlADICISAk4DCwuuX7+Um63Pd/fNg+Bcf4PkW08JUkYeT4OA4tLrp+/XNvdk80BEOf/EFnnaUnCiIg8md3L4G9UUVGB6upqdHR0GJXfe++9fWoUkRxZSxK2L22GSK0iIiJzHAqAqqqqsHDhQnz33XdG84JUKhUAyH4OEJG9PDFJGBGRJ7NrK4xuK1euRFhYGH766Sf4+/vj+++/R2FhIWJiYnDkyBEnN5H6qqq+FYdP1uF0g+fvLC7WtXJXdyIieXEoACotLcXmzZsRFBQELy8veHl54Re/+AUyMzOxYsUKu+rKzs5GWFgY/Pz8EB0djaKiIovHfvDBB5gzZw6CgoKg0WgQFxeHzz77zOiYnJwcqFQqk8eVK1ccuVRZa2rrwJI9xzH7laN4ZO9XmLXlCJbsOY7mtqtiN83pxL5WT0wSRkTkyRwKgDo7OzFgwAAAQGBgIC5cuADg+kTpkydP2lxPXl4eMjIysG7dOpSXlyMxMRHz5s1DdXW12eMLCwsxZ84c5Ofno6ysDLNmzcIvf/lLlJcbb8Sm0WhQU1Nj9PDz83PkUmXN0zaus0bsa+1OEub9/w8Dd/NWqZAUEcThLyIiiXFoDlBkZCS+/fZbhIeHIzY2Fi+//DJ8fX2xe/duhIeH21zPq6++irS0NCxduhQAsG3bNnz22WfYtWsXMjMzTY7ftm2b0fM//OEP+Oijj/Dxxx8jKupf+zipVCoMGzbM5na0t7ejvb3d8Fyv19t8rlQpaU6KVK5VqnuMERGRKYcCoPXr1xt2gH/hhRdwzz33IDExEUOGDME777xjUx0dHR0oKyvDmjVrjMqTk5NRUlJiUx1dXV1oaWnB4MGDjcpbW1sRGhqKzs5OTJ06Fc8//7xRgNRTZmYmNm3aZNN7yoUnblxniVSulXuMERHJh0MB0Ny5cw3/Hx4ejoqKCly8eBGDBg0yrATrTUNDAzo7OxEcHGxUHhwcjNraWpvqeOWVV3D58mUsWrTIUDZhwgTk5ORg8uTJ0Ov12L59OxISEnDixAlERESYrWft2rVYtWqV4bler0dISIhNbZAqJc1Jkdq1WksSxiSJRETS4NAcoEcffRQtLS1GZYMHD0ZbWxseffRRu+rqGTAJgmBTEJWbm4uNGzciLy8PQ4cONZTPnDkTDz74IKZMmYLExES8++67GDduHLKysizWpVarodFojB5yp6Q5KXK4VrEnaRMRkTGHAqA33ngDP//8s0n5zz//jH379tlUR2BgILy9vU16e+rq6kx6hXrKy8tDWloa3n33Xdxxxx1Wj/Xy8sL06dNRWVlpU7s8SdbiKCSMDTQq89Q5KVK/VrEnaRMRkTG7hsD0ej0EQYAgCGhpaTFaWdXZ2Yn8/Hyj3hhrfH19ER0djYKCAixcuNBQXlBQgPnz51s8Lzc3F48++ihyc3Nx99139/o+giBAp9Nh8uTJNrXLkyhpToqUr1Uqk7SJiOhf7AqABg4caMirM27cOJPXVSqVXZOJV61ahdTUVMTExCAuLg67d+9GdXU10tPTAVyfm3P+/HlDr1Jubi6WLFmC7du3Y+bMmYbeo/79+0Or1QIANm3ahJkzZyIiIgJ6vR47duyATqfDzp077blUj+IpG9fZQorXKpVJ2kRE9C92BUCHDx+GIAiYPXs29u/fb7T6ytfXF6GhoRgxYoTN9aWkpKCxsRGbN29GTU0NIiMjkZ+fb9h4taamxign0GuvvYZr167hiSeewBNPPGEof+ihh5CTkwMAaGpqwrJly1BbWwutVouoqCgUFhZixgzuxUTikNokbSIiAlRC90Zedjh79ixGjRpl84ovudHr9dBqtWhubvaICdEkviV7juPYqQZ03vDr5q1SIWFsIDdKJSJyEnu+vx2aBB0aGori4mI8+OCDiI+Px/nz5wEAb775JoqLix2pksijSX2SNhGR0jiUB2j//v1ITU3Fb37zG3zzzTeGLMotLS34wx/+gPz8fKc2kkjupDxJm4jcg3nApMWhIbCoqCg89dRTWLJkCW666SacOHEC4eHh0Ol0uPPOO21OZChVHAIjIiJnaWrrwIpcndFq0KSIIGQtjoLWv5+ILfM8Lh8CO3nyJJKSkkzKNRoNmpqaHKmSiIjIIzEPmDQ5FAANHz4cp06dMikvLi62azNUIiIiT9adB6yzx2DLjXnASBwOBUCPPfYYVq5ciS+//BIqlQoXLlzAW2+9hdWrV+Pxxx93dhuJiIhkyZY8YCQOhyZB/+53v4Ner8esWbNw5coVJCUlQa1WY/Xq1XjyySed3UYiIiJZYh4w6bIrAGpra8MzzzyDDz/8EFevXsUvf/lLPP300wCASZMmYcCAAS5pJBERkRx1b9ZsKQ8YV4OJx64AaMOGDcjJycFvfvMb9O/fH2+//Ta6urrw3nvvuap9REREspa1OArLc8uNVoExD5j47FoGP2bMGLz44ov49a9/DQA4fvw4EhIScOXKFXh7e7uske7GZfBERORszAPmevZ8f9vVA3Tu3DkkJiYans+YMQM+Pj64cOECQkJCHGstERGRAkhxs2Yls2sVWGdnJ3x9fY3KfHx8cO3aNac2ioiIiMiV7OoBEgQBDz/8MNRqtaHsypUrSE9PR0DAv6LaDz74wHktJCIiInIyuwKghx56yKTswQcfdFpjiIiIiNzBrgBo7969rmoHERERkds4lAmaiIiISM4YABEREZHiMAAiIiIixXFoLzAiIiI5qKpvxdmLbUw+SCYYABERkcdpauvAilyd0fYTSRFByFocBa1/PxFbRlLBITAiIvI4K3J1OHaqwajs2KkGLM8tF6lFJDUMgIiIyKNU1beisLLeaPd1AOgUBBRW1uN0w2WRWuY6VfWtOHyyziOvzVU4BEZERB7l7MU2q6+fabzsMfOBONTnOPYAERGRRwkd7G/19dFDPCP4ATjU1xcMgIiIyKOEBw1AUkQQvFUqo3JvlQpJEUEe0/ujxKE+Z2IAREREHidrcRQSxgYalSWMDUTW4iiRWuR8tgz1kWWcA0RERB5H698P+9Jm4HTDZZxpvOyReYCUNNTnCgyAiIjIY4UFel7g0617qO/YqQajYTBvlQoJYwM99rqdhUNgREREMqWEoT5XYQ8QERGRTClhqM9VRO8Bys7ORlhYGPz8/BAdHY2ioiKLx37wwQeYM2cOgoKCoNFoEBcXh88++8zkuP3792PSpElQq9WYNGkSDhw44MpLICIiEpXQYyUY9U7UACgvLw8ZGRlYt24dysvLkZiYiHnz5qG6utrs8YWFhZgzZw7y8/NRVlaGWbNm4Ze//CXKy/+V76C0tBQpKSlITU3FiRMnkJqaikWLFuHLL79012URERG5RVNbB5bsOY7ZrxzFI3u/wqwtR7Bkz3E0t10Vu2mSpxJEDBtjY2Mxbdo07Nq1y1A2ceJELFiwAJmZmTbVccsttyAlJQXPPfccACAlJQV6vR6ffPKJ4Zg777wTgwYNQm5urk116vV6aLVaNDc3Q6PR2HFFRERE7rNkz3GLk6D3pc0QsWXisOf7W7QeoI6ODpSVlSE5OdmoPDk5GSUlJTbV0dXVhZaWFgwePNhQVlpaalLn3LlzrdbZ3t4OvV5v9CAiIpIyJkLsG9ECoIaGBnR2diI4ONioPDg4GLW1tTbV8corr+Dy5ctYtGiRoay2ttbuOjMzM6HVag2PkJAQO66EiIjI/ZgIsW9EnwSt6pGqXBAEkzJzcnNzsXHjRuTl5WHo0KF9qnPt2rVobm42PM6dO2fHFRAREbkfEyH2jWjL4AMDA+Ht7W3SM1NXV2fSg9NTXl4e0tLS8N577+GOO+4wem3YsGF216lWq6FWq+28AiIiIvEwEWLfiNYD5Ovri+joaBQUFBiVFxQUID4+3uJ5ubm5ePjhh/H222/j7rvvNnk9Li7OpM5Dhw5ZrZOIiEiOmAjRcaImQly1ahVSU1MRExODuLg47N69G9XV1UhPTwdwfWjq/Pnz2LdvH4Drwc+SJUuwfft2zJw509DT079/f2i1WgDAypUrkZSUhJdeegnz58/HRx99hM8//xzFxcXiXCQREZGLMBGi40RdBg9cT4T48ssvo6amBpGRkdi6dSuSkpIAAA8//DDOnDmDI0eOAABuu+02HD161KSOhx56CDk5OYbn77//PtavX4+qqiqMGTMGL774Iu677z6b28Rl8ERERPJjz/e36AGQFDEAIiIikh9Z5AEiIiIiEgsDICIiIlIcBkBERESkOAyAiIiISHEYABEREZHiMAAiIiIixWEARERERIrDAIiIiIgUhwEQERERKY6oe4ERERHdqKq+FWcvtnFPK3I5BkBERCS6prYOrMjVobCy3lCWFBGErMVR0Pr3E7Fl5Kk4BEZERKJbkavDsVMNRmXHTjVgeW65SC0iT8cAiIiIRFVV34rCynp09tibu1MQUFhZj9MNl0VqGXkyBkBERCSqsxfbrL5+ppEBEDkfAyAiIhJV6GB/q6+PHsLJ0OR8DICIiEhU4UEDkBQRBG+VyqjcW6VCUkQQV4ORSzAAIiIi0WUtjkLC2ECjsoSxgchaHCVSi8jTcRk8ERGJTuvfD/vSZuB0w2WcabzskjxAzDFEN2IAREREkhEW6PzghDmGyBwOgRERkUdjjiEyhwEQERF5LOYYIksYABERkcdijiGyhAEQERF5LOYYIksYABERkcdijiGyhAEQERF5NOYYInO4DJ6IiDyaO3IMkfwwACIiIkVwRY4hki8GQEREZIJZk8nTMQAiIiIDZk0mpeAkaCIiMmDWZFIKBkBERASAWZNJWUQPgLKzsxEWFgY/Pz9ER0ejqKjI4rE1NTV44IEHMH78eHh5eSEjI8PkmJycHKhUKpPHlStXXHgVRETyx6zJpCSiBkB5eXnIyMjAunXrUF5ejsTERMybNw/V1dVmj29vb0dQUBDWrVuHKVOmWKxXo9GgpqbG6OHn5+eqyyAi8gjMmkxKImoA9OqrryItLQ1Lly7FxIkTsW3bNoSEhGDXrl1mjx89ejS2b9+OJUuWQKvVWqxXpVJh2LBhRg8iIrKOWZNJSUQLgDo6OlBWVobk5GSj8uTkZJSUlPSp7tbWVoSGhmLkyJG45557UF5uffJee3s79Hq90YOISImYNZmUQrRl8A0NDejs7ERwcLBReXBwMGprax2ud8KECcjJycHkyZOh1+uxfft2JCQk4MSJE4iIiDB7TmZmJjZt2uTwexIReQpmTSalED0PkKpHV6sgCCZl9pg5cyZmzpxpeJ6QkIBp06YhKysLO3bsMHvO2rVrsWrVKsNzvV6PkJAQh9tARCR3zJpMnk60ACgwMBDe3t4mvT11dXUmvUJ94eXlhenTp6OystLiMWq1Gmq12mnvSURE0sgmLYU2yI1SPjPRAiBfX19ER0ejoKAACxcuNJQXFBRg/vz5TnsfQRCg0+kwefJkp9VJRESWSSGbtBTaIDdK+8xEXQW2atUq/Nd//Rdef/11/PDDD3jqqadQXV2N9PR0ANeHppYsWWJ0jk6ng06nQ2trK+rr66HT6VBRUWF4fdOmTfjss89QVVUFnU6HtLQ06HQ6Q51ERORaUsgmLYU2yI3SPjNR5wClpKSgsbERmzdvRk1NDSIjI5Gfn4/Q0FAA1xMf9swJFBX1r5UIZWVlePvttxEaGoozZ84AAJqamrBs2TLU1tZCq9UiKioKhYWFmDFjhtuui4hIqbqzSfd0YzZpVw+rSKENcqPEz0z0SdCPP/44Hn/8cbOv5eTkmJQJPVK097R161Zs3brVGU0jIiI72ZJN2tVfpFJog9wo8TMTfSsMIiLyHFLIJi2FNsiNEj8zBkBEROQ0UsgmLYU2yI0SPzMGQERE5FR9ySZdVd+KwyfrTHaet1RuydPJ4zBx+E12tcHe9/A0SssCrhJ6m1SjQHq9HlqtFs3NzdBoNGI3h4hIluzJJm1pCfYLC27B+g+/t3lptrl6Im/W4A8LJ+PWkQPtem9PXf7dGzlnAbfn+5sBkBkMgIiI3GvJnuM4dqoBnTd8JXmrVND094H+52sm5QljA7EvzXR1r6V6LB3v6DkkTfZ8f3MIjIiIRNW9BLuzx9/jnYKAS21XzZZ3L822tR5zxzt6DnkGBkBERCSq3pZgW3Km0Tg4sWUpt73vbe4c8gwMgIiISFS9LcG2pOfSbEeWcitx+TddxwCIiIhEZW0J9iD/fjYvzXZkKbcSl3/TdQyAiIhIdJaWYB984hd2Lc12ZCm3Jyz/VvoSfkdwFZgZXAVGRCQOS0uw7V2a7chSbjku/+YSfmNcBt9HDICIiEgOuITfGJfBExEReTgu4e8bBkBEREQyxCX8fcMAiIiISIa4hL9vGAARERHJEJfw9w0DICIiheLSafnzhCX8YvERuwFEROReXDrtObT+/bAvbYYsl/CLjT1AREQKsyJXh2OnGozKjp1qwPLccpFaRH0VFhiAWeOHMvixAwMgIiIF4dJpousYABERKQiXThNdxwCIiEhBuHSa6DoGQERECsKl00TXMQAiIlIYLp0m4jJ4IiLF4dJpIgZARESKFRbIwIeUi0NgREREpDgMgIiIiEhxGAARERGR4jAAIiIiIsVhAERERESKI3oAlJ2djbCwMPj5+SE6OhpFRUUWj62pqcEDDzyA8ePHw8vLCxkZGWaP279/PyZNmgS1Wo1JkybhwIEDLmo9ERERyZGoAVBeXh4yMjKwbt06lJeXIzExEfPmzUN1dbXZ49vb2xEUFIR169ZhypQpZo8pLS1FSkoKUlNTceLECaSmpmLRokX48ssvXXkpREREJCMqQeixJbAbxcbGYtq0adi1a5ehbOLEiViwYAEyMzOtnnvbbbdh6tSp2LZtm1F5SkoK9Ho9PvnkE0PZnXfeiUGDBiE3N9dsXe3t7Whvbzc81+v1CAkJQXNzMzQajQNXRkRERO6m1+uh1Wpt+v4WrQeoo6MDZWVlSE5ONipPTk5GSUmJw/WWlpaa1Dl37lyrdWZmZkKr1RoeISEhDr8/ERERSZ9oAVBDQwM6OzsRHBxsVB4cHIza2lqH662trbW7zrVr16K5udnwOHfunMPvT0RERNIn+lYYqh47EguCYFLm6jrVajXUanWf3pOIiIjkQ7QeoMDAQHh7e5v0zNTV1Zn04Nhj2LBhTq+TiIiIPItoAZCvry+io6NRUFBgVF5QUID4+HiH642LizOp89ChQ32qk4iIiDyLqENgq1atQmpqKmJiYhAXF4fdu3ejuroa6enpAK7PzTl//jz27dtnOEen0wEAWltbUV9fD51OB19fX0yaNAkAsHLlSiQlJeGll17C/Pnz8dFHH+Hzzz9HcXGx26+PiIiIpEnUACglJQWNjY3YvHkzampqEBkZifz8fISGhgK4nviwZ06gqKgow/+XlZXh7bffRmhoKM6cOQMAiI+PxzvvvIP169fj2WefxZgxY5CXl4fY2Fi3XRcRERFJm6h5gKTKnjwCREREJA2yyANEREREJBYGQERERKQ4DICIiIhIcRgAERERkeIwACIiIiLFYQBEREREisMAiIiIiBSHARAREREpDgMgIiIiUhwGQERERKQ4DICIiIhIcRgAERERkeIwACIiIiLFYQBEREREisMAiIiIiBSHARAREREpjo/YDSAiIiJlqapvxdmLbRg9JABhgQGitIEBEBEREblFU1sHVuTqUFhZbyhLighC1uIoaP37ubUtHAIjIiIit1iRq8OxUw1GZcdONWB5brnb28IAiIiIiFyuqr4VhZX16BQEo/JOQUBhZT1ON1x2a3sYABEREZHLnb3YZvX1M40MgIiIiMjDhA72t/r66CHunQzNAIiIiIhcLjxoAJIiguCtUhmVe6tUSIoIcvtqMAZARERE5BZZi6OQMDbQqCxhbCCyFke5vS1cBk9ERERuofXvh31pM3C64TLONF5mHiAiIiJSjrBA8QKfbhwCIyIiIsVhAERERESKwwCIiIiIFIcBEBERESmO6AFQdnY2wsLC4Ofnh+joaBQVFVk9/ujRo4iOjoafnx/Cw8Px5z//2ej1nJwcqFQqk8eVK1dceRlEREQkI6IGQHl5ecjIyMC6detQXl6OxMREzJs3D9XV1WaPP336NO666y4kJiaivLwcv//977FixQrs37/f6DiNRoOamhqjh5+fnzsuiYiIiGRAJQg9diVzo9jYWEybNg27du0ylE2cOBELFixAZmamyfH/7//9Pxw8eBA//PCDoSw9PR0nTpxAaWkpgOs9QBkZGWhqanK4XXq9HlqtFs3NzdBoNA7XQ0RERO5jz/e3aD1AHR0dKCsrQ3JyslF5cnIySkpKzJ5TWlpqcvzcuXPx9ddf4+rVq4ay1tZWhIaGYuTIkbjnnntQXl5utS3t7e3Q6/VGDyIiIvJcogVADQ0N6OzsRHBwsFF5cHAwamtrzZ5TW1tr9vhr166hoaEBADBhwgTk5OTg4MGDyM3NhZ+fHxISElBZWWmxLZmZmdBqtYZHSEhIH6+OiIiIpEz0SdCqHpuiCYJgUtbb8TeWz5w5Ew8++CCmTJmCxMREvPvuuxg3bhyysrIs1rl27Vo0NzcbHufOnXP0coiIiEgGRNsKIzAwEN7e3ia9PXV1dSa9PN2GDRtm9ngfHx8MGTLE7DleXl6YPn261R4gtVoNtVpteN4dVHEojIiISD66v7dtmd4sWgDk6+uL6OhoFBQUYOHChYbygoICzJ8/3+w5cXFx+Pjjj43KDh06hJiYGPTr18/sOYIgQKfTYfLkyTa3raWlBQA4FEZERCRDLS0t0Gq1Vo8RdTPUVatWITU1FTExMYiLi8Pu3btRXV2N9PR0ANeHps6fP499+/YBuL7i609/+hNWrVqF3/72tygtLcWePXuQm5trqHPTpk2YOXMmIiIioNfrsWPHDuh0OuzcudPmdo0YMQLnzp3DTTfdZHU4jmyn1+sREhKCc+fOcWWdBPB+SAvvh7TwfkiLPfdDEAS0tLRgxIgRvdYragCUkpKCxsZGbN68GTU1NYiMjER+fj5CQ0MBADU1NUY5gcLCwpCfn4+nnnoKO3fuxIgRI7Bjxw7cf//9hmOampqwbNky1NbWQqvVIioqCoWFhZgxY4bN7fLy8sLIkSOdd6FkoNFo+A+KhPB+SAvvh7TwfkiLrfejt56fbqLmASLlYG4laeH9kBbeD2nh/ZAWV90P0VeBEREREbkbAyByC7VajQ0bNhittiPx8H5IC++HtPB+SIur7geHwIiIiEhx2ANEREREisMAiIiIiBSHARAREREpDgMgIiIiUhwGQOQ02dnZCAsLg5+fH6Kjo1FUVGTTeceOHYOPjw+mTp3q2gYqjD3348iRI1CpVCaP//u//3Njiz2bvb8f7e3tWLduHUJDQ6FWqzFmzBi8/vrrbmqt57Pnfjz88MNmfz9uueUWN7bYs9n7+/HWW29hypQp8Pf3x/Dhw/HII4+gsbHRvjcViJzgnXfeEfr16yf85S9/ESoqKoSVK1cKAQEBwtmzZ62e19TUJISHhwvJycnClClT3NNYBbD3fhw+fFgAIJw8eVKoqakxPK5du+bmlnsmR34/7r33XiE2NlYoKCgQTp8+LXz55ZfCsWPH3Nhqz2Xv/WhqajL6vTh37pwwePBgYcOGDe5tuIey934UFRUJXl5ewvbt24WqqiqhqKhIuOWWW4QFCxbY9b4MgMgpZsyYIaSnpxuVTZgwQVizZo3V81JSUoT169cLGzZsYADkRPbej+4A6NKlS25onfLYez8++eQTQavVCo2Nje5onuI4+u9VtwMHDggqlUo4c+aMK5qnOPbej//8z/8UwsPDjcp27NghjBw50q735RAY9VlHRwfKysqQnJxsVJ6cnIySkhKL5+3duxc//vgjNmzY4OomKoqj9wMAoqKiMHz4cNx+++04fPiwK5upGI7cj4MHDyImJgYvv/wybr75ZowbNw6rV6/Gzz//7I4me7S+/H5027NnD+644w7DvpXkOEfuR3x8PP75z38iPz8fgiDgp59+wvvvv4+7777brvcWdTNU8gwNDQ3o7OxEcHCwUXlwcDBqa2vNnlNZWYk1a9agqKgIPj78MXQmR+7H8OHDsXv3bkRHR6O9vR1vvvkmbr/9dhw5cgRJSUnuaLbHcuR+VFVVobi4GH5+fjhw4AAaGhrw+OOP4+LFi5wH1EeO3I8b1dTU4JNPPsHbb7/tqiYqiiP3Iz4+Hm+99RZSUlJw5coVXLt2Dffeey+ysrLsem9+85DTqFQqo+eCIJiUAUBnZyceeOABbNq0CePGjXNX8xTH1vsBAOPHj8f48eMNz+Pi4nDu3Dls2bKFAZCT2HM/urq6oFKp8NZbbxl2tn711Vfxq1/9Cjt37kT//v1d3l5PZ8/9uFFOTg4GDhyIBQsWuKhlymTP/aioqMCKFSvw3HPPYe7cuaipqcEzzzyD9PR07Nmzx+b3ZABEfRYYGAhvb2+TaL2urs4kqgeAlpYWfP311ygvL8eTTz4J4Po/+IIgwMfHB4cOHcLs2bPd0nZPZO/9sGTmzJn461//6uzmKY4j92P48OG4+eabDcEPAEycOBGCIOCf//wnIiIiXNpmT9aX3w9BEPD6668jNTUVvr6+rmymYjhyPzIzM5GQkIBnnnkGAHDrrbciICAAiYmJeOGFFzB8+HCb3ptzgKjPfH19ER0djYKCAqPygoICxMfHmxyv0Wjw3XffQafTGR7p6ekYP348dDodYmNj3dV0j2Tv/bCkvLzc5n9IyDJH7kdCQgIuXLiA1tZWQ9k//vEPeHl5YeTIkS5tr6fry+/H0aNHcerUKaSlpbmyiYriyP1oa2uDl5dx+OLt7Q3gepBqM7umTBNZ0L2Mcc+ePUJFRYWQkZEhBAQEGFZJrFmzRkhNTbV4PleBOZe992Pr1q3CgQMHhH/84x/C//7v/wpr1qwRAAj79+8X6xI8ir33o6WlRRg5cqTwq1/9Svj++++Fo0ePChEREcLSpUvFugSP4ui/Vw8++KAQGxvr7uZ6PHvvx969ewUfHx8hOztb+PHHH4Xi4mIhJiZGmDFjhl3vyyEwcoqUlBQ0NjZi8+bNqKmpQWRkJPLz8w2rJGpqalBdXS1yK5XD3vvR0dGB1atX4/z58+jfvz9uueUW/M///A/uuususS7Bo9h7PwYMGICCggIsX74cMTExGDJkCBYtWoQXXnhBrEvwKI78e9Xc3Iz9+/dj+/btYjTZo9l7Px5++GG0tLTgT3/6E55++mkMHDgQs2fPxksvvWTX+6oEwZ7+IiIiIiL54xwgIiIiUhwGQERERKQ4DICIiIhIcRgAERERkeIwACIiIiLFYQBEREREisMAiIiIiBSHARAREREpDgMgIlIUlUqFDz/80C3vtXHjRkydOtUt70VE9mEARERuUVdXh8ceewyjRo2CWq3GsGHDMHfuXJSWloraLpVKZXgMGDAAU6ZMQU5OjkP19AysVq9ejS+++MI5DSUip+JeYETkFvfffz+uXr2KN954A+Hh4fjpp5/wxRdf4OLFi2I3DXv37sWdd96Jy5cvIy8vD4888giGDx+OuXPn9qneAQMGYMCAAU5qJRE5E3uAiMjlmpqaUFxcjJdeegmzZs1CaGgoZsyYgbVr1+Luu+8GcH2zyWXLlmHo0KHQaDSYPXs2Tpw4YaijezjptddeQ0hICPz9/fHv//7vaGpqMhzz1VdfYc6cOQgMDIRWq8W//du/4Ztvvum1fQMHDsSwYcMwZswY/P73v8fgwYNx6NAhm+sdPXo0AGDhwoVQqVSG5z2HwLq6urB582aMHDkSarUaU6dOxaeffurAJ0pEfcUAiIhcrrsn5MMPP0R7e7vJ64Ig4O6770ZtbS3y8/NRVlaGadOm4fbbbzfqITp16hTeffddfPzxx/j000+h0+nwxBNPGF5vaWnBQw89hKKiIvz9739HREQE7rrrLrS0tNjUzs7OTrz77ru4ePEi+vXrZ3O9X331FYDrPUk1NTWG5z1t374dr7zyCrZs2YJvv/0Wc+fOxb333ovKykqb2kdETiQQEbnB+++/LwwaNEjw8/MT4uPjhbVr1wonTpwQBEEQvvjiC0Gj0QhXrlwxOmfMmDHCa6+9JgiCIGzYsEHw9vYWzp07Z3j9k08+Eby8vISamhqz73nt2jXhpptuEj7++GNDGQDhwIEDRs/9/PyEgIAAwdvbWwAgDB48WKisrLR4LbbU293mKVOmGJ6PGDFCePHFF42OmT59uvD4449bfC8icg32ABGRW9x///24cOECDh48iLlz5+LIkSOYNm0acnJyUFZWhtbWVgwZMsTQWzRgwACcPn0aP/74o6GOUaNGYeTIkYbncXFx6OrqwsmTJwFcn2idnp6OcePGQavVQqvVorW1FdXV1VbbtnXrVuh0OhQUFGDq1KnYunUrxo4da3jd0XpvpNfrceHCBSQkJBiVJyQk4IcffrC5HiJyDk6CJiK38fPzw5w5czBnzhw899xzWLp0KTZs2IDHH38cw4cPx5EjR0zOGThwoMX6VCqV0X8ffvhh1NfXY9u2bQgNDYVarUZcXBw6OjqstmvYsGEYO3Ysxo4di/feew9RUVGIiYnBpEmT+lSvtTZ3EwTBpIyIXI89QEQkmkmTJuHy5cuYNm0aamtr4ePjYwhEuh+BgYGG46urq3HhwgXD89LSUnh5eWHcuHEAgKKiIqxYsQJ33XUXbrnlFqjVajQ0NNjVprFjx+L+++/H2rVrDWW21NuvXz90dnZarFej0WDEiBEoLi42Ki8pKcHEiRPtaiMR9R0DICJyucbGRsyePRt//etf8e233+L06dN477338PLLL2P+/Pm44447EBcXhwULFuCzzz7DmTNnUFJSgvXr1+Prr7821OPn54eHHnoIJ06cMAQlixYtwrBhwwBcD17efPNN/PDDD/jyyy/xm9/8Bv3797e7vU8//TQ+/vhjw3vbUu/o0aPxxRdfoLa2FpcuXTJb7zPPPIOXXnoJeXl5OHnyJNasWQOdToeVK1fa3UYi6hsGQETkcgMGDEBsbCy2bt2KpKQkREZG4tlnn8Vvf/tb/OlPf4JKpUJ+fj6SkpLw6KOPYty4cfj1r3+NM2fOIDg42FDP2LFjcd999+Guu+5CcnIyIiMjkZ2dbXj99ddfx6VLlxAVFYXU1FSsWLECQ4cOtbu9kydPxh133IHnnnvO5npfeeUVFBQUICQkBFFRUWbrXbFiBZ5++mk8/fTTmDx5Mj799FMcPHgQERERdreRiPpGJQiCIHYjiIh6s3HjRnz44YfQ6XRiN4WIPAB7gIiIiEhxGAARERGR4nAIjIiIiBSHPUBERESkOAyAiIiISHEYABEREZHiMAAiIiIixWEARERERIrDAIiIiIgUhwEQERERKQ4DICIiIlKc/w8cR5HSxTuO2QAAAABJRU5ErkJggg==", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "(\n", + " iris.query(\"SepalLength > 5\")\n", + " .assign(\n", + " SepalRatio=lambda x: x.SepalWidth / x.SepalLength,\n", + " PetalRatio=lambda x: x.PetalWidth / x.PetalLength,\n", + " )\n", + " .plot(kind=\"scatter\", x=\"SepalRatio\", y=\"PetalRatio\")\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "7e1e3e3d", + "metadata": {}, + "source": [ + "Since a function is passed in, the function is computed on the DataFrame being assigned to. Importantly, this is the DataFrame that's been filtered to those rows with sepal length greater than 5. The filtering happens first, and then the ratio calculations. This is an example where we didn't have a reference to the filtered DataFrame available.\n", + "\n", + "The function signature for `assign()` is simply `**kwargs`. The keys are the column names for the new fields, and the values are either a value to be inserted (for example, a `Series` or NumPy array), or a function of one argument to be called on the `DataFrame`. A copy of the original `DataFrame` is returned, with the new values inserted.\n", + "\n", + "The order of `**kwargs` is preserved. This allows for dependent assignment, where an expression later in `**kwargs` can refer to a column created earlier in the same `assign()`." + ] + }, + { + "cell_type": "code", + "execution_count": 100, + "id": "60b7e3c7", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "dfa = pd.DataFrame({\"A\": [1, 2, 3], \"B\": [4, 5, 6]})" + ] + }, + { + "cell_type": "code", + "execution_count": 101, + "id": "4c821875", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABCD
01456
12579
236912
\n", + "
" + ], + "text/plain": [ + " A B C D\n", + "0 1 4 5 6\n", + "1 2 5 7 9\n", + "2 3 6 9 12" + ] + }, + "execution_count": 101, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dfa.assign(C=lambda x: x[\"A\"] + x[\"B\"], D=lambda x: x[\"A\"] + x[\"C\"])" + ] + }, + { + "cell_type": "markdown", + "id": "822c6838", + "metadata": {}, + "source": [ + "In the second expression, `x['C']` will refer to the newly created column, that's equal to `dfa['A'] + dfa['B']`.\n", + "\n", + "#### Indexing / selection\n", + "\n", + "The basics of indexing are as follows:\n", + "\n", + "|Operation |Syntax |Result |\n", + "|:------- |:----- |:----- |\n", + "|Select column |`df[col]` |Series |\n", + "|Select row by label |`df.loc[label]`|Series |\n", + "|Select row by integer location|`df.iloc[loc]` |Series |\n", + "|Slice rows |`df[5:10] ` |DataFrame|\n", + "|Select rows by boolean vector |`df[bool_vec]` |DataFrame|\n", + "\n", + "Row selection, for example, returns a `Series` whose index is the columns of the `DataFrame`:" + ] + }, + { + "cell_type": "code", + "execution_count": 102, + "id": "82154750", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "one 2.0\n", + "bar 2.0\n", + "flag False\n", + "foo bar\n", + "one_trunc 2.0\n", + "Name: b, dtype: object" + ] + }, + "execution_count": 102, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.loc[\"b\"]" + ] + }, + { + "cell_type": "code", + "execution_count": 103, + "id": "743d6893-bbf3-4fbf-a158-a3aaae040b39", + "metadata": { + "tags": [ + "hide-cell" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(\n", + " HTML(\n", + " \"\"\"\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "
\n", + "

Let's visualize it! 🎥

\n", + "
\n", + " \n", + "
\n", + "
\n", + "
\n", + "\n", + "\"\"\"\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 104, + "id": "2fae006c", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "one 3.0\n", + "bar 3.0\n", + "flag True\n", + "foo bar\n", + "one_trunc NaN\n", + "Name: c, dtype: object" + ] + }, + "execution_count": 104, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.iloc[2]" + ] + }, + { + "cell_type": "markdown", + "id": "87fe370b", + "metadata": {}, + "source": [ + "#### Data alignment and arithmetic\n", + "\n", + "Data alignment between `DataFrame` objects automatically aligns on **both** the columns and the index (row labels)**. Again, the resulting object will have the union of the column and row labels." + ] + }, + { + "cell_type": "code", + "execution_count": 105, + "id": "a3e29475", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "df = pd.DataFrame(np.random.randn(10, 4), columns=[\"A\", \"B\", \"C\", \"D\"])" + ] + }, + { + "cell_type": "code", + "execution_count": 106, + "id": "c4634479", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "df2 = pd.DataFrame(np.random.randn(7, 3), columns=[\"A\", \"B\", \"C\"])" + ] + }, + { + "cell_type": "code", + "execution_count": 107, + "id": "09eb77aa", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABCD
0-2.675459-0.1335811.148420NaN
1-1.979453-0.736220-4.194590NaN
2-1.3730360.939902-1.952070NaN
30.813456-0.228460-0.634051NaN
4-0.2877481.054761-2.133658NaN
50.6974261.493623-1.633845NaN
6-0.2493571.4325541.585387NaN
7NaNNaNNaNNaN
8NaNNaNNaNNaN
9NaNNaNNaNNaN
\n", + "
" + ], + "text/plain": [ + " A B C D\n", + "0 -2.675459 -0.133581 1.148420 NaN\n", + "1 -1.979453 -0.736220 -4.194590 NaN\n", + "2 -1.373036 0.939902 -1.952070 NaN\n", + "3 0.813456 -0.228460 -0.634051 NaN\n", + "4 -0.287748 1.054761 -2.133658 NaN\n", + "5 0.697426 1.493623 -1.633845 NaN\n", + "6 -0.249357 1.432554 1.585387 NaN\n", + "7 NaN NaN NaN NaN\n", + "8 NaN NaN NaN NaN\n", + "9 NaN NaN NaN NaN" + ] + }, + "execution_count": 107, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df + df2" + ] + }, + { + "cell_type": "markdown", + "id": "9062570a", + "metadata": {}, + "source": [ + "When doing an operation between `DataFrame` and `Series`, the default behavior is to align the `Series` **index** on the `DataFrame` **columns**, thus broadcasting row-wise. For example:" + ] + }, + { + "cell_type": "code", + "execution_count": 108, + "id": "c2a8adda", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABCD
00.0000000.0000000.0000000.000000
11.1766530.958183-3.117564-0.681690
20.9357301.163312-1.466732-1.218909
31.8547240.293515-0.4723880.755568
41.6049463.032983-0.8180880.248440
52.2074101.803085-1.369634-1.490638
64.2312871.8620161.764553-1.377419
71.1820001.136687-0.938919-0.008524
81.5791371.203216-0.808539-1.454299
91.8077120.360691-1.850980-0.663877
\n", + "
" + ], + "text/plain": [ + " A B C D\n", + "0 0.000000 0.000000 0.000000 0.000000\n", + "1 1.176653 0.958183 -3.117564 -0.681690\n", + "2 0.935730 1.163312 -1.466732 -1.218909\n", + "3 1.854724 0.293515 -0.472388 0.755568\n", + "4 1.604946 3.032983 -0.818088 0.248440\n", + "5 2.207410 1.803085 -1.369634 -1.490638\n", + "6 4.231287 1.862016 1.764553 -1.377419\n", + "7 1.182000 1.136687 -0.938919 -0.008524\n", + "8 1.579137 1.203216 -0.808539 -1.454299\n", + "9 1.807712 0.360691 -1.850980 -0.663877" + ] + }, + "execution_count": 108, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df - df.iloc[0]" + ] + }, + { + "cell_type": "markdown", + "id": "cf0c0013", + "metadata": {}, + "source": [ + "Arithmetic operations with scalars operate element-wise:" + ] + }, + { + "cell_type": "code", + "execution_count": 109, + "id": "d4cc4904", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABCD
0-7.038225-3.6166622.2496680.911815
1-1.1549611.174252-13.338150-2.496637
2-2.3595732.199900-5.083991-5.182731
32.235394-2.149088-0.1122704.689655
40.98650411.548255-1.8407742.154016
53.9988275.398765-4.598502-6.541375
614.1182115.69341811.072431-5.975282
7-1.1282252.066773-2.4449250.869196
80.8574612.399419-1.793024-6.359682
92.000335-1.813207-7.005231-2.407570
\n", + "
" + ], + "text/plain": [ + " A B C D\n", + "0 -7.038225 -3.616662 2.249668 0.911815\n", + "1 -1.154961 1.174252 -13.338150 -2.496637\n", + "2 -2.359573 2.199900 -5.083991 -5.182731\n", + "3 2.235394 -2.149088 -0.112270 4.689655\n", + "4 0.986504 11.548255 -1.840774 2.154016\n", + "5 3.998827 5.398765 -4.598502 -6.541375\n", + "6 14.118211 5.693418 11.072431 -5.975282\n", + "7 -1.128225 2.066773 -2.444925 0.869196\n", + "8 0.857461 2.399419 -1.793024 -6.359682\n", + "9 2.000335 -1.813207 -7.005231 -2.407570" + ] + }, + "execution_count": 109, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df * 5 + 2" + ] + }, + { + "cell_type": "code", + "execution_count": 110, + "id": "131ec689", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABCD
0-0.553206-0.89020820.026571-4.594807
1-1.584805-6.055120-0.325985-1.111942
2-1.14690125.012499-0.705817-0.696114
321.240986-1.205084-2.3671221.858975
4-4.9334190.523656-1.30182132.464111
52.5014671.471123-0.757748-0.585386
60.4126021.3537600.551120-0.626937
7-1.59835074.880783-1.124878-4.421634
8-4.37622012.518198-1.318209-0.598109
914927.997228-1.311232-0.555233-1.134412
\n", + "
" + ], + "text/plain": [ + " A B C D\n", + "0 -0.553206 -0.890208 20.026571 -4.594807\n", + "1 -1.584805 -6.055120 -0.325985 -1.111942\n", + "2 -1.146901 25.012499 -0.705817 -0.696114\n", + "3 21.240986 -1.205084 -2.367122 1.858975\n", + "4 -4.933419 0.523656 -1.301821 32.464111\n", + "5 2.501467 1.471123 -0.757748 -0.585386\n", + "6 0.412602 1.353760 0.551120 -0.626937\n", + "7 -1.598350 74.880783 -1.124878 -4.421634\n", + "8 -4.376220 12.518198 -1.318209 -0.598109\n", + "9 14927.997228 -1.311232 -0.555233 -1.134412" + ] + }, + "execution_count": 110, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "1 / df" + ] + }, + { + "cell_type": "code", + "execution_count": 111, + "id": "a2d50c6f", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABCD
01.067708e+011.592330e+000.0000062.243525e-03
11.585244e-017.438906e-0488.5547616.541410e-01
25.779576e-012.554887e-064.0293224.258713e+00
34.912486e-064.741661e-010.0318518.373485e-02
41.688138e-031.329891e+010.3481739.002973e-07
52.553999e-022.135033e-013.0332038.515912e+00
63.450436e+012.977377e-0110.8396376.472977e+00
71.532188e-013.180669e-080.6245652.616188e-03
82.726487e-034.072234e-050.3311797.814101e+00
92.013696e-173.382842e-0110.5220296.038328e-01
\n", + "
" + ], + "text/plain": [ + " A B C D\n", + "0 1.067708e+01 1.592330e+00 0.000006 2.243525e-03\n", + "1 1.585244e-01 7.438906e-04 88.554761 6.541410e-01\n", + "2 5.779576e-01 2.554887e-06 4.029322 4.258713e+00\n", + "3 4.912486e-06 4.741661e-01 0.031851 8.373485e-02\n", + "4 1.688138e-03 1.329891e+01 0.348173 9.002973e-07\n", + "5 2.553999e-02 2.135033e-01 3.033203 8.515912e+00\n", + "6 3.450436e+01 2.977377e-01 10.839637 6.472977e+00\n", + "7 1.532188e-01 3.180669e-08 0.624565 2.616188e-03\n", + "8 2.726487e-03 4.072234e-05 0.331179 7.814101e+00\n", + "9 2.013696e-17 3.382842e-01 10.522029 6.038328e-01" + ] + }, + "execution_count": 111, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df ** 4" + ] + }, + { + "cell_type": "markdown", + "id": "ab0cc5cb", + "metadata": {}, + "source": [ + "Boolean operators operate element-wise as well:" + ] + }, + { + "cell_type": "code", + "execution_count": 112, + "id": "edbec52a", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "df1 = pd.DataFrame({\"a\": [1, 0, 1], \"b\": [0, 1, 1]}, dtype=bool)" + ] + }, + { + "cell_type": "code", + "execution_count": 113, + "id": "727cd263", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "df2 = pd.DataFrame({\"a\": [0, 1, 1], \"b\": [1, 1, 0]}, dtype=bool)" + ] + }, + { + "cell_type": "code", + "execution_count": 114, + "id": "523bbe29", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ab
0FalseFalse
1FalseTrue
2TrueFalse
\n", + "
" + ], + "text/plain": [ + " a b\n", + "0 False False\n", + "1 False True\n", + "2 True False" + ] + }, + "execution_count": 114, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1 & df2" + ] + }, + { + "cell_type": "code", + "execution_count": 115, + "id": "b1a355fc", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ab
0TrueTrue
1TrueTrue
2TrueTrue
\n", + "
" + ], + "text/plain": [ + " a b\n", + "0 True True\n", + "1 True True\n", + "2 True True" + ] + }, + "execution_count": 115, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1 | df2" + ] + }, + { + "cell_type": "code", + "execution_count": 116, + "id": "e89dc58b", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ab
0TrueTrue
1TrueFalse
2FalseTrue
\n", + "
" + ], + "text/plain": [ + " a b\n", + "0 True True\n", + "1 True False\n", + "2 False True" + ] + }, + "execution_count": 116, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1 ^ df2" + ] + }, + { + "cell_type": "code", + "execution_count": 117, + "id": "9b438ef3", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ab
0FalseTrue
1TrueFalse
2FalseFalse
\n", + "
" + ], + "text/plain": [ + " a b\n", + "0 False True\n", + "1 True False\n", + "2 False False" + ] + }, + "execution_count": 117, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "-df1" + ] + }, + { + "cell_type": "markdown", + "id": "31d38eb7", + "metadata": {}, + "source": [ + "#### Transposing\n", + "\n", + "To transpose, access the `T` attribute or `DataFrame.transpose()`, similar to an ndarray:" + ] + }, + { + "cell_type": "code", + "execution_count": 118, + "id": "84f274b9", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
01234
A-1.807645-0.630992-0.8719150.047079-0.202699
B-1.123332-0.1651500.039980-0.8298181.909651
C0.049934-3.067630-1.416798-0.422454-0.768155
D-0.217637-0.899327-1.4365460.5379310.030803
\n", + "
" + ], + "text/plain": [ + " 0 1 2 3 4\n", + "A -1.807645 -0.630992 -0.871915 0.047079 -0.202699\n", + "B -1.123332 -0.165150 0.039980 -0.829818 1.909651\n", + "C 0.049934 -3.067630 -1.416798 -0.422454 -0.768155\n", + "D -0.217637 -0.899327 -1.436546 0.537931 0.030803" + ] + }, + "execution_count": 118, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[:5].T" + ] + }, + { + "cell_type": "markdown", + "id": "20c81c1c", + "metadata": {}, + "source": [ + "## Data indexing and selection\n", + "\n", + "The axis labeling information in Pandas objects serves many purposes:\n", + "\n", + "- Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display.\n", + "- Enables automatic and explicit data alignment.\n", + "- Allows intuitive getting and setting of subsets of the data set.\n", + "\n", + "In this section, we will focus on the final point: namely, how to slice, dice, and generally get and set subsets of Pandas objects. The primary focus will be on Series and DataFrame as they have received more development attention in this area.\n", + "\n", + ":::{note}\n", + "The Python and NumPy indexing operators `[]` and attribute operator `.` provide quick and easy access to Pandas data structures across a wide range of use cases. This makes interactive work intuitive, as there's little new to learn if you already know how to deal with Python dictionaries and NumPy arrays. However, since the type of the data to be accessed isn't known in advance, directly using standard operators has some optimization limits. For production code, we recommended that you take advantage of the optimized Pandas data access methods exposed in this chapter.\n", + ":::" + ] + }, + { + "cell_type": "markdown", + "id": "5f5b68a0-0590-48bc-8129-c36c6faf57db", + "metadata": { + "attributes": { + "classes": [ + "warning" + ], + "id": "" + } + }, + "source": [ + "Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called `chained assignment` and should be avoided." + ] + }, + { + "cell_type": "markdown", + "id": "cbdec733", + "metadata": {}, + "source": [ + "### Different choices for indexing\n", + "\n", + "Object selection has had a number of user-requested additions in order to support more explicit location-based indexing. Pandas now supports three types of multi-axis indexing.\n", + "\n", + "- `.loc` is primarily label based, but may also be used with a boolean array. `.loc` will raise `KeyError` when the items are not found. Allowed inputs are:\n", + " - A single label, e.g. `5` or `'a'` (Note that `5` is interpreted as a label of the index. This use is not an integer position along the index.).\n", + " - A list or array of labels `['a', 'b', 'c']`.\n", + " - A slice object with labels `'a':'f'` (Note that contrary to usual Python slices, both the start and the stop are included, when present in the index!)\n", + " - A boolean array (any `NA` values will be treated as `False`).\n", + " - A `callable` function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).\n", + "\n", + "- `.iloc` is primarily integer position based (from `0` to `length-1` of the axis), but may also be used with a boolean array. `.iloc` will raise `IndexError` if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with Python/NumPy slice semantics). Allowed inputs are:\n", + " - An integer e.g. `5`.\n", + " - A list or array of integers `[4, 3, 0]`.\n", + " - A slice object with ints `1:7`.\n", + " - A boolean array (any `NA` values will be treated as `False`).\n", + " - A `callable` function with one argument (the calling Series or DataFrame) that returns valid output for indexing (one of the above).\n", + "- `.loc`, `.iloc`, and also `[]` indexing can accept a `callable` as indexer.\n", + "\n", + "Getting values from an object with multi-axes selection uses the following notation (using `.loc` as an example, but the following applies to `.iloc` as well). Any of the axes accessors may be the null slice `:`. Axes left out of the specification are assumed to be `:`, e.g. `p.loc['a']` is equivalent to `p.loc['a', :]`.\n", + "\n", + "|**Object Type**|**Indexers** |\n", + "|:-- |:- |\n", + "|Series |`s.loc[indexer]` |\n", + "|DataFrame |`df.loc[row_indexer, column_indexer]`|\n", + "\n", + "### Basics\n", + "\n", + "As mentioned when introducing the data structures in the last section, the primary function of indexing with `[]` (a.k.a.` __getitem__` for those familiar with implementing class behavior in Python) is selecting out lower-dimensional slices. The following table shows return type values when indexing Pandas objects with `[]`:\n", + "\n", + "|**Object Type**|**Selection** |Return Value Type |\n", + "|:- |:- |:- |\n", + "|Series |`series[label]` |scalar value |\n", + "|DataFrame |`frame[colname]`|`Series` corresponding to colname|\n", + "\n", + "Here we construct a simple time series data set to use for illustrating the indexing functionality:" + ] + }, + { + "cell_type": "code", + "execution_count": 119, + "id": "12d39083", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABCD
2000-01-010.3114382.1227632.206814-1.488590
2000-01-02-0.4591541.6972990.8525622.389648
2000-01-031.408196-0.326012-0.137593-0.287003
2000-01-040.9261761.2133500.0394711.068784
2000-01-05-0.4909480.1083420.5530741.213043
2000-01-06-1.3782640.3746370.776962-0.125644
2000-01-070.628157-1.2399290.761446-0.847097
2000-01-080.2911231.5683501.0514890.526787
\n", + "
" + ], + "text/plain": [ + " A B C D\n", + "2000-01-01 0.311438 2.122763 2.206814 -1.488590\n", + "2000-01-02 -0.459154 1.697299 0.852562 2.389648\n", + "2000-01-03 1.408196 -0.326012 -0.137593 -0.287003\n", + "2000-01-04 0.926176 1.213350 0.039471 1.068784\n", + "2000-01-05 -0.490948 0.108342 0.553074 1.213043\n", + "2000-01-06 -1.378264 0.374637 0.776962 -0.125644\n", + "2000-01-07 0.628157 -1.239929 0.761446 -0.847097\n", + "2000-01-08 0.291123 1.568350 1.051489 0.526787" + ] + }, + "execution_count": 119, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dates = pd.date_range('1/1/2000', periods=8)\n", + "df = pd.DataFrame(np.random.randn(8, 4),\n", + " index=dates, columns=['A', 'B', 'C', 'D'])\n", + "df" + ] + }, + { + "cell_type": "markdown", + "id": "da294328", + "metadata": {}, + "source": [ + ":::{note}\n", + "None of the indexing functionality is time series specific unless specifically stated.\n", + ":::\n", + "\n", + "Thus, as per above, we have the most basic indexing using `[]`:" + ] + }, + { + "cell_type": "code", + "execution_count": 120, + "id": "1eee749c", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "-1.3782643792341864" + ] + }, + "execution_count": 120, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s = df['A']\n", + "\n", + "s[dates[5]]" + ] + }, + { + "cell_type": "markdown", + "id": "9c672552", + "metadata": {}, + "source": [ + "You can pass a list of columns to `[]` to select columns in that order. If a column is not contained in the DataFrame, an exception will be raised. Multiple columns can also be set in this manner:" + ] + }, + { + "cell_type": "code", + "execution_count": 121, + "id": "5a18bcbc", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABCD
2000-01-010.3114382.1227632.206814-1.488590
2000-01-02-0.4591541.6972990.8525622.389648
2000-01-031.408196-0.326012-0.137593-0.287003
2000-01-040.9261761.2133500.0394711.068784
2000-01-05-0.4909480.1083420.5530741.213043
2000-01-06-1.3782640.3746370.776962-0.125644
2000-01-070.628157-1.2399290.761446-0.847097
2000-01-080.2911231.5683501.0514890.526787
\n", + "
" + ], + "text/plain": [ + " A B C D\n", + "2000-01-01 0.311438 2.122763 2.206814 -1.488590\n", + "2000-01-02 -0.459154 1.697299 0.852562 2.389648\n", + "2000-01-03 1.408196 -0.326012 -0.137593 -0.287003\n", + "2000-01-04 0.926176 1.213350 0.039471 1.068784\n", + "2000-01-05 -0.490948 0.108342 0.553074 1.213043\n", + "2000-01-06 -1.378264 0.374637 0.776962 -0.125644\n", + "2000-01-07 0.628157 -1.239929 0.761446 -0.847097\n", + "2000-01-08 0.291123 1.568350 1.051489 0.526787" + ] + }, + "execution_count": 121, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 122, + "id": "be2e73fe", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABCD
2000-01-012.1227630.3114382.206814-1.488590
2000-01-021.697299-0.4591540.8525622.389648
2000-01-03-0.3260121.408196-0.137593-0.287003
2000-01-041.2133500.9261760.0394711.068784
2000-01-050.108342-0.4909480.5530741.213043
2000-01-060.374637-1.3782640.776962-0.125644
2000-01-07-1.2399290.6281570.761446-0.847097
2000-01-081.5683500.2911231.0514890.526787
\n", + "
" + ], + "text/plain": [ + " A B C D\n", + "2000-01-01 2.122763 0.311438 2.206814 -1.488590\n", + "2000-01-02 1.697299 -0.459154 0.852562 2.389648\n", + "2000-01-03 -0.326012 1.408196 -0.137593 -0.287003\n", + "2000-01-04 1.213350 0.926176 0.039471 1.068784\n", + "2000-01-05 0.108342 -0.490948 0.553074 1.213043\n", + "2000-01-06 0.374637 -1.378264 0.776962 -0.125644\n", + "2000-01-07 -1.239929 0.628157 0.761446 -0.847097\n", + "2000-01-08 1.568350 0.291123 1.051489 0.526787" + ] + }, + "execution_count": 122, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[['B', 'A']] = df[['A', 'B']]\n", + "df" + ] + }, + { + "cell_type": "markdown", + "id": "6e6cd9c9", + "metadata": {}, + "source": [ + "You may find this useful for applying a transform (in-place) to a subset of the columns." + ] + }, + { + "cell_type": "markdown", + "id": "b9d41a7f-5d30-40e2-8508-83b4d08e1ef1", + "metadata": { + "attributes": { + "classes": [ + "warning" + ], + "id": "" + } + }, + "source": [ + "Pandas aligns all AXES when setting `Series` and `DataFrame` from `.loc`, and `.iloc`.\n", + "\n", + "This will not modify `df` because the column alignment is before value assignment." + ] + }, + { + "cell_type": "code", + "execution_count": 123, + "id": "4e8a2ee9", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
AB
2000-01-012.1227630.311438
2000-01-021.697299-0.459154
2000-01-03-0.3260121.408196
2000-01-041.2133500.926176
2000-01-050.108342-0.490948
2000-01-060.374637-1.378264
2000-01-07-1.2399290.628157
2000-01-081.5683500.291123
\n", + "
" + ], + "text/plain": [ + " A B\n", + "2000-01-01 2.122763 0.311438\n", + "2000-01-02 1.697299 -0.459154\n", + "2000-01-03 -0.326012 1.408196\n", + "2000-01-04 1.213350 0.926176\n", + "2000-01-05 0.108342 -0.490948\n", + "2000-01-06 0.374637 -1.378264\n", + "2000-01-07 -1.239929 0.628157\n", + "2000-01-08 1.568350 0.291123" + ] + }, + "execution_count": 123, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[['A', 'B']]" + ] + }, + { + "cell_type": "code", + "execution_count": 124, + "id": "cf8c39ef", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
AB
2000-01-012.1227630.311438
2000-01-021.697299-0.459154
2000-01-03-0.3260121.408196
2000-01-041.2133500.926176
2000-01-050.108342-0.490948
2000-01-060.374637-1.378264
2000-01-07-1.2399290.628157
2000-01-081.5683500.291123
\n", + "
" + ], + "text/plain": [ + " A B\n", + "2000-01-01 2.122763 0.311438\n", + "2000-01-02 1.697299 -0.459154\n", + "2000-01-03 -0.326012 1.408196\n", + "2000-01-04 1.213350 0.926176\n", + "2000-01-05 0.108342 -0.490948\n", + "2000-01-06 0.374637 -1.378264\n", + "2000-01-07 -1.239929 0.628157\n", + "2000-01-08 1.568350 0.291123" + ] + }, + "execution_count": 124, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.loc[:, ['B', 'A']] = df[['A', 'B']]\n", + "df[['A', 'B']]" + ] + }, + { + "cell_type": "markdown", + "id": "4ed60d11-3f81-43b5-8274-4d896238b734", + "metadata": { + "attributes": { + "classes": [ + "warning" + ], + "id": "" + } + }, + "source": [ + "The correct way to swap column values is by using raw values:" + ] + }, + { + "cell_type": "code", + "execution_count": 125, + "id": "da9754c5", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
AB
2000-01-010.3114382.122763
2000-01-02-0.4591541.697299
2000-01-031.408196-0.326012
2000-01-040.9261761.213350
2000-01-05-0.4909480.108342
2000-01-06-1.3782640.374637
2000-01-070.628157-1.239929
2000-01-080.2911231.568350
\n", + "
" + ], + "text/plain": [ + " A B\n", + "2000-01-01 0.311438 2.122763\n", + "2000-01-02 -0.459154 1.697299\n", + "2000-01-03 1.408196 -0.326012\n", + "2000-01-04 0.926176 1.213350\n", + "2000-01-05 -0.490948 0.108342\n", + "2000-01-06 -1.378264 0.374637\n", + "2000-01-07 0.628157 -1.239929\n", + "2000-01-08 0.291123 1.568350" + ] + }, + "execution_count": 125, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.loc[:, ['B', 'A']] = df[['A', 'B']].to_numpy()\n", + "df[['A', 'B']]" + ] + }, + { + "cell_type": "markdown", + "id": "beb7928a", + "metadata": {}, + "source": [ + "### Attribute access\n", + "\n", + "You may access an index on a `Series` or column on a `DataFrame` directly as an attribute:" + ] + }, + { + "cell_type": "code", + "execution_count": 126, + "id": "86dec0c0", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [], + "source": [ + "sa = pd.Series([1, 2, 3], index=list('abc'))\n", + "dfa = df.copy()" + ] + }, + { + "cell_type": "code", + "execution_count": 127, + "id": "69ea1e07", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "2" + ] + }, + "execution_count": 127, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sa.b" + ] + }, + { + "cell_type": "code", + "execution_count": 128, + "id": "ce9f7637", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "2000-01-01 0.311438\n", + "2000-01-02 -0.459154\n", + "2000-01-03 1.408196\n", + "2000-01-04 0.926176\n", + "2000-01-05 -0.490948\n", + "2000-01-06 -1.378264\n", + "2000-01-07 0.628157\n", + "2000-01-08 0.291123\n", + "Freq: D, Name: A, dtype: float64" + ] + }, + "execution_count": 128, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dfa.A" + ] + }, + { + "cell_type": "code", + "execution_count": 129, + "id": "10cead84", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "a 5\n", + "b 2\n", + "c 3\n", + "dtype: int64" + ] + }, + "execution_count": 129, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sa.a = 5\n", + "sa" + ] + }, + { + "cell_type": "code", + "execution_count": 130, + "id": "6db24b96", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABCD
2000-01-0102.1227632.206814-1.488590
2000-01-0211.6972990.8525622.389648
2000-01-032-0.326012-0.137593-0.287003
2000-01-0431.2133500.0394711.068784
2000-01-0540.1083420.5530741.213043
2000-01-0650.3746370.776962-0.125644
2000-01-076-1.2399290.761446-0.847097
2000-01-0871.5683501.0514890.526787
\n", + "
" + ], + "text/plain": [ + " A B C D\n", + "2000-01-01 0 2.122763 2.206814 -1.488590\n", + "2000-01-02 1 1.697299 0.852562 2.389648\n", + "2000-01-03 2 -0.326012 -0.137593 -0.287003\n", + "2000-01-04 3 1.213350 0.039471 1.068784\n", + "2000-01-05 4 0.108342 0.553074 1.213043\n", + "2000-01-06 5 0.374637 0.776962 -0.125644\n", + "2000-01-07 6 -1.239929 0.761446 -0.847097\n", + "2000-01-08 7 1.568350 1.051489 0.526787" + ] + }, + "execution_count": 130, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dfa.A = list(range(len(dfa.index))) # ok if A already exists\n", + "dfa" + ] + }, + { + "cell_type": "code", + "execution_count": 131, + "id": "99790bfe", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ABCD
2000-01-0102.1227632.206814-1.488590
2000-01-0211.6972990.8525622.389648
2000-01-032-0.326012-0.137593-0.287003
2000-01-0431.2133500.0394711.068784
2000-01-0540.1083420.5530741.213043
2000-01-0650.3746370.776962-0.125644
2000-01-076-1.2399290.761446-0.847097
2000-01-0871.5683501.0514890.526787
\n", + "
" + ], + "text/plain": [ + " A B C D\n", + "2000-01-01 0 2.122763 2.206814 -1.488590\n", + "2000-01-02 1 1.697299 0.852562 2.389648\n", + "2000-01-03 2 -0.326012 -0.137593 -0.287003\n", + "2000-01-04 3 1.213350 0.039471 1.068784\n", + "2000-01-05 4 0.108342 0.553074 1.213043\n", + "2000-01-06 5 0.374637 0.776962 -0.125644\n", + "2000-01-07 6 -1.239929 0.761446 -0.847097\n", + "2000-01-08 7 1.568350 1.051489 0.526787" + ] + }, + "execution_count": 131, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dfa['A'] = list(range(len(dfa.index))) # use this form to create a new column\n", + "dfa" + ] + }, + { + "cell_type": "markdown", + "id": "1dac3787-172d-4127-b483-d08921a0e060", + "metadata": { + "attributes": { + "classes": [ + "warning" + ], + "id": "" + } + }, + "source": [ + "- You can use this access only if the index element is a valid Python identifier, e.g. s.1 is not allowed. See here for an explanation of valid identifiers.\n", + "\n", + "- The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed, but s['min'] is possible.\n", + "\n", + "- Similarly, the attribute will not be available if it conflicts with any of the following list: index, major_axis, minor_axis, items.\n", + "\n", + "- In any of these cases, standard indexing will still work, e.g. s['1'], s['min'], and s['index'] will access the corresponding element or column." + ] + }, + { + "cell_type": "markdown", + "id": "ae10e002", + "metadata": {}, + "source": [ + "If you are using the IPython environment, you may also use tab-completion to see these accessible attributes.\n", + "\n", + "You can also assign a `dict` to a row of a `DataFrame`:" + ] + }, + { + "cell_type": "code", + "execution_count": 132, + "id": "29d1e1b0", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
xy
013
1999
235
\n", + "
" + ], + "text/plain": [ + " x y\n", + "0 1 3\n", + "1 9 99\n", + "2 3 5" + ] + }, + "execution_count": 132, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]})\n", + "x.iloc[1] = {'x': 9, 'y': 99}\n", + "x" + ] + }, + { + "cell_type": "markdown", + "id": "9e1ee914", + "metadata": {}, + "source": [ + "You can use attribute access to modify an existing element of a Series or column of a DataFrame, but be careful; if you try to use attribute access to create a new column, it creates a new attribute rather than a new column. In 0.21.0 and later, this will raise a `UserWarning`:" + ] + }, + { + "cell_type": "code", + "execution_count": 144, + "id": "b55c8c4d", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "C:\\Users\\87554\\AppData\\Local\\Temp\\ipykernel_46616\\269534380.py:2: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access\n", + " df.two = [4, 5, 6]\n" + ] + } + ], + "source": [ + "df = pd.DataFrame({'one': [1., 2., 3.]})\n", + "df.two = [4, 5, 6]" + ] + }, + { + "cell_type": "code", + "execution_count": 135, + "id": "e0a12bf3", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
one
01.0
12.0
23.0
\n", + "
" + ], + "text/plain": [ + " one\n", + "0 1.0\n", + "1 2.0\n", + "2 3.0" + ] + }, + "execution_count": 135, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df" + ] + }, + { + "cell_type": "markdown", + "id": "96013159", + "metadata": {}, + "source": [ + "### Slicing ranges\n", + "\n", + "For now, we explain the semantics of slicing using the [] operator.\n", + "\n", + "With Series, the syntax works exactly as with an ndarray, returning a slice of the values and the corresponding labels:" + ] + }, + { + "cell_type": "code", + "execution_count": 136, + "id": "ab285a63", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "2000-01-01 0.311438\n", + "2000-01-02 -0.459154\n", + "2000-01-03 1.408196\n", + "2000-01-04 0.926176\n", + "2000-01-05 -0.490948\n", + "2000-01-06 -1.378264\n", + "2000-01-07 0.628157\n", + "2000-01-08 0.291123\n", + "Freq: D, Name: A, dtype: float64" + ] + }, + "execution_count": 136, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s" + ] + }, + { + "cell_type": "code", + "execution_count": 137, + "id": "73654be5", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "2000-01-01 0.311438\n", + "2000-01-02 -0.459154\n", + "2000-01-03 1.408196\n", + "2000-01-04 0.926176\n", + "2000-01-05 -0.490948\n", + "Freq: D, Name: A, dtype: float64" + ] + }, + "execution_count": 137, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s[:5]" + ] + }, + { + "cell_type": "code", + "execution_count": 138, + "id": "bafda5a6", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "2000-01-01 0.311438\n", + "2000-01-03 1.408196\n", + "2000-01-05 -0.490948\n", + "2000-01-07 0.628157\n", + "Freq: 2D, Name: A, dtype: float64" + ] + }, + "execution_count": 138, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s[::2]" + ] + }, + { + "cell_type": "code", + "execution_count": 139, + "id": "e28c3dc5", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "2000-01-08 0.291123\n", + "2000-01-07 0.628157\n", + "2000-01-06 -1.378264\n", + "2000-01-05 -0.490948\n", + "2000-01-04 0.926176\n", + "2000-01-03 1.408196\n", + "2000-01-02 -0.459154\n", + "2000-01-01 0.311438\n", + "Freq: -1D, Name: A, dtype: float64" + ] + }, + "execution_count": 139, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s[::-1]" + ] + }, + { + "cell_type": "markdown", + "id": "13b86bff", + "metadata": {}, + "source": [ + "Note that setting works as well:" + ] + }, + { + "cell_type": "code", + "execution_count": 140, + "id": "46dbb94c", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "2000-01-01 0.000000\n", + "2000-01-02 0.000000\n", + "2000-01-03 0.000000\n", + "2000-01-04 0.000000\n", + "2000-01-05 0.000000\n", + "2000-01-06 -1.378264\n", + "2000-01-07 0.628157\n", + "2000-01-08 0.291123\n", + "Freq: D, Name: A, dtype: float64" + ] + }, + "execution_count": 140, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s2 = s.copy()\n", + "s2[:5] = 0\n", + "s2" + ] + }, + { + "cell_type": "markdown", + "id": "89fac206", + "metadata": {}, + "source": [ + "With DataFrame, slicing inside of `[]` slices the rows. This is provided largely as a convenience since it is such a common operation." + ] + }, + { + "cell_type": "code", + "execution_count": 141, + "id": "d2c3c1af", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
one
01.0
12.0
23.0
\n", + "
" + ], + "text/plain": [ + " one\n", + "0 1.0\n", + "1 2.0\n", + "2 3.0" + ] + }, + "execution_count": 141, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[:3]" + ] + }, + { + "cell_type": "code", + "execution_count": 142, + "id": "c46fa1e7", + "metadata": { + "attributes": { + "classes": [ + "code-cell" + ], + "id": "" + }, + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
one
23.0
12.0
01.0
\n", + "
" + ], + "text/plain": [ + " one\n", + "2 3.0\n", + "1 2.0\n", + "0 1.0" + ] + }, + "execution_count": 142, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[::-1]" + ] + }, + { + "cell_type": "markdown", + "id": "674b4c29-0e55-4243-a46c-7fbffb51a02a", + "metadata": {}, + "source": [ + "## Acknowledgments\n", + "\n", + "Thanks for [Pandas user guide](https://pandas.pydata.org/docs/user_guide/index.html). It contributes the majority of the content in this chapter." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/open-machine-learning-jupyter-book/data-science/working-with-data/pandas/pandas_Pt1-introduction-and-data-structures.ipynb b/open-machine-learning-jupyter-book/data-science/working-with-data/pandas/pandas_Pt1-introduction-and-data-structures.ipynb deleted file mode 100644 index 26843671f8..0000000000 --- a/open-machine-learning-jupyter-book/data-science/working-with-data/pandas/pandas_Pt1-introduction-and-data-structures.ipynb +++ /dev/null @@ -1,3295 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "c90e65da-5d8a-4295-8fd2-601a50911cd0", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "source": [ - "---\n", - "jupytext:\n", - " cell_metadata_filter: -all\n", - " formats: md:myst\n", - " text_representation:\n", - " extension: .md\n", - " format_name: myst\n", - " format_version: 0.13\n", - " jupytext_version: 1.11.5\n", - "kernelspec:\n", - " display_name: Python 3\n", - " language: python\n", - " name: python3\n", - "---" - ] - }, - { - "cell_type": "markdown", - "id": "105bf8eb", - "metadata": {}, - "source": [ - "\n", - "# Pandas Part1-Introduction and Data Structures\n", - " \n", - "Pandas is a fast, powerful, flexible and easy-to-use open-source data analysis and manipulation tool, built on top of the Python programming language.\n", - "\n", - "## Introducing Pandas objects\n", - "\n", - "In 3 parts, we’ll start with a quick, non-comprehensive overview of the fundamental data structures in Pandas to get you started. The fundamental behavior about data types, indexing, axis labeling, and alignment apply across all of the objects. To get started, import NumPy and load Pandas into your namespace:" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "c8e7b835", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "# Install the necessary dependencies\n", - "import os\n", - "import sys\n", - "!{sys.executable} -m pip install --quiet jupyterlab_myst ipython\n", - "import numpy as np\n", - "import pandas as pd" - ] - }, - { - "cell_type": "markdown", - "id": "bb9af208", - "metadata": {}, - "source": [ - "### Series\n", - "\n", - "`Series` is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the **index**. The basic method to create a `Series` is to call:" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "6b173930", - "metadata": { - "attributes": { - "classes": [ - "py" - ], - "id": "" - } - }, - "outputs": [ - { - "ename": "NameError", - "evalue": "name 'data' is not defined", - "output_type": "error", - "traceback": [ - "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", - "Cell \u001b[1;32mIn[2], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m s \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mSeries(data, index\u001b[38;5;241m=\u001b[39mindex)\n", - "\u001b[1;31mNameError\u001b[0m: name 'data' is not defined" - ] - } - ], - "source": [ - "s = pd.Series(data, index=index)" - ] - }, - { - "cell_type": "markdown", - "id": "475acfee", - "metadata": {}, - "source": [ - "Here, `data` can be many different things:\n", - "\n", - "- a Python dict\n", - "- an ndarray\n", - "- a scalar value (like 5)\n", - "\n", - "\n", - "The passed **index** is a list of axis labels. Thus, this separates into a few cases depending on what the **data is**:\n", - "\n", - "#### Create a Series\n", - "\n", - "##### From ndarray\n", - "\n", - "If `data` is an ndarray, **index** must be the same length as the **data**. If no index is passed, one will be created having values `[0, ..., len(data) - 1]`." - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "646c8580", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "s = pd.Series(np.random.randn(5), index=[\"a\", \"b\", \"c\", \"d\", \"e\"])" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "2d2455c1", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "a -1.569040\n", - "b -0.861428\n", - "c -0.584740\n", - "d -0.341168\n", - "e 2.749506\n", - "dtype: float64" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "s" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "20f33329", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "Index(['a', 'b', 'c', 'd', 'e'], dtype='object')" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "s.index" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "5376f720", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "0 -0.674126\n", - "1 -1.760735\n", - "2 0.238505\n", - "3 0.548522\n", - "4 0.192064\n", - "dtype: float64" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "pd.Series(np.random.randn(5))" - ] - }, - { - "cell_type": "markdown", - "id": "2f4e73c7", - "metadata": {}, - "source": [ - ":::{note}\n", - "Pandas supports non-unique index values. If an operation that does not support duplicate index values is attempted, an exception will be raised at that time.\n", - ":::\n", - "\n", - "##### From dict\n", - "`Series` can be instantiated from dicts:" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "e8095575", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "d = {\"b\": 1, \"a\": 0, \"c\": 2}" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "ba462934", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "b 1\n", - "a 0\n", - "c 2\n", - "dtype: int64" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "pd.Series(d)" - ] - }, - { - "cell_type": "markdown", - "id": "c4329868", - "metadata": {}, - "source": [ - "If an index is passed, the values in data corresponding to the labels in the index will be pulled out." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "03488418", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "d = {\"a\": 0.0, \"b\": 1.0, \"c\": 2.0}" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "c35e968c", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "a 0.0\n", - "b 1.0\n", - "c 2.0\n", - "dtype: float64" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "pd.Series(d)" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "95eafc4d", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "b 1.0\n", - "c 2.0\n", - "d NaN\n", - "a 0.0\n", - "dtype: float64" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "pd.Series(d, index=[\"b\", \"c\", \"d\", \"a\"])" - ] - }, - { - "cell_type": "markdown", - "id": "1be5c72d", - "metadata": {}, - "source": [ - ":::{note}\n", - "NaN (not a number) is the standard missing data marker used in Pandas.\n", - ":::\n", - "\n", - "##### From scalar value\n", - "\n", - "If `data` is a scalar value, an index must be provided. The value will be repeated to match the length of **index**." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "6f744115", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "a 5.0\n", - "b 5.0\n", - "c 5.0\n", - "d 5.0\n", - "e 5.0\n", - "dtype: float64" - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "pd.Series(5.0, index=[\"a\", \"b\", \"c\", \"d\", \"e\"])" - ] - }, - { - "cell_type": "markdown", - "id": "8060fb92", - "metadata": {}, - "source": [ - "#### Series is ndarray-like\n", - "\n", - "`Series` acts very similarly to a `ndarray` and is a valid argument to most NumPy functions. However, operations such as slicing will also slice the index." - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "2ca453e9", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "-1.5690397882045732" - ] - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "s[0]" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "4cf8e176", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "a -1.569040\n", - "b -0.861428\n", - "c -0.584740\n", - "dtype: float64" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "s[:3]" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "1bab7730", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "d -0.341168\n", - "e 2.749506\n", - "dtype: float64" - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "s[s > s.median()]" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "b5e98d89", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "e 2.749506\n", - "d -0.341168\n", - "b -0.861428\n", - "dtype: float64" - ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "s[[4, 3, 1]]" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "c98a7190", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "a 0.208245\n", - "b 0.422558\n", - "c 0.557251\n", - "d 0.710940\n", - "e 15.634906\n", - "dtype: float64" - ] - }, - "execution_count": 17, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "np.exp(s)" - ] - }, - { - "cell_type": "markdown", - "id": "a49ee902", - "metadata": {}, - "source": [ - "Like a NumPy array, a Pandas Series has a single `dtype`." - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "b0298996", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "dtype('float64')" - ] - }, - "execution_count": 18, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "s.dtype" - ] - }, - { - "cell_type": "markdown", - "id": "69857db8", - "metadata": {}, - "source": [ - "If you need the actual array backing a `Series`, use `Series.array`." - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "id": "1989c3a9", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "\n", - "[ -1.5690397882045732, -0.8614280003412315, -0.5847401530974344,\n", - " -0.34116764319496673, 2.7495059540046984]\n", - "Length: 5, dtype: float64" - ] - }, - "execution_count": 19, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "s.array" - ] - }, - { - "cell_type": "markdown", - "id": "7ed219b0", - "metadata": {}, - "source": [ - "While `Series` is ndarray-like, if you need an actual ndarray, then use `Series.to_numpy()`." - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "id": "1cc04172", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "array([-1.56903979, -0.861428 , -0.58474015, -0.34116764, 2.74950595])" - ] - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "s.to_numpy()" - ] - }, - { - "cell_type": "markdown", - "id": "12f01f86", - "metadata": {}, - "source": [ - "Even if the `Series` is backed by an `ExtensionArray`, `Series.to_numpy()` will return a NumPy ndarray.\n", - "\n", - "#### Series is dict-like\n", - "\n", - "A `Series` is also like a fixed-size dict in that you can get and set values by index label:" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "id": "bcfe90c9", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "-1.5690397882045732" - ] - }, - "execution_count": 21, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "s[\"a\"]" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "id": "00c68766", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "s[\"e\"] = 12.0" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "id": "74f58473", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "a -1.569040\n", - "b -0.861428\n", - "c -0.584740\n", - "d -0.341168\n", - "e 12.000000\n", - "dtype: float64" - ] - }, - "execution_count": 23, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "s" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "id": "2f822110", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "True" - ] - }, - "execution_count": 24, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "\"e\" in s" - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "id": "164dcf61", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "False" - ] - }, - "execution_count": 25, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "\"f\" in s" - ] - }, - { - "cell_type": "markdown", - "id": "ca979c84", - "metadata": {}, - "source": [ - "If a label is not contained in the index, an exception is raised:" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "40a23c62-9c88-4a6e-9316-60317abe7859", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - }, - "tags": [ - "raises-exception" - ] - }, - "outputs": [ - { - "ename": "NameError", - "evalue": "name 's' is not defined", - "output_type": "error", - "traceback": [ - "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", - "Cell \u001b[1;32mIn[1], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m s[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n", - "\u001b[1;31mNameError\u001b[0m: name 's' is not defined" - ] - } - ], - "source": [ - "s[\"f\"]" - ] - }, - { - "cell_type": "markdown", - "id": "396df6e2", - "metadata": {}, - "source": [ - "Using the `Series.get()` method, a missing label will return None or specified default:" - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "id": "ad2a67c6", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "s.get(\"f\")" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "id": "13c1c13b", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "nan" - ] - }, - "execution_count": 28, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "s.get(\"f\", np.nan)" - ] - }, - { - "cell_type": "markdown", - "id": "1b19c44c", - "metadata": {}, - "source": [ - "These labels can also be accessed by `attribute`.\n", - "\n", - "#### Vectorized operations and label alignment with Series\n", - "\n", - "When working with raw NumPy arrays, looping through value-by-value is usually not necessary. The same is true when working with `Series` in Pandas. `Series` can also be passed into most NumPy methods expecting an ndarray." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "35540134", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "s + s" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "aea7c1dc", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "s * 2" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4dcdc8c4", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "np.exp(s)" - ] - }, - { - "cell_type": "markdown", - "id": "f8ed10f3", - "metadata": {}, - "source": [ - "A key difference between `Series` and ndarray is that operations between `Series` automatically align the data based on the label. Thus, you can write computations without giving consideration to whether the `Series` involved have the same labels." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "563555a0", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "s[1:] + s[:-1]" - ] - }, - { - "cell_type": "markdown", - "id": "7e19643a", - "metadata": {}, - "source": [ - "The result of an operation between unaligned `Series` will have the **union** of the indexes involved. If a label is not found in one `Series` or the other, the result will be marked as missing `NaN`. Being able to write code without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis and research. The integrated data alignment features of the Pandas data structures set Pandas apart from the majority of related tools for working with labeled data.\n", - "\n", - ":::{note}\n", - "In general, we chose to make the default result of operations between differently indexed objects yield the **union** of the indexes in order to avoid loss of information. Having an index label, though the data is missing, is typically important information as part of a computation. You of course have the option of dropping labels with missing data via the `dropna` function.\n", - ":::\n", - "\n", - "#### Name attribute\n", - "\n", - "`Series` also has a `name` attribute:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3b39834b", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "s = pd.Series(np.random.randn(5), name=\"something\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "18210d7f", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "s" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "06f09ce2", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "s.name" - ] - }, - { - "cell_type": "markdown", - "id": "b35b499b", - "metadata": {}, - "source": [ - "The `Series` `name` can be assigned automatically in many cases, in particular, when selecting a single column from a `DataFrame`, the `name` will be assigned the column label.\n", - "\n", - "You can rename a `Series` with the `pandas.Series.rename()` method." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bd079c61", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "s2 = s.rename(\"different\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a1767258", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "s2.name" - ] - }, - { - "cell_type": "markdown", - "id": "398a679d", - "metadata": {}, - "source": [ - "Note that `s` and `s2` refer to different objects.\n", - "\n", - "### DataFrame\n", - "\n", - "`DataFrame` is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a `dict` of `Series` objects. It is generally the most commonly used Pandas object. Like `Series`, `DataFrame` accepts many different kinds of input:\n", - "\n", - "- Dict of 1D ndarrays, lists, dicts, or `Series`\n", - "- 2-D `numpy.ndarray`\n", - "- Structured or record ndarray\n", - "- A `Series`\n", - "- Another `DataFrame`\n", - "\n", - "Along with the data, you can optionally pass **index** (row labels) and **columns** (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting `DataFrame`. Thus, a `dict` of Series plus a specific index will discard all data not matching up to the passed index.\n", - "\n", - "If axis labels are not passed, they will be constructed from the input data based on common sense rules.\n", - "\n", - "#### Create a Dataframe\n", - "\n", - "##### From dict of `Series` or dicts\n", - "\n", - "The resulting **index** will be the **union** of the indexes of the various Series. If there are any nested dicts, these will first be converted to Series. If no columns are passed, the columns will be the ordered list of `dict` keys." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "aa7ddc8a", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "d = {\n", - " \"one\": pd.Series([1.0, 2.0, 3.0], index=[\"a\", \"b\", \"c\"]),\n", - " \"two\": pd.Series([1.0, 2.0, 3.0, 4.0], index=[\"a\", \"b\", \"c\", \"d\"]),\n", - "}" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f526badc", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df = pd.DataFrame(d)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "69ddc66c", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1f5e8ccb", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "pd.DataFrame(d, index=[\"d\", \"b\", \"a\"])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9940fb65", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "pd.DataFrame(d, index=[\"d\", \"b\", \"a\"], columns=[\"two\", \"three\"])" - ] - }, - { - "cell_type": "markdown", - "id": "93b5a50c", - "metadata": {}, - "source": [ - "The row and column labels can be accessed respectively by accessing the **index** and **columns** attributes:\n", - "\n", - ":::{note}\n", - "When a particular set of columns is passed along with a dict of data, the passed columns override the keys in the dict.\n", - ":::" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8a3ba6ae", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df.index" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "13684125", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df.columns" - ] - }, - { - "cell_type": "markdown", - "id": "49c8bc9a", - "metadata": {}, - "source": [ - "##### From dict of ndarrays / lists\n", - "\n", - "The ndarrays must all be the same length. If an index is passed, it must also be the same length as the arrays. If no index is passed, the result will be `range(n)`, where `n` is the array length." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c4789555", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "d = {\"one\": [1.0, 2.0, 3.0, 4.0], \"two\": [4.0, 3.0, 2.0, 1.0]}" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "29098be0", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "pd.DataFrame(d)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5600834a", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "pd.DataFrame(d, index=[\"a\", \"b\", \"c\", \"d\"])" - ] - }, - { - "cell_type": "markdown", - "id": "506868de", - "metadata": {}, - "source": [ - "##### From structured or record array\n", - "\n", - "This case is handled identically to a dict of arrays." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0b3b5090", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "data = np.zeros((2,), dtype=[(\"A\", \"i4\"), (\"B\", \"f4\"), (\"C\", \"a10\")])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "543153a7", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "data[:] = [(1, 2.0, \"Hello\"), (2, 3.0, \"World\")]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c5278e68", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "pd.DataFrame(data)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fefbfc51", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "pd.DataFrame(data, index=[\"first\", \"second\"])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f76d517a", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "pd.DataFrame(data, columns=[\"C\", \"A\", \"B\"])" - ] - }, - { - "cell_type": "markdown", - "id": "75f7c017", - "metadata": {}, - "source": [ - ":::{note}\n", - "DataFrame is not intended to work exactly like a 2-dimensional NumPy ndarray.\n", - ":::\n", - "\n", - "\n", - "##### From a list of dicts" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a2aa6cb3", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "data2 = [{\"a\": 1, \"b\": 2}, {\"a\": 5, \"b\": 10, \"c\": 20}]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1e45ffbc", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "pd.DataFrame(data2)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8d6db924", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "pd.DataFrame(data2, index=[\"first\", \"second\"])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "258fa418", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "pd.DataFrame(data2, columns=[\"a\", \"b\"])" - ] - }, - { - "cell_type": "markdown", - "id": "dfb77761", - "metadata": {}, - "source": [ - "##### From a dict of tuples\n", - "\n", - "You can automatically create a MultiIndexed frame by passing a tuples dictionary." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "89af5166", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "pd.DataFrame(\n", - " {\n", - " (\"a\", \"b\"): {(\"A\", \"B\"): 1, (\"A\", \"C\"): 2},\n", - " (\"a\", \"a\"): {(\"A\", \"C\"): 3, (\"A\", \"B\"): 4},\n", - " (\"a\", \"c\"): {(\"A\", \"B\"): 5, (\"A\", \"C\"): 6},\n", - " (\"b\", \"a\"): {(\"A\", \"C\"): 7, (\"A\", \"B\"): 8},\n", - " (\"b\", \"b\"): {(\"A\", \"D\"): 9, (\"A\", \"B\"): 10},\n", - " }\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "e02d86d6", - "metadata": {}, - "source": [ - "##### From a Series\n", - "\n", - "The result will be a DataFrame with the same index as the input Series, and with one column whose name is the original name of the Series (only if no other column name provided)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "77ff8552", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "ser = pd.Series(range(3), index=list(\"abc\"), name=\"ser\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a86d1926", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "pd.DataFrame(ser)" - ] - }, - { - "cell_type": "markdown", - "id": "2f824850", - "metadata": {}, - "source": [ - "##### From a list of namedtuples\n", - "\n", - "The field names of the first `namedtuple` in the list determine the columns of the `DataFrame`. The remaining namedtuples (or tuples) are simply unpacked and their values are fed into the rows of the `DataFrame`. If any of those tuples is shorter than the first `namedtuple` then the later columns in the corresponding row are marked as missing values. If any are longer than the first `namedtuple` , a `ValueError` is raised." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "67fd765e", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "from collections import namedtuple" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d4524af3", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "Point = namedtuple(\"Point\", \"x y\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "02f0937c", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "pd.DataFrame([Point(0, 0), Point(0, 3), (2, 3)])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4c81da05", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "Point3D = namedtuple(\"Point3D\", \"x y z\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6731aad6", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "pd.DataFrame([Point3D(0, 0, 0), Point3D(0, 3, 5), Point(2, 3)])" - ] - }, - { - "cell_type": "markdown", - "id": "8ff0bca2", - "metadata": {}, - "source": [ - "##### From a list of dataclasses\n", - "\n", - "Data Classes as introduced in PEP557, can be passed into the DataFrame constructor. Passing a list of dataclasses is equivalent to passing a list of dictionaries.\n", - "\n", - "Please be aware, that all values in the list should be dataclasses, mixing types in the list would result in a `TypeError`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5fe92237", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "from dataclasses import make_dataclass" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e13b27cf", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "Point = make_dataclass(\"Point\", [(\"x\", int), (\"y\", int)])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "df6b2816", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)])" - ] - }, - { - "cell_type": "markdown", - "id": "8e826768", - "metadata": {}, - "source": [ - "#### Column selection, addition, deletion\n", - "\n", - "You can treat a `DataFrame` semantically like a dict of like-indexed `Series` objects. Getting, setting, and deleting columns works with the same syntax as the analogous dict operations:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a52d0734", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "804405d6", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df[\"one\"]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dfa00c9b", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df[\"three\"] = df[\"one\"] * df[\"two\"]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0f98ffa9", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df[\"flag\"] = df[\"one\"] > 2" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1ef5e1a3", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df" - ] - }, - { - "cell_type": "markdown", - "id": "f518cd88", - "metadata": {}, - "source": [ - "Columns can be deleted or popped like with a dict:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b418f585", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "del df[\"two\"]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "209ebb78", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "three = df.pop(\"three\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9aee9b49", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df" - ] - }, - { - "cell_type": "markdown", - "id": "40b5a135", - "metadata": {}, - "source": [ - "When inserting a scalar value, it will naturally be propagated to fill the column:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1bddfbc5", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df[\"foo\"] = \"bar\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e2613bd3", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df" - ] - }, - { - "cell_type": "markdown", - "id": "d93a6895", - "metadata": {}, - "source": [ - "When inserting a `Series` that does not have the same index as the `DataFrame`, it will be conformed to the DataFrame's index:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c20564a5", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df[\"one_trunc\"] = df[\"one\"][:2]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "877b972d-49b8-4225-855e-ec77bd876d8b", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "76026aba", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df" - ] - }, - { - "cell_type": "markdown", - "id": "b7c3f5d9", - "metadata": {}, - "source": [ - "You can insert raw ndarrays but their length must match the length of the DataFrame's index.\n", - "\n", - "By default, columns get inserted at the end. `DataFrame.insert()` inserts at a particular location in the columns:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8dbfb773", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df.insert(1, \"bar\", df[\"one\"])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "27dea852", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df" - ] - }, - { - "cell_type": "markdown", - "id": "4786e42f", - "metadata": {}, - "source": [ - "#### Assigning new columns in method chains\n", - "\n", - "DataFrame has an `assign()` method that allows you to easily create new columns that are potentially derived from existing columns." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e9e4dead", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "iris = pd.read_csv(\"../../assets/data/iris.csv\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "38eef1a4", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "iris.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ed27d63b", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "iris.assign(sepal_ratio=iris[\"SepalWidth\"] / iris[\"SepalLength\"]).head()" - ] - }, - { - "cell_type": "markdown", - "id": "c989dbf7", - "metadata": {}, - "source": [ - "In the example above, we inserted a precomputed value. We can also pass in a function of one argument to be evaluated on the DataFrame being assigned to." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4f39885a", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "iris.assign(sepal_ratio=lambda x: (x[\"SepalWidth\"] / x[\"SepalLength\"])).head()" - ] - }, - { - "cell_type": "markdown", - "id": "abcd0aee", - "metadata": {}, - "source": [ - "`assign()` **always** returns a copy of the data, leaving the original DataFrame untouched.\n", - "\n", - "Passing a callable, as opposed to an actual value to be inserted, is useful when you don't have a reference to the DataFrame at hand. This is common when using `assign()` in a chain of operations. For example, we can limit the DataFrame to just those observations with a Sepal Length greater than 5, calculate the ratio, and plot:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0508916b", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "(\n", - " iris.query(\"SepalLength > 5\")\n", - " .assign(\n", - " SepalRatio=lambda x: x.SepalWidth / x.SepalLength,\n", - " PetalRatio=lambda x: x.PetalWidth / x.PetalLength,\n", - " )\n", - " .plot(kind=\"scatter\", x=\"SepalRatio\", y=\"PetalRatio\")\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "7e1e3e3d", - "metadata": {}, - "source": [ - "Since a function is passed in, the function is computed on the DataFrame being assigned to. Importantly, this is the DataFrame that's been filtered to those rows with sepal length greater than 5. The filtering happens first, and then the ratio calculations. This is an example where we didn't have a reference to the filtered DataFrame available.\n", - "\n", - "The function signature for `assign()` is simply `**kwargs`. The keys are the column names for the new fields, and the values are either a value to be inserted (for example, a `Series` or NumPy array), or a function of one argument to be called on the `DataFrame`. A copy of the original `DataFrame` is returned, with the new values inserted.\n", - "\n", - "The order of `**kwargs` is preserved. This allows for dependent assignment, where an expression later in `**kwargs` can refer to a column created earlier in the same `assign()`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "60b7e3c7", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "dfa = pd.DataFrame({\"A\": [1, 2, 3], \"B\": [4, 5, 6]})" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4c821875", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "dfa.assign(C=lambda x: x[\"A\"] + x[\"B\"], D=lambda x: x[\"A\"] + x[\"C\"])" - ] - }, - { - "cell_type": "markdown", - "id": "822c6838", - "metadata": {}, - "source": [ - "In the second expression, `x['C']` will refer to the newly created column, that's equal to `dfa['A'] + dfa['B']`.\n", - "\n", - "#### Indexing / selection\n", - "\n", - "The basics of indexing are as follows:\n", - "\n", - "|Operation |Syntax |Result |\n", - "|:------- |:----- |:----- |\n", - "|Select column |`df[col]` |Series |\n", - "|Select row by label |`df.loc[label]`|Series |\n", - "|Select row by integer location|`df.iloc[loc]` |Series |\n", - "|Slice rows |`df[5:10] ` |DataFrame|\n", - "|Select rows by boolean vector |`df[bool_vec]` |DataFrame|\n", - "\n", - "Row selection, for example, returns a `Series` whose index is the columns of the `DataFrame`:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "82154750", - "metadata": {}, - "outputs": [], - "source": [ - "df.loc[\"b\"]" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "743d6893-bbf3-4fbf-a158-a3aaae040b39", - "metadata": { - "tags": [] - }, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2fae006c", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df.iloc[2]" - ] - }, - { - "cell_type": "markdown", - "id": "87fe370b", - "metadata": {}, - "source": [ - "#### Data alignment and arithmetic\n", - "\n", - "Data alignment between `DataFrame` objects automatically aligns on **both** the columns and the index (row labels)**. Again, the resulting object will have the union of the column and row labels." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a3e29475", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df = pd.DataFrame(np.random.randn(10, 4), columns=[\"A\", \"B\", \"C\", \"D\"])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c4634479", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df2 = pd.DataFrame(np.random.randn(7, 3), columns=[\"A\", \"B\", \"C\"])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "09eb77aa", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df + df2" - ] - }, - { - "cell_type": "markdown", - "id": "9062570a", - "metadata": {}, - "source": [ - "When doing an operation between `DataFrame` and `Series`, the default behavior is to align the `Series` **index** on the `DataFrame` **columns**, thus broadcasting row-wise. For example:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c2a8adda", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df - df.iloc[0]" - ] - }, - { - "cell_type": "markdown", - "id": "cf0c0013", - "metadata": {}, - "source": [ - "Arithmetic operations with scalars operate element-wise:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d4cc4904", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df * 5 + 2" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "131ec689", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "1 / df" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a2d50c6f", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df ** 4" - ] - }, - { - "cell_type": "markdown", - "id": "ab0cc5cb", - "metadata": {}, - "source": [ - "Boolean operators operate element-wise as well:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "edbec52a", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df1 = pd.DataFrame({\"a\": [1, 0, 1], \"b\": [0, 1, 1]}, dtype=bool)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "727cd263", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df2 = pd.DataFrame({\"a\": [0, 1, 1], \"b\": [1, 1, 0]}, dtype=bool)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "523bbe29", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df1 & df2" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b1a355fc", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df1 | df2" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e89dc58b", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df1 ^ df2" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9b438ef3", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "-df1" - ] - }, - { - "cell_type": "markdown", - "id": "31d38eb7", - "metadata": {}, - "source": [ - "#### Transposing\n", - "\n", - "To transpose, access the `T` attribute or `DataFrame.transpose()`, similar to an ndarray:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "84f274b9", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df[:5].T" - ] - }, - { - "cell_type": "markdown", - "id": "20c81c1c", - "metadata": {}, - "source": [ - "## Data indexing and selection\n", - "\n", - "The axis labeling information in Pandas objects serves many purposes:\n", - "\n", - "- Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display.\n", - "- Enables automatic and explicit data alignment.\n", - "- Allows intuitive getting and setting of subsets of the data set.\n", - "\n", - "In this section, we will focus on the final point: namely, how to slice, dice, and generally get and set subsets of Pandas objects. The primary focus will be on Series and DataFrame as they have received more development attention in this area.\n", - "\n", - ":::{note}\n", - "The Python and NumPy indexing operators `[]` and attribute operator `.` provide quick and easy access to Pandas data structures across a wide range of use cases. This makes interactive work intuitive, as there's little new to learn if you already know how to deal with Python dictionaries and NumPy arrays. However, since the type of the data to be accessed isn't known in advance, directly using standard operators has some optimization limits. For production code, we recommended that you take advantage of the optimized Pandas data access methods exposed in this chapter.\n", - ":::" - ] - }, - { - "cell_type": "markdown", - "id": "5f5b68a0-0590-48bc-8129-c36c6faf57db", - "metadata": { - "attributes": { - "classes": [ - "warning" - ], - "id": "" - } - }, - "source": [ - "Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called `chained assignment` and should be avoided." - ] - }, - { - "cell_type": "markdown", - "id": "cbdec733", - "metadata": {}, - "source": [ - "### Different choices for indexing\n", - "\n", - "Object selection has had a number of user-requested additions in order to support more explicit location-based indexing. Pandas now supports three types of multi-axis indexing.\n", - "\n", - "- `.loc` is primarily label based, but may also be used with a boolean array. `.loc` will raise `KeyError` when the items are not found. Allowed inputs are:\n", - " - A single label, e.g. `5` or `'a'` (Note that `5` is interpreted as a label of the index. This use is not an integer position along the index.).\n", - " - A list or array of labels `['a', 'b', 'c']`.\n", - " - A slice object with labels `'a':'f'` (Note that contrary to usual Python slices, both the start and the stop are included, when present in the index!)\n", - " - A boolean array (any `NA` values will be treated as `False`).\n", - " - A `callable` function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).\n", - "\n", - "- `.iloc` is primarily integer position based (from `0` to `length-1` of the axis), but may also be used with a boolean array. `.iloc` will raise `IndexError` if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with Python/NumPy slice semantics). Allowed inputs are:\n", - " - An integer e.g. `5`.\n", - " - A list or array of integers `[4, 3, 0]`.\n", - " - A slice object with ints `1:7`.\n", - " - A boolean array (any `NA` values will be treated as `False`).\n", - " - A `callable` function with one argument (the calling Series or DataFrame) that returns valid output for indexing (one of the above).\n", - "- `.loc`, `.iloc`, and also `[]` indexing can accept a `callable` as indexer.\n", - "\n", - "Getting values from an object with multi-axes selection uses the following notation (using `.loc` as an example, but the following applies to `.iloc` as well). Any of the axes accessors may be the null slice `:`. Axes left out of the specification are assumed to be `:`, e.g. `p.loc['a']` is equivalent to `p.loc['a', :]`.\n", - "\n", - "|**Object Type**|**Indexers** |\n", - "|:-- |:- |\n", - "|Series |`s.loc[indexer]` |\n", - "|DataFrame |`df.loc[row_indexer, column_indexer]`|\n", - "\n", - "### Basics\n", - "\n", - "As mentioned when introducing the data structures in the last section, the primary function of indexing with `[]` (a.k.a.` __getitem__` for those familiar with implementing class behavior in Python) is selecting out lower-dimensional slices. The following table shows return type values when indexing Pandas objects with `[]`:\n", - "\n", - "|**Object Type**|**Selection** |Return Value Type |\n", - "|:- |:- |:- |\n", - "|Series |`series[label]` |scalar value |\n", - "|DataFrame |`frame[colname]`|`Series` corresponding to colname|\n", - "\n", - "Here we construct a simple time series data set to use for illustrating the indexing functionality:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "12d39083", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "dates = pd.date_range('1/1/2000', periods=8)\n", - "df = pd.DataFrame(np.random.randn(8, 4),\n", - " index=dates, columns=['A', 'B', 'C', 'D'])\n", - "df" - ] - }, - { - "cell_type": "markdown", - "id": "da294328", - "metadata": {}, - "source": [ - ":::{note}\n", - "None of the indexing functionality is time series specific unless specifically stated.\n", - ":::\n", - "\n", - "Thus, as per above, we have the most basic indexing using `[]`:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1eee749c", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "s = df['A']\n", - "\n", - "s[dates[5]]" - ] - }, - { - "cell_type": "markdown", - "id": "9c672552", - "metadata": {}, - "source": [ - "You can pass a list of columns to `[]` to select columns in that order. If a column is not contained in the DataFrame, an exception will be raised. Multiple columns can also be set in this manner:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5a18bcbc", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "be2e73fe", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df[['B', 'A']] = df[['A', 'B']]\n", - "df" - ] - }, - { - "cell_type": "markdown", - "id": "6e6cd9c9", - "metadata": {}, - "source": [ - "You may find this useful for applying a transform (in-place) to a subset of the columns." - ] - }, - { - "cell_type": "markdown", - "id": "b9d41a7f-5d30-40e2-8508-83b4d08e1ef1", - "metadata": { - "attributes": { - "classes": [ - "warning" - ], - "id": "" - } - }, - "source": [ - "Pandas aligns all AXES when setting `Series` and `DataFrame` from `.loc`, and `.iloc`.\n", - "\n", - "This will not modify `df` because the column alignment is before value assignment." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4e8a2ee9", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df[['A', 'B']]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cf8c39ef", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df.loc[:, ['B', 'A']] = df[['A', 'B']]\n", - "df[['A', 'B']]" - ] - }, - { - "cell_type": "markdown", - "id": "4ed60d11-3f81-43b5-8274-4d896238b734", - "metadata": { - "attributes": { - "classes": [ - "warning" - ], - "id": "" - } - }, - "source": [ - "The correct way to swap column values is by using raw values:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "da9754c5", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df.loc[:, ['B', 'A']] = df[['A', 'B']].to_numpy()\n", - "df[['A', 'B']]" - ] - }, - { - "cell_type": "markdown", - "id": "beb7928a", - "metadata": {}, - "source": [ - "### Attribute access\n", - "\n", - "You may access an index on a `Series` or column on a `DataFrame` directly as an attribute:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "86dec0c0", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "sa = pd.Series([1, 2, 3], index=list('abc'))\n", - "dfa = df.copy()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "69ea1e07", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "sa.b" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ce9f7637", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "dfa.A" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "10cead84", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "sa.a = 5\n", - "sa" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6db24b96", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "dfa.A = list(range(len(dfa.index))) # ok if A already exists\n", - "dfa" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "99790bfe", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "dfa['A'] = list(range(len(dfa.index))) # use this form to create a new column\n", - "dfa" - ] - }, - { - "cell_type": "markdown", - "id": "1dac3787-172d-4127-b483-d08921a0e060", - "metadata": { - "attributes": { - "classes": [ - "warning" - ], - "id": "" - } - }, - "source": [ - "- You can use this access only if the index element is a valid Python identifier, e.g. s.1 is not allowed. See here for an explanation of valid identifiers.\n", - "\n", - "- The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed, but s['min'] is possible.\n", - "\n", - "- Similarly, the attribute will not be available if it conflicts with any of the following list: index, major_axis, minor_axis, items.\n", - "\n", - "- In any of these cases, standard indexing will still work, e.g. s['1'], s['min'], and s['index'] will access the corresponding element or column." - ] - }, - { - "cell_type": "markdown", - "id": "ae10e002", - "metadata": {}, - "source": [ - "If you are using the IPython environment, you may also use tab-completion to see these accessible attributes.\n", - "\n", - "You can also assign a `dict` to a row of a `DataFrame`:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "29d1e1b0", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "x = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]})\n", - "x.iloc[1] = {'x': 9, 'y': 99}\n", - "x" - ] - }, - { - "cell_type": "markdown", - "id": "9e1ee914", - "metadata": {}, - "source": [ - "You can use attribute access to modify an existing element of a Series or column of a DataFrame, but be careful; if you try to use attribute access to create a new column, it creates a new attribute rather than a new column. In 0.21.0 and later, this will raise a `UserWarning`:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b55c8c4d", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df = pd.DataFrame({'one': [1., 2., 3.]})\n", - "df.two = [4, 5, 6]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e0a12bf3", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df" - ] - }, - { - "cell_type": "markdown", - "id": "96013159", - "metadata": {}, - "source": [ - "### Slicing ranges\n", - "\n", - "For now, we explain the semantics of slicing using the [] operator.\n", - "\n", - "With Series, the syntax works exactly as with an ndarray, returning a slice of the values and the corresponding labels:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ab285a63", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "s" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "73654be5", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "s[:5]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bafda5a6", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "s[::2]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e28c3dc5", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "s[::-1]" - ] - }, - { - "cell_type": "markdown", - "id": "13b86bff", - "metadata": {}, - "source": [ - "Note that setting works as well:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "46dbb94c", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "s2 = s.copy()\n", - "s2[:5] = 0\n", - "s2" - ] - }, - { - "cell_type": "markdown", - "id": "89fac206", - "metadata": {}, - "source": [ - "With DataFrame, slicing inside of `[]` slices the rows. This is provided largely as a convenience since it is such a common operation." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d2c3c1af", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df[:3]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c46fa1e7", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df[::-1]" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.4" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/open-machine-learning-jupyter-book/data-science/working-with-data/pandas/pandas_Pt2-data-selection.ipynb b/open-machine-learning-jupyter-book/data-science/working-with-data/pandas/pandas_Pt2-data-selection.ipynb deleted file mode 100644 index f410bed10e..0000000000 --- a/open-machine-learning-jupyter-book/data-science/working-with-data/pandas/pandas_Pt2-data-selection.ipynb +++ /dev/null @@ -1,2368 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "4d6f93c8-aa6b-458d-9d0e-81244eee5808", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "source": [ - "---\n", - "jupytext:\n", - " cell_metadata_filter: -all\n", - " formats: md:myst\n", - " text_representation:\n", - " extension: .md\n", - " format_name: myst\n", - " format_version: 0.13\n", - " jupytext_version: 1.11.5\n", - "kernelspec:\n", - " display_name: Python 3\n", - " language: python\n", - " name: python3\n", - "---" - ] - }, - { - "cell_type": "markdown", - "id": "70c2694f-98d3-4846-a4d2-a88ac4da4a56", - "metadata": {}, - "source": [ - "# Pandas Part2-Data Selection " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f1931205-8c05-40ca-b266-c0f14e26cff3", - "metadata": {}, - "outputs": [], - "source": [ - "# Install the necessary dependencies\n", - "import os\n", - "import sys\n", - "!{sys.executable} -m pip install --quiet jupyterlab_myst ipython\n", - "import numpy as np\n", - "import pandas as pd" - ] - }, - { - "cell_type": "markdown", - "id": "281fa7e2", - "metadata": {}, - "source": [ - "## Selection by label" - ] - }, - { - "cell_type": "markdown", - "id": "8cfbc1d9-62b9-4f12-a249-0fb7af77d6f3", - "metadata": { - "attributes": { - "classes": [ - "warning" - ], - "id": "" - } - }, - "source": [ - "Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called `chained assignment` and should be avoided." - ] - }, - { - "cell_type": "markdown", - "id": "9dd00162-d4ac-4b84-9da4-9fe7e36cbcb5", - "metadata": { - "attributes": { - "classes": [ - "warning" - ], - "id": "" - } - }, - "source": [ - "`.loc` is strict when you present slicers that are not compatible (or convertible) with the index type. For example using integers in a `DatetimeIndex`. These will raise a `TypeError`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "19faf0a0", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "dfl = pd.DataFrame(np.random.randn(5, 4),\n", - " columns=list('ABCD'),\n", - " index=pd.date_range('20130101', periods=5))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5cd6165e", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - ":tags: [\"raises-exception\"]\n", - "dfl.loc[2:3]" - ] - }, - { - "cell_type": "markdown", - "id": "f2b699ce-d01f-4afa-8323-c43b9df24b38", - "metadata": { - "attributes": { - "classes": [ - "warning" - ], - "id": "" - } - }, - "source": [ - "String likes in slicing can be convertible to the type of the index and lead to natural slicing." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3f5fb2f0", - "metadata": {}, - "outputs": [], - "source": [ - "dfl.loc['20130102':'20130104']" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "abe5968b-ffe5-4302-9918-81a1d97ed568", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "f3c046d5-19dc-47cb-828f-880f008d02d4", - "metadata": { - "attributes": { - "classes": [ - "warning" - ], - "id": "" - } - }, - "source": [ - "Pandas will raise a `KeyError` if indexing with a list with missing labels." - ] - }, - { - "cell_type": "markdown", - "id": "29221dda", - "metadata": {}, - "source": [ - "Pandas provides a suite of methods in order to have **purely label-based indexing**. This is a strict inclusion-based protocol. Every label asked for must be in the index, or a `KeyError` will be raised. When slicing, both the start bound **AND** the stop bound are included, if present in the index. Integers are valid labels, but they refer to the label **and not the position**.\n", - "\n", - "- The `.loc` attribute is the primary access method. The following are valid inputs:\n", - "\n", - "- A single label, e.g. `5` or `'a'` (Note that `5` is interpreted as a label of the index. This use is not an integer position along the index.).\n", - "\n", - "- A list or array of labels `['a', 'b', 'c']`.\n", - "\n", - "- A slice object with labels `'a':'f'` (Note that contrary to usual Python slices, both the start and the stop are included, when present in the index!\n", - "\n", - "- A boolean array.\n", - "\n", - "- A `callable`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8a174f11", - "metadata": {}, - "outputs": [], - "source": [ - "s1 = pd.Series(np.random.randn(6), index=list('abcdef'))\n", - "s1\n", - "s1.loc['c':]" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "b276bd82-797f-4eb6-8886-51153d771bb0", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "11e56acc", - "metadata": {}, - "outputs": [], - "source": [ - "s1.loc['b']" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "74a7ae51-b334-4d5f-b9a2-e2080958663f", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "eb2dbf2d-cdd9-42e4-b374-fc7944f1996f", - "metadata": {}, - "source": [ - "Note that the setting works as well:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8fe78c41", - "metadata": {}, - "outputs": [], - "source": [ - "s1.loc['c':] = 0\n", - "s1" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "e32f82e4-6b3e-48a7-ab56-c6ea820274e5", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "With a DataFrame:\n" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cfb25d9f", - "metadata": {}, - "outputs": [], - "source": [ - "df1 = pd.DataFrame(np.random.randn(6, 4),\n", - " index=list('abcdef'),\n", - " columns=list('ABCD'))\n", - "df1\n", - "df1.loc[['a', 'b', 'd'], :]" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "de1a7123-2c8e-4910-b435-cdd489baff5b", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ef15a4e0-f059-4f51-93f4-6348e1aa549a", - "metadata": {}, - "outputs": [], - "source": [ - "Accessing via label slices:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2934e9e8", - "metadata": {}, - "outputs": [], - "source": [ - "df1.loc['d':, 'A':'C']" - ] - }, - { - "cell_type": "markdown", - "id": "ca6259e9", - "metadata": {}, - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "460c6b87-9248-4b67-bdfb-7e35415d324a", - "metadata": {}, - "outputs": [], - "source": [ - "For getting a cross-section using a label (equivalent to `df.xs('a')`):" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ccbffe12", - "metadata": {}, - "outputs": [], - "source": [ - "df1.loc['a']" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "c9570d12-8020-4328-94e8-91266619e666", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "589a2a99", - "metadata": {}, - "source": [ - "For getting values with a boolean array:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e60fdddf", - "metadata": {}, - "outputs": [], - "source": [ - "df1.loc['a'] > 0" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "4a9f2648-9f92-4077-a7ec-00836c2f28fd", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d6226934", - "metadata": {}, - "outputs": [], - "source": [ - "df1.loc[:, df1.loc['a'] > 0]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f8ae65cd-dbea-4f40-a464-7b07554b9b11", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "0e52a617", - "metadata": {}, - "source": [ - "NA values in a boolean array propagate as `False`:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0ca93c29", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "mask = pd.array([True, False, True, False, pd.NA, False], dtype=\"boolean\")\n", - "mask" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fd577bd5", - "metadata": {}, - "outputs": [], - "source": [ - "df1[mask]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4f1b5f67-5c56-4e47-8953-4d6383f283e1", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "2ff30b9c", - "metadata": {}, - "source": [ - "For getting a value explicitly:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7e425a66", - "metadata": {}, - "outputs": [], - "source": [ - "df1.loc['a', 'A'] # this is also equivalent to ``df1.at['a','A']``" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "50e88f3d-07f0-443d-994c-d7fb36c4dc7a", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "b29c0cd3", - "metadata": {}, - "source": [ - "## Slicing with labels\n", - "\n", - "When using `.loc` with slices, if both the start and the stop labels are present in the index, then elements located between the two (including them) are returned:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2bd13eab", - "metadata": {}, - "outputs": [], - "source": [ - "s = pd.Series(list('abcde'), index=[0, 3, 2, 5, 4])\n", - "s.loc[3:5]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "63081450-8216-403c-8b53-04b2cc18e442", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "0a1f8d46", - "metadata": {}, - "source": [ - "If at least one of the two is absent, but the index is sorted, and can be compared against start and stop labels, then slicing will still work as expected, by selecting labels which rank between the two:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a08caf62", - "metadata": {}, - "outputs": [], - "source": [ - "s.sort_index()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7d665bb1-9bd1-4826-9a0f-f13496d64549", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a5f5d2ba", - "metadata": {}, - "outputs": [], - "source": [ - "s.sort_index().loc[1:6]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "81114a6f-4511-4f2e-990b-c7edd5e4cf86", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "5115e1d2", - "metadata": {}, - "source": [ - "However, if at least one of the two is absent and the index is not sorted, an error will be raised (since doing otherwise would be computationally expensive, as well as potentially ambiguous for mixed-type indexes). For instance, in the above example, `s.loc[1:6]` would raise `KeyError`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "318b8e37", - "metadata": {}, - "outputs": [], - "source": [ - "s = pd.Series(list('abcdef'), index=[0, 3, 2, 5, 4, 2])\n", - "s.loc[3:5]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "537dd0b6-b4fc-468b-88a4-5d828eba5ed8", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "ce05682d", - "metadata": {}, - "source": [ - "\n", - "Also, if the index has duplicate labels and either the start or the stop label is duplicated, an error will be raised. For instance, in the above example, `s.loc[2:5]` would raise a `KeyError`.\n", - "\n", - "## Selection by position" - ] - }, - { - "cell_type": "markdown", - "id": "099c8fa7-d8df-4304-a513-1b142c1021d5", - "metadata": { - "attributes": { - "classes": [ - "warning" - ], - "id": "" - } - }, - "source": [ - "Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called `chained assignment` and should be avoided." - ] - }, - { - "cell_type": "markdown", - "id": "9c2e1dab", - "metadata": {}, - "source": [ - "Pandas provides a suite of methods in order to get purely integer-based indexing. The semantics follow closely Python and NumPy slicing. These are 0-based indexing. When slicing, the start bound is included, while the upper bound is excluded. Trying to use a non-integer, even a valid label will raise an `IndexError`.\n", - "\n", - "The `.iloc` attribute is the primary access method. The following are valid inputs:\n", - "\n", - "- An integer e.g. `5`.\n", - "\n", - "- A list or array of integers `[4, 3, 0]`.\n", - "\n", - "- A slice object with ints `1:7`.\n", - "\n", - "- A boolean array.\n", - "\n", - "- A `callable`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e7b93cb1", - "metadata": {}, - "outputs": [], - "source": [ - "s1 = pd.Series(np.random.randn(5), index=list(range(0, 10, 2)))\n", - "s1\n", - "s1.iloc[:3]" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "24d4de8c-5c42-484b-89d7-e21ebb0ba7c3", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fe63cdf3", - "metadata": {}, - "outputs": [], - "source": [ - "s1.iloc[3]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ed15834b-fd14-4000-bbdb-0eb86a214984", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "4ac478c2", - "metadata": {}, - "source": [ - "Note that setting works as well:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9c4e8129", - "metadata": {}, - "outputs": [], - "source": [ - "s1.iloc[:3] = 0\n", - "s1" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5b793d9f-5ddb-4121-8218-8a5eda713eab", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "56ced074", - "metadata": {}, - "source": [ - "With a DataFrame,Select via integer slicing:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3d55d682", - "metadata": {}, - "outputs": [], - "source": [ - "df1 = pd.DataFrame(np.random.randn(6, 4),\n", - " index=list(range(0, 12, 2)),\n", - " columns=list(range(0, 8, 2)))\n", - "df1\n", - "df1.iloc[:3]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "172e44bf-8faf-42a1-b9a7-3adab79b97d1", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b5427ec6", - "metadata": {}, - "outputs": [], - "source": [ - "df1.iloc[1:5, 2:4]" - ] - }, - { - "cell_type": "markdown", - "id": "550715ab", - "metadata": {}, - "source": [ - "Select via integer list:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d86dd6d1", - "metadata": {}, - "outputs": [], - "source": [ - "df1.iloc[[1, 3, 5], [1, 3]]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a5e2a6ba-671b-4aab-b63d-5ab4ee92501f", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8528cc39", - "metadata": {}, - "outputs": [], - "source": [ - "df1.iloc[1:3, :]" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "178d6f69-464f-464e-ad45-fac857b9a370", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\n" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f9288433", - "metadata": {}, - "outputs": [], - "source": [ - "df1.iloc[:, 1:3]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "71859ce4-7ad5-4bea-9df2-f5929c0c2470", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "eb3f25f3", - "metadata": {}, - "outputs": [], - "source": [ - "df1.iloc[1, 1] # this is also equivalent to ``df1.iat[1,1]``" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5dad7d1a-0bf5-40d8-a4ef-2c3e573ae6fc", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "6cb0234e", - "metadata": {}, - "source": [ - "\n", - "For getting a cross-section using an integer position (equiv to `df.xs(1)`):" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cc95030f", - "metadata": {}, - "outputs": [], - "source": [ - "df1.iloc[1]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bfa6df43-353d-4ba4-94a0-e65c9a659468", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "bc5305b8", - "metadata": {}, - "source": [ - "Out-of-range slice indexes are handled gracefully just as in Python/NumPy." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0c635e2f", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "x = list('abcdef') # these are allowed in Python/NumPy.\n", - "x" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bae9b708", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "x[4:10]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ccb95b2c", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "x[8:10]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fcaaeb73", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "s = pd.Series(x)\n", - "s" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "19e7f165", - "metadata": {}, - "outputs": [], - "source": [ - "s.iloc[4:10]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3b612356-7774-472e-849e-0f3dc267b578", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2a25cc5c", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "s.iloc[8:10]" - ] - }, - { - "cell_type": "markdown", - "id": "23aa8371", - "metadata": {}, - "source": [ - "Note that using slices that go out of bounds can result in an empty axis (e.g. an empty DataFrame being returned)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f9024d15", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "dfl = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5837f585", - "metadata": {}, - "outputs": [], - "source": [ - "dfl.iloc[:, 2:3]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4b81ac82-5d47-4410-90b9-040f0dac662b", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d0e19553", - "metadata": {}, - "outputs": [], - "source": [ - "dfl.iloc[:, 1:3]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "39dab713-a3f6-4189-bad9-cba564f56951", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f91ab868", - "metadata": {}, - "outputs": [], - "source": [ - "dfl.iloc[4:6]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "220aa5af-5003-45e9-87cf-c4f5d0ac6d93", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "59c65e15", - "metadata": {}, - "source": [ - "\n", - "A single indexer that is out of bounds will raise an `IndexError`. A list of indexers where any element is out of bounds will raise an `IndexError`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f3496be2", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - ":tags: [\"raises-exception\"]\n", - "dfl.iloc[[4, 5, 6]]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7b081f89", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - ":tags: [\"raises-exception\"]\n", - "dfl.iloc[:, 4]" - ] - }, - { - "cell_type": "markdown", - "id": "b3fe22e7", - "metadata": {}, - "source": [ - "## Selection by callable\n", - "\n", - "`.loc`, `.iloc`, and also `[]` indexing can accept a `callable` as indexer. The `callable` must be a function with one argument (the calling Series or DataFrame) that returns valid output for indexing." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "72420538", - "metadata": {}, - "outputs": [], - "source": [ - "df1 = pd.DataFrame(np.random.randn(6, 4),\n", - " index=list('abcdef'),\n", - " columns=list('ABCD'))\n", - "df1\n", - "df1.loc[lambda df: df['A'] > 0, :]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7206088f-3aa5-4392-9982-cadec553e616", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ab18a18f", - "metadata": {}, - "outputs": [], - "source": [ - "df1.loc[:, lambda df: ['A', 'B']]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2166496e-975d-4539-a3b6-54cedd012e73", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "aeb4a77e", - "metadata": {}, - "outputs": [], - "source": [ - "df1.iloc[:, lambda df: [0, 1]]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e8fe3be5-15de-4036-ab8a-d6483abf265f", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ec331b54", - "metadata": {}, - "outputs": [], - "source": [ - "df1[lambda df: df.columns[0]]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "31840764-a775-4e5f-8023-6c4762005ff6", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "861f0e5e", - "metadata": {}, - "source": [ - "\n", - "You can use callable indexing in `Series`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d4e60491", - "metadata": {}, - "outputs": [], - "source": [ - "df1['A'].loc[lambda s: s > 0]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1d7a46f1-98ce-4d87-924a-288812c6b4ed", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "12d2d96d", - "metadata": {}, - "source": [ - "\n", - "### Combining positional and label-based indexing\n", - "\n", - "If you wish to get the 0th and the 2nd elements from the index in the `'A'` column, you can do:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "978312bb", - "metadata": {}, - "outputs": [], - "source": [ - "dfd = pd.DataFrame({'A': [1, 2, 3],\n", - " 'B': [4, 5, 6]},\n", - " index=list('abc'))\n", - "dfd\n", - "dfd.loc[dfd.index[[0, 2]], 'A']" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a8844d1c-fdc5-4c85-923c-092ac6367692", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "11210c0d", - "metadata": {}, - "source": [ - "\n", - "This can also be expressed using `.iloc`, by explicitly getting locations on the indexers, and using positional indexing to select things." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2e7e25d2", - "metadata": {}, - "outputs": [], - "source": [ - "dfd.iloc[[0, 2], dfd.columns.get_loc('A')]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "48f7feb0-9334-441f-893a-42815523e739", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "d6c36e79", - "metadata": {}, - "source": [ - "\n", - "For getting multiple indexers, using `.get_indexer`:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7c0b22e6", - "metadata": {}, - "outputs": [], - "source": [ - "dfd.iloc[[0, 2], dfd.columns.get_indexer(['A', 'B'])]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c0924629-67d8-43b6-a435-d91bb8bf6408", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.4" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/open-machine-learning-jupyter-book/data-science/working-with-data/pandas/pandas_Pt3-advanced-pandas-techniques.ipynb b/open-machine-learning-jupyter-book/data-science/working-with-data/pandas/pandas_Pt3-advanced-pandas-techniques.ipynb deleted file mode 100644 index 0ec8d17e0f..0000000000 --- a/open-machine-learning-jupyter-book/data-science/working-with-data/pandas/pandas_Pt3-advanced-pandas-techniques.ipynb +++ /dev/null @@ -1,2634 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "73bc6d8a-5f93-4207-a34f-68f68f587837", - "metadata": { - "tags": [ - "hide-cell" - ] - }, - "source": [ - "---\n", - "jupytext:\n", - " cell_metadata_filter: -all\n", - " formats: md:myst\n", - " text_representation:\n", - " extension: .md\n", - " format_name: myst\n", - " format_version: 0.13\n", - " jupytext_version: 1.11.5\n", - "kernelspec:\n", - " display_name: Python 3\n", - " language: python\n", - " name: python3\n", - "---" - ] - }, - { - "cell_type": "markdown", - "id": "aa35406e-c73d-49f1-aa84-5cc5ced6c294", - "metadata": {}, - "source": [ - "# Pandas Part3-Advanced Pandas Techniques" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "b221a566-8a04-4689-8eb1-c266ede5a264", - "metadata": {}, - "outputs": [], - "source": [ - "# Install the necessary dependencies\n", - "import os\n", - "import sys\n", - "!{sys.executable} -m pip install --quiet jupyterlab_myst ipython\n", - "import numpy as np\n", - "import pandas as pd" - ] - }, - { - "cell_type": "markdown", - "id": "9edbb0f3", - "metadata": {}, - "source": [ - "## Combining datasets: concat, merge and join\n", - "\n", - "### concat\n", - "\n", - "- Concatenate Pandas objects along a particular axis.\n", - "\n", - "- Allows optional set logic along the other axes.\n", - "\n", - "- Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.\n", - "\n", - "For example:\n", - "\n", - "Combine two `Series`." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "b08dcc94", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "0 a\n", - "1 b\n", - "0 c\n", - "1 d\n", - "dtype: object" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "s1 = pd.Series(['a', 'b'])\n", - "s2 = pd.Series(['c', 'd'])\n", - "pd.concat([s1, s2])" - ] - }, - { - "cell_type": "markdown", - "id": "b1c47e7c", - "metadata": {}, - "source": [ - "Clear the existing index and reset it in the result by setting the `ignore_index` option to `True`." - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "32049abb", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "0 a\n", - "1 b\n", - "2 c\n", - "3 d\n", - "dtype: object" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "pd.concat([s1, s2], ignore_index=True)" - ] - }, - { - "cell_type": "markdown", - "id": "31f73f90", - "metadata": {}, - "source": [ - "Add a hierarchical index at the outermost level of the data with the `keys` option." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "d5b95507", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "s1 0 a\n", - " 1 b\n", - "s2 0 c\n", - " 1 d\n", - "dtype: object" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "pd.concat([s1, s2], keys=['s1', 's2'])" - ] - }, - { - "cell_type": "markdown", - "id": "9c618012", - "metadata": {}, - "source": [ - "Label the index keys you create with the `names` option." - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "6d54830d", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "Series name Row ID\n", - "s1 0 a\n", - " 1 b\n", - "s2 0 c\n", - " 1 d\n", - "dtype: object" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "pd.concat([s1, s2], keys=['s1', 's2'],\n", - " names=['Series name', 'Row ID'])" - ] - }, - { - "cell_type": "markdown", - "id": "31fac69f", - "metadata": {}, - "source": [ - "Combine two `DataFrame` objects with identical columns." - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "fec72294", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
letternumber
0a1
1b2
\n", - "
" - ], - "text/plain": [ - " letter number\n", - "0 a 1\n", - "1 b 2" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df1 = pd.DataFrame([['a', 1], ['b', 2]],\n", - " columns=['letter', 'number'])\n", - "df1" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "80a1f5b0", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
letternumber
0c3
1d4
\n", - "
" - ], - "text/plain": [ - " letter number\n", - "0 c 3\n", - "1 d 4" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df2 = pd.DataFrame([['c', 3], ['d', 4]],\n", - " columns=['letter', 'number'])\n", - "df2" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "4e9e65f6", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
letternumber
0a1
1b2
0c3
1d4
\n", - "
" - ], - "text/plain": [ - " letter number\n", - "0 a 1\n", - "1 b 2\n", - "0 c 3\n", - "1 d 4" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "pd.concat([df1, df2])" - ] - }, - { - "cell_type": "markdown", - "id": "49d878b5", - "metadata": {}, - "source": [ - "Combine `DataFrame` objects with overlapping columns and return everything. Columns outside the intersection will be filled with `NaN` values." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "f50e8ede", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
letternumberanimal
0c3cat
1d4dog
\n", - "
" - ], - "text/plain": [ - " letter number animal\n", - "0 c 3 cat\n", - "1 d 4 dog" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],\n", - " columns=['letter', 'number', 'animal'])\n", - "df3" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "9def1cdd", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
letternumberanimal
0a1NaN
1b2NaN
0c3cat
1d4dog
\n", - "
" - ], - "text/plain": [ - " letter number animal\n", - "0 a 1 NaN\n", - "1 b 2 NaN\n", - "0 c 3 cat\n", - "1 d 4 dog" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "pd.concat([df1, df3], sort=False)" - ] - }, - { - "cell_type": "markdown", - "id": "6f2fcb0c", - "metadata": {}, - "source": [ - "Combine DataFrame objects with overlapping columns and return only those that are shared by passing inner to the join keyword argument." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "ef69d51c", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
letternumber
0a1
1b2
0c3
1d4
\n", - "
" - ], - "text/plain": [ - " letter number\n", - "0 a 1\n", - "1 b 2\n", - "0 c 3\n", - "1 d 4" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "pd.concat([df1, df3], join=\"inner\")" - ] - }, - { - "cell_type": "markdown", - "id": "0fda5cf5", - "metadata": {}, - "source": [ - "Combine `DataFrame` objects horizontally along the x-axis by passing in `axis=1`." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "2159161d", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
letternumberanimalname
0a1birdpolly
1b2monkeygeorge
\n", - "
" - ], - "text/plain": [ - " letter number animal name\n", - "0 a 1 bird polly\n", - "1 b 2 monkey george" - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],\n", - " columns=['animal', 'name'])\n", - "pd.concat([df1, df4], axis=1)" - ] - }, - { - "cell_type": "markdown", - "id": "adb11ea6", - "metadata": {}, - "source": [ - "Prevent the result from including duplicate index values with the `verify_integrity` option." - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "45bea28a", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
0
a1
\n", - "
" - ], - "text/plain": [ - " 0\n", - "a 1" - ] - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df5 = pd.DataFrame([1], index=['a'])\n", - "df5" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "db871526", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
0
a2
\n", - "
" - ], - "text/plain": [ - " 0\n", - "a 2" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df6 = pd.DataFrame([2], index=['a'])\n", - "df6" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "1ab6b3b0", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - }, - "tags": [ - "raises-exception" - ] - }, - "outputs": [ - { - "ename": "ValueError", - "evalue": "Indexes have overlapping values: Index(['a'], dtype='object')", - "output_type": "error", - "traceback": [ - "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)", - "Cell \u001b[1;32mIn[15], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m pd\u001b[38;5;241m.\u001b[39mconcat([df5, df6], verify_integrity\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m)\n", - "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\util\\_decorators.py:331\u001b[0m, in \u001b[0;36mdeprecate_nonkeyword_arguments..decorate..wrapper\u001b[1;34m(*args, **kwargs)\u001b[0m\n\u001b[0;32m 325\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(args) \u001b[38;5;241m>\u001b[39m num_allow_args:\n\u001b[0;32m 326\u001b[0m warnings\u001b[38;5;241m.\u001b[39mwarn(\n\u001b[0;32m 327\u001b[0m msg\u001b[38;5;241m.\u001b[39mformat(arguments\u001b[38;5;241m=\u001b[39m_format_argument_list(allow_args)),\n\u001b[0;32m 328\u001b[0m \u001b[38;5;167;01mFutureWarning\u001b[39;00m,\n\u001b[0;32m 329\u001b[0m stacklevel\u001b[38;5;241m=\u001b[39mfind_stack_level(),\n\u001b[0;32m 330\u001b[0m )\n\u001b[1;32m--> 331\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m func(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n", - "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\reshape\\concat.py:368\u001b[0m, in \u001b[0;36mconcat\u001b[1;34m(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)\u001b[0m\n\u001b[0;32m 146\u001b[0m \u001b[38;5;129m@deprecate_nonkeyword_arguments\u001b[39m(version\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mNone\u001b[39;00m, allowed_args\u001b[38;5;241m=\u001b[39m[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mobjs\u001b[39m\u001b[38;5;124m\"\u001b[39m])\n\u001b[0;32m 147\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mconcat\u001b[39m(\n\u001b[0;32m 148\u001b[0m objs: Iterable[NDFrame] \u001b[38;5;241m|\u001b[39m Mapping[HashableT, NDFrame],\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 157\u001b[0m copy: \u001b[38;5;28mbool\u001b[39m \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mTrue\u001b[39;00m,\n\u001b[0;32m 158\u001b[0m ) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m DataFrame \u001b[38;5;241m|\u001b[39m Series:\n\u001b[0;32m 159\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[0;32m 160\u001b[0m \u001b[38;5;124;03m Concatenate pandas objects along a particular axis.\u001b[39;00m\n\u001b[0;32m 161\u001b[0m \n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 366\u001b[0m \u001b[38;5;124;03m 1 3 4\u001b[39;00m\n\u001b[0;32m 367\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[1;32m--> 368\u001b[0m op \u001b[38;5;241m=\u001b[39m _Concatenator(\n\u001b[0;32m 369\u001b[0m objs,\n\u001b[0;32m 370\u001b[0m axis\u001b[38;5;241m=\u001b[39maxis,\n\u001b[0;32m 371\u001b[0m ignore_index\u001b[38;5;241m=\u001b[39mignore_index,\n\u001b[0;32m 372\u001b[0m join\u001b[38;5;241m=\u001b[39mjoin,\n\u001b[0;32m 373\u001b[0m keys\u001b[38;5;241m=\u001b[39mkeys,\n\u001b[0;32m 374\u001b[0m levels\u001b[38;5;241m=\u001b[39mlevels,\n\u001b[0;32m 375\u001b[0m names\u001b[38;5;241m=\u001b[39mnames,\n\u001b[0;32m 376\u001b[0m verify_integrity\u001b[38;5;241m=\u001b[39mverify_integrity,\n\u001b[0;32m 377\u001b[0m copy\u001b[38;5;241m=\u001b[39mcopy,\n\u001b[0;32m 378\u001b[0m sort\u001b[38;5;241m=\u001b[39msort,\n\u001b[0;32m 379\u001b[0m )\n\u001b[0;32m 381\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m op\u001b[38;5;241m.\u001b[39mget_result()\n", - "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\reshape\\concat.py:563\u001b[0m, in \u001b[0;36m_Concatenator.__init__\u001b[1;34m(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)\u001b[0m\n\u001b[0;32m 560\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mverify_integrity \u001b[38;5;241m=\u001b[39m verify_integrity\n\u001b[0;32m 561\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mcopy \u001b[38;5;241m=\u001b[39m copy\n\u001b[1;32m--> 563\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mnew_axes \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_get_new_axes()\n", - "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\reshape\\concat.py:633\u001b[0m, in \u001b[0;36m_Concatenator._get_new_axes\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 631\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m_get_new_axes\u001b[39m(\u001b[38;5;28mself\u001b[39m) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m \u001b[38;5;28mlist\u001b[39m[Index]:\n\u001b[0;32m 632\u001b[0m ndim \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_get_result_dim()\n\u001b[1;32m--> 633\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m [\n\u001b[0;32m 634\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_get_concat_axis \u001b[38;5;28;01mif\u001b[39;00m i \u001b[38;5;241m==\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mbm_axis \u001b[38;5;28;01melse\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_get_comb_axis(i)\n\u001b[0;32m 635\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m i \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mrange\u001b[39m(ndim)\n\u001b[0;32m 636\u001b[0m ]\n", - "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\reshape\\concat.py:634\u001b[0m, in \u001b[0;36m\u001b[1;34m(.0)\u001b[0m\n\u001b[0;32m 631\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m_get_new_axes\u001b[39m(\u001b[38;5;28mself\u001b[39m) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m \u001b[38;5;28mlist\u001b[39m[Index]:\n\u001b[0;32m 632\u001b[0m ndim \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_get_result_dim()\n\u001b[0;32m 633\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m [\n\u001b[1;32m--> 634\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_get_concat_axis \u001b[38;5;28;01mif\u001b[39;00m i \u001b[38;5;241m==\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mbm_axis \u001b[38;5;28;01melse\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_get_comb_axis(i)\n\u001b[0;32m 635\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m i \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mrange\u001b[39m(ndim)\n\u001b[0;32m 636\u001b[0m ]\n", - "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\_libs\\properties.pyx:36\u001b[0m, in \u001b[0;36mpandas._libs.properties.CachedProperty.__get__\u001b[1;34m()\u001b[0m\n", - "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\reshape\\concat.py:697\u001b[0m, in \u001b[0;36m_Concatenator._get_concat_axis\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 692\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 693\u001b[0m concat_axis \u001b[38;5;241m=\u001b[39m _make_concat_multiindex(\n\u001b[0;32m 694\u001b[0m indexes, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mkeys, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mlevels, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mnames\n\u001b[0;32m 695\u001b[0m )\n\u001b[1;32m--> 697\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_maybe_check_integrity(concat_axis)\n\u001b[0;32m 699\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m concat_axis\n", - "File \u001b[1;32mF:\\anaconda\\Lib\\site-packages\\pandas\\core\\reshape\\concat.py:705\u001b[0m, in \u001b[0;36m_Concatenator._maybe_check_integrity\u001b[1;34m(self, concat_index)\u001b[0m\n\u001b[0;32m 703\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m concat_index\u001b[38;5;241m.\u001b[39mis_unique:\n\u001b[0;32m 704\u001b[0m overlap \u001b[38;5;241m=\u001b[39m concat_index[concat_index\u001b[38;5;241m.\u001b[39mduplicated()]\u001b[38;5;241m.\u001b[39munique()\n\u001b[1;32m--> 705\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mIndexes have overlapping values: \u001b[39m\u001b[38;5;132;01m{\u001b[39;00moverlap\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m)\n", - "\u001b[1;31mValueError\u001b[0m: Indexes have overlapping values: Index(['a'], dtype='object')" - ] - } - ], - "source": [ - ":tags: [\"raises-exception\"]\n", - "pd.concat([df5, df6], verify_integrity=True)" - ] - }, - { - "cell_type": "markdown", - "id": "90fc36d4", - "metadata": {}, - "source": [ - "Append a single row to the end of a `DataFrame` object." - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "007c1ed6", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
ab
012
\n", - "
" - ], - "text/plain": [ - " a b\n", - "0 1 2" - ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df7 = pd.DataFrame({'a': 1, 'b': 2}, index=[0])\n", - "df7" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "9dbaddff", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "a 3\n", - "b 4\n", - "dtype: int64" - ] - }, - "execution_count": 17, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "new_row = pd.Series({'a': 3, 'b': 4})\n", - "new_row" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "ad2d1313", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
ab
012
134
\n", - "
" - ], - "text/plain": [ - " a b\n", - "0 1 2\n", - "1 3 4" - ] - }, - "execution_count": 18, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "pd.concat([df7, new_row.to_frame().T], ignore_index=True)" - ] - }, - { - "cell_type": "markdown", - "id": "39223d1c", - "metadata": {}, - "source": [ - ":::{note}\n", - "`append()` has been deprecated since version 1.4.0: Use `concat()` instead. \n", - ":::\n", - "\n", - "### merge\n", - "\n", - "- Merge DataFrame or named Series objects with a database-style join.\n", - "\n", - "- A named Series object is treated as a DataFrame with a single named column.\n", - "\n", - "- The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross-merge, no column specifications to merge on are allowed." - ] - }, - { - "cell_type": "markdown", - "id": "c1afc536-2209-4fa1-8d63-0b19c18c66c6", - "metadata": { - "attributes": { - "classes": [ - "warning" - ], - "id": "" - } - }, - "source": [ - "If both key columns contain rows where the key is a null value, those rows will be matched against each other. This is different from usual SQL join behaviour and can lead to unexpected results." - ] - }, - { - "cell_type": "markdown", - "id": "0f2ffec1", - "metadata": {}, - "source": [ - "For example:" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "id": "e223179b", - "metadata": {}, - "outputs": [], - "source": [ - "df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],\n", - " 'value': [1, 2, 3, 5]})\n", - "df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],\n", - " 'value': [5, 6, 7, 8]})" - ] - }, - { - "cell_type": "markdown", - "id": "ee9441ec", - "metadata": {}, - "source": [ - "Merge DataFrames `df1` and `df2` with specified left and right suffixes appended to any overlapping columns." - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "id": "e22da8fc", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
lkeyvalue_leftrkeyvalue_right
0foo1foo5
1foo1foo8
2foo5foo5
3foo5foo8
4bar2bar6
5baz3baz7
\n", - "
" - ], - "text/plain": [ - " lkey value_left rkey value_right\n", - "0 foo 1 foo 5\n", - "1 foo 1 foo 8\n", - "2 foo 5 foo 5\n", - "3 foo 5 foo 8\n", - "4 bar 2 bar 6\n", - "5 baz 3 baz 7" - ] - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=('_left', '_right'))" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "id": "6147bab8-4644-4a23-ba71-205573a1c3f9", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "5112fc3a", - "metadata": {}, - "source": [ - "\n", - "Merge DataFrames `df1` and `df2`, but raise an exception if the DataFrames have any overlapping columns." - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "id": "3dea68f6", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [ - { - "ename": "SyntaxError", - "evalue": "invalid syntax (3035502358.py, line 1)", - "output_type": "error", - "traceback": [ - "\u001b[1;36m Cell \u001b[1;32mIn[22], line 1\u001b[1;36m\u001b[0m\n\u001b[1;33m :tags: [\"raises-exception\"]\u001b[0m\n\u001b[1;37m ^\u001b[0m\n\u001b[1;31mSyntaxError\u001b[0m\u001b[1;31m:\u001b[0m invalid syntax\n" - ] - } - ], - "source": [ - ":tags: [\"raises-exception\"]\n", - "df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=(False, False))" - ] - }, - { - "cell_type": "markdown", - "id": "86efca65", - "metadata": {}, - "source": [ - "Using `how` parameter decide the type of merge to be performed." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1026fc27", - "metadata": {}, - "outputs": [], - "source": [ - "df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})\n", - "df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b4379cb1", - "metadata": {}, - "outputs": [], - "source": [ - "df1.merge(df2, how='inner', on='a')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "90916930-6a8e-40e3-871e-d0043aae93d8", - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2a8bb3d7", - "metadata": {}, - "outputs": [], - "source": [ - "df1.merge(df2, how='left', on='a')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "467da7f9-a710-442e-9fcf-afb4990ea3b0", - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8951b7b9", - "metadata": {}, - "outputs": [], - "source": [ - "df1 = pd.DataFrame({'left': ['foo', 'bar']})\n", - "df2 = pd.DataFrame({'right': [7, 8]})" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "93051401", - "metadata": {}, - "outputs": [], - "source": [ - "df1.merge(df2, how='cross')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bc243059-83f7-485c-bcd0-453d611c3d1f", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "b58237c9", - "metadata": {}, - "source": [ - "\n", - "### join\n", - "\n", - "- Join columns of another DataFrame.\n", - "\n", - "- Join columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.\n", - "\n", - "For example:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5ad178d6", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],\n", - " 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ff1aa936", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],\n", - " 'B': ['B0', 'B1', 'B2']}) " - ] - }, - { - "cell_type": "markdown", - "id": "3278bb56", - "metadata": {}, - "source": [ - "Join DataFrames using their indexes." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a2517b83", - "metadata": {}, - "outputs": [], - "source": [ - "df.join(other, lsuffix='_caller', rsuffix='_other')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "81738ab5-bc94-4264-bb43-8c64c041c332", - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "59935609", - "metadata": {}, - "source": [ - "\n", - "If we want to join using the `key` columns, we need to set `key` to be the index in both `df` and `other`. The joined DataFrame will have `key` as its index." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "91c6f0f0", - "metadata": {}, - "outputs": [], - "source": [ - "df.set_index('key').join(other.set_index('key'))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f942120e-c151-473d-aa0a-3ed6b0679204", - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "1483f153", - "metadata": {}, - "source": [ - "\n", - "Another option to join using the key columns is to use the `on` parameter. `DataFrame.join` always uses `other`'s index but we can use any column in `df`. This method preserves the original DataFrame's index in the result." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d8fbb1f7", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df.join(other.set_index('key'), on='key')" - ] - }, - { - "cell_type": "markdown", - "id": "0ed06755", - "metadata": {}, - "source": [ - "Using non-unique key values shows how they are matched." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b4d1eb0d", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df = pd.DataFrame({'key': ['K0', 'K1', 'K1', 'K3', 'K0', 'K1'],\n", - " 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})\n", - "df " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7f6bc83d", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df.join(other.set_index('key'), on='key', validate='m:1')" - ] - }, - { - "cell_type": "markdown", - "id": "61fb9627", - "metadata": {}, - "source": [ - "## Aggregation and grouping\n", - "\n", - "Group `DataFrame` using a mapper or by a `Series` of columns.\n", - "\n", - "A `groupby` operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.\n", - "\n", - "For example:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "38adb2b7", - "metadata": {}, - "outputs": [], - "source": [ - "df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',\n", - " 'Parrot', 'Parrot'],\n", - " 'Max Speed': [380., 370., 24., 26.]})\n", - "df\n", - "df.groupby(['Animal']).mean()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "917ba231-1ee4-4f2c-bcb9-4262d7eba119", - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "84fe11db", - "metadata": {}, - "source": [ - "\n", - "### Hierarchical Indexes\n", - "\n", - "We can `groupby` different levels of a hierarchical index using the `level` parameter:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5e84fd8b", - "metadata": {}, - "outputs": [], - "source": [ - "arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],\n", - " ['Captive', 'Wild', 'Captive', 'Wild']]\n", - "index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))\n", - "df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]},\n", - " index=index)\n", - "df.groupby(level=0).mean()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8d6ff678-1c1e-4629-9e06-1874511ecdf0", - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5a7a2d6a", - "metadata": {}, - "outputs": [], - "source": [ - "df.groupby(level=\"Type\").mean()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "31f4c668-6a8b-4dba-a6db-29673e7fbdba", - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "fe08b062", - "metadata": {}, - "source": [ - "\n", - "We can also choose to include NA in group keys or not by setting `dropna` parameter, the default setting is `True`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f27b6536", - "metadata": {}, - "outputs": [], - "source": [ - "l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]\n", - "df = pd.DataFrame(l, columns=[\"a\", \"b\", \"c\"])\n", - "df.groupby(by=[\"b\"]).sum()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "47261c15-1d74-4a39-a7bb-073f6835cbf8", - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "815ba4c3", - "metadata": {}, - "outputs": [], - "source": [ - "df.groupby(by=[\"b\"], dropna=False).sum()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "17c93213-8bcf-4ac8-a30d-09df48b9ca71", - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "719dc004", - "metadata": {}, - "outputs": [], - "source": [ - "l = [[\"a\", 12, 12], [None, 12.3, 33.], [\"b\", 12.3, 123], [\"a\", 1, 1]]\n", - "df = pd.DataFrame(l, columns=[\"a\", \"b\", \"c\"])\n", - "df.groupby(by=\"a\").sum()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ba2d22de-ed75-4d52-a6d8-badf4791429f", - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cce87c6a", - "metadata": {}, - "outputs": [], - "source": [ - "df.groupby(by=\"a\", dropna=False).sum()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "70cc2217-577e-4b8c-8fc2-ce02f036622b", - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "6988f12c", - "metadata": {}, - "source": [ - "\n", - "When using `.apply()`, use `group_keys` to include or exclude the group keys. The `group_keys` argument defaults to `True` (include)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1fa5930a", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',\n", - " 'Parrot', 'Parrot'],\n", - " 'Max Speed': [380., 370., 24., 26.]})\n", - "df.groupby(\"Animal\", group_keys=True).apply(lambda x: x)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "67e4668e", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df.groupby(\"Animal\", group_keys=False).apply(lambda x: x)" - ] - }, - { - "cell_type": "markdown", - "id": "c8777695", - "metadata": {}, - "source": [ - "## Pivot table\n", - "\n", - "Create a spreadsheet-style pivot table as a DataFrame.\n", - "\n", - "The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c8e1b317", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df = pd.DataFrame({\"A\": [\"foo\", \"foo\", \"foo\", \"foo\", \"foo\",\n", - " \"bar\", \"bar\", \"bar\", \"bar\"],\n", - " \"B\": [\"one\", \"one\", \"one\", \"two\", \"two\",\n", - " \"one\", \"one\", \"two\", \"two\"],\n", - " \"C\": [\"small\", \"large\", \"large\", \"small\",\n", - " \"small\", \"large\", \"small\", \"small\",\n", - " \"large\"],\n", - " \"D\": [1, 2, 2, 3, 3, 4, 5, 6, 7],\n", - " \"E\": [2, 4, 5, 5, 6, 6, 8, 9, 9]})\n", - "df" - ] - }, - { - "cell_type": "markdown", - "id": "ef96918e", - "metadata": {}, - "source": [ - "This first example aggregates values by taking the sum." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7206f156", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "table = pd.pivot_table(df, values='D', index=['A', 'B'],\n", - " columns=['C'], aggfunc=np.sum)\n", - "table" - ] - }, - { - "cell_type": "markdown", - "id": "e0df6460", - "metadata": {}, - "source": [ - "We can also fill in missing values using the `fill_value` parameter." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6cfd03f9", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "table = pd.pivot_table(df, values='D', index=['A', 'B'],\n", - " columns=['C'], aggfunc=np.sum, fill_value=0)\n", - "table" - ] - }, - { - "cell_type": "markdown", - "id": "bf713c57", - "metadata": {}, - "source": [ - "The next example aggregates by taking the mean across multiple columns." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "900dc876", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],\n", - " aggfunc={'D': np.mean,\n", - " 'E': np.mean})\n", - "table" - ] - }, - { - "cell_type": "markdown", - "id": "6a428fdc", - "metadata": {}, - "source": [ - "We can also calculate multiple types of aggregations for any given value column." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "36ccdfaf", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],\n", - " aggfunc={'D': np.mean,\n", - " 'E': [min, max, np.mean]})\n", - "table" - ] - }, - { - "cell_type": "markdown", - "id": "19eeb851", - "metadata": {}, - "source": [ - "## High-performance Pandas: eval() and query()\n", - "\n", - "### eval()\n", - "\n", - "Evaluate a string describing operations on DataFrame columns.\n", - "\n", - "Operates on columns only, not specific rows or elements. This allows `eval` to run arbitrary code, which can make you vulnerable to code injection if you pass user input to this function.\n", - "\n", - "For example:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "db6fdd36", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)})\n", - "df" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "92e71f86", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df.eval('A + B')" - ] - }, - { - "cell_type": "markdown", - "id": "e5f51480", - "metadata": {}, - "source": [ - "The assignment is allowed though by default the original `DataFrame` is not modified." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b6387047", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df.eval('C = A + B')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a5322c51", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df" - ] - }, - { - "cell_type": "markdown", - "id": "9a0a5d4d", - "metadata": {}, - "source": [ - "Use `inplace=True` to modify the original DataFrame." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "13d2dffa", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df.eval('C = A + B', inplace=True)\n", - "df" - ] - }, - { - "cell_type": "markdown", - "id": "e9c14654", - "metadata": {}, - "source": [ - "Multiple columns can be assigned using multi-line expressions:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8ee5ceea", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df.eval(\n", - " '''\n", - " C = A + B\n", - " D = A - B\n", - " '''\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "9c052b27", - "metadata": {}, - "source": [ - "### query()\n", - "\n", - "Query the columns of a DataFrame with a boolean expression.\n", - "\n", - "For example:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d99bb798", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df = pd.DataFrame({\n", - " 'A': range(1, 6),\n", - " 'B': range(10, 0, -2),\n", - " 'C C': range(10, 5, -1)\n", - "})\n", - "df" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c228b08b", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df.query('A > B')" - ] - }, - { - "cell_type": "markdown", - "id": "e90ed305", - "metadata": {}, - "source": [ - "The previous expression is equivalent to" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "28a30c04", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df[df.A > df.B]" - ] - }, - { - "cell_type": "markdown", - "id": "454bb2b9", - "metadata": {}, - "source": [ - "For columns with spaces in their name, you can use backtick quoting." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4d06bb30", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df.query('B == `C C`')" - ] - }, - { - "cell_type": "markdown", - "id": "2ac03c29", - "metadata": {}, - "source": [ - "The previous expression is equivalent to" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f8dacc1f", - "metadata": { - "attributes": { - "classes": [ - "code-cell" - ], - "id": "" - } - }, - "outputs": [], - "source": [ - "df[df.B == df['C C']]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6ec1ded1-6f8a-46ca-b304-25621fe08677", - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import HTML\n", - "\n", - "display(\n", - " HTML(\n", - " \"\"\"\n", - "
\n", - "
\n", - "

Let's visualize it! 🎥

\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - "\"\"\"\n", - " )\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "id": "bc6c4cd4", - "metadata": {}, - "source": [ - "\n", - "## Your turn! 🚀\n", - "\n", - "### Processing image data\n", - "\n", - "Recently, very powerful AI models have been developed that allow us to understand images. There are many tasks that can be solved using pre-trained neural networks, or cloud services. Some examples include:\n", - "\n", - "- **Image Classification**, can help you categorize the image into one of the pre-defined classes. You can easily train your own image classifiers using services such as [Custom Vision](https://azure.microsoft.com/services/cognitive-services/custom-vision-service/?WT.mc_id=academic-77958-bethanycheum)\n", - "- **Object Detection** to detect different objects in the image. Services such as [computer vision](https://azure.microsoft.com/services/cognitive-services/computer-vision/?WT.mc_id=academic-77958-bethanycheum) can detect a number of common objects, and you can train [Custom Vision](https://azure.microsoft.com/services/cognitive-services/custom-vision-service/?WT.mc_id=academic-77958-bethanycheum) model to detect some specific objects of interest.\n", - "- **Face Detection**, including Age, Gender and Emotion detection. This can be done via [Face API](https://azure.microsoft.com/services/cognitive-services/face/?WT.mc_id=academic-77958-bethanycheum).\n", - "\n", - "All those cloud services can be called using [Python SDKs](https://docs.microsoft.com/samples/azure-samples/cognitive-services-python-sdk-samples/cognitive-services-python-sdk-samples/?WT.mc_id=academic-77958-bethanycheum), and thus can be easily incorporated into your data exploration workflow.\n", - "\n", - "Here are some examples of exploring data from Image data sources:\n", - "\n", - "- In the blog post [How to Learn Data Science without Coding](https://soshnikov.com/azure/how-to-learn-data-science-without-coding/) we explore Instagram photos, trying to understand what makes people give more likes to a photo. We first extract as much information from pictures as possible using [computer vision](https://azure.microsoft.com/services/cognitive-services/computer-vision/?WT.mc_id=academic-77958-bethanycheum), and then use [Azure Machine Learning AutoML](https://docs.microsoft.com/azure/machine-learning/concept-automated-ml/?WT.mc_id=academic-77958-bethanycheum) to build the interpretable model.\n", - "- In [Facial Studies Workshop](https://github.com/CloudAdvocacy/FaceStudies) we use [Face API](https://azure.microsoft.com/services/cognitive-services/face/?WT.mc_id=academic-77958-bethanycheum) to extract emotions from people on photographs from events, in order to try to understand what makes people happy.\n", - "\n", - "### Assignment\n", - "\n", - "[Perform more detailed data study for the challenges above](../../assignments/data-science/data-processing-in-python.md)\n", - "\n", - "## Self study\n", - "\n", - "In this chapter, we've covered many of the basics of using Pandas effectively for data analysis. Still, much has been omitted from our discussion. To learn more about Pandas, we recommend the following resources:\n", - "\n", - "- [Pandas online documentation](http://pandas.pydata.org/): This is the go-to source for complete documentation of the package. While the examples in the documentation tend to be small generated datasets, the description of the options is complete and generally very useful for understanding the use of various functions.\n", - "\n", - "- [*Python for Data Analysis*](http://shop.oreilly.com/product/0636920023784.do) Written by Wes McKinney (the original creator of Pandas), this book contains much more detail on the Pandas package than we had room for in this chapter. In particular, he takes a deep dive into tools for time series, which were his bread and butter as a financial consultant. The book also has many entertaining examples of applying Pandas to gain insight from real-world datasets. Keep in mind, though, that the book is now several years old, and the Pandas package has quite a few new features that this book does not cover (but be on the lookout for a new edition in 2017).\n", - "\n", - "- [Stack Overflow](http://stackoverflow.com/questions/tagged/pandas): Pandas has so many users that any question you have has likely been asked and answered on Stack Overflow. Using Pandas is a case where some Google-Fu is your best friend. Simply go to your favorite search engine and type in the question, problem, or error you're coming across-more than likely you'll find your answer on a Stack Overflow page.\n", - "\n", - "- [Pandas on PyVideo](http://pyvideo.org/search?q=pandas): From PyCon to SciPy to PyData, many conferences have featured tutorials from Pandas developers and power users. The PyCon tutorials in particular tend to be given by very well-vetted presenters.\n", - "\n", - "Using these resources, combined with the walk-through given in this chapter, my hope is that you'll be poised to use Pandas to tackle any data analysis problem you come across!\n", - "\n", - "## Acknowledgments\n", - "\n", - "Thanks for [Pandas user guide](https://pandas.pydata.org/docs/user_guide/index.html). It contributes the majority of the content in this chapter." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.4" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -}