diff --git a/course.yml b/course.yml
index 0b9a039..e3dea16 100644
--- a/course.yml
+++ b/course.yml
@@ -45,6 +45,7 @@ plan:
slug: eda4
date: 2023-10-09
materials:
+ - lesson: pydata/pandas_joins
- lesson: pydata/pandas_correlations
- title: "Svátky klidu a konce kurzu"
diff --git a/lessons/pydata/pandas_correlations/index.ipynb b/lessons/pydata/pandas_correlations/index.ipynb
index aabdc5e..7151d89 100644
--- a/lessons/pydata/pandas_correlations/index.ipynb
+++ b/lessons/pydata/pandas_correlations/index.ipynb
@@ -4,6100 +4,36 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "# Pandas - spojování tabulek a vztahy mezi více proměnnými\n",
- "\n",
- "Tato lekce se nese ve znamení mnohosti a propojování - naučíš se:\n",
- "\n",
- "- pracovat s více tabulkami najednou\n",
- "- nacházet spojitosti mezi dvěma (a více) proměnnými\n",
- "\n",
- "Při tom společně projdeme (ne poprvé a ne naposledy) čištění reálných datových sad."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Importy jako obvykle\n",
- "import pandas as pd\n",
- "import matplotlib.pyplot as plt\n",
- "import seaborn as sns\n",
- "\n",
- "%matplotlib inline"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "toc-hr-collapsed": false
- },
- "source": [
- "## Spojování tabulek\n",
- "\n",
- "V lekci, kde jsme zpracovávali data o počasí, jsme ti ukázali, že je pomocí funkce `concat` možné slepit dohromady několik objektů `DataFrame` či `Series`, pokud mají \"kompatibilní\" index. Nyní se na problematiku podíváme trochu blíže a ukážeme si, jak spojovat tabulky na základě různých sloupců, a co dělat, když řádky z jedné tabulky nepasují přesně na tabulku druhou.\n",
- "\n",
- "Obecně pro spojování `pandas` nabízí čtyři funkce / metody, z nichž každá má svoje typické využití (možnostmi se ovšem překrývají):\n",
- "\n",
- "- [`concat`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) je univerzální funkce pro slepování dvou či více tabulek / sloupců - pod sebe, vedle sebe, s přihlédnutím k indexům i bez něj. \n",
- "- [`append`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html) (metoda) je jednodušší alternativou `concat`, pokud jen chceš do nějaké tabulky přidat pár řádků.\n",
- "- [`merge`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html) je univerzální funkce pro spojování tabulek na základě vazby mezi indexy nebo sloupci.\n",
- "- [`join`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html) (metoda) zjednodušuje práci, když chceš spojit dvě tabulky na základě indexu.\n",
- "\n",
- "Detailní rozbor toho, co která umí, najdeš v [dokumentaci](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html). My si je také postupně ukážeme."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Jednoduché skládání"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Pod sebou\n",
- "\n",
- "To je asi ten nejjednodušší případ - máme dva objekty `Series` nebo dva kusy tabulky se stejnými sloupci a chceme je spojit pod sebou. Na to se používá funkce [`concat`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html):"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [],
- "source": [
- "a = pd.Series([\"jedna\", \"dvě\", \"tři\"])\n",
- "b = pd.Series([\"čtyři\", \"pět\", \"šest\"])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "0 jedna\n",
- "1 dvě\n",
- "2 tři\n",
- "0 čtyři\n",
- "1 pět\n",
- "2 šest\n",
- "dtype: object"
- ]
- },
- "execution_count": 3,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "pd.concat([a, b])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "💡 Vidíš, že se index opakuje? Vytvořili jsme dvě `Series`, u kterých jsme index neřešili. Jenže `pandas` na rozdíl od nás ano, a tak poslušně oba indexy spojil, i za cenu duplicitních hodnot. Za cenu použití dodatečného argumentu `ignore_index=True` se tomu lze vyhnout, což si ukážeme na příklady spojování dvou tabulek o stejných sloupcích:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "0 jedna\n",
- "1 dvě\n",
- "2 tři\n",
- "3 jedna\n",
- "4 dvě\n",
- "5 tři\n",
- "6 jedna\n",
- "7 dvě\n",
- "8 tři\n",
- "9 jedna\n",
- "10 dvě\n",
- "11 tři\n",
- "12 jedna\n",
- "13 dvě\n",
- "14 tři\n",
- "dtype: object"
- ]
- },
- "execution_count": 4,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "pd.concat([a, a, a, a, a], ignore_index=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Vedle sebe\n",
- "\n",
- "Toto asi použijete zřídka, ale když chceme \"lepit\" doprava (třeba deset `Series`), stačí přidat nám dobře známý argument `axis`:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
0
\n",
- "
1
\n",
- "
2
\n",
- "
3
\n",
- "
4
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
0
\n",
- "
jedna
\n",
- "
jedna
\n",
- "
jedna
\n",
- "
jedna
\n",
- "
jedna
\n",
- "
\n",
- "
\n",
- "
1
\n",
- "
dvě
\n",
- "
dvě
\n",
- "
dvě
\n",
- "
dvě
\n",
- "
dvě
\n",
- "
\n",
- "
\n",
- "
2
\n",
- "
tři
\n",
- "
tři
\n",
- "
tři
\n",
- "
tři
\n",
- "
tři
\n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " 0 1 2 3 4\n",
- "0 jedna jedna jedna jedna jedna\n",
- "1 dvě dvě dvě dvě dvě\n",
- "2 tři tři tři tři tři"
- ]
- },
- "execution_count": 5,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "pd.concat([a, a, a, a, a], axis=\"columns\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**Příklad:** Jak co nejrychleji \"nakreslit prázdnou šachovnici\" (obě slova jsou v uvozovkách)?"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
A
\n",
- "
B
\n",
- "
C
\n",
- "
D
\n",
- "
E
\n",
- "
F
\n",
- "
G
\n",
- "
H
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
8
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
\n",
- "
\n",
- "
7
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
\n",
- "
\n",
- "
6
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
\n",
- "
\n",
- "
5
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
\n",
- "
\n",
- "
4
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
\n",
- "
\n",
- "
3
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
\n",
- "
\n",
- "
2
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
\n",
- "
\n",
- "
1
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
⬛
\n",
- "
⬜
\n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " A B C D E F G H\n",
- "8 ⬜ ⬛ ⬜ ⬛ ⬜ ⬛ ⬜ ⬛\n",
- "7 ⬛ ⬜ ⬛ ⬜ ⬛ ⬜ ⬛ ⬜\n",
- "6 ⬜ ⬛ ⬜ ⬛ ⬜ ⬛ ⬜ ⬛\n",
- "5 ⬛ ⬜ ⬛ ⬜ ⬛ ⬜ ⬛ ⬜\n",
- "4 ⬜ ⬛ ⬜ ⬛ ⬜ ⬛ ⬜ ⬛\n",
- "3 ⬛ ⬜ ⬛ ⬜ ⬛ ⬜ ⬛ ⬜\n",
- "2 ⬜ ⬛ ⬜ ⬛ ⬜ ⬛ ⬜ ⬛\n",
- "1 ⬛ ⬜ ⬛ ⬜ ⬛ ⬜ ⬛ ⬜"
- ]
- },
- "execution_count": 6,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "sachy = pd.concat(\n",
- " [\n",
- " pd.concat( \n",
- " [pd.DataFrame([[\"⬜\", \"⬛\"], [\"⬛\", \"⬜\"]])] * 4,\n",
- " axis=1)\n",
- " ] * 4\n",
- ")\n",
- "sachy.index = list(range(8, 0, -1))\n",
- "sachy.columns = list(\"ABCDEFGH\")\n",
- "sachy"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Spojování různorodých tabulek"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "🎦 Pro spojování heterogenních dat (v datové hantýrce \"joinování\") sáhneme po trochu komplexnějších filmových datech..."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Máme staženo několik souborů, načteme si je (zatím hrubě, \"raw\") - s přihlédnutím k tomu, že první dva nejsou v pravém slova smyslu \"comma-separated\", ale používají k oddělení hodnot tabulátor (tady pomůže argument `sep`). Také zohledníme, že v nich řetězec `\"\\N\"` představuje chybějící hodnoty (pomůže argument `na_values`):"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [],
- "source": [
- "imdb_titles_raw = pd.read_csv(\"title.basics.tsv.gz\", sep=\"\\t\", na_values=\"\\\\N\")\n",
- "imdb_ratings_raw = pd.read_csv(\"title.ratings.tsv.gz\", sep=\"\\t\", na_values=\"\\\\N\")\n",
- "boxoffice_raw = pd.read_csv(\"boxoffice_march_2019.csv.gz\")\n",
- "rotten_tomatoes_raw = pd.read_csv(\"rotten_tomatoes_top_movies_2019-01-15.csv\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Co který soubor obsahuje? \n",
- "\n",
- "- První dva soubory obsahují volně dostupná (byť \"jen\" pro nekomerční použití) data o filmech z IMDb (Internet Movie Database). My jsme si zvolili obecné informace a uživatelská (číselná) hodnocení. Detailní popis souborů, stejně jako odkazy na další soubory, najdeš na https://www.imdb.com/interfaces/. Z důvodů paměťové náročnosti jsme datovou sadu ořezali o epizody seriálů, protože nás nebudou zajímat a s trochu štěstí přežijeme i na počítačích s menší operační pamětí.\n",
- "\n",
- "- Soubor `boxoffice_march_2019.csv.gz` obsahuje informace o výdělcích jednotlivých filmů. Pochází z ukázkového datasetu pro soutěž \"TMDB Box Office Prediction\" na serveru Kaggle: https://www.kaggle.com/c/tmdb-box-office-prediction/data\n",
- "\n",
- "- Soubor `rotten_tomatoes_top_movies_2019-01-15.csv` obsahuje procentuální hodnocení filmů ze serveru Rotten Tomatoes, které se počítá jako podíl pozitivních hodnoceních od filmových kritiku (je to tedy jiný princip než na IMDb). Staženo z: https://data.world/prasert/rotten-tomatoes-top-movies-by-genre\n",
- "\n",
- "Pojďme se podívat na nedostatky těchto souborů a postupně je skládat dohromady. Zajímalo by nás (a snad i tebe!), jak souvisí hodnocení s komerční úspěšností filmu, jak se liší hodnocení rotten tomatoes od těch na IMDb."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
tconst
\n",
- "
titleType
\n",
- "
primaryTitle
\n",
- "
originalTitle
\n",
- "
isAdult
\n",
- "
startYear
\n",
- "
endYear
\n",
- "
runtimeMinutes
\n",
- "
genres
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
0
\n",
- "
tt0000001
\n",
- "
short
\n",
- "
Carmencita
\n",
- "
Carmencita
\n",
- "
0
\n",
- "
1894.0
\n",
- "
NaN
\n",
- "
1.0
\n",
- "
Documentary,Short
\n",
- "
\n",
- "
\n",
- "
1
\n",
- "
tt0000002
\n",
- "
short
\n",
- "
Le clown et ses chiens
\n",
- "
Le clown et ses chiens
\n",
- "
0
\n",
- "
1892.0
\n",
- "
NaN
\n",
- "
5.0
\n",
- "
Animation,Short
\n",
- "
\n",
- "
\n",
- "
2
\n",
- "
tt0000003
\n",
- "
short
\n",
- "
Pauvre Pierrot
\n",
- "
Pauvre Pierrot
\n",
- "
0
\n",
- "
1892.0
\n",
- "
NaN
\n",
- "
4.0
\n",
- "
Animation,Comedy,Romance
\n",
- "
\n",
- "
\n",
- "
3
\n",
- "
tt0000004
\n",
- "
short
\n",
- "
Un bon bock
\n",
- "
Un bon bock
\n",
- "
0
\n",
- "
1892.0
\n",
- "
NaN
\n",
- "
NaN
\n",
- "
Animation,Short
\n",
- "
\n",
- "
\n",
- "
4
\n",
- "
tt0000005
\n",
- "
short
\n",
- "
Blacksmith Scene
\n",
- "
Blacksmith Scene
\n",
- "
0
\n",
- "
1893.0
\n",
- "
NaN
\n",
- "
1.0
\n",
- "
Comedy,Short
\n",
- "
\n",
- "
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
\n",
- "
\n",
- "
1783511
\n",
- "
tt9916734
\n",
- "
video
\n",
- "
Manca: Peleo
\n",
- "
Manca: Peleo
\n",
- "
0
\n",
- "
2018.0
\n",
- "
NaN
\n",
- "
NaN
\n",
- "
Music,Short
\n",
- "
\n",
- "
\n",
- "
1783512
\n",
- "
tt9916754
\n",
- "
movie
\n",
- "
Chico Albuquerque - Revelações
\n",
- "
Chico Albuquerque - Revelações
\n",
- "
0
\n",
- "
2013.0
\n",
- "
NaN
\n",
- "
NaN
\n",
- "
Documentary
\n",
- "
\n",
- "
\n",
- "
1783513
\n",
- "
tt9916756
\n",
- "
short
\n",
- "
Pretty Pretty Black Girl
\n",
- "
Pretty Pretty Black Girl
\n",
- "
0
\n",
- "
2019.0
\n",
- "
NaN
\n",
- "
NaN
\n",
- "
Short
\n",
- "
\n",
- "
\n",
- "
1783514
\n",
- "
tt9916764
\n",
- "
short
\n",
- "
38
\n",
- "
38
\n",
- "
0
\n",
- "
2018.0
\n",
- "
NaN
\n",
- "
NaN
\n",
- "
Short
\n",
- "
\n",
- "
\n",
- "
1783515
\n",
- "
tt9916856
\n",
- "
short
\n",
- "
The Wind
\n",
- "
The Wind
\n",
- "
0
\n",
- "
2015.0
\n",
- "
NaN
\n",
- "
27.0
\n",
- "
Short
\n",
- "
\n",
- " \n",
- "
\n",
- "
1783516 rows × 9 columns
\n",
- "
"
- ],
- "text/plain": [
- " tconst titleType primaryTitle \\\n",
- "0 tt0000001 short Carmencita \n",
- "1 tt0000002 short Le clown et ses chiens \n",
- "2 tt0000003 short Pauvre Pierrot \n",
- "3 tt0000004 short Un bon bock \n",
- "4 tt0000005 short Blacksmith Scene \n",
- "... ... ... ... \n",
- "1783511 tt9916734 video Manca: Peleo \n",
- "1783512 tt9916754 movie Chico Albuquerque - Revelações \n",
- "1783513 tt9916756 short Pretty Pretty Black Girl \n",
- "1783514 tt9916764 short 38 \n",
- "1783515 tt9916856 short The Wind \n",
- "\n",
- " originalTitle isAdult startYear endYear \\\n",
- "0 Carmencita 0 1894.0 NaN \n",
- "1 Le clown et ses chiens 0 1892.0 NaN \n",
- "2 Pauvre Pierrot 0 1892.0 NaN \n",
- "3 Un bon bock 0 1892.0 NaN \n",
- "4 Blacksmith Scene 0 1893.0 NaN \n",
- "... ... ... ... ... \n",
- "1783511 Manca: Peleo 0 2018.0 NaN \n",
- "1783512 Chico Albuquerque - Revelações 0 2013.0 NaN \n",
- "1783513 Pretty Pretty Black Girl 0 2019.0 NaN \n",
- "1783514 38 0 2018.0 NaN \n",
- "1783515 The Wind 0 2015.0 NaN \n",
- "\n",
- " runtimeMinutes genres \n",
- "0 1.0 Documentary,Short \n",
- "1 5.0 Animation,Short \n",
- "2 4.0 Animation,Comedy,Romance \n",
- "3 NaN Animation,Short \n",
- "4 1.0 Comedy,Short \n",
- "... ... ... \n",
- "1783511 NaN Music,Short \n",
- "1783512 NaN Documentary \n",
- "1783513 NaN Short \n",
- "1783514 NaN Short \n",
- "1783515 27.0 Short \n",
- "\n",
- "[1783516 rows x 9 columns]"
- ]
- },
- "execution_count": 8,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "imdb_titles_raw"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "648.8971881866455"
- ]
- },
- "execution_count": 9,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Kolik tabulka zabírá megabajtů paměti? (1 MB = 2**20 bajtů)\n",
- "imdb_titles_raw.memory_usage(deep=True).sum() / 2**20 "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Jistě budeme chtít převést sloupce na správné typy. Jaké jsou v základu?"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "tconst object\n",
- "titleType object\n",
- "primaryTitle object\n",
- "originalTitle object\n",
- "isAdult int64\n",
- "startYear float64\n",
- "endYear float64\n",
- "runtimeMinutes float64\n",
- "genres object\n",
- "dtype: object"
- ]
- },
- "execution_count": 10,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "imdb_titles_raw.dtypes"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Na co budeme převádět?\n",
- "\n",
- "- `tconst` je řetězec, který posléze použijeme jako index, protože představuje unikátní identifikátor v databázi IMDb.\n",
- "- `titleType`:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "titleType\n",
- "short 676930\n",
- "movie 514654\n",
- "video 227582\n",
- "tvSeries 162781\n",
- "tvMovie 126507\n",
- "tvMiniSeries 25574\n",
- "videoGame 23310\n",
- "tvSpecial 17007\n",
- "tvShort 9171\n",
- "Name: count, dtype: int64"
- ]
- },
- "execution_count": 11,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "imdb_titles_raw[\"titleType\"].value_counts()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Jen devět různých hodnot ve skoro 2 milionech řádků? To je ideální kandidát na převedení na typ `\"category\"`.\n",
- "\n",
- "- `primaryTitle` a `originalTitle` vypadají jako obyčejné řetězce (pokud možno anglický a pokud možno původní název)\n",
- "- `isAdult` určuje, zda se jedná o dílo pro dospělé. Tento sloupec bychom nejspíše měli převést na `bool`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 12,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "isAdult\n",
- "0 1692292\n",
- "1 91224\n",
- "Name: count, dtype: int64"
- ]
- },
- "execution_count": 12,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "imdb_titles_raw[\"isAdult\"].value_counts()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "- `startYear` a `endYear` obsahují roky, t.j. celá čísla, ovšem kvůli chybějícím hodnotám je pro ně zvolen typ `float64`. V `pandas` raději zvolíme tzv. \"nullable integer\", který se zapisuje s velkým \"I\". Když nevíš, jaký podtyp konkrétně, sáhni po `Int64`.\n",
- "- totéž platí pro `runtimeMinutes`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "startYear 2115.0\n",
- "endYear 2027.0\n",
- "runtimeMinutes 125156.0\n",
- "dtype: float64"
- ]
- },
- "execution_count": 13,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "imdb_titles_raw[[\"startYear\", \"endYear\", \"runtimeMinutes\"]].max()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Mimochodem všimli jste si, že máme díla z budoucnosti (rok 2115)?"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "startYear\n",
- "2020.0 340\n",
- "2021.0 36\n",
- "2022.0 14\n",
- "2023.0 1\n",
- "2024.0 2\n",
- "2025.0 1\n",
- "2115.0 1\n",
- "Name: count, dtype: int64"
- ]
- },
- "execution_count": 14,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "",
- "text/plain": [
- "
"
- ],
- "text/plain": [
- " title \\\n",
- "tconst \n",
- "tt0000009 Miss Jerry \n",
- "tt0000147 The Corbett-Fitzsimmons Fight \n",
- "tt0000335 Soldiers of the Cross \n",
- "tt0000502 Bohemios \n",
- "tt0000574 The Story of the Kelly Gang \n",
- "... ... \n",
- "tt9916622 Rodolpho Teóphilo - O Legado de um Pioneiro \n",
- "tt9916680 De la ilusión al desconcierto: cine colombiano... \n",
- "tt9916706 Dankyavar Danka \n",
- "tt9916730 6 Gunn \n",
- "tt9916754 Chico Albuquerque - Revelações \n",
- "\n",
- " original_title is_adult year \\\n",
- "tconst \n",
- "tt0000009 Miss Jerry False 1894 \n",
- "tt0000147 The Corbett-Fitzsimmons Fight False 1897 \n",
- "tt0000335 Soldiers of the Cross False 1900 \n",
- "tt0000502 Bohemios False 1905 \n",
- "tt0000574 The Story of the Kelly Gang False 1906 \n",
- "... ... ... ... \n",
- "tt9916622 Rodolpho Teóphilo - O Legado de um Pioneiro False 2015 \n",
- "tt9916680 De la ilusión al desconcierto: cine colombiano... False 2007 \n",
- "tt9916706 Dankyavar Danka False 2013 \n",
- "tt9916730 6 Gunn False 2017 \n",
- "tt9916754 Chico Albuquerque - Revelações False 2013 \n",
- "\n",
- " length genres imdb_rating imdb_votes \n",
- "tconst \n",
- "tt0000009 45 Romance 5.5 77.0 \n",
- "tt0000147 20 Documentary,News,Sport 5.2 289.0 \n",
- "tt0000335 Biography,Drama 6.3 39.0 \n",
- "tt0000502 100 NaN NaN NaN \n",
- "tt0000574 70 Biography,Crime,Drama 6.2 505.0 \n",
- "... ... ... ... ... \n",
- "tt9916622 Documentary NaN NaN \n",
- "tt9916680 100 Documentary NaN NaN \n",
- "tt9916706 Comedy NaN NaN \n",
- "tt9916730 116 NaN NaN NaN \n",
- "tt9916754 Documentary NaN NaN \n",
- "\n",
- "[514654 rows x 8 columns]"
- ]
- },
- "execution_count": 23,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "movies.join(ratings)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "K tabulce se nenápadně přidaly dva sloupce z tabulky `ratings`, a to takovým způsobem, že se porovnaly hodnoty indexu (tedy `tconst`) a spárovaly se ty části řádku, kde se tento index shoduje.\n",
- "\n",
- "💡 Uvědom si (ačkoliv z volání funkcí v `pandas` to není úplně zřejmé), že se tady děje něco fundamentálně odlišného od \"nalepení doprava\" - tabulky tu nejsou chápány jako čtverečky, které jde skládat jako lego, nýbrž jako zdroj údajů o jednotlivých objektech, které je potřeba spojit sémanticky.\n",
- "\n",
- "Jak ale vidíš, tabulka obsahuje spoustu řádků, kde ve sloupcích s hodnocením chybí hodnoty (respektive nachází se `NaN`). To vychází ze způsobu, jakým metoda `join` ve výchozím nastavení \"joinuje\" - použije všechny řádky z levé tabulky bez ohledu na to, jestli jim odpovídá nějaký protějšek v tabulce pravé. Naštěstí lze pomocí argumentu `how` specifikovat i jiné způsoby spojování:\n",
- "\n",
- "- `left` (výchozí pro metodu `join`) - vezmou se všechny prvky z levé tabulky a jim odpovídající prvky z pravé tabulky (kde nejsou, doplní se `NaN`)\n",
- "- `right` - vezmou se všechny prvky z pravé tabulky a jim odpovídající prvky z levé tabulky (kde nejsou, doplní se `NaN`)\n",
- "- `inner` (výchozí pro funkci `merge`) - vezmou se jen ty prvky, které jsou v levé i pravé tabulce.\n",
- "- `outer` (výchozí pro funkci `concat`) - vezmou se všechny prvky, z levé i pravé tabulky, kde něco chybí, doplní se `NaN`.\n",
- "\n",
- "V podobě Vennově diagramu, kde kruhy představují množiny řádků v obou zdrojových tabulkách a modrou barvou jsou zvýrazněny řádky v tabulce cílové:\n",
- "\n",
- "![Typy joinů](static/joins.svg)\n",
- "\n",
- "*Obrázek adaptován z https://upload.wikimedia.org/wikipedia/commons/9/9d/SQL_Joins.svg (autor: Arbeck)*\n",
- "\n",
- "💡 Až budeme probírat databáze, tyto čtyři typu joinů se nám znovu vynoří.\n",
- "\n",
- "Následující výpis ukáže, kolik řádků bychom dostali při použití různých hodnot `how`:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 24,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "movies.join(ratings, how=\"left\"): 514654 řádků.\n",
- "movies.join(ratings, how=\"right\"): 923696 řádků.\n",
- "movies.join(ratings, how=\"inner\"): 232496 řádků.\n",
- "movies.join(ratings, how=\"outer\"): 1205854 řádků.\n"
- ]
- }
- ],
- "source": [
- "for how in [\"left\", \"right\", \"inner\", \"outer\"]:\n",
- " print(f\"movies.join(ratings, how=\\\"{how}\\\"):\", movies.join(ratings, how=how).shape[0], \"řádků.\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "A teď tedy ty tři alternativy:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 25,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
title
\n",
- "
original_title
\n",
- "
is_adult
\n",
- "
year
\n",
- "
length
\n",
- "
genres
\n",
- "
imdb_rating
\n",
- "
imdb_votes
\n",
- "
\n",
- "
\n",
- "
tconst
\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
tt0000009
\n",
- "
Miss Jerry
\n",
- "
Miss Jerry
\n",
- "
False
\n",
- "
1894
\n",
- "
45
\n",
- "
Romance
\n",
- "
5.5
\n",
- "
77
\n",
- "
\n",
- "
\n",
- "
tt0000147
\n",
- "
The Corbett-Fitzsimmons Fight
\n",
- "
The Corbett-Fitzsimmons Fight
\n",
- "
False
\n",
- "
1897
\n",
- "
20
\n",
- "
Documentary,News,Sport
\n",
- "
5.2
\n",
- "
289
\n",
- "
\n",
- "
\n",
- "
tt0000335
\n",
- "
Soldiers of the Cross
\n",
- "
Soldiers of the Cross
\n",
- "
False
\n",
- "
1900
\n",
- "
<NA>
\n",
- "
Biography,Drama
\n",
- "
6.3
\n",
- "
39
\n",
- "
\n",
- "
\n",
- "
tt0000574
\n",
- "
The Story of the Kelly Gang
\n",
- "
The Story of the Kelly Gang
\n",
- "
False
\n",
- "
1906
\n",
- "
70
\n",
- "
Biography,Crime,Drama
\n",
- "
6.2
\n",
- "
505
\n",
- "
\n",
- "
\n",
- "
tt0000615
\n",
- "
Robbery Under Arms
\n",
- "
Robbery Under Arms
\n",
- "
False
\n",
- "
1907
\n",
- "
<NA>
\n",
- "
Drama
\n",
- "
4.8
\n",
- "
14
\n",
- "
\n",
- "
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
\n",
- "
\n",
- "
tt9910930
\n",
- "
Jeg ser deg
\n",
- "
Jeg ser deg
\n",
- "
False
\n",
- "
2019
\n",
- "
75
\n",
- "
Crime,Documentary
\n",
- "
4.6
\n",
- "
5
\n",
- "
\n",
- "
\n",
- "
tt9911774
\n",
- "
Padmavyuhathile Abhimanyu
\n",
- "
Padmavyuhathile Abhimanyu
\n",
- "
False
\n",
- "
2019
\n",
- "
130
\n",
- "
Drama
\n",
- "
8.5
\n",
- "
363
\n",
- "
\n",
- "
\n",
- "
tt9913056
\n",
- "
Swarm Season
\n",
- "
Swarm Season
\n",
- "
False
\n",
- "
2019
\n",
- "
86
\n",
- "
Documentary
\n",
- "
6.2
\n",
- "
5
\n",
- "
\n",
- "
\n",
- "
tt9913084
\n",
- "
Diabolik sono io
\n",
- "
Diabolik sono io
\n",
- "
False
\n",
- "
2019
\n",
- "
75
\n",
- "
Documentary
\n",
- "
6.2
\n",
- "
6
\n",
- "
\n",
- "
\n",
- "
tt9914286
\n",
- "
Sokagin Çocuklari
\n",
- "
Sokagin Çocuklari
\n",
- "
False
\n",
- "
2019
\n",
- "
98
\n",
- "
Drama,Family
\n",
- "
9.8
\n",
- "
72
\n",
- "
\n",
- " \n",
- "
\n",
- "
232496 rows × 8 columns
\n",
- "
"
- ],
- "text/plain": [
- " title original_title \\\n",
- "tconst \n",
- "tt0000009 Miss Jerry Miss Jerry \n",
- "tt0000147 The Corbett-Fitzsimmons Fight The Corbett-Fitzsimmons Fight \n",
- "tt0000335 Soldiers of the Cross Soldiers of the Cross \n",
- "tt0000574 The Story of the Kelly Gang The Story of the Kelly Gang \n",
- "tt0000615 Robbery Under Arms Robbery Under Arms \n",
- "... ... ... \n",
- "tt9910930 Jeg ser deg Jeg ser deg \n",
- "tt9911774 Padmavyuhathile Abhimanyu Padmavyuhathile Abhimanyu \n",
- "tt9913056 Swarm Season Swarm Season \n",
- "tt9913084 Diabolik sono io Diabolik sono io \n",
- "tt9914286 Sokagin Çocuklari Sokagin Çocuklari \n",
- "\n",
- " is_adult year length genres imdb_rating \\\n",
- "tconst \n",
- "tt0000009 False 1894 45 Romance 5.5 \n",
- "tt0000147 False 1897 20 Documentary,News,Sport 5.2 \n",
- "tt0000335 False 1900 Biography,Drama 6.3 \n",
- "tt0000574 False 1906 70 Biography,Crime,Drama 6.2 \n",
- "tt0000615 False 1907 Drama 4.8 \n",
- "... ... ... ... ... ... \n",
- "tt9910930 False 2019 75 Crime,Documentary 4.6 \n",
- "tt9911774 False 2019 130 Drama 8.5 \n",
- "tt9913056 False 2019 86 Documentary 6.2 \n",
- "tt9913084 False 2019 75 Documentary 6.2 \n",
- "tt9914286 False 2019 98 Drama,Family 9.8 \n",
- "\n",
- " imdb_votes \n",
- "tconst \n",
- "tt0000009 77 \n",
- "tt0000147 289 \n",
- "tt0000335 39 \n",
- "tt0000574 505 \n",
- "tt0000615 14 \n",
- "... ... \n",
- "tt9910930 5 \n",
- "tt9911774 363 \n",
- "tt9913056 5 \n",
- "tt9913084 6 \n",
- "tt9914286 72 \n",
- "\n",
- "[232496 rows x 8 columns]"
- ]
- },
- "execution_count": 25,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Alternativa 1 (preferovaná)\n",
- "movies_with_rating = movies.join(ratings, how=\"inner\")\n",
- "movies_with_rating"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 26,
- "metadata": {
- "scrolled": true
- },
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
title
\n",
- "
original_title
\n",
- "
is_adult
\n",
- "
year
\n",
- "
length
\n",
- "
genres
\n",
- "
imdb_rating
\n",
- "
imdb_votes
\n",
- "
\n",
- "
\n",
- "
tconst
\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
tt0000009
\n",
- "
Miss Jerry
\n",
- "
Miss Jerry
\n",
- "
False
\n",
- "
1894
\n",
- "
45
\n",
- "
Romance
\n",
- "
5.5
\n",
- "
77
\n",
- "
\n",
- "
\n",
- "
tt0000147
\n",
- "
The Corbett-Fitzsimmons Fight
\n",
- "
The Corbett-Fitzsimmons Fight
\n",
- "
False
\n",
- "
1897
\n",
- "
20
\n",
- "
Documentary,News,Sport
\n",
- "
5.2
\n",
- "
289
\n",
- "
\n",
- "
\n",
- "
tt0000335
\n",
- "
Soldiers of the Cross
\n",
- "
Soldiers of the Cross
\n",
- "
False
\n",
- "
1900
\n",
- "
<NA>
\n",
- "
Biography,Drama
\n",
- "
6.3
\n",
- "
39
\n",
- "
\n",
- "
\n",
- "
tt0000574
\n",
- "
The Story of the Kelly Gang
\n",
- "
The Story of the Kelly Gang
\n",
- "
False
\n",
- "
1906
\n",
- "
70
\n",
- "
Biography,Crime,Drama
\n",
- "
6.2
\n",
- "
505
\n",
- "
\n",
- "
\n",
- "
tt0000615
\n",
- "
Robbery Under Arms
\n",
- "
Robbery Under Arms
\n",
- "
False
\n",
- "
1907
\n",
- "
<NA>
\n",
- "
Drama
\n",
- "
4.8
\n",
- "
14
\n",
- "
\n",
- "
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
\n",
- "
\n",
- "
tt9910930
\n",
- "
Jeg ser deg
\n",
- "
Jeg ser deg
\n",
- "
False
\n",
- "
2019
\n",
- "
75
\n",
- "
Crime,Documentary
\n",
- "
4.6
\n",
- "
5
\n",
- "
\n",
- "
\n",
- "
tt9911774
\n",
- "
Padmavyuhathile Abhimanyu
\n",
- "
Padmavyuhathile Abhimanyu
\n",
- "
False
\n",
- "
2019
\n",
- "
130
\n",
- "
Drama
\n",
- "
8.5
\n",
- "
363
\n",
- "
\n",
- "
\n",
- "
tt9913056
\n",
- "
Swarm Season
\n",
- "
Swarm Season
\n",
- "
False
\n",
- "
2019
\n",
- "
86
\n",
- "
Documentary
\n",
- "
6.2
\n",
- "
5
\n",
- "
\n",
- "
\n",
- "
tt9913084
\n",
- "
Diabolik sono io
\n",
- "
Diabolik sono io
\n",
- "
False
\n",
- "
2019
\n",
- "
75
\n",
- "
Documentary
\n",
- "
6.2
\n",
- "
6
\n",
- "
\n",
- "
\n",
- "
tt9914286
\n",
- "
Sokagin Çocuklari
\n",
- "
Sokagin Çocuklari
\n",
- "
False
\n",
- "
2019
\n",
- "
98
\n",
- "
Drama,Family
\n",
- "
9.8
\n",
- "
72
\n",
- "
\n",
- " \n",
- "
\n",
- "
232496 rows × 8 columns
\n",
- "
"
- ],
- "text/plain": [
- " title original_title \\\n",
- "tconst \n",
- "tt0000009 Miss Jerry Miss Jerry \n",
- "tt0000147 The Corbett-Fitzsimmons Fight The Corbett-Fitzsimmons Fight \n",
- "tt0000335 Soldiers of the Cross Soldiers of the Cross \n",
- "tt0000574 The Story of the Kelly Gang The Story of the Kelly Gang \n",
- "tt0000615 Robbery Under Arms Robbery Under Arms \n",
- "... ... ... \n",
- "tt9910930 Jeg ser deg Jeg ser deg \n",
- "tt9911774 Padmavyuhathile Abhimanyu Padmavyuhathile Abhimanyu \n",
- "tt9913056 Swarm Season Swarm Season \n",
- "tt9913084 Diabolik sono io Diabolik sono io \n",
- "tt9914286 Sokagin Çocuklari Sokagin Çocuklari \n",
- "\n",
- " is_adult year length genres imdb_rating \\\n",
- "tconst \n",
- "tt0000009 False 1894 45 Romance 5.5 \n",
- "tt0000147 False 1897 20 Documentary,News,Sport 5.2 \n",
- "tt0000335 False 1900 Biography,Drama 6.3 \n",
- "tt0000574 False 1906 70 Biography,Crime,Drama 6.2 \n",
- "tt0000615 False 1907 Drama 4.8 \n",
- "... ... ... ... ... ... \n",
- "tt9910930 False 2019 75 Crime,Documentary 4.6 \n",
- "tt9911774 False 2019 130 Drama 8.5 \n",
- "tt9913056 False 2019 86 Documentary 6.2 \n",
- "tt9913084 False 2019 75 Documentary 6.2 \n",
- "tt9914286 False 2019 98 Drama,Family 9.8 \n",
- "\n",
- " imdb_votes \n",
- "tconst \n",
- "tt0000009 77 \n",
- "tt0000147 289 \n",
- "tt0000335 39 \n",
- "tt0000574 505 \n",
- "tt0000615 14 \n",
- "... ... \n",
- "tt9910930 5 \n",
- "tt9911774 363 \n",
- "tt9913056 5 \n",
- "tt9913084 6 \n",
- "tt9914286 72 \n",
- "\n",
- "[232496 rows x 8 columns]"
- ]
- },
- "execution_count": 26,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Alternativa 2 (taky dobrá)\n",
- "pd.merge(movies, ratings, left_index=True, right_index=True)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 27,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
title
\n",
- "
original_title
\n",
- "
is_adult
\n",
- "
year
\n",
- "
length
\n",
- "
genres
\n",
- "
imdb_rating
\n",
- "
imdb_votes
\n",
- "
\n",
- "
\n",
- "
tconst
\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
tt0000009
\n",
- "
Miss Jerry
\n",
- "
Miss Jerry
\n",
- "
False
\n",
- "
1894
\n",
- "
45
\n",
- "
Romance
\n",
- "
5.5
\n",
- "
77
\n",
- "
\n",
- "
\n",
- "
tt0000147
\n",
- "
The Corbett-Fitzsimmons Fight
\n",
- "
The Corbett-Fitzsimmons Fight
\n",
- "
False
\n",
- "
1897
\n",
- "
20
\n",
- "
Documentary,News,Sport
\n",
- "
5.2
\n",
- "
289
\n",
- "
\n",
- "
\n",
- "
tt0000335
\n",
- "
Soldiers of the Cross
\n",
- "
Soldiers of the Cross
\n",
- "
False
\n",
- "
1900
\n",
- "
<NA>
\n",
- "
Biography,Drama
\n",
- "
6.3
\n",
- "
39
\n",
- "
\n",
- "
\n",
- "
tt0000574
\n",
- "
The Story of the Kelly Gang
\n",
- "
The Story of the Kelly Gang
\n",
- "
False
\n",
- "
1906
\n",
- "
70
\n",
- "
Biography,Crime,Drama
\n",
- "
6.2
\n",
- "
505
\n",
- "
\n",
- "
\n",
- "
tt0000615
\n",
- "
Robbery Under Arms
\n",
- "
Robbery Under Arms
\n",
- "
False
\n",
- "
1907
\n",
- "
<NA>
\n",
- "
Drama
\n",
- "
4.8
\n",
- "
14
\n",
- "
\n",
- "
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
\n",
- "
\n",
- "
tt9910930
\n",
- "
Jeg ser deg
\n",
- "
Jeg ser deg
\n",
- "
False
\n",
- "
2019
\n",
- "
75
\n",
- "
Crime,Documentary
\n",
- "
4.6
\n",
- "
5
\n",
- "
\n",
- "
\n",
- "
tt9911774
\n",
- "
Padmavyuhathile Abhimanyu
\n",
- "
Padmavyuhathile Abhimanyu
\n",
- "
False
\n",
- "
2019
\n",
- "
130
\n",
- "
Drama
\n",
- "
8.5
\n",
- "
363
\n",
- "
\n",
- "
\n",
- "
tt9913056
\n",
- "
Swarm Season
\n",
- "
Swarm Season
\n",
- "
False
\n",
- "
2019
\n",
- "
86
\n",
- "
Documentary
\n",
- "
6.2
\n",
- "
5
\n",
- "
\n",
- "
\n",
- "
tt9913084
\n",
- "
Diabolik sono io
\n",
- "
Diabolik sono io
\n",
- "
False
\n",
- "
2019
\n",
- "
75
\n",
- "
Documentary
\n",
- "
6.2
\n",
- "
6
\n",
- "
\n",
- "
\n",
- "
tt9914286
\n",
- "
Sokagin Çocuklari
\n",
- "
Sokagin Çocuklari
\n",
- "
False
\n",
- "
2019
\n",
- "
98
\n",
- "
Drama,Family
\n",
- "
9.8
\n",
- "
72
\n",
- "
\n",
- " \n",
- "
\n",
- "
232496 rows × 8 columns
\n",
- "
"
- ],
- "text/plain": [
- " title original_title \\\n",
- "tconst \n",
- "tt0000009 Miss Jerry Miss Jerry \n",
- "tt0000147 The Corbett-Fitzsimmons Fight The Corbett-Fitzsimmons Fight \n",
- "tt0000335 Soldiers of the Cross Soldiers of the Cross \n",
- "tt0000574 The Story of the Kelly Gang The Story of the Kelly Gang \n",
- "tt0000615 Robbery Under Arms Robbery Under Arms \n",
- "... ... ... \n",
- "tt9910930 Jeg ser deg Jeg ser deg \n",
- "tt9911774 Padmavyuhathile Abhimanyu Padmavyuhathile Abhimanyu \n",
- "tt9913056 Swarm Season Swarm Season \n",
- "tt9913084 Diabolik sono io Diabolik sono io \n",
- "tt9914286 Sokagin Çocuklari Sokagin Çocuklari \n",
- "\n",
- " is_adult year length genres imdb_rating \\\n",
- "tconst \n",
- "tt0000009 False 1894 45 Romance 5.5 \n",
- "tt0000147 False 1897 20 Documentary,News,Sport 5.2 \n",
- "tt0000335 False 1900 Biography,Drama 6.3 \n",
- "tt0000574 False 1906 70 Biography,Crime,Drama 6.2 \n",
- "tt0000615 False 1907 Drama 4.8 \n",
- "... ... ... ... ... ... \n",
- "tt9910930 False 2019 75 Crime,Documentary 4.6 \n",
- "tt9911774 False 2019 130 Drama 8.5 \n",
- "tt9913056 False 2019 86 Documentary 6.2 \n",
- "tt9913084 False 2019 75 Documentary 6.2 \n",
- "tt9914286 False 2019 98 Drama,Family 9.8 \n",
- "\n",
- " imdb_votes \n",
- "tconst \n",
- "tt0000009 77 \n",
- "tt0000147 289 \n",
- "tt0000335 39 \n",
- "tt0000574 505 \n",
- "tt0000615 14 \n",
- "... ... \n",
- "tt9910930 5 \n",
- "tt9911774 363 \n",
- "tt9913056 5 \n",
- "tt9913084 6 \n",
- "tt9914286 72 \n",
- "\n",
- "[232496 rows x 8 columns]"
- ]
- },
- "execution_count": 27,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Alternativa 3 (méně \"sémantická\")\n",
- "pd.concat([movies, ratings], axis=\"columns\", join=\"inner\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Zkusme si zreprodukovat pořadí 250 nejlepších filmů z IMDb (viz https://www.imdb.com/chart/top/?ref_=nv_mv_250):"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 28,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
title
\n",
- "
original_title
\n",
- "
is_adult
\n",
- "
year
\n",
- "
length
\n",
- "
genres
\n",
- "
imdb_rating
\n",
- "
imdb_votes
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
0
\n",
- "
The Chaos Class
\n",
- "
Hababam Sinifi
\n",
- "
False
\n",
- "
1975
\n",
- "
87
\n",
- "
Comedy,Drama
\n",
- "
9.4
\n",
- "
33394
\n",
- "
\n",
- "
\n",
- "
1
\n",
- "
The Shawshank Redemption
\n",
- "
The Shawshank Redemption
\n",
- "
False
\n",
- "
1994
\n",
- "
142
\n",
- "
Drama
\n",
- "
9.3
\n",
- "
2071759
\n",
- "
\n",
- "
\n",
- "
2
\n",
- "
The Mountain II
\n",
- "
Dag II
\n",
- "
False
\n",
- "
2016
\n",
- "
135
\n",
- "
Action,Drama,War
\n",
- "
9.3
\n",
- "
100095
\n",
- "
\n",
- "
\n",
- "
3
\n",
- "
CM101MMXI Fundamentals
\n",
- "
CM101MMXI Fundamentals
\n",
- "
False
\n",
- "
2013
\n",
- "
139
\n",
- "
Comedy,Documentary
\n",
- "
9.2
\n",
- "
41327
\n",
- "
\n",
- "
\n",
- "
4
\n",
- "
The Godfather
\n",
- "
The Godfather
\n",
- "
False
\n",
- "
1972
\n",
- "
175
\n",
- "
Crime,Drama
\n",
- "
9.2
\n",
- "
1421495
\n",
- "
\n",
- "
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
\n",
- "
\n",
- "
245
\n",
- "
12 Years a Slave
\n",
- "
12 Years a Slave
\n",
- "
False
\n",
- "
2013
\n",
- "
134
\n",
- "
Biography,Drama,History
\n",
- "
8.1
\n",
- "
571204
\n",
- "
\n",
- "
\n",
- "
246
\n",
- "
The Sixth Sense
\n",
- "
The Sixth Sense
\n",
- "
False
\n",
- "
1999
\n",
- "
107
\n",
- "
Drama,Mystery,Thriller
\n",
- "
8.1
\n",
- "
836928
\n",
- "
\n",
- "
\n",
- "
247
\n",
- "
The Passion of Joan of Arc
\n",
- "
La passion de Jeanne d'Arc
\n",
- "
False
\n",
- "
1928
\n",
- "
110
\n",
- "
Biography,Drama,History
\n",
- "
8.1
\n",
- "
40107
\n",
- "
\n",
- "
\n",
- "
248
\n",
- "
Barfi!
\n",
- "
Barfi!
\n",
- "
False
\n",
- "
2012
\n",
- "
151
\n",
- "
Comedy,Drama,Romance
\n",
- "
8.1
\n",
- "
68274
\n",
- "
\n",
- "
\n",
- "
249
\n",
- "
Platoon
\n",
- "
Platoon
\n",
- "
False
\n",
- "
1986
\n",
- "
120
\n",
- "
Drama,War
\n",
- "
8.1
\n",
- "
348628
\n",
- "
\n",
- " \n",
- "
\n",
- "
250 rows × 8 columns
\n",
- "
"
- ],
- "text/plain": [
- " title original_title is_adult year \\\n",
- "0 The Chaos Class Hababam Sinifi False 1975 \n",
- "1 The Shawshank Redemption The Shawshank Redemption False 1994 \n",
- "2 The Mountain II Dag II False 2016 \n",
- "3 CM101MMXI Fundamentals CM101MMXI Fundamentals False 2013 \n",
- "4 The Godfather The Godfather False 1972 \n",
- ".. ... ... ... ... \n",
- "245 12 Years a Slave 12 Years a Slave False 2013 \n",
- "246 The Sixth Sense The Sixth Sense False 1999 \n",
- "247 The Passion of Joan of Arc La passion de Jeanne d'Arc False 1928 \n",
- "248 Barfi! Barfi! False 2012 \n",
- "249 Platoon Platoon False 1986 \n",
- "\n",
- " length genres imdb_rating imdb_votes \n",
- "0 87 Comedy,Drama 9.4 33394 \n",
- "1 142 Drama 9.3 2071759 \n",
- "2 135 Action,Drama,War 9.3 100095 \n",
- "3 139 Comedy,Documentary 9.2 41327 \n",
- "4 175 Crime,Drama 9.2 1421495 \n",
- ".. ... ... ... ... \n",
- "245 134 Biography,Drama,History 8.1 571204 \n",
- "246 107 Drama,Mystery,Thriller 8.1 836928 \n",
- "247 110 Biography,Drama,History 8.1 40107 \n",
- "248 151 Comedy,Drama,Romance 8.1 68274 \n",
- "249 120 Drama,War 8.1 348628 \n",
- "\n",
- "[250 rows x 8 columns]"
- ]
- },
- "execution_count": 28,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Ty nejlepší (do června 2019)\n",
- "(movies_with_rating\n",
- " .query(\"imdb_votes > 25000\") # Berou se jen filmy s více než 25000 hlasy\n",
- " .sort_values(\"imdb_rating\", ascending=False) # IMDb tu použivá i váhu jednotlivých hlasů (kterou neznáme)\n",
- " .reset_index(drop=True)\n",
- ").iloc[:250]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Do výčtu se nám dostaly filmy, které hranici hlasů nepřekračují o moc. Máme důvodné podezření, že toto kritérium dávno změnili. S požadovanými 250 000 hlasy se už blížíme:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 29,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
"
- ],
- "text/plain": [
- " rank title studio lifetime_gross \\\n",
- "0 1 Star Wars: The Force Awakens BV 936662225 \n",
- "1 2 Avatar Fox 760507625 \n",
- "2 3 Black Panther BV 700059566 \n",
- "3 4 Avengers: Infinity War BV 678815482 \n",
- "4 5 Titanic Par. 659363944 \n",
- "... ... ... ... ... \n",
- "16262 16263 Dog Eat Dog IFC 80 \n",
- "16263 16264 Paranoid Girls NaN 78 \n",
- "16264 16265 Confession of a Child of the Century Cohen 74 \n",
- "16265 16266 Storage 24 Magn. 72 \n",
- "16266 16267 Zyzzyx Road Reg. 30 \n",
- "\n",
- " year \n",
- "0 2015 \n",
- "1 2009 \n",
- "2 2018 \n",
- "3 2018 \n",
- "4 1997 \n",
- "... ... \n",
- "16262 2009 \n",
- "16263 2015 \n",
- "16264 2015 \n",
- "16265 2013 \n",
- "16266 2006 \n",
- "\n",
- "[16267 rows x 5 columns]"
- ]
- },
- "execution_count": 30,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "boxoffice_raw"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 31,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "rank int64\n",
- "title object\n",
- "studio object\n",
- "lifetime_gross int64\n",
- "year int64\n",
- "dtype: object"
- ]
- },
- "execution_count": 31,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "boxoffice_raw.dtypes"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "S tím bychom v podstatně mohli být spokojení, jen přejmenujeme `rank`, abychom při joinování věděli, odkud daný sloupec pochází."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 32,
- "metadata": {},
- "outputs": [],
- "source": [
- "boxoffice = (boxoffice_raw\n",
- " .rename({\n",
- " \"rank\": \"boxoffice_rank\"\n",
- " }, axis=\"columns\")\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "A zkusíme joinovat. V tomto případě se nemůžeme opřít o index (`boxoffice` pochází z jiného zdroje a o nějakém ID filmu z IMDb nemá ani tuchy), ale explicitně specifikujeme, který sloupec (či sloupce) se musí shodovat - na to slouží argument `on`:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 33,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
title
\n",
- "
original_title
\n",
- "
is_adult
\n",
- "
year (imdb)
\n",
- "
length
\n",
- "
genres
\n",
- "
imdb_rating
\n",
- "
imdb_votes
\n",
- "
boxoffice_rank
\n",
- "
studio
\n",
- "
lifetime_gross
\n",
- "
year (boxoffice)
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
1643
\n",
- "
Pinocchio
\n",
- "
Pinocchio
\n",
- "
False
\n",
- "
1940
\n",
- "
88
\n",
- "
Animation,Comedy,Family
\n",
- "
7.5
\n",
- "
114689
\n",
- "
885
\n",
- "
Dis.
\n",
- "
84254167
\n",
- "
1940
\n",
- "
\n",
- "
\n",
- "
1644
\n",
- "
Pinocchio
\n",
- "
Pinocchio
\n",
- "
False
\n",
- "
1940
\n",
- "
88
\n",
- "
Animation,Comedy,Family
\n",
- "
7.5
\n",
- "
114689
\n",
- "
6108
\n",
- "
Mira.
\n",
- "
3684305
\n",
- "
2002
\n",
- "
\n",
- "
\n",
- "
1645
\n",
- "
Pinocchio
\n",
- "
Turlis Abenteuer
\n",
- "
False
\n",
- "
1967
\n",
- "
75
\n",
- "
Adventure,Family,Fantasy
\n",
- "
7.2
\n",
- "
19
\n",
- "
885
\n",
- "
Dis.
\n",
- "
84254167
\n",
- "
1940
\n",
- "
\n",
- "
\n",
- "
1646
\n",
- "
Pinocchio
\n",
- "
Turlis Abenteuer
\n",
- "
False
\n",
- "
1967
\n",
- "
75
\n",
- "
Adventure,Family,Fantasy
\n",
- "
7.2
\n",
- "
19
\n",
- "
6108
\n",
- "
Mira.
\n",
- "
3684305
\n",
- "
2002
\n",
- "
\n",
- "
\n",
- "
1647
\n",
- "
Pinocchio
\n",
- "
Pinocchio
\n",
- "
False
\n",
- "
1971
\n",
- "
79
\n",
- "
Comedy,Fantasy
\n",
- "
3.5
\n",
- "
123
\n",
- "
885
\n",
- "
Dis.
\n",
- "
84254167
\n",
- "
1940
\n",
- "
\n",
- "
\n",
- "
1648
\n",
- "
Pinocchio
\n",
- "
Pinocchio
\n",
- "
False
\n",
- "
1971
\n",
- "
79
\n",
- "
Comedy,Fantasy
\n",
- "
3.5
\n",
- "
123
\n",
- "
6108
\n",
- "
Mira.
\n",
- "
3684305
\n",
- "
2002
\n",
- "
\n",
- "
\n",
- "
1649
\n",
- "
Pinocchio
\n",
- "
Pinocchio
\n",
- "
False
\n",
- "
1911
\n",
- "
50
\n",
- "
Fantasy
\n",
- "
5.9
\n",
- "
69
\n",
- "
885
\n",
- "
Dis.
\n",
- "
84254167
\n",
- "
1940
\n",
- "
\n",
- "
\n",
- "
1650
\n",
- "
Pinocchio
\n",
- "
Pinocchio
\n",
- "
False
\n",
- "
1911
\n",
- "
50
\n",
- "
Fantasy
\n",
- "
5.9
\n",
- "
69
\n",
- "
6108
\n",
- "
Mira.
\n",
- "
3684305
\n",
- "
2002
\n",
- "
\n",
- "
\n",
- "
1651
\n",
- "
Pinocchio
\n",
- "
Pinocchio
\n",
- "
False
\n",
- "
2002
\n",
- "
108
\n",
- "
Comedy,Family,Fantasy
\n",
- "
4.3
\n",
- "
7192
\n",
- "
885
\n",
- "
Dis.
\n",
- "
84254167
\n",
- "
1940
\n",
- "
\n",
- "
\n",
- "
1652
\n",
- "
Pinocchio
\n",
- "
Pinocchio
\n",
- "
False
\n",
- "
2002
\n",
- "
108
\n",
- "
Comedy,Family,Fantasy
\n",
- "
4.3
\n",
- "
7192
\n",
- "
6108
\n",
- "
Mira.
\n",
- "
3684305
\n",
- "
2002
\n",
- "
\n",
- "
\n",
- "
1653
\n",
- "
Pinocchio
\n",
- "
Un burattino di nome Pinocchio
\n",
- "
False
\n",
- "
1971
\n",
- "
96
\n",
- "
Animation,Family,Fantasy
\n",
- "
7.0
\n",
- "
117
\n",
- "
885
\n",
- "
Dis.
\n",
- "
84254167
\n",
- "
1940
\n",
- "
\n",
- "
\n",
- "
1654
\n",
- "
Pinocchio
\n",
- "
Un burattino di nome Pinocchio
\n",
- "
False
\n",
- "
1971
\n",
- "
96
\n",
- "
Animation,Family,Fantasy
\n",
- "
7.0
\n",
- "
117
\n",
- "
6108
\n",
- "
Mira.
\n",
- "
3684305
\n",
- "
2002
\n",
- "
\n",
- "
\n",
- "
1655
\n",
- "
Pinocchio
\n",
- "
Pinocchio
\n",
- "
False
\n",
- "
2012
\n",
- "
75
\n",
- "
Animation,Family,Fantasy
\n",
- "
6.3
\n",
- "
218
\n",
- "
885
\n",
- "
Dis.
\n",
- "
84254167
\n",
- "
1940
\n",
- "
\n",
- "
\n",
- "
1656
\n",
- "
Pinocchio
\n",
- "
Pinocchio
\n",
- "
False
\n",
- "
2012
\n",
- "
75
\n",
- "
Animation,Family,Fantasy
\n",
- "
6.3
\n",
- "
218
\n",
- "
6108
\n",
- "
Mira.
\n",
- "
3684305
\n",
- "
2002
\n",
- "
\n",
- "
\n",
- "
1657
\n",
- "
Pinocchio
\n",
- "
Pinocchio
\n",
- "
False
\n",
- "
2015
\n",
- "
<NA>
\n",
- "
Family,Fantasy
\n",
- "
4.9
\n",
- "
43
\n",
- "
885
\n",
- "
Dis.
\n",
- "
84254167
\n",
- "
1940
\n",
- "
\n",
- "
\n",
- "
1658
\n",
- "
Pinocchio
\n",
- "
Pinocchio
\n",
- "
False
\n",
- "
2015
\n",
- "
<NA>
\n",
- "
Family,Fantasy
\n",
- "
4.9
\n",
- "
43
\n",
- "
6108
\n",
- "
Mira.
\n",
- "
3684305
\n",
- "
2002
\n",
- "
\n",
- "
\n",
- "
1659
\n",
- "
Pinocchio
\n",
- "
Pinocchio
\n",
- "
False
\n",
- "
2015
\n",
- "
75
\n",
- "
Documentary
\n",
- "
6.8
\n",
- "
8
\n",
- "
885
\n",
- "
Dis.
\n",
- "
84254167
\n",
- "
1940
\n",
- "
\n",
- "
\n",
- "
1660
\n",
- "
Pinocchio
\n",
- "
Pinocchio
\n",
- "
False
\n",
- "
2015
\n",
- "
75
\n",
- "
Documentary
\n",
- "
6.8
\n",
- "
8
\n",
- "
6108
\n",
- "
Mira.
\n",
- "
3684305
\n",
- "
2002
\n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " title original_title is_adult year (imdb) \\\n",
- "1643 Pinocchio Pinocchio False 1940 \n",
- "1644 Pinocchio Pinocchio False 1940 \n",
- "1645 Pinocchio Turlis Abenteuer False 1967 \n",
- "1646 Pinocchio Turlis Abenteuer False 1967 \n",
- "1647 Pinocchio Pinocchio False 1971 \n",
- "1648 Pinocchio Pinocchio False 1971 \n",
- "1649 Pinocchio Pinocchio False 1911 \n",
- "1650 Pinocchio Pinocchio False 1911 \n",
- "1651 Pinocchio Pinocchio False 2002 \n",
- "1652 Pinocchio Pinocchio False 2002 \n",
- "1653 Pinocchio Un burattino di nome Pinocchio False 1971 \n",
- "1654 Pinocchio Un burattino di nome Pinocchio False 1971 \n",
- "1655 Pinocchio Pinocchio False 2012 \n",
- "1656 Pinocchio Pinocchio False 2012 \n",
- "1657 Pinocchio Pinocchio False 2015 \n",
- "1658 Pinocchio Pinocchio False 2015 \n",
- "1659 Pinocchio Pinocchio False 2015 \n",
- "1660 Pinocchio Pinocchio False 2015 \n",
- "\n",
- " length genres imdb_rating imdb_votes \\\n",
- "1643 88 Animation,Comedy,Family 7.5 114689 \n",
- "1644 88 Animation,Comedy,Family 7.5 114689 \n",
- "1645 75 Adventure,Family,Fantasy 7.2 19 \n",
- "1646 75 Adventure,Family,Fantasy 7.2 19 \n",
- "1647 79 Comedy,Fantasy 3.5 123 \n",
- "1648 79 Comedy,Fantasy 3.5 123 \n",
- "1649 50 Fantasy 5.9 69 \n",
- "1650 50 Fantasy 5.9 69 \n",
- "1651 108 Comedy,Family,Fantasy 4.3 7192 \n",
- "1652 108 Comedy,Family,Fantasy 4.3 7192 \n",
- "1653 96 Animation,Family,Fantasy 7.0 117 \n",
- "1654 96 Animation,Family,Fantasy 7.0 117 \n",
- "1655 75 Animation,Family,Fantasy 6.3 218 \n",
- "1656 75 Animation,Family,Fantasy 6.3 218 \n",
- "1657 Family,Fantasy 4.9 43 \n",
- "1658 Family,Fantasy 4.9 43 \n",
- "1659 75 Documentary 6.8 8 \n",
- "1660 75 Documentary 6.8 8 \n",
- "\n",
- " boxoffice_rank studio lifetime_gross year (boxoffice) \n",
- "1643 885 Dis. 84254167 1940 \n",
- "1644 6108 Mira. 3684305 2002 \n",
- "1645 885 Dis. 84254167 1940 \n",
- "1646 6108 Mira. 3684305 2002 \n",
- "1647 885 Dis. 84254167 1940 \n",
- "1648 6108 Mira. 3684305 2002 \n",
- "1649 885 Dis. 84254167 1940 \n",
- "1650 6108 Mira. 3684305 2002 \n",
- "1651 885 Dis. 84254167 1940 \n",
- "1652 6108 Mira. 3684305 2002 \n",
- "1653 885 Dis. 84254167 1940 \n",
- "1654 6108 Mira. 3684305 2002 \n",
- "1655 885 Dis. 84254167 1940 \n",
- "1656 6108 Mira. 3684305 2002 \n",
- "1657 885 Dis. 84254167 1940 \n",
- "1658 6108 Mira. 3684305 2002 \n",
- "1659 885 Dis. 84254167 1940 \n",
- "1660 6108 Mira. 3684305 2002 "
- ]
- },
- "execution_count": 33,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "pd.merge(\n",
- " movies_with_rating,\n",
- " boxoffice,\n",
- " suffixes=[\" (imdb)\", \" (boxoffice)\"],\n",
- " on=\"title\"\n",
- ").query(\"title == 'Pinocchio'\") # \"Jeden\" ukázkový film"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Jejda, to jsme asi nechtěli. Existuje spousta různých Pinocchiů a ke každému z nich se připojili vždy oba snímky tohoto jména z `boxoffice`. Z toho vyplývá poučení, že při joinování je dobré se zamyslet nad jedinečností hodnot ve sloupci, který používáme jako klíč. Jméno filmu takové očividně není.\n",
- "\n",
- "V našem konkrétním případě jsme si problému všimli sami, ale pokud bude duplikátní klíč utopen někde v milionech hodnot, rádi bychom, aby to počítač poznal za nás. K tomu slouží argument `validate` - podle toho, jaký vztah mezi tabulkami očekáš, jsou přípustné hodnoty `\"one_to_one\"`, `\"one_to_many\"`, `\"many_to_one\"` nebo `\"many_to_many\"`:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 34,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
title
\n",
- "
original_title
\n",
- "
is_adult
\n",
- "
year (imdb)
\n",
- "
length
\n",
- "
genres
\n",
- "
imdb_rating
\n",
- "
imdb_votes
\n",
- "
boxoffice_rank
\n",
- "
studio
\n",
- "
lifetime_gross
\n",
- "
year (boxoffice)
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
0
\n",
- "
Oliver Twist
\n",
- "
Oliver Twist
\n",
- "
False
\n",
- "
1912
\n",
- "
<NA>
\n",
- "
Drama
\n",
- "
4.7
\n",
- "
19
\n",
- "
6826
\n",
- "
Sony
\n",
- "
2080321
\n",
- "
2005
\n",
- "
\n",
- "
\n",
- "
1
\n",
- "
Oliver Twist
\n",
- "
Oliver Twist
\n",
- "
False
\n",
- "
1912
\n",
- "
<NA>
\n",
- "
Drama
\n",
- "
4.4
\n",
- "
12
\n",
- "
6826
\n",
- "
Sony
\n",
- "
2080321
\n",
- "
2005
\n",
- "
\n",
- "
\n",
- "
2
\n",
- "
Oliver Twist
\n",
- "
Oliver Twist
\n",
- "
False
\n",
- "
1916
\n",
- "
50
\n",
- "
Drama
\n",
- "
6.6
\n",
- "
16
\n",
- "
6826
\n",
- "
Sony
\n",
- "
2080321
\n",
- "
2005
\n",
- "
\n",
- "
\n",
- "
3
\n",
- "
Oliver Twist
\n",
- "
Oliver Twist
\n",
- "
False
\n",
- "
1922
\n",
- "
98
\n",
- "
Drama
\n",
- "
6.8
\n",
- "
657
\n",
- "
6826
\n",
- "
Sony
\n",
- "
2080321
\n",
- "
2005
\n",
- "
\n",
- "
\n",
- "
4
\n",
- "
Oliver Twist
\n",
- "
Oliver Twist
\n",
- "
False
\n",
- "
1933
\n",
- "
80
\n",
- "
Drama
\n",
- "
5.0
\n",
- "
292
\n",
- "
6826
\n",
- "
Sony
\n",
- "
2080321
\n",
- "
2005
\n",
- "
\n",
- "
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
\n",
- "
\n",
- "
20562
\n",
- "
BTS World Tour: Love Yourself in Seoul
\n",
- "
BTS World Tour: Love Yourself in Seoul
\n",
- "
False
\n",
- "
2019
\n",
- "
112
\n",
- "
Documentary,Music
\n",
- "
8.5
\n",
- "
439
\n",
- "
6173
\n",
- "
Fathom
\n",
- "
3509917
\n",
- "
2019
\n",
- "
\n",
- "
\n",
- "
20563
\n",
- "
Mojin: The Worm Valley
\n",
- "
Yun nan chong gu
\n",
- "
False
\n",
- "
2018
\n",
- "
110
\n",
- "
Action,Fantasy
\n",
- "
4.7
\n",
- "
120
\n",
- "
11240
\n",
- "
WGUSA
\n",
- "
101516
\n",
- "
2019
\n",
- "
\n",
- "
\n",
- "
20564
\n",
- "
Extreme Job
\n",
- "
Geukhanjikeob
\n",
- "
False
\n",
- "
2019
\n",
- "
111
\n",
- "
Action,Comedy
\n",
- "
7.3
\n",
- "
905
\n",
- "
7212
\n",
- "
CJ
\n",
- "
1548816
\n",
- "
2019
\n",
- "
\n",
- "
\n",
- "
20565
\n",
- "
Peppa Celebrates Chinese New Year
\n",
- "
xiao zhu pei qi guo da nian
\n",
- "
False
\n",
- "
2019
\n",
- "
81
\n",
- "
Animation,Family
\n",
- "
3.4
\n",
- "
41
\n",
- "
10811
\n",
- "
STX
\n",
- "
131225
\n",
- "
2019
\n",
- "
\n",
- "
\n",
- "
20566
\n",
- "
Avant qu'on explose
\n",
- "
Avant qu'on explose
\n",
- "
False
\n",
- "
2019
\n",
- "
108
\n",
- "
Comedy
\n",
- "
6.9
\n",
- "
41
\n",
- "
10995
\n",
- "
EOne
\n",
- "
116576
\n",
- "
2019
\n",
- "
\n",
- " \n",
- "
\n",
- "
20567 rows × 12 columns
\n",
- "
"
- ],
- "text/plain": [
- " title \\\n",
- "0 Oliver Twist \n",
- "1 Oliver Twist \n",
- "2 Oliver Twist \n",
- "3 Oliver Twist \n",
- "4 Oliver Twist \n",
- "... ... \n",
- "20562 BTS World Tour: Love Yourself in Seoul \n",
- "20563 Mojin: The Worm Valley \n",
- "20564 Extreme Job \n",
- "20565 Peppa Celebrates Chinese New Year \n",
- "20566 Avant qu'on explose \n",
- "\n",
- " original_title is_adult year (imdb) length \\\n",
- "0 Oliver Twist False 1912 \n",
- "1 Oliver Twist False 1912 \n",
- "2 Oliver Twist False 1916 50 \n",
- "3 Oliver Twist False 1922 98 \n",
- "4 Oliver Twist False 1933 80 \n",
- "... ... ... ... ... \n",
- "20562 BTS World Tour: Love Yourself in Seoul False 2019 112 \n",
- "20563 Yun nan chong gu False 2018 110 \n",
- "20564 Geukhanjikeob False 2019 111 \n",
- "20565 xiao zhu pei qi guo da nian False 2019 81 \n",
- "20566 Avant qu'on explose False 2019 108 \n",
- "\n",
- " genres imdb_rating imdb_votes boxoffice_rank studio \\\n",
- "0 Drama 4.7 19 6826 Sony \n",
- "1 Drama 4.4 12 6826 Sony \n",
- "2 Drama 6.6 16 6826 Sony \n",
- "3 Drama 6.8 657 6826 Sony \n",
- "4 Drama 5.0 292 6826 Sony \n",
- "... ... ... ... ... ... \n",
- "20562 Documentary,Music 8.5 439 6173 Fathom \n",
- "20563 Action,Fantasy 4.7 120 11240 WGUSA \n",
- "20564 Action,Comedy 7.3 905 7212 CJ \n",
- "20565 Animation,Family 3.4 41 10811 STX \n",
- "20566 Comedy 6.9 41 10995 EOne \n",
- "\n",
- " lifetime_gross year (boxoffice) \n",
- "0 2080321 2005 \n",
- "1 2080321 2005 \n",
- "2 2080321 2005 \n",
- "3 2080321 2005 \n",
- "4 2080321 2005 \n",
- "... ... ... \n",
- "20562 3509917 2019 \n",
- "20563 101516 2019 \n",
- "20564 1548816 2019 \n",
- "20565 131225 2019 \n",
- "20566 116576 2019 \n",
- "\n",
- "[20567 rows x 12 columns]"
- ]
- },
- "execution_count": 34,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "pd.merge(\n",
- " movies_with_rating,\n",
- " boxoffice,\n",
- " on=\"title\",\n",
- " suffixes=[\" (imdb)\", \" (boxoffice)\"],\n",
- "# validate=\"one_to_one\" # Odkomentuj a vyskočí chyba!\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Řešení je jednoduché - budeme joinovat přes dva různé sloupce (argument `on` to unese ;-)). Při té příležitosti navíc zjišťujeme, že nedává smysl spojovat filmy, které rok vůbec uvedený nemají, a proto je vyhodíme:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 35,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
title
\n",
- "
original_title
\n",
- "
is_adult
\n",
- "
year
\n",
- "
length
\n",
- "
genres
\n",
- "
imdb_rating
\n",
- "
imdb_votes
\n",
- "
boxoffice_rank
\n",
- "
studio
\n",
- "
lifetime_gross
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
6926
\n",
- "
Playback
\n",
- "
Playback
\n",
- "
False
\n",
- "
2012
\n",
- "
98
\n",
- "
Horror,Thriller
\n",
- "
4.3
\n",
- "
4478
\n",
- "
16256
\n",
- "
Magn.
\n",
- "
264
\n",
- "
\n",
- "
\n",
- "
6927
\n",
- "
Playback
\n",
- "
Playback
\n",
- "
False
\n",
- "
2012
\n",
- "
113
\n",
- "
Drama
\n",
- "
4.9
\n",
- "
27
\n",
- "
16256
\n",
- "
Magn.
\n",
- "
264
\n",
- "
\n",
- "
\n",
- "
6928
\n",
- "
Playback
\n",
- "
Dur d'être Dieu
\n",
- "
False
\n",
- "
2012
\n",
- "
66
\n",
- "
Documentary
\n",
- "
5.2
\n",
- "
8
\n",
- "
16256
\n",
- "
Magn.
\n",
- "
264
\n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " title original_title is_adult year length genres \\\n",
- "6926 Playback Playback False 2012 98 Horror,Thriller \n",
- "6927 Playback Playback False 2012 113 Drama \n",
- "6928 Playback Dur d'être Dieu False 2012 66 Documentary \n",
- "\n",
- " imdb_rating imdb_votes boxoffice_rank studio lifetime_gross \n",
- "6926 4.3 4478 16256 Magn. 264 \n",
- "6927 4.9 27 16256 Magn. 264 \n",
- "6928 5.2 8 16256 Magn. 264 "
- ]
- },
- "execution_count": 35,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "(\n",
- " pd.merge(\n",
- " movies_with_rating.dropna(subset=[\"year\"]), # Vyhoď všechny řádky bez roku\n",
- " boxoffice,\n",
- " on=[\"title\", \"year\"],\n",
- " validate=\"many_to_one\", # movies_with_rating pořád nejsou unikátní!\n",
- " )\n",
- ").query(\"title == 'Playback'\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Pořád nejsou unikátní! Co s tím?\n",
- "\n",
- "**Hypotéza:** Vstupujeme na nebezpečnou půdu a zkusíme spekulovat, že informace o ziscích budeme mít nejspíš jen o nejpopulárnějších filmech. Možná máme pravdu, možná ne a nejspíš nějakou drobnou nepřesnost zaneseme, ale dobrat se tady skutečné pravdy je \"drahé\" (a možná i skutečně drahé), z nabízených datových sad to věrohodně možné není.\n",
- "\n",
- "Abychom se co nejvíc přiblížili realitě, z každé opakující se dvojice (název, rok) vybereme film s nejvyšším `imdb_votes`. Nejdříve si pomocí `sort_values` srovnáme všechny filmy a pak zavoláme `drop_duplicates(..., keep=\"first\")`, což nám ponechá vždy jen jeden z řady duplikátů:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 36,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
"
- ],
- "text/plain": [
- "Empty DataFrame\n",
- "Columns: [title_type, title, original_title, is_adult, start_year, end_year, length, genres, Title, RatingTomatometer, No. of Reviews]\n",
- "Index: []"
- ]
- },
- "execution_count": 41,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Ready to merge?\n",
- "pd.merge(imdb_titles, rotten_tomatoes_nodup, left_on=\"title\", right_on=\"Title\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "0 řádků!\n",
- "\n",
- "Dosud jsme manipulovali s řádky a sloupci jako celky, nicméně teď musíme zasahovat přímo do hodnot v buňkách. I to se při slučování dat z různých zdrojů nezřídka stává. Stojíme před úkolem převést řetězce typu \"Black Panther (2018)\" na dvě hodnoty: název \"Black Panther\" a rok 2018 (jeden sloupec na dva). \n",
- "\n",
- "Naštěstí si ty sloupce umíme jednoduše vyrobit pomocí řetězcové metody [`.str.slice`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.slice.html), která z každého řetězce vyřízne nějakou jeho část (a zase pracuje na celém sloupci - výsledkem bude nový sloupec s funkcí aplikovanou na každou z hodnot). Budeme věřit, že předposlední čtyři znaky představují rok a zbytek, až na nějaké ty závorky, tvoří skutečný název:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 42,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
tomatoes_rating
\n",
- "
tomatoes_votes
\n",
- "
title
\n",
- "
year
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
0
\n",
- "
97
\n",
- "
444
\n",
- "
Black Panther
\n",
- "
2018
\n",
- "
\n",
- "
\n",
- "
1
\n",
- "
97
\n",
- "
394
\n",
- "
Mad Max: Fury Road
\n",
- "
2015
\n",
- "
\n",
- "
\n",
- "
2
\n",
- "
93
\n",
- "
410
\n",
- "
Wonder Woman
\n",
- "
2017
\n",
- "
\n",
- "
\n",
- "
3
\n",
- "
99
\n",
- "
118
\n",
- "
Metropolis
\n",
- "
1927
\n",
- "
\n",
- "
\n",
- "
4
\n",
- "
97
\n",
- "
308
\n",
- "
Coco
\n",
- "
2017
\n",
- "
\n",
- "
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
\n",
- "
\n",
- "
1585
\n",
- "
15
\n",
- "
97
\n",
- "
Priest
\n",
- "
2011
\n",
- "
\n",
- "
\n",
- "
1586
\n",
- "
14
\n",
- "
103
\n",
- "
American Outlaws
\n",
- "
2001
\n",
- "
\n",
- "
\n",
- "
1587
\n",
- "
15
\n",
- "
54
\n",
- "
September Dawn
\n",
- "
2007
\n",
- "
\n",
- "
\n",
- "
1588
\n",
- "
12
\n",
- "
147
\n",
- "
Jonah Hex
\n",
- "
2010
\n",
- "
\n",
- "
\n",
- "
1589
\n",
- "
2
\n",
- "
51
\n",
- "
Texas Rangers
\n",
- "
2001
\n",
- "
\n",
- " \n",
- "
\n",
- "
947 rows × 4 columns
\n",
- "
"
- ],
- "text/plain": [
- " tomatoes_rating tomatoes_votes title year\n",
- "0 97 444 Black Panther 2018\n",
- "1 97 394 Mad Max: Fury Road 2015\n",
- "2 93 410 Wonder Woman 2017\n",
- "3 99 118 Metropolis 1927\n",
- "4 97 308 Coco 2017\n",
- "... ... ... ... ...\n",
- "1585 15 97 Priest 2011\n",
- "1586 14 103 American Outlaws 2001\n",
- "1587 15 54 September Dawn 2007\n",
- "1588 12 147 Jonah Hex 2010\n",
- "1589 2 51 Texas Rangers 2001\n",
- "\n",
- "[947 rows x 4 columns]"
- ]
- },
- "execution_count": 42,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "rotten_tomatoes_beta = (rotten_tomatoes_nodup\n",
- " .assign(\n",
- " title=rotten_tomatoes_nodup[\"Title\"].str.slice(0, -7), \n",
- " year=rotten_tomatoes_nodup[\"Title\"].str.slice(-5, -1).astype(int)\n",
- " )\n",
- " .rename({\n",
- " \"RatingTomatometer\": \"tomatoes_rating\",\n",
- " \"No. of Reviews\": \"tomatoes_votes\",\n",
- " }, axis=\"columns\")\n",
- " .drop([\"Title\"], axis=\"columns\")\n",
- ")\n",
- "rotten_tomatoes_beta"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Závorková odysea nekončí, někdo nám proaktivně do závorek nacpal i originální název naanglickojazyčných filmů. Pojďme se o tom přesvědčit pomocí metody [`.str.contains`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html) (protože tato metoda ve výchozím stavu používá pro vyhledávání regulární výrazy, které jsme se zatím nenaučili používat, musíme jí to explicitně zakázat argumentem `regex=False`):"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 43,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
tomatoes_rating
\n",
- "
tomatoes_votes
\n",
- "
title
\n",
- "
year
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
15
\n",
- "
100
\n",
- "
58
\n",
- "
Seven Samurai (Shichinin no Samurai)
\n",
- "
1956
\n",
- "
\n",
- "
\n",
- "
51
\n",
- "
98
\n",
- "
46
\n",
- "
Aguirre, the Wrath of God (Aguirre, der Zorn G...
\n",
- "
1972
\n",
- "
\n",
- "
\n",
- "
61
\n",
- "
97
\n",
- "
71
\n",
- "
Ghostbusters (1984 Original)
\n",
- "
1984
\n",
- "
\n",
- "
\n",
- "
69
\n",
- "
98
\n",
- "
47
\n",
- "
A Fistful of Dollars (Per un Pugno di Dollari)
\n",
- "
1964
\n",
- "
\n",
- "
\n",
- "
99
\n",
- "
96
\n",
- "
139
\n",
- "
Embrace Of The Serpent (El Abrazo De La Serpie...
\n",
- "
2016
\n",
- "
\n",
- "
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
...
\n",
- "
\n",
- "
\n",
- "
1368
\n",
- "
97
\n",
- "
59
\n",
- "
To Be and to Have (Etre et Avoir)
\n",
- "
2003
\n",
- "
\n",
- "
\n",
- "
1457
\n",
- "
43
\n",
- "
82
\n",
- "
Goal! The Dream Begins (Goal!: The Impossible ...
\n",
- "
2005
\n",
- "
\n",
- "
\n",
- "
1502
\n",
- "
71
\n",
- "
52
\n",
- "
Only Human (Seres queridos)
\n",
- "
2006
\n",
- "
\n",
- "
\n",
- "
1547
\n",
- "
83
\n",
- "
64
\n",
- "
The Good, the Bad, the Weird (Joheun-nom, Nabb...
\n",
- "
2010
\n",
- "
\n",
- "
\n",
- "
1559
\n",
- "
74
\n",
- "
62
\n",
- "
Fah talai jone (Tears of the Black Tiger)
\n",
- "
2007
\n",
- "
\n",
- " \n",
- "
\n",
- "
66 rows × 4 columns
\n",
- "
"
- ],
- "text/plain": [
- " tomatoes_rating tomatoes_votes \\\n",
- "15 100 58 \n",
- "51 98 46 \n",
- "61 97 71 \n",
- "69 98 47 \n",
- "99 96 139 \n",
- "... ... ... \n",
- "1368 97 59 \n",
- "1457 43 82 \n",
- "1502 71 52 \n",
- "1547 83 64 \n",
- "1559 74 62 \n",
- "\n",
- " title year \n",
- "15 Seven Samurai (Shichinin no Samurai) 1956 \n",
- "51 Aguirre, the Wrath of God (Aguirre, der Zorn G... 1972 \n",
- "61 Ghostbusters (1984 Original) 1984 \n",
- "69 A Fistful of Dollars (Per un Pugno di Dollari) 1964 \n",
- "99 Embrace Of The Serpent (El Abrazo De La Serpie... 2016 \n",
- "... ... ... \n",
- "1368 To Be and to Have (Etre et Avoir) 2003 \n",
- "1457 Goal! The Dream Begins (Goal!: The Impossible ... 2005 \n",
- "1502 Only Human (Seres queridos) 2006 \n",
- "1547 The Good, the Bad, the Weird (Joheun-nom, Nabb... 2010 \n",
- "1559 Fah talai jone (Tears of the Black Tiger) 2007 \n",
- "\n",
- "[66 rows x 4 columns]"
- ]
- },
- "execution_count": 43,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "rotten_tomatoes_beta[rotten_tomatoes_beta[\"title\"].str.contains(\")\", regex=False)]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "V rámci zjednodušení proto ještě odstraníme všechny takové závorky. K tomu pomůže funkce [`.str.rsplit`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.rsplit.html), která rozdělí zprava řetězec na několik částí podle oddělovače a vloží je do seznamu - my za ten oddělovač zvolíme levou závorku `\"(\"`, omezíme počet částí na jednu až dvě (`n=1`):"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 44,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "41 [Marvel's The Avengers]\n",
- "61 [Ghostbusters , 1984 Original)]\n",
- "81 [Mad Max 2: The Road Warrior]\n",
- "Name: title, dtype: object"
- ]
- },
- "execution_count": 44,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "split_title = (\n",
- " rotten_tomatoes_beta[\"title\"]\n",
- " .str.rsplit(\"(\", n=1)\n",
- ")\n",
- "split_title.loc[[41, 61, 81]] # Některé seznamy obsahují jeden prvek, jiné dva"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "A jak teď vybrat první prvek z každého seznamu?\n",
- "\n",
- "💡 Metoda [`apply`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html) umožňuje použít libovolnou transformaci (definovanou jako funkci) na každý řádek v tabulce či hodnotu v `Series`. Obvykle se bez ní obejdeme a měli bychom (proto se jí tolik speciálně nevěnujeme), protože není příliš výpočetně efektivní. Tady nám ale usnadní pochopení, co se vlastně dělá, t.j. vybírá první prvek nějakého seznamu:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 45,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
"
]
@@ -7936,7 +1896,7 @@
},
{
"cell_type": "code",
- "execution_count": 72,
+ "execution_count": 26,
"metadata": {},
"outputs": [
{
@@ -7955,7 +1915,7 @@
""
]
},
- "execution_count": 72,
+ "execution_count": 26,
"metadata": {},
"output_type": "execute_result"
},
@@ -8008,7 +1968,7 @@
},
{
"cell_type": "code",
- "execution_count": 73,
+ "execution_count": 27,
"metadata": {},
"outputs": [
{
@@ -8038,7 +1998,7 @@
},
{
"cell_type": "code",
- "execution_count": 74,
+ "execution_count": 28,
"metadata": {},
"outputs": [
{
@@ -8111,7 +2071,7 @@
},
{
"cell_type": "code",
- "execution_count": 76,
+ "execution_count": 30,
"metadata": {},
"outputs": [
{
diff --git a/lessons/pydata/pandas_correlations/movies_complete.csv.gz b/lessons/pydata/pandas_correlations/movies_complete.csv.gz
new file mode 100644
index 0000000..589ffc3
Binary files /dev/null and b/lessons/pydata/pandas_correlations/movies_complete.csv.gz differ
diff --git a/lessons/pydata/pandas_correlations/movies_with_rating.csv.gz b/lessons/pydata/pandas_correlations/movies_with_rating.csv.gz
new file mode 100644
index 0000000..cfdb7eb
Binary files /dev/null and b/lessons/pydata/pandas_correlations/movies_with_rating.csv.gz differ
diff --git a/lessons/pydata/pandas_correlations/boxoffice_march_2019.csv.gz b/lessons/pydata/pandas_joins/boxoffice_march_2019.csv.gz
similarity index 100%
rename from lessons/pydata/pandas_correlations/boxoffice_march_2019.csv.gz
rename to lessons/pydata/pandas_joins/boxoffice_march_2019.csv.gz
diff --git a/lessons/pydata/pandas_joins/index.ipynb b/lessons/pydata/pandas_joins/index.ipynb
new file mode 100644
index 0000000..5c3a49d
--- /dev/null
+++ b/lessons/pydata/pandas_joins/index.ipynb
@@ -0,0 +1,6118 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Pandas - spojování tabulek\n",
+ "\n",
+ "Tato lekce se nese ve znamení mnohosti a propojování - naučíš se pracovat s více tabulkami najednou. Při tom společně projdeme (ne poprvé a ne naposledy) čištění reálných datových sad."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Importy jako obvykle\n",
+ "import pandas as pd"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "toc-hr-collapsed": false
+ },
+ "source": [
+ "## Spojování tabulek\n",
+ "\n",
+ "V lekci, kde jsme zpracovávali data o počasí, jsme ti ukázali, že je pomocí funkce `concat` možné slepit dohromady několik objektů `DataFrame` či `Series`, pokud mají \"kompatibilní\" index. Nyní se na problematiku podíváme trochu blíže a ukážeme si, jak spojovat tabulky na základě různých sloupců, a co dělat, když řádky z jedné tabulky nepasují přesně na tabulku druhou.\n",
+ "\n",
+ "Obecně pro spojování `pandas` nabízí tři funkce / metody, z nichž každá má svoje typické využití (možnostmi se ovšem překrývají):\n",
+ "\n",
+ "- [`concat`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) je univerzální funkce pro slepování dvou či více tabulek / sloupců - pod sebe, vedle sebe, s přihlédnutím k indexům i bez něj. \n",
+ "- [`merge`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html) je univerzální funkce pro spojování tabulek na základě vazby mezi indexy nebo sloupci.\n",
+ "- [`join`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html) (metoda) zjednodušuje práci, když chceš spojit dvě tabulky na základě indexu.\n",
+ "\n",
+ "Detailní rozbor toho, co která umí, najdeš v [dokumentaci](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html). My si je také postupně ukážeme."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Jednoduché skládání"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Pod sebou\n",
+ "\n",
+ "To je asi ten nejjednodušší případ - máme dva objekty `Series` nebo dva kusy tabulky se stejnými sloupci a chceme je spojit pod sebou. Na to se používá funkce [`concat`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "a = pd.Series([\"jedna\", \"dvě\", \"tři\"])\n",
+ "b = pd.Series([\"čtyři\", \"pět\", \"šest\"])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 jedna\n",
+ "1 dvě\n",
+ "2 tři\n",
+ "0 čtyři\n",
+ "1 pět\n",
+ "2 šest\n",
+ "dtype: object"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pd.concat([a, b])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "💡 Vidíš, že se index opakuje? Vytvořili jsme dvě `Series`, u kterých jsme index neřešili. Jenže `pandas` na rozdíl od nás ano, a tak poslušně oba indexy spojil, i za cenu duplicitních hodnot. Za cenu použití dodatečného argumentu `ignore_index=True` se tomu lze vyhnout, což si ukážeme na příklady spojování dvou tabulek o stejných sloupcích:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 jedna\n",
+ "1 dvě\n",
+ "2 tři\n",
+ "3 jedna\n",
+ "4 dvě\n",
+ "5 tři\n",
+ "6 jedna\n",
+ "7 dvě\n",
+ "8 tři\n",
+ "9 jedna\n",
+ "10 dvě\n",
+ "11 tři\n",
+ "12 jedna\n",
+ "13 dvě\n",
+ "14 tři\n",
+ "dtype: object"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pd.concat([a, a, a, a, a], ignore_index=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Vedle sebe\n",
+ "\n",
+ "Toto asi použijete zřídka, ale když chceme \"lepit\" doprava (třeba deset `Series`), stačí přidat nám dobře známý argument `axis`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
0
\n",
+ "
1
\n",
+ "
2
\n",
+ "
3
\n",
+ "
4
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
jedna
\n",
+ "
jedna
\n",
+ "
jedna
\n",
+ "
jedna
\n",
+ "
jedna
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
dvě
\n",
+ "
dvě
\n",
+ "
dvě
\n",
+ "
dvě
\n",
+ "
dvě
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
tři
\n",
+ "
tři
\n",
+ "
tři
\n",
+ "
tři
\n",
+ "
tři
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " 0 1 2 3 4\n",
+ "0 jedna jedna jedna jedna jedna\n",
+ "1 dvě dvě dvě dvě dvě\n",
+ "2 tři tři tři tři tři"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pd.concat([a, a, a, a, a], axis=\"columns\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Příklad:** Jak co nejrychleji \"nakreslit prázdnou šachovnici\" (obě slova jsou v uvozovkách)?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
A
\n",
+ "
B
\n",
+ "
C
\n",
+ "
D
\n",
+ "
E
\n",
+ "
F
\n",
+ "
G
\n",
+ "
H
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
8
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
\n",
+ "
\n",
+ "
7
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
\n",
+ "
\n",
+ "
6
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
\n",
+ "
\n",
+ "
5
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
\n",
+ "
\n",
+ "
4
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
⬛
\n",
+ "
⬜
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " A B C D E F G H\n",
+ "8 ⬜ ⬛ ⬜ ⬛ ⬜ ⬛ ⬜ ⬛\n",
+ "7 ⬛ ⬜ ⬛ ⬜ ⬛ ⬜ ⬛ ⬜\n",
+ "6 ⬜ ⬛ ⬜ ⬛ ⬜ ⬛ ⬜ ⬛\n",
+ "5 ⬛ ⬜ ⬛ ⬜ ⬛ ⬜ ⬛ ⬜\n",
+ "4 ⬜ ⬛ ⬜ ⬛ ⬜ ⬛ ⬜ ⬛\n",
+ "3 ⬛ ⬜ ⬛ ⬜ ⬛ ⬜ ⬛ ⬜\n",
+ "2 ⬜ ⬛ ⬜ ⬛ ⬜ ⬛ ⬜ ⬛\n",
+ "1 ⬛ ⬜ ⬛ ⬜ ⬛ ⬜ ⬛ ⬜"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "sachy = pd.concat(\n",
+ " [\n",
+ " pd.concat( \n",
+ " [pd.DataFrame([[\"⬜\", \"⬛\"], [\"⬛\", \"⬜\"]])] * 4,\n",
+ " axis=1)\n",
+ " ] * 4\n",
+ ")\n",
+ "sachy.index = list(range(8, 0, -1))\n",
+ "sachy.columns = list(\"ABCDEFGH\")\n",
+ "sachy"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Spojování různorodých tabulek"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "🎦 Pro spojování heterogenních dat (v datové hantýrce \"joinování\") sáhneme po trochu komplexnějších filmových datech..."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Máme staženo několik souborů, načteme si je (zatím hrubě, \"raw\") - s přihlédnutím k tomu, že první dva nejsou v pravém slova smyslu \"comma-separated\", ale používají k oddělení hodnot tabulátor (tady pomůže argument `sep`). Také zohledníme, že v nich řetězec `\"\\N\"` představuje chybějící hodnoty (pomůže argument `na_values`):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "imdb_titles_raw = pd.read_csv(\"title.basics.tsv.gz\", sep=\"\\t\", na_values=\"\\\\N\")\n",
+ "imdb_ratings_raw = pd.read_csv(\"title.ratings.tsv.gz\", sep=\"\\t\", na_values=\"\\\\N\")\n",
+ "boxoffice_raw = pd.read_csv(\"boxoffice_march_2019.csv.gz\")\n",
+ "rotten_tomatoes_raw = pd.read_csv(\"rotten_tomatoes_top_movies_2019-01-15.csv\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Co který soubor obsahuje? \n",
+ "\n",
+ "- První dva soubory obsahují volně dostupná (byť \"jen\" pro nekomerční použití) data o filmech z IMDb (Internet Movie Database). My jsme si zvolili obecné informace a uživatelská (číselná) hodnocení. Detailní popis souborů, stejně jako odkazy na další soubory, najdeš na https://www.imdb.com/interfaces/. Z důvodů paměťové náročnosti jsme datovou sadu ořezali o epizody seriálů, protože nás nebudou zajímat a s trochu štěstí přežijeme i na počítačích s menší operační pamětí.\n",
+ "\n",
+ "- Soubor `boxoffice_march_2019.csv.gz` obsahuje informace o výdělcích jednotlivých filmů. Pochází z ukázkového datasetu pro soutěž \"TMDB Box Office Prediction\" na serveru Kaggle: https://www.kaggle.com/c/tmdb-box-office-prediction/data\n",
+ "\n",
+ "- Soubor `rotten_tomatoes_top_movies_2019-01-15.csv` obsahuje procentuální hodnocení filmů ze serveru Rotten Tomatoes, které se počítá jako podíl pozitivních hodnoceních od filmových kritiku (je to tedy jiný princip než na IMDb). Staženo z: https://data.world/prasert/rotten-tomatoes-top-movies-by-genre\n",
+ "\n",
+ "Pojďme se podívat na nedostatky těchto souborů a postupně je skládat dohromady. Zajímalo by nás (a snad i tebe!), jak souvisí hodnocení s komerční úspěšností filmu, jak se liší hodnocení rotten tomatoes od těch na IMDb."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
tconst
\n",
+ "
titleType
\n",
+ "
primaryTitle
\n",
+ "
originalTitle
\n",
+ "
isAdult
\n",
+ "
startYear
\n",
+ "
endYear
\n",
+ "
runtimeMinutes
\n",
+ "
genres
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
tt0000001
\n",
+ "
short
\n",
+ "
Carmencita
\n",
+ "
Carmencita
\n",
+ "
0
\n",
+ "
1894.0
\n",
+ "
NaN
\n",
+ "
1.0
\n",
+ "
Documentary,Short
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
tt0000002
\n",
+ "
short
\n",
+ "
Le clown et ses chiens
\n",
+ "
Le clown et ses chiens
\n",
+ "
0
\n",
+ "
1892.0
\n",
+ "
NaN
\n",
+ "
5.0
\n",
+ "
Animation,Short
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
tt0000003
\n",
+ "
short
\n",
+ "
Pauvre Pierrot
\n",
+ "
Pauvre Pierrot
\n",
+ "
0
\n",
+ "
1892.0
\n",
+ "
NaN
\n",
+ "
4.0
\n",
+ "
Animation,Comedy,Romance
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
tt0000004
\n",
+ "
short
\n",
+ "
Un bon bock
\n",
+ "
Un bon bock
\n",
+ "
0
\n",
+ "
1892.0
\n",
+ "
NaN
\n",
+ "
NaN
\n",
+ "
Animation,Short
\n",
+ "
\n",
+ "
\n",
+ "
4
\n",
+ "
tt0000005
\n",
+ "
short
\n",
+ "
Blacksmith Scene
\n",
+ "
Blacksmith Scene
\n",
+ "
0
\n",
+ "
1893.0
\n",
+ "
NaN
\n",
+ "
1.0
\n",
+ "
Comedy,Short
\n",
+ "
\n",
+ "
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
\n",
+ "
\n",
+ "
1783511
\n",
+ "
tt9916734
\n",
+ "
video
\n",
+ "
Manca: Peleo
\n",
+ "
Manca: Peleo
\n",
+ "
0
\n",
+ "
2018.0
\n",
+ "
NaN
\n",
+ "
NaN
\n",
+ "
Music,Short
\n",
+ "
\n",
+ "
\n",
+ "
1783512
\n",
+ "
tt9916754
\n",
+ "
movie
\n",
+ "
Chico Albuquerque - Revelações
\n",
+ "
Chico Albuquerque - Revelações
\n",
+ "
0
\n",
+ "
2013.0
\n",
+ "
NaN
\n",
+ "
NaN
\n",
+ "
Documentary
\n",
+ "
\n",
+ "
\n",
+ "
1783513
\n",
+ "
tt9916756
\n",
+ "
short
\n",
+ "
Pretty Pretty Black Girl
\n",
+ "
Pretty Pretty Black Girl
\n",
+ "
0
\n",
+ "
2019.0
\n",
+ "
NaN
\n",
+ "
NaN
\n",
+ "
Short
\n",
+ "
\n",
+ "
\n",
+ "
1783514
\n",
+ "
tt9916764
\n",
+ "
short
\n",
+ "
38
\n",
+ "
38
\n",
+ "
0
\n",
+ "
2018.0
\n",
+ "
NaN
\n",
+ "
NaN
\n",
+ "
Short
\n",
+ "
\n",
+ "
\n",
+ "
1783515
\n",
+ "
tt9916856
\n",
+ "
short
\n",
+ "
The Wind
\n",
+ "
The Wind
\n",
+ "
0
\n",
+ "
2015.0
\n",
+ "
NaN
\n",
+ "
27.0
\n",
+ "
Short
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
1783516 rows × 9 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " tconst titleType primaryTitle \\\n",
+ "0 tt0000001 short Carmencita \n",
+ "1 tt0000002 short Le clown et ses chiens \n",
+ "2 tt0000003 short Pauvre Pierrot \n",
+ "3 tt0000004 short Un bon bock \n",
+ "4 tt0000005 short Blacksmith Scene \n",
+ "... ... ... ... \n",
+ "1783511 tt9916734 video Manca: Peleo \n",
+ "1783512 tt9916754 movie Chico Albuquerque - Revelações \n",
+ "1783513 tt9916756 short Pretty Pretty Black Girl \n",
+ "1783514 tt9916764 short 38 \n",
+ "1783515 tt9916856 short The Wind \n",
+ "\n",
+ " originalTitle isAdult startYear endYear \\\n",
+ "0 Carmencita 0 1894.0 NaN \n",
+ "1 Le clown et ses chiens 0 1892.0 NaN \n",
+ "2 Pauvre Pierrot 0 1892.0 NaN \n",
+ "3 Un bon bock 0 1892.0 NaN \n",
+ "4 Blacksmith Scene 0 1893.0 NaN \n",
+ "... ... ... ... ... \n",
+ "1783511 Manca: Peleo 0 2018.0 NaN \n",
+ "1783512 Chico Albuquerque - Revelações 0 2013.0 NaN \n",
+ "1783513 Pretty Pretty Black Girl 0 2019.0 NaN \n",
+ "1783514 38 0 2018.0 NaN \n",
+ "1783515 The Wind 0 2015.0 NaN \n",
+ "\n",
+ " runtimeMinutes genres \n",
+ "0 1.0 Documentary,Short \n",
+ "1 5.0 Animation,Short \n",
+ "2 4.0 Animation,Comedy,Romance \n",
+ "3 NaN Animation,Short \n",
+ "4 1.0 Comedy,Short \n",
+ "... ... ... \n",
+ "1783511 NaN Music,Short \n",
+ "1783512 NaN Documentary \n",
+ "1783513 NaN Short \n",
+ "1783514 NaN Short \n",
+ "1783515 27.0 Short \n",
+ "\n",
+ "[1783516 rows x 9 columns]"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "imdb_titles_raw"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "648.8971881866455"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Kolik tabulka zabírá megabajtů paměti? (1 MB = 2**20 bajtů)\n",
+ "imdb_titles_raw.memory_usage(deep=True).sum() / 2**20 "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Jistě budeme chtít převést sloupce na správné typy. Jaké jsou v základu?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "tconst object\n",
+ "titleType object\n",
+ "primaryTitle object\n",
+ "originalTitle object\n",
+ "isAdult int64\n",
+ "startYear float64\n",
+ "endYear float64\n",
+ "runtimeMinutes float64\n",
+ "genres object\n",
+ "dtype: object"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "imdb_titles_raw.dtypes"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Na co budeme převádět?\n",
+ "\n",
+ "- `tconst` je řetězec, který posléze použijeme jako index, protože představuje unikátní identifikátor v databázi IMDb.\n",
+ "- `titleType`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "titleType\n",
+ "short 676930\n",
+ "movie 514654\n",
+ "video 227582\n",
+ "tvSeries 162781\n",
+ "tvMovie 126507\n",
+ "tvMiniSeries 25574\n",
+ "videoGame 23310\n",
+ "tvSpecial 17007\n",
+ "tvShort 9171\n",
+ "Name: count, dtype: int64"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "imdb_titles_raw[\"titleType\"].value_counts()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Jen devět různých hodnot ve skoro 2 milionech řádků? To je ideální kandidát na převedení na typ `\"category\"`.\n",
+ "\n",
+ "- `primaryTitle` a `originalTitle` vypadají jako obyčejné řetězce (pokud možno anglický a pokud možno původní název)\n",
+ "- `isAdult` určuje, zda se jedná o dílo pro dospělé. Tento sloupec bychom nejspíše měli převést na `bool`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "isAdult\n",
+ "0 1692292\n",
+ "1 91224\n",
+ "Name: count, dtype: int64"
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "imdb_titles_raw[\"isAdult\"].value_counts()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "- `startYear` a `endYear` obsahují roky, t.j. celá čísla, ovšem kvůli chybějícím hodnotám je pro ně zvolen typ `float64`. V `pandas` raději zvolíme tzv. \"nullable integer\", který se zapisuje s velkým \"I\". Když nevíš, jaký podtyp konkrétně, sáhni po `Int64`.\n",
+ "- totéž platí pro `runtimeMinutes`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "startYear 2115.0\n",
+ "endYear 2027.0\n",
+ "runtimeMinutes 125156.0\n",
+ "dtype: float64"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "imdb_titles_raw[[\"startYear\", \"endYear\", \"runtimeMinutes\"]].max()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Mimochodem všimli jste si, že máme díla z budoucnosti (rok 2115)?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "startYear\n",
+ "2020.0 340\n",
+ "2021.0 36\n",
+ "2022.0 14\n",
+ "2023.0 1\n",
+ "2024.0 2\n",
+ "2025.0 1\n",
+ "2115.0 1\n",
+ "Name: count, dtype: int64"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ "
"
+ ],
+ "text/plain": [
+ " title \\\n",
+ "tconst \n",
+ "tt0000009 Miss Jerry \n",
+ "tt0000147 The Corbett-Fitzsimmons Fight \n",
+ "tt0000335 Soldiers of the Cross \n",
+ "tt0000502 Bohemios \n",
+ "tt0000574 The Story of the Kelly Gang \n",
+ "... ... \n",
+ "tt9916622 Rodolpho Teóphilo - O Legado de um Pioneiro \n",
+ "tt9916680 De la ilusión al desconcierto: cine colombiano... \n",
+ "tt9916706 Dankyavar Danka \n",
+ "tt9916730 6 Gunn \n",
+ "tt9916754 Chico Albuquerque - Revelações \n",
+ "\n",
+ " original_title is_adult year \\\n",
+ "tconst \n",
+ "tt0000009 Miss Jerry False 1894 \n",
+ "tt0000147 The Corbett-Fitzsimmons Fight False 1897 \n",
+ "tt0000335 Soldiers of the Cross False 1900 \n",
+ "tt0000502 Bohemios False 1905 \n",
+ "tt0000574 The Story of the Kelly Gang False 1906 \n",
+ "... ... ... ... \n",
+ "tt9916622 Rodolpho Teóphilo - O Legado de um Pioneiro False 2015 \n",
+ "tt9916680 De la ilusión al desconcierto: cine colombiano... False 2007 \n",
+ "tt9916706 Dankyavar Danka False 2013 \n",
+ "tt9916730 6 Gunn False 2017 \n",
+ "tt9916754 Chico Albuquerque - Revelações False 2013 \n",
+ "\n",
+ " length genres imdb_rating imdb_votes \n",
+ "tconst \n",
+ "tt0000009 45 Romance 5.5 77.0 \n",
+ "tt0000147 20 Documentary,News,Sport 5.2 289.0 \n",
+ "tt0000335 Biography,Drama 6.3 39.0 \n",
+ "tt0000502 100 NaN NaN NaN \n",
+ "tt0000574 70 Biography,Crime,Drama 6.2 505.0 \n",
+ "... ... ... ... ... \n",
+ "tt9916622 Documentary NaN NaN \n",
+ "tt9916680 100 Documentary NaN NaN \n",
+ "tt9916706 Comedy NaN NaN \n",
+ "tt9916730 116 NaN NaN NaN \n",
+ "tt9916754 Documentary NaN NaN \n",
+ "\n",
+ "[514654 rows x 8 columns]"
+ ]
+ },
+ "execution_count": 23,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "movies.join(ratings)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "K tabulce se nenápadně přidaly dva sloupce z tabulky `ratings`, a to takovým způsobem, že se porovnaly hodnoty indexu (tedy `tconst`) a spárovaly se ty části řádku, kde se tento index shoduje.\n",
+ "\n",
+ "💡 Uvědom si (ačkoliv z volání funkcí v `pandas` to není úplně zřejmé), že se tady děje něco fundamentálně odlišného od \"nalepení doprava\" - tabulky tu nejsou chápány jako čtverečky, které jde skládat jako lego, nýbrž jako zdroj údajů o jednotlivých objektech, které je potřeba spojit sémanticky.\n",
+ "\n",
+ "Jak ale vidíš, tabulka obsahuje spoustu řádků, kde ve sloupcích s hodnocením chybí hodnoty (respektive nachází se `NaN`). To vychází ze způsobu, jakým metoda `join` ve výchozím nastavení \"joinuje\" - použije všechny řádky z levé tabulky bez ohledu na to, jestli jim odpovídá nějaký protějšek v tabulce pravé. Naštěstí lze pomocí argumentu `how` specifikovat i jiné způsoby spojování:\n",
+ "\n",
+ "- `left` (výchozí pro metodu `join`) - vezmou se všechny prvky z levé tabulky a jim odpovídající prvky z pravé tabulky (kde nejsou, doplní se `NaN`)\n",
+ "- `right` - vezmou se všechny prvky z pravé tabulky a jim odpovídající prvky z levé tabulky (kde nejsou, doplní se `NaN`)\n",
+ "- `inner` (výchozí pro funkci `merge`) - vezmou se jen ty prvky, které jsou v levé i pravé tabulce.\n",
+ "- `outer` (výchozí pro funkci `concat`) - vezmou se všechny prvky, z levé i pravé tabulky, kde něco chybí, doplní se `NaN`.\n",
+ "\n",
+ "V podobě Vennově diagramu, kde kruhy představují množiny řádků v obou zdrojových tabulkách a modrou barvou jsou zvýrazněny řádky v tabulce cílové:\n",
+ "\n",
+ "![Typy joinů](static/joins.svg)\n",
+ "\n",
+ "*Obrázek adaptován z https://upload.wikimedia.org/wikipedia/commons/9/9d/SQL_Joins.svg (autor: Arbeck)*\n",
+ "\n",
+ "💡 Až budeme probírat databáze, tyto čtyři typu joinů se nám znovu vynoří.\n",
+ "\n",
+ "Následující výpis ukáže, kolik řádků bychom dostali při použití různých hodnot `how`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "movies.join(ratings, how=\"left\"): 514654 řádků.\n",
+ "movies.join(ratings, how=\"right\"): 923696 řádků.\n",
+ "movies.join(ratings, how=\"inner\"): 232496 řádků.\n",
+ "movies.join(ratings, how=\"outer\"): 1205854 řádků.\n"
+ ]
+ }
+ ],
+ "source": [
+ "for how in [\"left\", \"right\", \"inner\", \"outer\"]:\n",
+ " print(f\"movies.join(ratings, how=\\\"{how}\\\"):\", movies.join(ratings, how=how).shape[0], \"řádků.\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "A teď tedy ty tři alternativy:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
title
\n",
+ "
original_title
\n",
+ "
is_adult
\n",
+ "
year
\n",
+ "
length
\n",
+ "
genres
\n",
+ "
imdb_rating
\n",
+ "
imdb_votes
\n",
+ "
\n",
+ "
\n",
+ "
tconst
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
tt0000009
\n",
+ "
Miss Jerry
\n",
+ "
Miss Jerry
\n",
+ "
False
\n",
+ "
1894
\n",
+ "
45
\n",
+ "
Romance
\n",
+ "
5.5
\n",
+ "
77
\n",
+ "
\n",
+ "
\n",
+ "
tt0000147
\n",
+ "
The Corbett-Fitzsimmons Fight
\n",
+ "
The Corbett-Fitzsimmons Fight
\n",
+ "
False
\n",
+ "
1897
\n",
+ "
20
\n",
+ "
Documentary,News,Sport
\n",
+ "
5.2
\n",
+ "
289
\n",
+ "
\n",
+ "
\n",
+ "
tt0000335
\n",
+ "
Soldiers of the Cross
\n",
+ "
Soldiers of the Cross
\n",
+ "
False
\n",
+ "
1900
\n",
+ "
<NA>
\n",
+ "
Biography,Drama
\n",
+ "
6.3
\n",
+ "
39
\n",
+ "
\n",
+ "
\n",
+ "
tt0000574
\n",
+ "
The Story of the Kelly Gang
\n",
+ "
The Story of the Kelly Gang
\n",
+ "
False
\n",
+ "
1906
\n",
+ "
70
\n",
+ "
Biography,Crime,Drama
\n",
+ "
6.2
\n",
+ "
505
\n",
+ "
\n",
+ "
\n",
+ "
tt0000615
\n",
+ "
Robbery Under Arms
\n",
+ "
Robbery Under Arms
\n",
+ "
False
\n",
+ "
1907
\n",
+ "
<NA>
\n",
+ "
Drama
\n",
+ "
4.8
\n",
+ "
14
\n",
+ "
\n",
+ "
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
\n",
+ "
\n",
+ "
tt9910930
\n",
+ "
Jeg ser deg
\n",
+ "
Jeg ser deg
\n",
+ "
False
\n",
+ "
2019
\n",
+ "
75
\n",
+ "
Crime,Documentary
\n",
+ "
4.6
\n",
+ "
5
\n",
+ "
\n",
+ "
\n",
+ "
tt9911774
\n",
+ "
Padmavyuhathile Abhimanyu
\n",
+ "
Padmavyuhathile Abhimanyu
\n",
+ "
False
\n",
+ "
2019
\n",
+ "
130
\n",
+ "
Drama
\n",
+ "
8.5
\n",
+ "
363
\n",
+ "
\n",
+ "
\n",
+ "
tt9913056
\n",
+ "
Swarm Season
\n",
+ "
Swarm Season
\n",
+ "
False
\n",
+ "
2019
\n",
+ "
86
\n",
+ "
Documentary
\n",
+ "
6.2
\n",
+ "
5
\n",
+ "
\n",
+ "
\n",
+ "
tt9913084
\n",
+ "
Diabolik sono io
\n",
+ "
Diabolik sono io
\n",
+ "
False
\n",
+ "
2019
\n",
+ "
75
\n",
+ "
Documentary
\n",
+ "
6.2
\n",
+ "
6
\n",
+ "
\n",
+ "
\n",
+ "
tt9914286
\n",
+ "
Sokagin Çocuklari
\n",
+ "
Sokagin Çocuklari
\n",
+ "
False
\n",
+ "
2019
\n",
+ "
98
\n",
+ "
Drama,Family
\n",
+ "
9.8
\n",
+ "
72
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
232496 rows × 8 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " title original_title \\\n",
+ "tconst \n",
+ "tt0000009 Miss Jerry Miss Jerry \n",
+ "tt0000147 The Corbett-Fitzsimmons Fight The Corbett-Fitzsimmons Fight \n",
+ "tt0000335 Soldiers of the Cross Soldiers of the Cross \n",
+ "tt0000574 The Story of the Kelly Gang The Story of the Kelly Gang \n",
+ "tt0000615 Robbery Under Arms Robbery Under Arms \n",
+ "... ... ... \n",
+ "tt9910930 Jeg ser deg Jeg ser deg \n",
+ "tt9911774 Padmavyuhathile Abhimanyu Padmavyuhathile Abhimanyu \n",
+ "tt9913056 Swarm Season Swarm Season \n",
+ "tt9913084 Diabolik sono io Diabolik sono io \n",
+ "tt9914286 Sokagin Çocuklari Sokagin Çocuklari \n",
+ "\n",
+ " is_adult year length genres imdb_rating \\\n",
+ "tconst \n",
+ "tt0000009 False 1894 45 Romance 5.5 \n",
+ "tt0000147 False 1897 20 Documentary,News,Sport 5.2 \n",
+ "tt0000335 False 1900 Biography,Drama 6.3 \n",
+ "tt0000574 False 1906 70 Biography,Crime,Drama 6.2 \n",
+ "tt0000615 False 1907 Drama 4.8 \n",
+ "... ... ... ... ... ... \n",
+ "tt9910930 False 2019 75 Crime,Documentary 4.6 \n",
+ "tt9911774 False 2019 130 Drama 8.5 \n",
+ "tt9913056 False 2019 86 Documentary 6.2 \n",
+ "tt9913084 False 2019 75 Documentary 6.2 \n",
+ "tt9914286 False 2019 98 Drama,Family 9.8 \n",
+ "\n",
+ " imdb_votes \n",
+ "tconst \n",
+ "tt0000009 77 \n",
+ "tt0000147 289 \n",
+ "tt0000335 39 \n",
+ "tt0000574 505 \n",
+ "tt0000615 14 \n",
+ "... ... \n",
+ "tt9910930 5 \n",
+ "tt9911774 363 \n",
+ "tt9913056 5 \n",
+ "tt9913084 6 \n",
+ "tt9914286 72 \n",
+ "\n",
+ "[232496 rows x 8 columns]"
+ ]
+ },
+ "execution_count": 25,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Alternativa 1 (preferovaná)\n",
+ "movies_with_rating = movies.join(ratings, how=\"inner\")\n",
+ "movies_with_rating"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
title
\n",
+ "
original_title
\n",
+ "
is_adult
\n",
+ "
year
\n",
+ "
length
\n",
+ "
genres
\n",
+ "
imdb_rating
\n",
+ "
imdb_votes
\n",
+ "
\n",
+ "
\n",
+ "
tconst
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
tt0000009
\n",
+ "
Miss Jerry
\n",
+ "
Miss Jerry
\n",
+ "
False
\n",
+ "
1894
\n",
+ "
45
\n",
+ "
Romance
\n",
+ "
5.5
\n",
+ "
77
\n",
+ "
\n",
+ "
\n",
+ "
tt0000147
\n",
+ "
The Corbett-Fitzsimmons Fight
\n",
+ "
The Corbett-Fitzsimmons Fight
\n",
+ "
False
\n",
+ "
1897
\n",
+ "
20
\n",
+ "
Documentary,News,Sport
\n",
+ "
5.2
\n",
+ "
289
\n",
+ "
\n",
+ "
\n",
+ "
tt0000335
\n",
+ "
Soldiers of the Cross
\n",
+ "
Soldiers of the Cross
\n",
+ "
False
\n",
+ "
1900
\n",
+ "
<NA>
\n",
+ "
Biography,Drama
\n",
+ "
6.3
\n",
+ "
39
\n",
+ "
\n",
+ "
\n",
+ "
tt0000574
\n",
+ "
The Story of the Kelly Gang
\n",
+ "
The Story of the Kelly Gang
\n",
+ "
False
\n",
+ "
1906
\n",
+ "
70
\n",
+ "
Biography,Crime,Drama
\n",
+ "
6.2
\n",
+ "
505
\n",
+ "
\n",
+ "
\n",
+ "
tt0000615
\n",
+ "
Robbery Under Arms
\n",
+ "
Robbery Under Arms
\n",
+ "
False
\n",
+ "
1907
\n",
+ "
<NA>
\n",
+ "
Drama
\n",
+ "
4.8
\n",
+ "
14
\n",
+ "
\n",
+ "
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
\n",
+ "
\n",
+ "
tt9910930
\n",
+ "
Jeg ser deg
\n",
+ "
Jeg ser deg
\n",
+ "
False
\n",
+ "
2019
\n",
+ "
75
\n",
+ "
Crime,Documentary
\n",
+ "
4.6
\n",
+ "
5
\n",
+ "
\n",
+ "
\n",
+ "
tt9911774
\n",
+ "
Padmavyuhathile Abhimanyu
\n",
+ "
Padmavyuhathile Abhimanyu
\n",
+ "
False
\n",
+ "
2019
\n",
+ "
130
\n",
+ "
Drama
\n",
+ "
8.5
\n",
+ "
363
\n",
+ "
\n",
+ "
\n",
+ "
tt9913056
\n",
+ "
Swarm Season
\n",
+ "
Swarm Season
\n",
+ "
False
\n",
+ "
2019
\n",
+ "
86
\n",
+ "
Documentary
\n",
+ "
6.2
\n",
+ "
5
\n",
+ "
\n",
+ "
\n",
+ "
tt9913084
\n",
+ "
Diabolik sono io
\n",
+ "
Diabolik sono io
\n",
+ "
False
\n",
+ "
2019
\n",
+ "
75
\n",
+ "
Documentary
\n",
+ "
6.2
\n",
+ "
6
\n",
+ "
\n",
+ "
\n",
+ "
tt9914286
\n",
+ "
Sokagin Çocuklari
\n",
+ "
Sokagin Çocuklari
\n",
+ "
False
\n",
+ "
2019
\n",
+ "
98
\n",
+ "
Drama,Family
\n",
+ "
9.8
\n",
+ "
72
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
232496 rows × 8 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " title original_title \\\n",
+ "tconst \n",
+ "tt0000009 Miss Jerry Miss Jerry \n",
+ "tt0000147 The Corbett-Fitzsimmons Fight The Corbett-Fitzsimmons Fight \n",
+ "tt0000335 Soldiers of the Cross Soldiers of the Cross \n",
+ "tt0000574 The Story of the Kelly Gang The Story of the Kelly Gang \n",
+ "tt0000615 Robbery Under Arms Robbery Under Arms \n",
+ "... ... ... \n",
+ "tt9910930 Jeg ser deg Jeg ser deg \n",
+ "tt9911774 Padmavyuhathile Abhimanyu Padmavyuhathile Abhimanyu \n",
+ "tt9913056 Swarm Season Swarm Season \n",
+ "tt9913084 Diabolik sono io Diabolik sono io \n",
+ "tt9914286 Sokagin Çocuklari Sokagin Çocuklari \n",
+ "\n",
+ " is_adult year length genres imdb_rating \\\n",
+ "tconst \n",
+ "tt0000009 False 1894 45 Romance 5.5 \n",
+ "tt0000147 False 1897 20 Documentary,News,Sport 5.2 \n",
+ "tt0000335 False 1900 Biography,Drama 6.3 \n",
+ "tt0000574 False 1906 70 Biography,Crime,Drama 6.2 \n",
+ "tt0000615 False 1907 Drama 4.8 \n",
+ "... ... ... ... ... ... \n",
+ "tt9910930 False 2019 75 Crime,Documentary 4.6 \n",
+ "tt9911774 False 2019 130 Drama 8.5 \n",
+ "tt9913056 False 2019 86 Documentary 6.2 \n",
+ "tt9913084 False 2019 75 Documentary 6.2 \n",
+ "tt9914286 False 2019 98 Drama,Family 9.8 \n",
+ "\n",
+ " imdb_votes \n",
+ "tconst \n",
+ "tt0000009 77 \n",
+ "tt0000147 289 \n",
+ "tt0000335 39 \n",
+ "tt0000574 505 \n",
+ "tt0000615 14 \n",
+ "... ... \n",
+ "tt9910930 5 \n",
+ "tt9911774 363 \n",
+ "tt9913056 5 \n",
+ "tt9913084 6 \n",
+ "tt9914286 72 \n",
+ "\n",
+ "[232496 rows x 8 columns]"
+ ]
+ },
+ "execution_count": 26,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Alternativa 2 (taky dobrá)\n",
+ "pd.merge(movies, ratings, left_index=True, right_index=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
title
\n",
+ "
original_title
\n",
+ "
is_adult
\n",
+ "
year
\n",
+ "
length
\n",
+ "
genres
\n",
+ "
imdb_rating
\n",
+ "
imdb_votes
\n",
+ "
\n",
+ "
\n",
+ "
tconst
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
tt0000009
\n",
+ "
Miss Jerry
\n",
+ "
Miss Jerry
\n",
+ "
False
\n",
+ "
1894
\n",
+ "
45
\n",
+ "
Romance
\n",
+ "
5.5
\n",
+ "
77
\n",
+ "
\n",
+ "
\n",
+ "
tt0000147
\n",
+ "
The Corbett-Fitzsimmons Fight
\n",
+ "
The Corbett-Fitzsimmons Fight
\n",
+ "
False
\n",
+ "
1897
\n",
+ "
20
\n",
+ "
Documentary,News,Sport
\n",
+ "
5.2
\n",
+ "
289
\n",
+ "
\n",
+ "
\n",
+ "
tt0000335
\n",
+ "
Soldiers of the Cross
\n",
+ "
Soldiers of the Cross
\n",
+ "
False
\n",
+ "
1900
\n",
+ "
<NA>
\n",
+ "
Biography,Drama
\n",
+ "
6.3
\n",
+ "
39
\n",
+ "
\n",
+ "
\n",
+ "
tt0000574
\n",
+ "
The Story of the Kelly Gang
\n",
+ "
The Story of the Kelly Gang
\n",
+ "
False
\n",
+ "
1906
\n",
+ "
70
\n",
+ "
Biography,Crime,Drama
\n",
+ "
6.2
\n",
+ "
505
\n",
+ "
\n",
+ "
\n",
+ "
tt0000615
\n",
+ "
Robbery Under Arms
\n",
+ "
Robbery Under Arms
\n",
+ "
False
\n",
+ "
1907
\n",
+ "
<NA>
\n",
+ "
Drama
\n",
+ "
4.8
\n",
+ "
14
\n",
+ "
\n",
+ "
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
\n",
+ "
\n",
+ "
tt9910930
\n",
+ "
Jeg ser deg
\n",
+ "
Jeg ser deg
\n",
+ "
False
\n",
+ "
2019
\n",
+ "
75
\n",
+ "
Crime,Documentary
\n",
+ "
4.6
\n",
+ "
5
\n",
+ "
\n",
+ "
\n",
+ "
tt9911774
\n",
+ "
Padmavyuhathile Abhimanyu
\n",
+ "
Padmavyuhathile Abhimanyu
\n",
+ "
False
\n",
+ "
2019
\n",
+ "
130
\n",
+ "
Drama
\n",
+ "
8.5
\n",
+ "
363
\n",
+ "
\n",
+ "
\n",
+ "
tt9913056
\n",
+ "
Swarm Season
\n",
+ "
Swarm Season
\n",
+ "
False
\n",
+ "
2019
\n",
+ "
86
\n",
+ "
Documentary
\n",
+ "
6.2
\n",
+ "
5
\n",
+ "
\n",
+ "
\n",
+ "
tt9913084
\n",
+ "
Diabolik sono io
\n",
+ "
Diabolik sono io
\n",
+ "
False
\n",
+ "
2019
\n",
+ "
75
\n",
+ "
Documentary
\n",
+ "
6.2
\n",
+ "
6
\n",
+ "
\n",
+ "
\n",
+ "
tt9914286
\n",
+ "
Sokagin Çocuklari
\n",
+ "
Sokagin Çocuklari
\n",
+ "
False
\n",
+ "
2019
\n",
+ "
98
\n",
+ "
Drama,Family
\n",
+ "
9.8
\n",
+ "
72
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
232496 rows × 8 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " title original_title \\\n",
+ "tconst \n",
+ "tt0000009 Miss Jerry Miss Jerry \n",
+ "tt0000147 The Corbett-Fitzsimmons Fight The Corbett-Fitzsimmons Fight \n",
+ "tt0000335 Soldiers of the Cross Soldiers of the Cross \n",
+ "tt0000574 The Story of the Kelly Gang The Story of the Kelly Gang \n",
+ "tt0000615 Robbery Under Arms Robbery Under Arms \n",
+ "... ... ... \n",
+ "tt9910930 Jeg ser deg Jeg ser deg \n",
+ "tt9911774 Padmavyuhathile Abhimanyu Padmavyuhathile Abhimanyu \n",
+ "tt9913056 Swarm Season Swarm Season \n",
+ "tt9913084 Diabolik sono io Diabolik sono io \n",
+ "tt9914286 Sokagin Çocuklari Sokagin Çocuklari \n",
+ "\n",
+ " is_adult year length genres imdb_rating \\\n",
+ "tconst \n",
+ "tt0000009 False 1894 45 Romance 5.5 \n",
+ "tt0000147 False 1897 20 Documentary,News,Sport 5.2 \n",
+ "tt0000335 False 1900 Biography,Drama 6.3 \n",
+ "tt0000574 False 1906 70 Biography,Crime,Drama 6.2 \n",
+ "tt0000615 False 1907 Drama 4.8 \n",
+ "... ... ... ... ... ... \n",
+ "tt9910930 False 2019 75 Crime,Documentary 4.6 \n",
+ "tt9911774 False 2019 130 Drama 8.5 \n",
+ "tt9913056 False 2019 86 Documentary 6.2 \n",
+ "tt9913084 False 2019 75 Documentary 6.2 \n",
+ "tt9914286 False 2019 98 Drama,Family 9.8 \n",
+ "\n",
+ " imdb_votes \n",
+ "tconst \n",
+ "tt0000009 77 \n",
+ "tt0000147 289 \n",
+ "tt0000335 39 \n",
+ "tt0000574 505 \n",
+ "tt0000615 14 \n",
+ "... ... \n",
+ "tt9910930 5 \n",
+ "tt9911774 363 \n",
+ "tt9913056 5 \n",
+ "tt9913084 6 \n",
+ "tt9914286 72 \n",
+ "\n",
+ "[232496 rows x 8 columns]"
+ ]
+ },
+ "execution_count": 27,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Alternativa 3 (méně \"sémantická\")\n",
+ "pd.concat([movies, ratings], axis=\"columns\", join=\"inner\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Zkusme si zreprodukovat pořadí 250 nejlepších filmů z IMDb (viz https://www.imdb.com/chart/top/?ref_=nv_mv_250):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
title
\n",
+ "
original_title
\n",
+ "
is_adult
\n",
+ "
year
\n",
+ "
length
\n",
+ "
genres
\n",
+ "
imdb_rating
\n",
+ "
imdb_votes
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
The Chaos Class
\n",
+ "
Hababam Sinifi
\n",
+ "
False
\n",
+ "
1975
\n",
+ "
87
\n",
+ "
Comedy,Drama
\n",
+ "
9.4
\n",
+ "
33394
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
The Shawshank Redemption
\n",
+ "
The Shawshank Redemption
\n",
+ "
False
\n",
+ "
1994
\n",
+ "
142
\n",
+ "
Drama
\n",
+ "
9.3
\n",
+ "
2071759
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
The Mountain II
\n",
+ "
Dag II
\n",
+ "
False
\n",
+ "
2016
\n",
+ "
135
\n",
+ "
Action,Drama,War
\n",
+ "
9.3
\n",
+ "
100095
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
CM101MMXI Fundamentals
\n",
+ "
CM101MMXI Fundamentals
\n",
+ "
False
\n",
+ "
2013
\n",
+ "
139
\n",
+ "
Comedy,Documentary
\n",
+ "
9.2
\n",
+ "
41327
\n",
+ "
\n",
+ "
\n",
+ "
4
\n",
+ "
The Godfather
\n",
+ "
The Godfather
\n",
+ "
False
\n",
+ "
1972
\n",
+ "
175
\n",
+ "
Crime,Drama
\n",
+ "
9.2
\n",
+ "
1421495
\n",
+ "
\n",
+ "
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
\n",
+ "
\n",
+ "
245
\n",
+ "
12 Years a Slave
\n",
+ "
12 Years a Slave
\n",
+ "
False
\n",
+ "
2013
\n",
+ "
134
\n",
+ "
Biography,Drama,History
\n",
+ "
8.1
\n",
+ "
571204
\n",
+ "
\n",
+ "
\n",
+ "
246
\n",
+ "
The Sixth Sense
\n",
+ "
The Sixth Sense
\n",
+ "
False
\n",
+ "
1999
\n",
+ "
107
\n",
+ "
Drama,Mystery,Thriller
\n",
+ "
8.1
\n",
+ "
836928
\n",
+ "
\n",
+ "
\n",
+ "
247
\n",
+ "
The Passion of Joan of Arc
\n",
+ "
La passion de Jeanne d'Arc
\n",
+ "
False
\n",
+ "
1928
\n",
+ "
110
\n",
+ "
Biography,Drama,History
\n",
+ "
8.1
\n",
+ "
40107
\n",
+ "
\n",
+ "
\n",
+ "
248
\n",
+ "
Barfi!
\n",
+ "
Barfi!
\n",
+ "
False
\n",
+ "
2012
\n",
+ "
151
\n",
+ "
Comedy,Drama,Romance
\n",
+ "
8.1
\n",
+ "
68274
\n",
+ "
\n",
+ "
\n",
+ "
249
\n",
+ "
Platoon
\n",
+ "
Platoon
\n",
+ "
False
\n",
+ "
1986
\n",
+ "
120
\n",
+ "
Drama,War
\n",
+ "
8.1
\n",
+ "
348628
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
250 rows × 8 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " title original_title is_adult year \\\n",
+ "0 The Chaos Class Hababam Sinifi False 1975 \n",
+ "1 The Shawshank Redemption The Shawshank Redemption False 1994 \n",
+ "2 The Mountain II Dag II False 2016 \n",
+ "3 CM101MMXI Fundamentals CM101MMXI Fundamentals False 2013 \n",
+ "4 The Godfather The Godfather False 1972 \n",
+ ".. ... ... ... ... \n",
+ "245 12 Years a Slave 12 Years a Slave False 2013 \n",
+ "246 The Sixth Sense The Sixth Sense False 1999 \n",
+ "247 The Passion of Joan of Arc La passion de Jeanne d'Arc False 1928 \n",
+ "248 Barfi! Barfi! False 2012 \n",
+ "249 Platoon Platoon False 1986 \n",
+ "\n",
+ " length genres imdb_rating imdb_votes \n",
+ "0 87 Comedy,Drama 9.4 33394 \n",
+ "1 142 Drama 9.3 2071759 \n",
+ "2 135 Action,Drama,War 9.3 100095 \n",
+ "3 139 Comedy,Documentary 9.2 41327 \n",
+ "4 175 Crime,Drama 9.2 1421495 \n",
+ ".. ... ... ... ... \n",
+ "245 134 Biography,Drama,History 8.1 571204 \n",
+ "246 107 Drama,Mystery,Thriller 8.1 836928 \n",
+ "247 110 Biography,Drama,History 8.1 40107 \n",
+ "248 151 Comedy,Drama,Romance 8.1 68274 \n",
+ "249 120 Drama,War 8.1 348628 \n",
+ "\n",
+ "[250 rows x 8 columns]"
+ ]
+ },
+ "execution_count": 28,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Ty nejlepší (do června 2019)\n",
+ "(movies_with_rating\n",
+ " .query(\"imdb_votes > 25000\") # Berou se jen filmy s více než 25000 hlasy\n",
+ " .sort_values(\"imdb_rating\", ascending=False) # IMDb tu použivá i váhu jednotlivých hlasů (kterou neznáme)\n",
+ " .reset_index(drop=True)\n",
+ ").iloc[:250]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Do výčtu se nám dostaly filmy, které hranici hlasů nepřekračují o moc. Máme důvodné podezření, že toto kritérium dávno změnili. S požadovanými 250 000 hlasy se už blížíme:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
"
+ ],
+ "text/plain": [
+ " rank title studio lifetime_gross \\\n",
+ "0 1 Star Wars: The Force Awakens BV 936662225 \n",
+ "1 2 Avatar Fox 760507625 \n",
+ "2 3 Black Panther BV 700059566 \n",
+ "3 4 Avengers: Infinity War BV 678815482 \n",
+ "4 5 Titanic Par. 659363944 \n",
+ "... ... ... ... ... \n",
+ "16262 16263 Dog Eat Dog IFC 80 \n",
+ "16263 16264 Paranoid Girls NaN 78 \n",
+ "16264 16265 Confession of a Child of the Century Cohen 74 \n",
+ "16265 16266 Storage 24 Magn. 72 \n",
+ "16266 16267 Zyzzyx Road Reg. 30 \n",
+ "\n",
+ " year \n",
+ "0 2015 \n",
+ "1 2009 \n",
+ "2 2018 \n",
+ "3 2018 \n",
+ "4 1997 \n",
+ "... ... \n",
+ "16262 2009 \n",
+ "16263 2015 \n",
+ "16264 2015 \n",
+ "16265 2013 \n",
+ "16266 2006 \n",
+ "\n",
+ "[16267 rows x 5 columns]"
+ ]
+ },
+ "execution_count": 30,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "boxoffice_raw"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "rank int64\n",
+ "title object\n",
+ "studio object\n",
+ "lifetime_gross int64\n",
+ "year int64\n",
+ "dtype: object"
+ ]
+ },
+ "execution_count": 31,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "boxoffice_raw.dtypes"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "S tím bychom v podstatně mohli být spokojení, jen přejmenujeme `rank`, abychom při joinování věděli, odkud daný sloupec pochází."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "boxoffice = (boxoffice_raw\n",
+ " .rename({\n",
+ " \"rank\": \"boxoffice_rank\"\n",
+ " }, axis=\"columns\")\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "A zkusíme joinovat. V tomto případě se nemůžeme opřít o index (`boxoffice` pochází z jiného zdroje a o nějakém ID filmu z IMDb nemá ani tuchy), ale explicitně specifikujeme, který sloupec (či sloupce) se musí shodovat - na to slouží argument `on`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 33,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
title
\n",
+ "
original_title
\n",
+ "
is_adult
\n",
+ "
year (imdb)
\n",
+ "
length
\n",
+ "
genres
\n",
+ "
imdb_rating
\n",
+ "
imdb_votes
\n",
+ "
boxoffice_rank
\n",
+ "
studio
\n",
+ "
lifetime_gross
\n",
+ "
year (boxoffice)
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
1643
\n",
+ "
Pinocchio
\n",
+ "
Pinocchio
\n",
+ "
False
\n",
+ "
1940
\n",
+ "
88
\n",
+ "
Animation,Comedy,Family
\n",
+ "
7.5
\n",
+ "
114689
\n",
+ "
885
\n",
+ "
Dis.
\n",
+ "
84254167
\n",
+ "
1940
\n",
+ "
\n",
+ "
\n",
+ "
1644
\n",
+ "
Pinocchio
\n",
+ "
Pinocchio
\n",
+ "
False
\n",
+ "
1940
\n",
+ "
88
\n",
+ "
Animation,Comedy,Family
\n",
+ "
7.5
\n",
+ "
114689
\n",
+ "
6108
\n",
+ "
Mira.
\n",
+ "
3684305
\n",
+ "
2002
\n",
+ "
\n",
+ "
\n",
+ "
1645
\n",
+ "
Pinocchio
\n",
+ "
Turlis Abenteuer
\n",
+ "
False
\n",
+ "
1967
\n",
+ "
75
\n",
+ "
Adventure,Family,Fantasy
\n",
+ "
7.2
\n",
+ "
19
\n",
+ "
885
\n",
+ "
Dis.
\n",
+ "
84254167
\n",
+ "
1940
\n",
+ "
\n",
+ "
\n",
+ "
1646
\n",
+ "
Pinocchio
\n",
+ "
Turlis Abenteuer
\n",
+ "
False
\n",
+ "
1967
\n",
+ "
75
\n",
+ "
Adventure,Family,Fantasy
\n",
+ "
7.2
\n",
+ "
19
\n",
+ "
6108
\n",
+ "
Mira.
\n",
+ "
3684305
\n",
+ "
2002
\n",
+ "
\n",
+ "
\n",
+ "
1647
\n",
+ "
Pinocchio
\n",
+ "
Pinocchio
\n",
+ "
False
\n",
+ "
1971
\n",
+ "
79
\n",
+ "
Comedy,Fantasy
\n",
+ "
3.5
\n",
+ "
123
\n",
+ "
885
\n",
+ "
Dis.
\n",
+ "
84254167
\n",
+ "
1940
\n",
+ "
\n",
+ "
\n",
+ "
1648
\n",
+ "
Pinocchio
\n",
+ "
Pinocchio
\n",
+ "
False
\n",
+ "
1971
\n",
+ "
79
\n",
+ "
Comedy,Fantasy
\n",
+ "
3.5
\n",
+ "
123
\n",
+ "
6108
\n",
+ "
Mira.
\n",
+ "
3684305
\n",
+ "
2002
\n",
+ "
\n",
+ "
\n",
+ "
1649
\n",
+ "
Pinocchio
\n",
+ "
Pinocchio
\n",
+ "
False
\n",
+ "
1911
\n",
+ "
50
\n",
+ "
Fantasy
\n",
+ "
5.9
\n",
+ "
69
\n",
+ "
885
\n",
+ "
Dis.
\n",
+ "
84254167
\n",
+ "
1940
\n",
+ "
\n",
+ "
\n",
+ "
1650
\n",
+ "
Pinocchio
\n",
+ "
Pinocchio
\n",
+ "
False
\n",
+ "
1911
\n",
+ "
50
\n",
+ "
Fantasy
\n",
+ "
5.9
\n",
+ "
69
\n",
+ "
6108
\n",
+ "
Mira.
\n",
+ "
3684305
\n",
+ "
2002
\n",
+ "
\n",
+ "
\n",
+ "
1651
\n",
+ "
Pinocchio
\n",
+ "
Pinocchio
\n",
+ "
False
\n",
+ "
2002
\n",
+ "
108
\n",
+ "
Comedy,Family,Fantasy
\n",
+ "
4.3
\n",
+ "
7192
\n",
+ "
885
\n",
+ "
Dis.
\n",
+ "
84254167
\n",
+ "
1940
\n",
+ "
\n",
+ "
\n",
+ "
1652
\n",
+ "
Pinocchio
\n",
+ "
Pinocchio
\n",
+ "
False
\n",
+ "
2002
\n",
+ "
108
\n",
+ "
Comedy,Family,Fantasy
\n",
+ "
4.3
\n",
+ "
7192
\n",
+ "
6108
\n",
+ "
Mira.
\n",
+ "
3684305
\n",
+ "
2002
\n",
+ "
\n",
+ "
\n",
+ "
1653
\n",
+ "
Pinocchio
\n",
+ "
Un burattino di nome Pinocchio
\n",
+ "
False
\n",
+ "
1971
\n",
+ "
96
\n",
+ "
Animation,Family,Fantasy
\n",
+ "
7.0
\n",
+ "
117
\n",
+ "
885
\n",
+ "
Dis.
\n",
+ "
84254167
\n",
+ "
1940
\n",
+ "
\n",
+ "
\n",
+ "
1654
\n",
+ "
Pinocchio
\n",
+ "
Un burattino di nome Pinocchio
\n",
+ "
False
\n",
+ "
1971
\n",
+ "
96
\n",
+ "
Animation,Family,Fantasy
\n",
+ "
7.0
\n",
+ "
117
\n",
+ "
6108
\n",
+ "
Mira.
\n",
+ "
3684305
\n",
+ "
2002
\n",
+ "
\n",
+ "
\n",
+ "
1655
\n",
+ "
Pinocchio
\n",
+ "
Pinocchio
\n",
+ "
False
\n",
+ "
2012
\n",
+ "
75
\n",
+ "
Animation,Family,Fantasy
\n",
+ "
6.3
\n",
+ "
218
\n",
+ "
885
\n",
+ "
Dis.
\n",
+ "
84254167
\n",
+ "
1940
\n",
+ "
\n",
+ "
\n",
+ "
1656
\n",
+ "
Pinocchio
\n",
+ "
Pinocchio
\n",
+ "
False
\n",
+ "
2012
\n",
+ "
75
\n",
+ "
Animation,Family,Fantasy
\n",
+ "
6.3
\n",
+ "
218
\n",
+ "
6108
\n",
+ "
Mira.
\n",
+ "
3684305
\n",
+ "
2002
\n",
+ "
\n",
+ "
\n",
+ "
1657
\n",
+ "
Pinocchio
\n",
+ "
Pinocchio
\n",
+ "
False
\n",
+ "
2015
\n",
+ "
<NA>
\n",
+ "
Family,Fantasy
\n",
+ "
4.9
\n",
+ "
43
\n",
+ "
885
\n",
+ "
Dis.
\n",
+ "
84254167
\n",
+ "
1940
\n",
+ "
\n",
+ "
\n",
+ "
1658
\n",
+ "
Pinocchio
\n",
+ "
Pinocchio
\n",
+ "
False
\n",
+ "
2015
\n",
+ "
<NA>
\n",
+ "
Family,Fantasy
\n",
+ "
4.9
\n",
+ "
43
\n",
+ "
6108
\n",
+ "
Mira.
\n",
+ "
3684305
\n",
+ "
2002
\n",
+ "
\n",
+ "
\n",
+ "
1659
\n",
+ "
Pinocchio
\n",
+ "
Pinocchio
\n",
+ "
False
\n",
+ "
2015
\n",
+ "
75
\n",
+ "
Documentary
\n",
+ "
6.8
\n",
+ "
8
\n",
+ "
885
\n",
+ "
Dis.
\n",
+ "
84254167
\n",
+ "
1940
\n",
+ "
\n",
+ "
\n",
+ "
1660
\n",
+ "
Pinocchio
\n",
+ "
Pinocchio
\n",
+ "
False
\n",
+ "
2015
\n",
+ "
75
\n",
+ "
Documentary
\n",
+ "
6.8
\n",
+ "
8
\n",
+ "
6108
\n",
+ "
Mira.
\n",
+ "
3684305
\n",
+ "
2002
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " title original_title is_adult year (imdb) \\\n",
+ "1643 Pinocchio Pinocchio False 1940 \n",
+ "1644 Pinocchio Pinocchio False 1940 \n",
+ "1645 Pinocchio Turlis Abenteuer False 1967 \n",
+ "1646 Pinocchio Turlis Abenteuer False 1967 \n",
+ "1647 Pinocchio Pinocchio False 1971 \n",
+ "1648 Pinocchio Pinocchio False 1971 \n",
+ "1649 Pinocchio Pinocchio False 1911 \n",
+ "1650 Pinocchio Pinocchio False 1911 \n",
+ "1651 Pinocchio Pinocchio False 2002 \n",
+ "1652 Pinocchio Pinocchio False 2002 \n",
+ "1653 Pinocchio Un burattino di nome Pinocchio False 1971 \n",
+ "1654 Pinocchio Un burattino di nome Pinocchio False 1971 \n",
+ "1655 Pinocchio Pinocchio False 2012 \n",
+ "1656 Pinocchio Pinocchio False 2012 \n",
+ "1657 Pinocchio Pinocchio False 2015 \n",
+ "1658 Pinocchio Pinocchio False 2015 \n",
+ "1659 Pinocchio Pinocchio False 2015 \n",
+ "1660 Pinocchio Pinocchio False 2015 \n",
+ "\n",
+ " length genres imdb_rating imdb_votes \\\n",
+ "1643 88 Animation,Comedy,Family 7.5 114689 \n",
+ "1644 88 Animation,Comedy,Family 7.5 114689 \n",
+ "1645 75 Adventure,Family,Fantasy 7.2 19 \n",
+ "1646 75 Adventure,Family,Fantasy 7.2 19 \n",
+ "1647 79 Comedy,Fantasy 3.5 123 \n",
+ "1648 79 Comedy,Fantasy 3.5 123 \n",
+ "1649 50 Fantasy 5.9 69 \n",
+ "1650 50 Fantasy 5.9 69 \n",
+ "1651 108 Comedy,Family,Fantasy 4.3 7192 \n",
+ "1652 108 Comedy,Family,Fantasy 4.3 7192 \n",
+ "1653 96 Animation,Family,Fantasy 7.0 117 \n",
+ "1654 96 Animation,Family,Fantasy 7.0 117 \n",
+ "1655 75 Animation,Family,Fantasy 6.3 218 \n",
+ "1656 75 Animation,Family,Fantasy 6.3 218 \n",
+ "1657 Family,Fantasy 4.9 43 \n",
+ "1658 Family,Fantasy 4.9 43 \n",
+ "1659 75 Documentary 6.8 8 \n",
+ "1660 75 Documentary 6.8 8 \n",
+ "\n",
+ " boxoffice_rank studio lifetime_gross year (boxoffice) \n",
+ "1643 885 Dis. 84254167 1940 \n",
+ "1644 6108 Mira. 3684305 2002 \n",
+ "1645 885 Dis. 84254167 1940 \n",
+ "1646 6108 Mira. 3684305 2002 \n",
+ "1647 885 Dis. 84254167 1940 \n",
+ "1648 6108 Mira. 3684305 2002 \n",
+ "1649 885 Dis. 84254167 1940 \n",
+ "1650 6108 Mira. 3684305 2002 \n",
+ "1651 885 Dis. 84254167 1940 \n",
+ "1652 6108 Mira. 3684305 2002 \n",
+ "1653 885 Dis. 84254167 1940 \n",
+ "1654 6108 Mira. 3684305 2002 \n",
+ "1655 885 Dis. 84254167 1940 \n",
+ "1656 6108 Mira. 3684305 2002 \n",
+ "1657 885 Dis. 84254167 1940 \n",
+ "1658 6108 Mira. 3684305 2002 \n",
+ "1659 885 Dis. 84254167 1940 \n",
+ "1660 6108 Mira. 3684305 2002 "
+ ]
+ },
+ "execution_count": 33,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pd.merge(\n",
+ " movies_with_rating,\n",
+ " boxoffice,\n",
+ " suffixes=[\" (imdb)\", \" (boxoffice)\"],\n",
+ " on=\"title\"\n",
+ ").query(\"title == 'Pinocchio'\") # \"Jeden\" ukázkový film"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Jejda, to jsme asi nechtěli. Existuje spousta různých Pinocchiů a ke každému z nich se připojili vždy oba snímky tohoto jména z `boxoffice`. Z toho vyplývá poučení, že při joinování je dobré se zamyslet nad jedinečností hodnot ve sloupci, který používáme jako klíč. Jméno filmu takové očividně není.\n",
+ "\n",
+ "V našem konkrétním případě jsme si problému všimli sami, ale pokud bude duplikátní klíč utopen někde v milionech hodnot, rádi bychom, aby to počítač poznal za nás. K tomu slouží argument `validate` - podle toho, jaký vztah mezi tabulkami očekáš, jsou přípustné hodnoty `\"one_to_one\"`, `\"one_to_many\"`, `\"many_to_one\"` nebo `\"many_to_many\"`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 34,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
title
\n",
+ "
original_title
\n",
+ "
is_adult
\n",
+ "
year (imdb)
\n",
+ "
length
\n",
+ "
genres
\n",
+ "
imdb_rating
\n",
+ "
imdb_votes
\n",
+ "
boxoffice_rank
\n",
+ "
studio
\n",
+ "
lifetime_gross
\n",
+ "
year (boxoffice)
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
Oliver Twist
\n",
+ "
Oliver Twist
\n",
+ "
False
\n",
+ "
1912
\n",
+ "
<NA>
\n",
+ "
Drama
\n",
+ "
4.7
\n",
+ "
19
\n",
+ "
6826
\n",
+ "
Sony
\n",
+ "
2080321
\n",
+ "
2005
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
Oliver Twist
\n",
+ "
Oliver Twist
\n",
+ "
False
\n",
+ "
1912
\n",
+ "
<NA>
\n",
+ "
Drama
\n",
+ "
4.4
\n",
+ "
12
\n",
+ "
6826
\n",
+ "
Sony
\n",
+ "
2080321
\n",
+ "
2005
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
Oliver Twist
\n",
+ "
Oliver Twist
\n",
+ "
False
\n",
+ "
1916
\n",
+ "
50
\n",
+ "
Drama
\n",
+ "
6.6
\n",
+ "
16
\n",
+ "
6826
\n",
+ "
Sony
\n",
+ "
2080321
\n",
+ "
2005
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
Oliver Twist
\n",
+ "
Oliver Twist
\n",
+ "
False
\n",
+ "
1922
\n",
+ "
98
\n",
+ "
Drama
\n",
+ "
6.8
\n",
+ "
657
\n",
+ "
6826
\n",
+ "
Sony
\n",
+ "
2080321
\n",
+ "
2005
\n",
+ "
\n",
+ "
\n",
+ "
4
\n",
+ "
Oliver Twist
\n",
+ "
Oliver Twist
\n",
+ "
False
\n",
+ "
1933
\n",
+ "
80
\n",
+ "
Drama
\n",
+ "
5.0
\n",
+ "
292
\n",
+ "
6826
\n",
+ "
Sony
\n",
+ "
2080321
\n",
+ "
2005
\n",
+ "
\n",
+ "
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
\n",
+ "
\n",
+ "
20562
\n",
+ "
BTS World Tour: Love Yourself in Seoul
\n",
+ "
BTS World Tour: Love Yourself in Seoul
\n",
+ "
False
\n",
+ "
2019
\n",
+ "
112
\n",
+ "
Documentary,Music
\n",
+ "
8.5
\n",
+ "
439
\n",
+ "
6173
\n",
+ "
Fathom
\n",
+ "
3509917
\n",
+ "
2019
\n",
+ "
\n",
+ "
\n",
+ "
20563
\n",
+ "
Mojin: The Worm Valley
\n",
+ "
Yun nan chong gu
\n",
+ "
False
\n",
+ "
2018
\n",
+ "
110
\n",
+ "
Action,Fantasy
\n",
+ "
4.7
\n",
+ "
120
\n",
+ "
11240
\n",
+ "
WGUSA
\n",
+ "
101516
\n",
+ "
2019
\n",
+ "
\n",
+ "
\n",
+ "
20564
\n",
+ "
Extreme Job
\n",
+ "
Geukhanjikeob
\n",
+ "
False
\n",
+ "
2019
\n",
+ "
111
\n",
+ "
Action,Comedy
\n",
+ "
7.3
\n",
+ "
905
\n",
+ "
7212
\n",
+ "
CJ
\n",
+ "
1548816
\n",
+ "
2019
\n",
+ "
\n",
+ "
\n",
+ "
20565
\n",
+ "
Peppa Celebrates Chinese New Year
\n",
+ "
xiao zhu pei qi guo da nian
\n",
+ "
False
\n",
+ "
2019
\n",
+ "
81
\n",
+ "
Animation,Family
\n",
+ "
3.4
\n",
+ "
41
\n",
+ "
10811
\n",
+ "
STX
\n",
+ "
131225
\n",
+ "
2019
\n",
+ "
\n",
+ "
\n",
+ "
20566
\n",
+ "
Avant qu'on explose
\n",
+ "
Avant qu'on explose
\n",
+ "
False
\n",
+ "
2019
\n",
+ "
108
\n",
+ "
Comedy
\n",
+ "
6.9
\n",
+ "
41
\n",
+ "
10995
\n",
+ "
EOne
\n",
+ "
116576
\n",
+ "
2019
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
20567 rows × 12 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " title \\\n",
+ "0 Oliver Twist \n",
+ "1 Oliver Twist \n",
+ "2 Oliver Twist \n",
+ "3 Oliver Twist \n",
+ "4 Oliver Twist \n",
+ "... ... \n",
+ "20562 BTS World Tour: Love Yourself in Seoul \n",
+ "20563 Mojin: The Worm Valley \n",
+ "20564 Extreme Job \n",
+ "20565 Peppa Celebrates Chinese New Year \n",
+ "20566 Avant qu'on explose \n",
+ "\n",
+ " original_title is_adult year (imdb) length \\\n",
+ "0 Oliver Twist False 1912 \n",
+ "1 Oliver Twist False 1912 \n",
+ "2 Oliver Twist False 1916 50 \n",
+ "3 Oliver Twist False 1922 98 \n",
+ "4 Oliver Twist False 1933 80 \n",
+ "... ... ... ... ... \n",
+ "20562 BTS World Tour: Love Yourself in Seoul False 2019 112 \n",
+ "20563 Yun nan chong gu False 2018 110 \n",
+ "20564 Geukhanjikeob False 2019 111 \n",
+ "20565 xiao zhu pei qi guo da nian False 2019 81 \n",
+ "20566 Avant qu'on explose False 2019 108 \n",
+ "\n",
+ " genres imdb_rating imdb_votes boxoffice_rank studio \\\n",
+ "0 Drama 4.7 19 6826 Sony \n",
+ "1 Drama 4.4 12 6826 Sony \n",
+ "2 Drama 6.6 16 6826 Sony \n",
+ "3 Drama 6.8 657 6826 Sony \n",
+ "4 Drama 5.0 292 6826 Sony \n",
+ "... ... ... ... ... ... \n",
+ "20562 Documentary,Music 8.5 439 6173 Fathom \n",
+ "20563 Action,Fantasy 4.7 120 11240 WGUSA \n",
+ "20564 Action,Comedy 7.3 905 7212 CJ \n",
+ "20565 Animation,Family 3.4 41 10811 STX \n",
+ "20566 Comedy 6.9 41 10995 EOne \n",
+ "\n",
+ " lifetime_gross year (boxoffice) \n",
+ "0 2080321 2005 \n",
+ "1 2080321 2005 \n",
+ "2 2080321 2005 \n",
+ "3 2080321 2005 \n",
+ "4 2080321 2005 \n",
+ "... ... ... \n",
+ "20562 3509917 2019 \n",
+ "20563 101516 2019 \n",
+ "20564 1548816 2019 \n",
+ "20565 131225 2019 \n",
+ "20566 116576 2019 \n",
+ "\n",
+ "[20567 rows x 12 columns]"
+ ]
+ },
+ "execution_count": 34,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pd.merge(\n",
+ " movies_with_rating,\n",
+ " boxoffice,\n",
+ " on=\"title\",\n",
+ " suffixes=[\" (imdb)\", \" (boxoffice)\"],\n",
+ "# validate=\"one_to_one\" # Odkomentuj a vyskočí chyba!\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Řešení je jednoduché - budeme joinovat přes dva různé sloupce (argument `on` to unese ;-)). Při té příležitosti navíc zjišťujeme, že nedává smysl spojovat filmy, které rok vůbec uvedený nemají, a proto je vyhodíme:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 35,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
title
\n",
+ "
original_title
\n",
+ "
is_adult
\n",
+ "
year
\n",
+ "
length
\n",
+ "
genres
\n",
+ "
imdb_rating
\n",
+ "
imdb_votes
\n",
+ "
boxoffice_rank
\n",
+ "
studio
\n",
+ "
lifetime_gross
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
6926
\n",
+ "
Playback
\n",
+ "
Playback
\n",
+ "
False
\n",
+ "
2012
\n",
+ "
98
\n",
+ "
Horror,Thriller
\n",
+ "
4.3
\n",
+ "
4478
\n",
+ "
16256
\n",
+ "
Magn.
\n",
+ "
264
\n",
+ "
\n",
+ "
\n",
+ "
6927
\n",
+ "
Playback
\n",
+ "
Playback
\n",
+ "
False
\n",
+ "
2012
\n",
+ "
113
\n",
+ "
Drama
\n",
+ "
4.9
\n",
+ "
27
\n",
+ "
16256
\n",
+ "
Magn.
\n",
+ "
264
\n",
+ "
\n",
+ "
\n",
+ "
6928
\n",
+ "
Playback
\n",
+ "
Dur d'être Dieu
\n",
+ "
False
\n",
+ "
2012
\n",
+ "
66
\n",
+ "
Documentary
\n",
+ "
5.2
\n",
+ "
8
\n",
+ "
16256
\n",
+ "
Magn.
\n",
+ "
264
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " title original_title is_adult year length genres \\\n",
+ "6926 Playback Playback False 2012 98 Horror,Thriller \n",
+ "6927 Playback Playback False 2012 113 Drama \n",
+ "6928 Playback Dur d'être Dieu False 2012 66 Documentary \n",
+ "\n",
+ " imdb_rating imdb_votes boxoffice_rank studio lifetime_gross \n",
+ "6926 4.3 4478 16256 Magn. 264 \n",
+ "6927 4.9 27 16256 Magn. 264 \n",
+ "6928 5.2 8 16256 Magn. 264 "
+ ]
+ },
+ "execution_count": 35,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "(\n",
+ " pd.merge(\n",
+ " movies_with_rating.dropna(subset=[\"year\"]), # Vyhoď všechny řádky bez roku\n",
+ " boxoffice,\n",
+ " on=[\"title\", \"year\"],\n",
+ " validate=\"many_to_one\", # movies_with_rating pořád nejsou unikátní!\n",
+ " )\n",
+ ").query(\"title == 'Playback'\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Pořád nejsou unikátní! Co s tím?\n",
+ "\n",
+ "**Hypotéza:** Vstupujeme na nebezpečnou půdu a zkusíme spekulovat, že informace o ziscích budeme mít nejspíš jen o nejpopulárnějších filmech. Možná máme pravdu, možná ne a nejspíš nějakou drobnou nepřesnost zaneseme, ale dobrat se tady skutečné pravdy je \"drahé\" (a možná i skutečně drahé), z nabízených datových sad to věrohodně možné není.\n",
+ "\n",
+ "Abychom se co nejvíc přiblížili realitě, z každé opakující se dvojice (název, rok) vybereme film s nejvyšším `imdb_votes`. Nejdříve si pomocí `sort_values` srovnáme všechny filmy a pak zavoláme `drop_duplicates(..., keep=\"first\")`, což nám ponechá vždy jen jeden z řady duplikátů:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 36,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
"
+ ],
+ "text/plain": [
+ "Empty DataFrame\n",
+ "Columns: [title_type, title, original_title, is_adult, start_year, end_year, length, genres, Title, RatingTomatometer, No. of Reviews]\n",
+ "Index: []"
+ ]
+ },
+ "execution_count": 41,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Ready to merge?\n",
+ "pd.merge(imdb_titles, rotten_tomatoes_nodup, left_on=\"title\", right_on=\"Title\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "0 řádků!\n",
+ "\n",
+ "Dosud jsme manipulovali s řádky a sloupci jako celky, nicméně teď musíme zasahovat přímo do hodnot v buňkách. I to se při slučování dat z různých zdrojů nezřídka stává. Stojíme před úkolem převést řetězce typu \"Black Panther (2018)\" na dvě hodnoty: název \"Black Panther\" a rok 2018 (jeden sloupec na dva). \n",
+ "\n",
+ "Naštěstí si ty sloupce umíme jednoduše vyrobit pomocí řetězcové metody [`.str.slice`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.slice.html), která z každého řetězce vyřízne nějakou jeho část (a zase pracuje na celém sloupci - výsledkem bude nový sloupec s funkcí aplikovanou na každou z hodnot). Budeme věřit, že předposlední čtyři znaky představují rok a zbytek, až na nějaké ty závorky, tvoří skutečný název:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 42,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
tomatoes_rating
\n",
+ "
tomatoes_votes
\n",
+ "
title
\n",
+ "
year
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
97
\n",
+ "
444
\n",
+ "
Black Panther
\n",
+ "
2018
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
97
\n",
+ "
394
\n",
+ "
Mad Max: Fury Road
\n",
+ "
2015
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
93
\n",
+ "
410
\n",
+ "
Wonder Woman
\n",
+ "
2017
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
99
\n",
+ "
118
\n",
+ "
Metropolis
\n",
+ "
1927
\n",
+ "
\n",
+ "
\n",
+ "
4
\n",
+ "
97
\n",
+ "
308
\n",
+ "
Coco
\n",
+ "
2017
\n",
+ "
\n",
+ "
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
\n",
+ "
\n",
+ "
1585
\n",
+ "
15
\n",
+ "
97
\n",
+ "
Priest
\n",
+ "
2011
\n",
+ "
\n",
+ "
\n",
+ "
1586
\n",
+ "
14
\n",
+ "
103
\n",
+ "
American Outlaws
\n",
+ "
2001
\n",
+ "
\n",
+ "
\n",
+ "
1587
\n",
+ "
15
\n",
+ "
54
\n",
+ "
September Dawn
\n",
+ "
2007
\n",
+ "
\n",
+ "
\n",
+ "
1588
\n",
+ "
12
\n",
+ "
147
\n",
+ "
Jonah Hex
\n",
+ "
2010
\n",
+ "
\n",
+ "
\n",
+ "
1589
\n",
+ "
2
\n",
+ "
51
\n",
+ "
Texas Rangers
\n",
+ "
2001
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
947 rows × 4 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " tomatoes_rating tomatoes_votes title year\n",
+ "0 97 444 Black Panther 2018\n",
+ "1 97 394 Mad Max: Fury Road 2015\n",
+ "2 93 410 Wonder Woman 2017\n",
+ "3 99 118 Metropolis 1927\n",
+ "4 97 308 Coco 2017\n",
+ "... ... ... ... ...\n",
+ "1585 15 97 Priest 2011\n",
+ "1586 14 103 American Outlaws 2001\n",
+ "1587 15 54 September Dawn 2007\n",
+ "1588 12 147 Jonah Hex 2010\n",
+ "1589 2 51 Texas Rangers 2001\n",
+ "\n",
+ "[947 rows x 4 columns]"
+ ]
+ },
+ "execution_count": 42,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "rotten_tomatoes_beta = (rotten_tomatoes_nodup\n",
+ " .assign(\n",
+ " title=rotten_tomatoes_nodup[\"Title\"].str.slice(0, -7), \n",
+ " year=rotten_tomatoes_nodup[\"Title\"].str.slice(-5, -1).astype(int)\n",
+ " )\n",
+ " .rename({\n",
+ " \"RatingTomatometer\": \"tomatoes_rating\",\n",
+ " \"No. of Reviews\": \"tomatoes_votes\",\n",
+ " }, axis=\"columns\")\n",
+ " .drop([\"Title\"], axis=\"columns\")\n",
+ ")\n",
+ "rotten_tomatoes_beta"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Závorková odysea nekončí, někdo nám proaktivně do závorek nacpal i originální název naanglickojazyčných filmů. Pojďme se o tom přesvědčit pomocí metody [`.str.contains`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html) (protože tato metoda ve výchozím stavu používá pro vyhledávání regulární výrazy, které jsme se zatím nenaučili používat, musíme jí to explicitně zakázat argumentem `regex=False`):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 43,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
tomatoes_rating
\n",
+ "
tomatoes_votes
\n",
+ "
title
\n",
+ "
year
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
15
\n",
+ "
100
\n",
+ "
58
\n",
+ "
Seven Samurai (Shichinin no Samurai)
\n",
+ "
1956
\n",
+ "
\n",
+ "
\n",
+ "
51
\n",
+ "
98
\n",
+ "
46
\n",
+ "
Aguirre, the Wrath of God (Aguirre, der Zorn G...
\n",
+ "
1972
\n",
+ "
\n",
+ "
\n",
+ "
61
\n",
+ "
97
\n",
+ "
71
\n",
+ "
Ghostbusters (1984 Original)
\n",
+ "
1984
\n",
+ "
\n",
+ "
\n",
+ "
69
\n",
+ "
98
\n",
+ "
47
\n",
+ "
A Fistful of Dollars (Per un Pugno di Dollari)
\n",
+ "
1964
\n",
+ "
\n",
+ "
\n",
+ "
99
\n",
+ "
96
\n",
+ "
139
\n",
+ "
Embrace Of The Serpent (El Abrazo De La Serpie...
\n",
+ "
2016
\n",
+ "
\n",
+ "
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
...
\n",
+ "
\n",
+ "
\n",
+ "
1368
\n",
+ "
97
\n",
+ "
59
\n",
+ "
To Be and to Have (Etre et Avoir)
\n",
+ "
2003
\n",
+ "
\n",
+ "
\n",
+ "
1457
\n",
+ "
43
\n",
+ "
82
\n",
+ "
Goal! The Dream Begins (Goal!: The Impossible ...
\n",
+ "
2005
\n",
+ "
\n",
+ "
\n",
+ "
1502
\n",
+ "
71
\n",
+ "
52
\n",
+ "
Only Human (Seres queridos)
\n",
+ "
2006
\n",
+ "
\n",
+ "
\n",
+ "
1547
\n",
+ "
83
\n",
+ "
64
\n",
+ "
The Good, the Bad, the Weird (Joheun-nom, Nabb...
\n",
+ "
2010
\n",
+ "
\n",
+ "
\n",
+ "
1559
\n",
+ "
74
\n",
+ "
62
\n",
+ "
Fah talai jone (Tears of the Black Tiger)
\n",
+ "
2007
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
66 rows × 4 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " tomatoes_rating tomatoes_votes \\\n",
+ "15 100 58 \n",
+ "51 98 46 \n",
+ "61 97 71 \n",
+ "69 98 47 \n",
+ "99 96 139 \n",
+ "... ... ... \n",
+ "1368 97 59 \n",
+ "1457 43 82 \n",
+ "1502 71 52 \n",
+ "1547 83 64 \n",
+ "1559 74 62 \n",
+ "\n",
+ " title year \n",
+ "15 Seven Samurai (Shichinin no Samurai) 1956 \n",
+ "51 Aguirre, the Wrath of God (Aguirre, der Zorn G... 1972 \n",
+ "61 Ghostbusters (1984 Original) 1984 \n",
+ "69 A Fistful of Dollars (Per un Pugno di Dollari) 1964 \n",
+ "99 Embrace Of The Serpent (El Abrazo De La Serpie... 2016 \n",
+ "... ... ... \n",
+ "1368 To Be and to Have (Etre et Avoir) 2003 \n",
+ "1457 Goal! The Dream Begins (Goal!: The Impossible ... 2005 \n",
+ "1502 Only Human (Seres queridos) 2006 \n",
+ "1547 The Good, the Bad, the Weird (Joheun-nom, Nabb... 2010 \n",
+ "1559 Fah talai jone (Tears of the Black Tiger) 2007 \n",
+ "\n",
+ "[66 rows x 4 columns]"
+ ]
+ },
+ "execution_count": 43,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "rotten_tomatoes_beta[rotten_tomatoes_beta[\"title\"].str.contains(\")\", regex=False)]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "V rámci zjednodušení proto ještě odstraníme všechny takové závorky. K tomu pomůže funkce [`.str.rsplit`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.rsplit.html), která rozdělí zprava řetězec na několik částí podle oddělovače a vloží je do seznamu - my za ten oddělovač zvolíme levou závorku `\"(\"`, omezíme počet částí na jednu až dvě (`n=1`):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 44,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "41 [Marvel's The Avengers]\n",
+ "61 [Ghostbusters , 1984 Original)]\n",
+ "81 [Mad Max 2: The Road Warrior]\n",
+ "Name: title, dtype: object"
+ ]
+ },
+ "execution_count": 44,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "split_title = (\n",
+ " rotten_tomatoes_beta[\"title\"]\n",
+ " .str.rsplit(\"(\", n=1)\n",
+ ")\n",
+ "split_title.loc[[41, 61, 81]] # Některé seznamy obsahují jeden prvek, jiné dva"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "A jak teď vybrat první prvek z každého seznamu?\n",
+ "\n",
+ "💡 Metoda [`apply`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html) umožňuje použít libovolnou transformaci (definovanou jako funkci) na každý řádek v tabulce či hodnotu v `Series`. Obvykle se bez ní obejdeme a měli bychom (proto se jí tolik speciálně nevěnujeme), protože není příliš výpočetně efektivní. Tady nám ale usnadní pochopení, co se vlastně dělá, t.j. vybírá první prvek nějakého seznamu:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 45,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "