diff --git a/exercises/Chapter3.ipynb b/exercises/Chapter3.ipynb
index 04d7038..4b6edb3 100644
--- a/exercises/Chapter3.ipynb
+++ b/exercises/Chapter3.ipynb
@@ -22,26 +22,18 @@
    ],
    "source": [
     "import os\n",
-    "\n",
+    "import arviz as az\n",
     "import matplotlib.pyplot as plt\n",
     "import numpy as np\n",
     "import pandas as pd\n",
-    "from scipy import stats\n",
-    "import arviz as az\n",
     "import pymc3 as pm\n",
-    "from pandas.plotting import parallel_coordinates\n",
+    "\n",
     "from IPython.display import SVG, display\n",
+    "from pandas.plotting import parallel_coordinates\n",
+    "from scipy import stats\n",
     "from theano import shared, tensor\n",
     "\n",
-    "np.random.seed(seed=0)\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "metadata": {},
-   "outputs": [],
-   "source": [
+    "np.random.seed(seed=0)\n",
     "az.style.use('arviz-darkgrid')"
    ]
   },
@@ -50,25 +42,28 @@
    "metadata": {},
    "source": [
     "## Question 1\n",
-    "***"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The priors in this model are\n",
+    "***\n",
+    "\n",
+    "*Check the following definition of a probabilistic model. Identify the likelihood, the prior and the posterior:*\n",
     "\n",
     "\\begin{eqnarray}\n",
+    "y_i \\text{~} Normal(\\mu, \\sigma) \\newline\n",
     "\\mu \\text{~} Normal(0,10) \\newline\n",
     "\\sigma \\text{~} \\left|Normal(0,25) \\right|\n",
     "\\end{eqnarray}\n",
     "\n",
+    "The priors in this model are:\n",
+    "\n",
+    "\\begin{eqnarray}\n",
+    "\\mu \\text{~} Normal(0,10) \\newline\n",
+    "\\sigma \\text{~} \\left|Normal(0,25) \\right|\n",
+    "\\end{eqnarray}\n",
     "\n",
-    "The likelihood in our model is  \n",
+    "\n",
+    "The likelihood in our model is  :\n",
     "$$ Normal(\\mu, \\sigma)$$\n",
     "\n",
-    "And the posterior will be distribution over $\\mu$ and $\\sigma$, but the posterior is not directly specified in the model."
+    "And the posterior will be a distribution over $\\mu$ and $\\sigma$, but the posterior is not directly specified in the model (it is the result of Bayes formula!)."
    ]
   },
   {
@@ -76,14 +71,11 @@
    "metadata": {},
    "source": [
     "## Question 2\n",
-    "***"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "There are two parameters in this model, $\\mu$ and $\\sigma$"
+    "***\n",
+    "\n",
+    "*For the model in exercise 1, how many parameters will the posterior have? In other words, how many dimensions will it have?*\n",
+    "\n",
+    "There are two parameters in this model: $\\mu$ and $\\sigma$. So the posterior is 2-dimensional."
    ]
   },
   {
@@ -91,19 +83,16 @@
    "metadata": {},
    "source": [
     "## Question 3\n",
-    "***"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Without expanding the denominator\n",
+    "***\n",
+    "\n",
+    "*Write Bayes' theorem for the model in exercise 1.*\n",
+    "\n",
+    "Without expanding the denominator:\n",
     "\n",
-    "$$ p(\\mu, \\sigma | y) = \\frac{\\Pi_i Normal(y| \\mu, \\sigma) Normal(\\mu|0,10) |Normal(\\sigma|0,25)|}{p(y)}  $$\n",
+    "$$ p(\\mu, \\sigma | y) = \\frac{\\Pi_i\\; \\bigg( Normal(y| \\mu, \\sigma)\\quad Normal(\\mu|0,10)\\quad HalfNormal(\\sigma|0,25) \\bigg) }{p(y)}  $$\n",
     "\n",
-    "Expanding the denominator\n",
-    "$$ p(\\mu, \\sigma | y) = \\frac{\\Pi_i Normal(y| \\mu, \\sigma) Normal(\\mu|0,10) |Normal(\\sigma|0,25)|}{\\int \\int \\Pi_i Normal(y| \\mu, \\sigma) Normal(\\mu|0,10) |Normal(\\sigma|0,25)| d\\mu d\\sigma} $$"
+    "Expanding the denominator:\n",
+    "$$ p(\\mu, \\sigma | y) = \\frac{\\Pi_i\\; \\bigg( Normal(y| \\mu, \\sigma)\\quad Normal(\\mu|0,10)\\quad HalfNormal(\\sigma|0,25) \\bigg) }{\\int \\int\\; \\Pi_i\\; \\bigg( Normal(y| \\mu, \\sigma)\\quad Normal(\\mu|0,10)\\quad HalfNormal(\\sigma|0,25) \\bigg)\\; d\\mu\\; d\\sigma} $$"
    ]
   },
   {
@@ -111,23 +100,28 @@
    "metadata": {},
    "source": [
     "## Question 4\n",
-    "***"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The linear model is \n",
+    "***\n",
+    "\n",
+    "*Check the following model. Identify the linear model and the likelihood. How many parameters does the posterior have?*\n",
+    "\n",
+    "\\begin{eqnarray}\n",
+    "y \\text{~} Normal(\\mu, \\epsilon) \\newline\n",
+    "\\mu = \\alpha + \\beta x \\newline\n",
+    "\\alpha \\text{~} Normal(0,10) \\newline\n",
+    "\\beta \\text{~} Normal(0,1) \\newline\n",
+    "\\epsilon \\text{~} \\left|Normal(0,25) \\right|\n",
+    "\\end{eqnarray}\n",
+    "\n",
+    "The linear model is:\n",
     "\\begin{eqnarray}\n",
     "\\mu = \\alpha + \\beta x\n",
     "\\end{eqnarray}\n",
     "\n",
     "\n",
-    "The likelihood in our model is  \n",
+    "The likelihood in our model is:  \n",
     "$$ Normal(\\mu, \\epsilon)$$\n",
     "\n",
-    "The posterior will have three parameters\n",
+    "The posterior will have three parameters:\n",
     "\n",
     "$$ \\alpha, \\beta, \\epsilon $$"
    ]
@@ -137,14 +131,17 @@
    "metadata": {},
    "source": [
     "## Question 5\n",
-    "***"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "For this exercise we will generate 57 datapoints from a distribution of $N(4,. 5)$"
+    "***\n",
+    "\n",
+    "*For the model in exercise 1, assume that you have a dataset with 57 data points coming from a Gaussian with a mean of 4 and a standard deviation of 0.5. Using PyMC3, compute:*\n",
+    "- The posterior distribution\n",
+    "- The prior distribution\n",
+    "- The posterior predictive distribution\n",
+    "- The prior predictive distribution\n",
+    "\n",
+    "*Tip: Besides `pm.sample()`, PyMC3 has other functions to compute samples.*\n",
+    "\n",
+    "For this exercise we will generate 57 datapoints from a distribution of $Normal(4, 0.5)$:"
    ]
   },
   {
@@ -191,15 +188,6 @@
     "    posterior_predictive = pm.sample_posterior_predictive(trace)"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "dataset = az.from_pymc3(trace=trace, posterior_predictive=posterior_predictive, prior=prior_predictive)"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": 6,
@@ -222,6 +210,7 @@
     }
    ],
    "source": [
+    "dataset = az.from_pymc3(trace=trace, posterior_predictive=posterior_predictive, prior=prior_predictive)\n",
     "dataset"
    ]
   },
@@ -262,14 +251,14 @@
    ],
    "source": [
     "# The plot_posterior method can be used to plot priors as well\n",
-    "az.plot_posterior(dataset.prior, var_names=[\"mu\", \"sd\"])"
+    "az.plot_posterior(dataset.prior, var_names=[\"mu\", \"sd\"]);"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We'll also plot the posterior as well to check the distributions after updates. You'll notice that the posterior for SD is bimodel, this is a result of our model definition which dates the absolute value of sd."
+    "Now let's plot the posterior, to check the distributions after update:"
    ]
   },
   {
@@ -305,6 +294,13 @@
     "az.plot_posterior(dataset)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The exercise also asks for the prior predictive values. We'll need to do some data manipulation to get the data into a format we can use with ArviZ:"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 9,
@@ -402,14 +398,14 @@
     }
    ],
    "source": [
-    "az.plot_kde(prior_predictive)"
+    "az.plot_kde(prior_predictive);"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can them compare this to the posterior predictive distribution"
+    "We can them compare this to the posterior predictive distribution:"
    ]
   },
   {
@@ -440,7 +436,7 @@
     }
    ],
    "source": [
-    "az.plot_ppc(dataset)"
+    "az.plot_ppc(dataset);"
    ]
   },
   {
@@ -448,7 +444,9 @@
    "metadata": {},
    "source": [
     "## Question 6\n",
-    "***"
+    "***\n",
+    "\n",
+    "*Execute `model_g` using NUTS (the default sampler) and then using Metropolis. Compare the results using ArviZ functions like `plot_trace` and `plot_pairs`. Center the variable $x$ and repeat the exercise. What conclusion can you draw from this?*"
    ]
   },
   {
@@ -852,7 +850,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We'll standardize the variables and take samples again. We don't need to redefine the model but will do so for clarities sake"
+    "Now let's standardize the variables and take samples again. We don't need to redefine the model, but we'll do so for clarities sake:"
    ]
   },
   {
@@ -861,7 +859,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# or standardize the data\n",
+    "# standardize the data\n",
     "x_centered = (x - x.mean())/x.std()\n",
     "y_centered = (y - y.mean())/y.std()"
    ]
@@ -1029,7 +1027,6 @@
    "source": [
     "%%time\n",
     "with model_g_centered:\n",
-    "    # step = pm.Metropolis(vars=[\"μ\"])\n",
     "    step = pm.Metropolis()\n",
     "    trace_mh_centered = pm.sample(step=step)"
    ]
@@ -1118,14 +1115,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Looking through the plots there's a couple things to note\n",
+    "Looking through the plots there's a couple things to note.\n",
     "\n",
-    "Metropolis Hastings is less effective at sampling than NUTS. This is indicated by\n",
+    "Metropolis Hastings is less effective at sampling than NUTS. This is indicated by:\n",
     " 1. The Metropolis Hasting trace plots looking \"square\" when compared the NUTS traceplot. This is due to the sampler getting \"stuck\" at a value.\n",
     " 2. The kernel density estimates of each chain have \"squiggly\" topologies\n",
     " 3. The effective number of samples for Non Centered Metropolis Hastings is 1\n",
     " \n",
-    "One thing to note though is that Metropolis Hastings does sample faster than NUTS. While the the results aren't great credit is due where its deserved!\n",
+    "One thing to note though is that Metropolis Hastings does sample faster than NUTS. While the results aren't great, credit is due where it's deserved!\n",
     " \n",
     "Diving into the problem further, we can see that $\\alpha$ and $\\beta$ are linearly correlated. Metropolis Hastings does not sample well when toplogies have such shapes. We'll talk more about this in Chapter 8, but for now note how centering x helps somewhat with the Metropolis Hasting sampler, as centering decorrelates $\\alpha$ and $\\beta$ parameters.\n",
     "\n",
@@ -1136,15 +1133,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Question 7\n",
-    "***"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Let's import the dataset and create a mask for people older than the age of 18"
+    "## Question 7\n",
+    "***\n",
+    "\n",
+    "*Use the howell dataset to create a linear model of the weight ($x$) against the height ($y$). Exclude subjects that are younger than 18. Explain the results.*\n",
+    "\n",
+    "Let's import the dataset and create a mask for people older than 18:"
    ]
   },
   {
@@ -1288,7 +1282,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "When looking at the plot above this is consistent our expectations. As weight increases, height increases as well. From visual inspection it looks like a linear fit with some noise is best. In this case we will assume constant variance. Let's create a model"
+    "When looking at the plot above this is with consistent our expectations. As weight increases, height increases as well. From visual inspection, it looks like a linear fit with some noise is best. In this case we will assume constant variance. Let's create a model:"
    ]
   },
   {
@@ -1420,22 +1414,23 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "From visual inspection the average parameters of the fit look quite good, and the 98% interval of the posterior predictive check covers most of the distribution. Overall it looks like a linear fit is great for height vs weight for people over 18."
+    "From visual inspection the average parameters of the fit look quite good, and the 98% interval of the posterior predictive checks covers most of the distribution. Overall, it looks like a linear fit is great for height vs weight for people over 18!"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Question 8\n",
-    "***"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Using our previous fit we can generate datapoints for the height of people for various weights"
+    "## Question 8\n",
+    "***\n",
+    "\n",
+    "*For four subjects, we get the weights (45.73, 65.8, 54.2, 32.59), but not their heights. Using the model from the previous exercise, predict the height for each subject, together with their 50% and 94% HPDs.*\n",
+    "\n",
+    "*Tip 1: Check the [coal mining disaster example](https://docs.pymc.io/notebooks/getting_started.html#Case-study-2:-Coal-mining-disasters) in PyMC3's documentation.*\n",
+    "\n",
+    "*Tip 2: Use shared variables.*\n",
+    "\n",
+    "Using our previous fit, we can generate predictions for the height of people with various weights:"
    ]
   },
   {
@@ -1493,29 +1488,26 @@
    ],
    "source": [
     "ppc_first_col = ppc[\"height_pred\"][:,0]\n",
-    "az.plot_kde(ppc_first_col)"
+    "az.plot_kde(ppc_first_col);"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The reason we take the 0th value is that in the current version PyMC3 `pm.sample_ppc` returns an array of size samples x observed values. This makes total sense when you want to do a posterior predictive check as you want to compare the simulations/predictions be of the same size/shape as the observations to be able to do one-to-one comparison. But when you just want to make predictions at values other that the input-data that restriction does not apply. In summary I consider this a limitation of the current implementation of `pm.sample_ppc`"
+    "The reason we take the 0th column is that, in the current version of PyMC3, `pm.sample_ppc` returns an array of size $samples\\; \\times\\; observed\\; values$. This makes total sense when you want to do posterior predictive checks, as you want to check that the simulations/predictions have the  same size/shape as the observations. But when you just want to make predictions at values other that the input data, that restriction does not apply. In summary, I consider this a limitation of the current implementation of `pm.sample_ppc`."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Question 9\n",
-    "***"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Let's take a look at the data again with out the age limits"
+    "## Question 9\n",
+    "***\n",
+    "\n",
+    "*Repeat exercise 7, this time including those below 18 years old. Explain the results.*\n",
+    "\n",
+    "Let's take a look at the data again, now without the age limit:"
    ]
   },
   {
@@ -1545,7 +1537,7 @@
     }
    ],
    "source": [
-    "howell.plot(kind=\"scatter\", x=\"weight\", y=\"height\")"
+    "howell.plot(kind=\"scatter\", x=\"weight\", y=\"height\");"
    ]
   },
   {
@@ -1554,9 +1546,9 @@
    "source": [
     "By removing the age limit we notice a different trend. At lower weights, a single unit of weight generally corresponds to more height. At higher weights however the height still goes up, but by a lesser amount. There also seems to be more \"spread\" in the higher weights, than in the lower weights.\n",
     "\n",
-    "Intuitively again this makes sense. Weight is a proxy for age, and when born the variability in height and weight is smaller than for adults. Additionally children tend to grow in both height and weight. Unfortunately once humans reach adulthood, the height is mostly fixed, and the weight unfortunately changes all too easily.\n",
+    "Intuitively again this makes sense. Weight is a proxy for age, and when born the variability in height and weight is smaller than for adults. Additionally children tend to grow in both height and weight. Once humans reach adulthood, the height is mostly fixed, and the weight unfortunately changes all too easily.\n",
     "\n",
-    "Another thing to note is the shape of the distribution, it no longer looks linear throughout, but instead looks more like a curve. We could use a square root linear fit like earlier in the chapter but we instead will use a logarithmic fit. We will also model the noise term to be correlated with weight"
+    "Another thing to note is the shape of the distribution: it no longer looks linear throughout, but instead looks more like a curve. We could use a square root linear fit, like earlier in the chapter, but we instead will use a logarithmic fit. We will also model the noise term to be correlated with weight, as heights seem to vary more when weights get higher."
    ]
   },
   {
@@ -1629,7 +1621,7 @@
     }
    ],
    "source": [
-    "az.plot_trace(trace_heights, var_names = [\"α\",\"β\", \"γ\", \"δ\"])"
+    "az.plot_trace(trace_heights, var_names = [\"α\",\"β\", \"γ\", \"δ\"]);"
    ]
   },
   {
@@ -1663,21 +1655,21 @@
     "\n",
     "ax.plot(weight, height, \"C0.\")\n",
     "μ_m = trace_heights[\"μ\"].mean(0)\n",
-    "ϵ_m = trace_heights[\"ϵ\"].mean(0)\n",
     "\n",
     "order = np.argsort(weight)\n",
     "ax.plot(weight[order], μ_m[order], c=\"k\")\n",
     "\n",
     "az.plot_hpd(weight, trace_heights[\"μ\"], credible_interval=.98)\n",
     "az.plot_hpd(weight, ppc_heights[\"height_pred\"], credible_interval=.98, color=\"gray\")\n",
-    "fig.suptitle(\"Weight vs Height fit and posterior predictive checks\")"
+    "\n",
+    "fig.suptitle(\"Weight vs Height fit and posterior predictive checks\");"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's also plot noise as a function of weight"
+    "Let's also plot the noise as a function of weight:"
    ]
   },
   {
@@ -1709,32 +1701,29 @@
    "source": [
     "fig, ax = plt.subplots()\n",
     "ax.plot(weight, trace_heights[\"ϵ\"].mean(0))\n",
-    "az.plot_hpd(weight, trace_heights[\"ϵ\"], credible_interval=.98)"
+    "az.plot_hpd(weight, trace_heights[\"ϵ\"], credible_interval=.98);"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "It can be seen that in lower weight ranges there tends to be less variability in weight relative to height, versus when older in life. This makes sense with intuition as humans start out roughly the same in their earlier years, and tend to become more different in physical dimensions as they grow older in age and weight"
+    "We can see that in lower weight ranges there tends to be less variability in height than for bigger weight ranges (i.e when people are older). This makes sense intuitively, as humans start out roughly the same in their earlier years, and tend to become more different in physical dimensions as they grow older in age and weight."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Question 10\n",
-    "***"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Let's fit the model one more time but with a 2nd order polynomial that follows this definition\n",
+    "## Question 10\n",
+    "***\n",
+    "\n",
+    "*It is known that for many species the weight does not scale with the height, but with the logarithm of the weight. Use this information to fit the howell data (including subjects from all ages). Do one more model, this time without using the logarithm but instead a second order polynomial. Compare and explain both results.*\n",
+    "\n",
+    "We did the logarithm bit in the previous exercise, so let's directly fit the model with a 2nd order polynomial that follows this definition:\n",
     "$$\\mu = \\alpha + \\beta_0*x + \\beta_1*x^2$$\n",
     "\n",
-    "Note that we could have used the dot product like the model_mlr example, but in this model we chose to explicitly split out the terms"
+    "Note that we could have used the dot product like in the `model_mlr` example, but in this model we chose to explicitly split out the terms."
    ]
   },
   {
@@ -1764,10 +1753,10 @@
     "    \n",
     "    weight_shared = shared(weight.values * 1.)\n",
     "    \n",
-    "    μ = pm.Deterministic(\"μ\", α+β[0]*weight_shared + β[1]*weight_shared**2)\n",
-    "    ϵ = pm.Deterministic(\"ϵ\", γ+δ*weight_shared)\n",
+    "    μ = pm.Deterministic(\"μ\", α + β[0] * weight_shared + β[1] * weight_shared**2)\n",
+    "    ϵ = pm.Deterministic(\"ϵ\", γ + δ * weight_shared)\n",
     "    \n",
-    "    height_pred_polynomial = pm.Normal(\"height_pred\", mu=μ, sd=ϵ, observed = height)\n",
+    "    height_pred_polynomial = pm.Normal(\"height_pred\", mu=μ, sd=ϵ, observed=height)\n",
     "    trace_heights_polynomial = pm.sample(tune=2000)\n",
     "    ppc_heights_polynomial = pm.sample_posterior_predictive(trace_heights_polynomial, samples=2000)"
    ]
@@ -1845,7 +1834,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "For weights up until around ~50 units the polynomial fit looks good. However past that point the curve starts dropping dropping. Inuitively this does not make sense, that as people weigh more their height increases, until a certain point it starts dropping again. This phenonema is not a property of our data, but of our model choice. Polynomial functions always have to make N-1 turns, where N is the degree of the polynomial. This doesn't necessarily make our model useless, it seems to do a good job in certain parts of the domain, but as a statistical modeler its up to you to understand the tools in your toolbox and the tradeoffs of each."
+    "For weights up until around ~50 units, the polynomial fit looks good. However past that point the curve starts dropping. Inuitively this does not make sense. This phenonemon is not a property of our data, but of our model choice. Polynomial functions always have to make N-1 turns, where N is the degree of the polynomial. This doesn't necessarily make our model useless, it seems to do a good job in certain parts of the domain, but as a statistical modeler, it's up to you to understand the tools in your toolbox and the tradeoffs of each."
    ]
   },
   {
@@ -1853,7 +1842,9 @@
    "metadata": {},
    "source": [
     "## Question 11\n",
-    "***"
+    "***\n",
+    "\n",
+    "*Think about a model that's able to fit the first three dataset from the Anscombe quartet. Also, think about a model to fit the fourth dataset.*"
    ]
   },
   {
@@ -1868,13 +1859,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "A model that might fit the first three models is a polynomial regression of form\n",
+    "A model that might fit the first three models is a polynomial regression of the form:\n",
     "\n",
     "$ y = \\alpha_2  x^2 + \\alpha_1 x + \\alpha_0 $\n",
     "\n",
     "For the more linear datasets the model could have a low value for $\\alpha_2$, and for the second dataset the model would be able to fit the non linearity.\n",
     "\n",
-    "For the last dataset there seems to be two distinct patterns, a cluster of points at x=8 and one at x=19. We could model this one with two groups as follows"
+    "For the last dataset there seems to be two distinct patterns, a cluster of points at x=8 and one at x=19. We could model this one with two groups as follows:"
    ]
   },
   {
@@ -1954,10 +1945,11 @@
     "with pm.Model() as anscombe:\n",
     "    \n",
     "    # Two groups of points with independent parameters\n",
+    "    mu = pm.Normal(\"mu\", sd=10, shape=2)\n",
     "    sd = pm.HalfNormal(\"sd\", sd=10)\n",
-    "    mu = pm.Normal(\"mu\", shape=2, sd=10)\n",
     "    \n",
     "    y = pm.Normal(\"y\", mu=mu[idx], sd=sd, observed=df[\"y\"].values)\n",
+    "    \n",
     "    trace_4 = pm.sample(draws=10000)\n",
     "    ppc = pm.sample_posterior_predictive(trace_4)"
    ]
@@ -2003,15 +1995,17 @@
     }
    ],
    "source": [
-    "az.plot_trace(trace_4)"
+    "az.plot_trace(trace_4);"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Question 12\n",
-    "***"
+    "## Question 12\n",
+    "***\n",
+    "\n",
+    "*See in the code accompanying the book the `model_t2` (and the data associated with it). Experiment with priors for $\\nu$, like the non-shifted exponential and gamma priors (they are commented in the code below). Plot the prior distribution, to ensure that you understand them. An easy way to do this is to just comment the likelihood in the model and check the trace plot. A more efficient way though is to use the `pm.sample_prior_predictive()` function instead of `pm.sample()`.*"
    ]
   },
   {
@@ -2053,6 +2047,7 @@
     "    # ν = pm.Gamma('ν', 2, 0.1)\n",
     "\n",
     "    y_pred = pm.StudentT('y_pred', mu=α + β * x_4, sd=ϵ, nu=ν, observed=y_4)\n",
+    "    \n",
     "    prior_v_exp = pm.sample_prior_predictive(2000)\n",
     "    trace_v_exp = pm.sample(2000)"
    ]
@@ -2095,7 +2090,7 @@
     }
    ],
    "source": [
-    "az.plot_trace(data_exp.prior, var_names=[\"ν\"])"
+    "az.plot_trace(data_exp.prior, var_names=[\"ν\"]);"
    ]
   },
   {
@@ -2125,6 +2120,7 @@
     "    #ν = pm.Gamma('ν', 2, 0.1)\n",
     "\n",
     "    y_pred = pm.StudentT('y_pred', mu=α + β * x_4, sd=ϵ, nu=ν, observed=y_4)\n",
+    "    \n",
     "    prior_v20_15 = pm.sample_prior_predictive(2000)\n",
     "    trace_v_20 = pm.sample(2000)"
    ]
@@ -2167,7 +2163,7 @@
     }
    ],
    "source": [
-    "az.plot_trace(data_20.prior, var_names=[\"ν\"])"
+    "az.plot_trace(data_20.prior, var_names=[\"ν\"]);"
    ]
   },
   {
@@ -2197,6 +2193,7 @@
     "    ν = pm.Gamma('ν', 2, 0.1)\n",
     "\n",
     "    y_pred = pm.StudentT('y_pred', mu=α + β * x_4, sd=ϵ, nu=ν, observed=y_4)\n",
+    "    \n",
     "    prior_v2_01 = pm.sample_prior_predictive(2000)\n",
     "    trace_v_2_01 = pm.sample(2000)"
    ]
@@ -2239,15 +2236,17 @@
     }
    ],
    "source": [
-    "az.plot_trace(data_2.prior, var_names=[\"ν\"])"
+    "az.plot_trace(data_2.prior, var_names=[\"ν\"]);"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Question 13\n",
-    "***"
+    "## Question 13\n",
+    "***\n",
+    "\n",
+    "*For the `unpooled_model`, change the value of `sd` for the $\\beta$ prior. Try values of 1 and 100. Explore how the estimated slopes change for each group. Which group is more affected by this change?*"
    ]
   },
   {
@@ -2276,7 +2275,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's loop through a couple standard deviation values for the slope parameter of each group"
+    "Let's loop through a couple standard deviation values for the slope parameter of each group:"
    ]
   },
   {
@@ -2366,15 +2365,17 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "As we increase standard deviation of the beta prior, for the slope parameter, we can see that for most of the groups the effect is neglible. However for group 7 the prior has a strong effect on the posterior estimation because group 7 only has one data point and the unpooled model doesn't consider the datapoints in the other groups. There simple isn't enough data to \"wash out\" the prior distribution in this case."
+    "As we increase the standard deviation of the beta prior (the slope parameter), we see that for most of the groups the effect is neglible. However, for group 7 the prior has a strong effect on the posterior estimation because group 7 only has one data point and the unpooled model doesn't consider the datapoints in the other groups. There simply isn't enough data to \"wash out\" the prior distribution in this case."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Question 14\n",
-    "***"
+    "## Question 14\n",
+    "***\n",
+    "\n",
+    "*Using the model `hierarchical_model`, repeat Figure 3.18, the one with the eight groups and the eight lines, but this time add the uncertainty to the linear fit.*"
    ]
   },
   {
@@ -2473,8 +2474,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Question 15\n",
-    "***"
+    "## Question 15\n",
+    "***\n",
+    "\n",
+    "*Re-run the `model_mlr` example, this time without centering the data. Compare the uncertainty in the $\\alpha$ parameter for one case and the other. Can you explain these results?*\n",
+    "\n",
+    "*Tip: Remember the definition of the $\\alpha$ parameter (also known as the intercept).*"
    ]
   },
   {
@@ -2528,15 +2533,35 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In the non centered model the alpha parameter changes to compensate for the position of the points. In other words alpha needs to compensate $beta * X$ distance up or down since the x values are not longer centered around the y axis."
+    "With the non-centered data, $\\alpha$ changes to compensate for the position of the points. In other words, $\\alpha$ needs to compensate $\\beta X$ distance up or down since the $X$ values are no longer centered around the $y$ axis."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Question 16\n",
+    "***\n",
+    "\n",
+    "*Read and run [this notebook](https://docs.pymc.io/notebooks/LKJ.html) from PyMC3's documentation*"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Question 17\n",
+    "***\n",
+    "\n",
+    "*Choose a dataset that you find interesting and use it with the simple linear regression model. Be sure to explore the results using ArviZ functions and compute the Pearson correlation coefficient. If you do not have an interesting dataset, try searching online, for example [here](https://data.worldbank.org/) or [there](http://users.stat.ufl.edu/~winner/datasets.html).*"
    ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "stat-rethink-pymc3",
    "language": "python",
-   "name": "python3"
+   "name": "stat-rethink-pymc3"
   },
   "language_info": {
    "codemirror_mode": {
@@ -2548,7 +2573,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.7"
+   "version": "3.7.3"
   }
  },
  "nbformat": 4,
diff --git a/exercises/Chapter4.ipynb b/exercises/Chapter4.ipynb
index 598dc88..ae85bc8 100644
--- a/exercises/Chapter4.ipynb
+++ b/exercises/Chapter4.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Chapter 4 Exercises 1-5"
+    "# Chapter 4 Exercises"
    ]
   },
   {
@@ -13,15 +13,17 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "import arviz as az\n",
     "import matplotlib.pyplot as plt\n",
     "import numpy as np\n",
     "import pandas as pd\n",
-    "from scipy import stats\n",
-    "from scipy.special import expit as logistic\n",
-    "import arviz as az\n",
     "import pymc3 as pm\n",
     "import seaborn as sns\n",
     "import theano.tensor as tt\n",
+    "\n",
+    "from scipy import stats\n",
+    "from scipy.special import expit as logistic\n",
+    "\n",
     "np.random.seed(123)"
    ]
   },
@@ -30,7 +32,9 @@
    "metadata": {},
    "source": [
     "## Exercise 1\n",
-    "***"
+    "***\n",
+    "\n",
+    "*Re-run the first model using the petal length and then petal width variables. What are the main differences in the results? How wide or narrow is the 95% HPD interval in each case?*"
    ]
   },
   {
@@ -43,7 +47,7 @@
     "df = iris.query(\"species == ('setosa', 'versicolor')\")\n",
     "y_0 = pd.Categorical(df['species']).codes\n",
     "\n",
-    "varnames = ['α', 'β', 'bd']\n"
+    "varnames = ['α', 'β', 'bd']"
    ]
   },
   {
@@ -141,6 +145,7 @@
     "        yl = pm.Bernoulli('yl', p=θ, observed=y_0)\n",
     "\n",
     "        trace_0 = pm.sample(1000)\n",
+    "        \n",
     "        print(\"Feature {} summary\".format(feature))\n",
     "        print(az.summary(trace_0, varnames, credible_interval=.95))"
    ]
@@ -149,7 +154,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "From the results we can see that bd variable HPD is the smallest with sepal length, and increases with petal_width, and petal_length."
+    "From the results, we can see that the `bd` variable's HPD is the smallest with sepal length, and increases with petal_width and petal_length."
    ]
   },
   {
@@ -157,7 +162,9 @@
    "metadata": {},
    "source": [
     "## Exercise 2\n",
-    "***"
+    "***\n",
+    "\n",
+    "*Repeat exercise 1, this time using a Student's t-distribution as a weakly informative prior. Try different values of $\\nu$.*"
    ]
   },
   {
@@ -243,7 +250,7 @@
     }
    ],
    "source": [
-    "for nu in [1,10, 30]:\n",
+    "for nu in [1, 10, 30]:\n",
     "\n",
     "    x_0 = df[\"petal_length\"].values\n",
     "    x_c = x_0 - x_0.mean()\n",
@@ -261,7 +268,8 @@
     "        yl = pm.Bernoulli('yl', p=θ, observed=y_0)\n",
     "\n",
     "        trace_0 = pm.sample(1000)\n",
-    "        print(\"Feature {} nu {} summary\".format(feature, nu))\n",
+    "        \n",
+    "        print(f\"Feature {feature} nu {nu} summary\")\n",
     "        print(az.summary(trace_0, varnames, credible_interval=.95))"
    ]
   },
@@ -270,7 +278,11 @@
    "metadata": {},
    "source": [
     "## Exercise 3\n",
-    "***"
+    "***\n",
+    "\n",
+    "*Go back to the first example, the logistic regression for classifying setosa or versicolor given sepal length. Try to solve the same problem using a simple linear regression model, as we saw in chapter 3. How useful is linear regression compared to logistic regression? Can the result be interpreted as a probability?*\n",
+    "\n",
+    "*Tip: check whether the values of $y$ are restricted to the interval [0,1].*"
    ]
   },
   {
@@ -314,7 +326,9 @@
     "    μ = α + pm.math.dot(x_c, β)\n",
     "\n",
     "    yl = pm.Normal('yl', mu=μ, sd=sd, observed=y_0)\n",
+    "    \n",
     "    trace_linear = pm.sample(1000)\n",
+    "    \n",
     "    posterior_predictive_linear = pm.sample_posterior_predictive(trace_linear)\n",
     "    print(az.summary(trace_linear, credible_interval=.95))"
    ]
@@ -350,14 +364,14 @@
    ],
    "source": [
     "data = az.from_pymc3(trace=trace_linear, posterior_predictive=posterior_predictive_linear)\n",
-    "az.plot_ppc(data)"
+    "az.plot_ppc(data);"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "From the posterior predictive check this model is not very useful. We are trying to estimate the probability of a species given a sepal_length, but a number of the posterior predictive check values are below 0 and above 1. As is such the result cannot be interpreted as probability."
+    "From the posterior predictive checks, this model is not very useful. We are trying to estimate the probability of a species given sepal_length, but a number of the posterior predictive check values are below 0 and above 1. As such, the result cannot be interpreted as a probability."
    ]
   },
   {
@@ -365,7 +379,9 @@
    "metadata": {},
    "source": [
     "## Exercise 4\n",
-    "***"
+    "***\n",
+    "\n",
+    "*In the example from the \"Interpreting the coefficients of a logistic regression\" section, we changed `sepal_length` by 1 unit. Using figure 4.6, corroborate that the value of `log_odds_versicolor_i` corresponds to the value of `probability_versicolor_i`. Do the same for `log_odds_versicolor_f` and `probability_versicolor_f`. Just by noting that `log_odds_versicolor_f` - `log_odds_versicolor_i` is negative, what can you say about the probability? Use figure 4.6 to help you. Is this result clear to you from the definition of log-odds?*"
    ]
   },
   {
@@ -523,7 +539,6 @@
     "log_odds_versicolor_i = (summary['mean'] * [1, x_1, x_2]).sum()\n",
     "probability_versicolor_i = logistic(log_odds_versicolor_i)\n",
     "\n",
-    "\n",
     "log_odds_versicolor_f = (summary['mean'] * [1, x_1, x_2+1]).sum()\n",
     "probability_versicolor_f = logistic(log_odds_versicolor_f)\n",
     "\n",
@@ -534,9 +549,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The value of -5.19 is consistent across the summary and our \"hand check\". A log odds value of -5.1 means that as $\\beta_1$ increases, the probability that the species is versicolor decreases, or equivalently as sepal_width increases, the probability the flower is versicolor decreases.\n",
+    "The value of -5.22 is consistent with the summary and our \"hand check\". A log odds value of -5.22 means that as $x_2$ increases by one unit, the probability that the species is versicolor decreases. Or, equivalently, as sepal width increases, the probability that the flower is versicolor decreases.\n",
     "\n",
-    "We can verify with a quick plot"
+    "We can verify this with a quick plot:"
    ]
   },
   {
@@ -569,22 +584,24 @@
    ],
    "source": [
     "colors = df[\"species\"].replace({'setosa':\"blue\", 'versicolor':\"green\"})\n",
-    "df.plot(kind=\"scatter\", x=\"sepal_length\", y=\"sepal_width\", c=colors)"
+    "df.plot(kind=\"scatter\", x=\"sepal_length\", y=\"sepal_width\", c=colors);"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "As sepal with increases from 3 to 4 we get farther from the green dots, reducing the probability that the flower were seeing is of the virginica species."
+    "We see that, as sepal width increases from 3 to 4, we get further away from the green dots, reducing the probability that the flower we're seeing is of the versicolor species."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Question 5\n",
-    "***"
+    "## Question 5\n",
+    "***\n",
+    "\n",
+    "*Use the same example from the previous exercise. For `model_1`, check how much the log-odds change when increasing `sepal_length` from 5.5 to 6.5 (spoiler: it should be 4.66). How much does the probability change? How does this increase compared to when we increase `sepal_length` from 4.5 to 5.5?*"
    ]
   },
   {
@@ -613,14 +630,14 @@
     "x_2 = 3   # sepal_width \n",
     "\n",
     "for i in (0,1):\n",
-    "    log_odds_versicolor_i = (summary['mean'] * [1, x_1+i, x_2]).sum()\n",
+    "    log_odds_versicolor_i = (summary['mean'] * [1, x_1 + i, x_2]).sum()\n",
     "    probability_versicolor_i = logistic(log_odds_versicolor_i)\n",
     "\n",
     "\n",
-    "    log_odds_versicolor_f = (summary['mean'] * [1, x_1+i+1, x_2]).sum()\n",
+    "    log_odds_versicolor_f = (summary['mean'] * [1, x_1 + i + 1, x_2]).sum()\n",
     "    probability_versicolor_f = logistic(log_odds_versicolor_f)\n",
     "\n",
-    "    print(f\"\"\"sepal_length_i {x_1+i}, sepal_length_f {x_1+i+1}\n",
+    "    print(f\"\"\"sepal_length_i {x_1 + i}, sepal_length_f {x_1 + i + 1}\n",
     "    Log Odds Change {log_odds_versicolor_f - log_odds_versicolor_i}\n",
     "    Probability Change {probability_versicolor_f - probability_versicolor_i}\n",
     "    \"\"\")"
@@ -630,7 +647,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "From from the calculation above we see that while the coefficient stays constant, as it should in linear regression, the probability change is not as large from 5.5 to 6.5 as it is from 4.5 to 5.5. Looking at the graphic this intuitively makes sense as well. In the region of sepal length of 4.5 to 5.5, it's hard to differentiate among the species. However when moving from a sepal length of 5.5 to 6.5, only the virginica species has sepal lengths in that range."
+    "From the calculation above we see that while the log-odds change stays constant, as it should in linear regression, the probability change is not as large from 5.5 to 6.5 as it is from 4.5 to 5.5. Looking at the graphic this intuitively makes sense as well. When sepal length is at 4.5, the chance that the species is versicolor is very small. When sepal length jumps to 5.5, this probability gets a lot bigger. This means that subsequently going from 5.5 to 6.5 still increases the probability of versicolor, but not as much - because, well, at 5.5 there is already a good chance that the species we're seeing is versicolor."
    ]
   },
   {
@@ -638,7 +655,9 @@
    "metadata": {},
    "source": [
     "## Exercise 6\n",
-    "***"
+    "***\n",
+    "\n",
+    "*In the example for dealing with unbalanced data, change `df = df[45:]` to `df = df[22:78]`. This will keep roughly the same number of data points, but now the classes will be balanced. Compare the new result with the previous ones. Which one is more similar to the example using the complete dataset?*"
    ]
   },
   {
@@ -650,7 +669,8 @@
     "iris = pd.read_csv('../code/data/iris.csv')\n",
     "\n",
     "df = iris.query(\"species == ('setosa', 'versicolor')\") \n",
-    "df = df[22:78]  \n",
+    "df = df[22:78]\n",
+    "\n",
     "y_3 = pd.Categorical(df['species']).codes \n",
     "x_n = ['sepal_length', 'sepal_width'] \n",
     "x_3 = df[x_n].values\n",
@@ -740,7 +760,8 @@
    ],
    "source": [
     "idx = np.argsort(x_3[:,0]) \n",
-    "bd = trace_3['bd'].mean(0)[idx] \n",
+    "bd = trace_3['bd'].mean(0)[idx]\n",
+    "\n",
     "plt.scatter(x_3[:,0], x_3[:,1], c= [f'C{x}' for x in y_3]) \n",
     "plt.plot(x_3[:,0][idx], bd, color='k')\n",
     "\n",
@@ -754,7 +775,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The decision boundary in this plot looks more like the unfiltered dataset as the blue datapoints are largely not contained in the 95% HPD. This indicates that the balanced model, even with less data points, is better able to distinguish between classes."
+    "The decision boundary in this plot looks more like the unfiltered dataset as the blue data points are largely not contained in the boundary decision's 95% HPD. This indicates that the balanced model, even with less data points, is better able to distinguish between classes."
    ]
   },
   {
@@ -762,14 +783,11 @@
    "metadata": {},
    "source": [
     "## Exercise 7\n",
-    "***"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Lets run the model to have data points for a discussion"
+    "***\n",
+    "\n",
+    "*Suppose instead of a softmax regression we use a simple linear model by coding `setosa = 0`, `versicolor = 1` and `virginica = 2`. Under the simple linear regression model, what will happen if we switch the coding? Will we get the same or different results?*\n",
+    "\n",
+    "Lets run the model to have data points for a discussion:"
    ]
   },
   {
@@ -798,7 +816,7 @@
     "\n",
     "with pm.Model() as model_s:\n",
     "    α = pm.Normal('α', mu=0, sd=5, shape=3)\n",
-    "    β = pm.Normal('β', mu=0, sd=5, shape=(4,3))\n",
+    "    β = pm.Normal('β', mu=0, sd=5, shape=(4, 3))\n",
     "    μ = pm.Deterministic('μ', α + pm.math.dot(x_s, β))\n",
     "    \n",
     "    θ = tt.nnet.softmax(μ)\n",
@@ -856,7 +874,7 @@
    "metadata": {},
    "source": [
     "#### Conceptual Understanding\n",
-    "Note that the shape of the trace. The dimensions should read as follows. We have 4000 estimations of the 3 softmax class values for each of the 150 rows in the dataset."
+    "Note the shape of the trace. The dimensions should read as follows: we have 4000 estimations of the 3 softmax class values for each of the 150 rows in the dataset."
    ]
   },
   {
@@ -864,7 +882,7 @@
    "metadata": {},
    "source": [
     "#### Discussion\n",
-    "If we changed a softmax model to a linear model regression model a couple things would change. First would be the interpretation of the final output. A softmax prediction estimates the probability of each class, whereas a linear regression would just provide one number as an estimate for the class. The other problem is that a liner regression would output a continous value across all real numbers, and how to define when one class starts and another ends is unclear."
+    "If we changed the softmax model to a linear regression model a couple things would change. First, the interpretation of the final output would be different. A softmax prediction estimates the probability of each class, whereas a linear regression would just provide one number as an estimate for the class. The other problem is that a linear regression would output continous values across all real numbers, and how to define when one class starts and another ends is unclear."
    ]
   },
   {
@@ -872,7 +890,9 @@
    "metadata": {},
    "source": [
     "## Exercise 8\n",
-    "***"
+    "***\n",
+    "\n",
+    "*Compare the likelihood of the logistic model versus the likelihood of the LDA model. Use the `sample_posterior_predictive` function to generate predicted data and compare the types of data you get for both cases. Be sure you understand the difference between the types of data the model predicts.*"
    ]
   },
   {
@@ -923,7 +943,8 @@
     "    θ = pm.Deterministic(\"θ\", pm.math.sigmoid(μ))\n",
     "    bd = pm.Deterministic(\"bd\", -α / β )\n",
     "    \n",
-    "    y1= pm.Bernoulli(\"y1\", p=θ, observed = y_3)\n",
+    "    y1= pm.Bernoulli(\"y1\", p=θ, observed=y_3)\n",
+    "    \n",
     "    trace_logistic = pm.sample(size=2000)\n",
     "    ppc_logistic = pm.sample_posterior_predictive(trace_logistic)"
    ]
@@ -982,7 +1003,7 @@
     "Setosa_{sepal\\_length} \\text{~} Normal(\\mu_1, \\sigma)\n",
     "\\end{eqnarray}\n",
     "\n",
-    "In the logistic regression we are not estimating the properties of the sepal length. Were are merely fitting parameters of the inverse link function. In the LDA model we are estimating the sepal length distributions directly."
+    "In the logistic regression we are not estimating the properties of the sepal length. We are merely fitting parameters of the inverse link function. In the LDA model we are estimating the sepal length distributions directly."
    ]
   },
   {
@@ -1042,7 +1063,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "When comparing the posterior predictive it can be seen that the logistic model is binary, estimating either 0 or 1, versus the LDA model which has real numbers that generally look like sepal lengths. This follows our understandings of the models, the discriminative model can only make predictions as to which class a particular sepal length belongs to, whereas the LDA model can make predictions about the sepal lengths directly."
+    "When comparing the posterior predictive, it can be seen that the logistic model is binary, estimating either 0 or 1, while the LDA model has real numbers that generally look like sepal lengths. This follows our understandings of the models: the logistic regression makes predictions as to which class a particular sepal length belongs to, whereas the LDA model makes predictions about the sepal lengths directly."
    ]
   },
   {
@@ -1050,7 +1071,9 @@
    "metadata": {},
    "source": [
     "## Exercise 9\n",
-    "***"
+    "***\n",
+    "\n",
+    "*Using the fish data, extend the `ZIP_reg` model to include the persons variable as part of a linear model. Include this variable to model the number of extra zeros. You should get a model that includes two linear models: one connecting the number of children and the presence/absence of a camper to the Poisson rate (as in the example we saw), and another connecting the number of persons to the $\\psi$ variable. For the second case, you will need a logistic inverse link!*"
    ]
   },
   {
@@ -1108,7 +1131,9 @@
    "metadata": {},
    "source": [
     "## Exercise 10\n",
-    "***"
+    "***\n",
+    "\n",
+    "*Use the data for the robust logistic example to feed a non-robust logistic regression model and to check that the outliers actually affected the results. You may want to add or remove outliers to better understand the effect of the estimation on a logistic regression and the robustness of the model introduced in this chapter.*"
    ]
   },
   {
@@ -1132,12 +1157,14 @@
    "source": [
     "iris = sns.load_dataset(\"iris\") \n",
     "df = iris.query(\"species == ('setosa', 'versicolor')\") \n",
+    "\n",
     "y_0 = pd.Categorical(df['species']).codes \n",
     "x_n = 'sepal_length'  \n",
     "x_0 = df[x_n].values \n",
     "y_0 = np.concatenate((y_0, np.ones(6, dtype=int))) \n",
     "x_0 = np.concatenate((x_0, [4.2, 4.5, 4.0, 4.3, 4.2, 4.4])) \n",
     "x_c = x_0 - x_0.mean() \n",
+    "\n",
     "plt.plot(x_c, y_0, 'o', color='k');"
    ]
   },
@@ -1145,7 +1172,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's take the robust logistic regression from the chapter and make it non robust"
+    "Let's take the robust logistic regression from the chapter and make it non robust:"
    ]
   },
   {
@@ -1174,9 +1201,8 @@
     "    θ = pm.Deterministic(\"θ\", pm.math.sigmoid(μ))\n",
     "    bd = pm.Deterministic(\"bd\", -α/β)\n",
     "    \n",
-    "    π = pm.Beta(\"π\",1,1)\n",
-    "    \n",
-    "    # Short Circuit Robust regression \n",
+    "    # Short Circuit Robust regression\n",
+    "    # π = pm.Beta(\"π\", 1, 1)\n",
     "    # p = π *.5 + (1-π)*θ\n",
     "    p = θ\n",
     "    \n",
@@ -1293,13 +1319,13 @@
    "source": [
     "theta = trace_rlg['θ'].mean(axis=0)\n",
     "idx = np.argsort(x_c)\n",
-    "plt.plot(x_c[idx], theta[idx], color='C2', lw=3);\n",
+    "\n",
     "plt.vlines(trace_rlg['bd'].mean(), 0, 1, color='k')\n",
     "bd_hpd = az.hpd(trace_rlg['bd'])\n",
-    "\n",
     "plt.fill_betweenx([0, 1], bd_hpd[0], bd_hpd[1], color='k', alpha=0.5)\n",
     "\n",
     "plt.scatter(x_c, np.random.normal(y_0, 0.02), marker='.', color=[f'C{x}' for x in y_0])\n",
+    "plt.plot(x_c[idx], theta[idx], color='C2', lw=3)\n",
     "theta_hpd = az.hpd(trace_rlg['θ'])[idx]\n",
     "plt.fill_between(x_c[idx], theta_hpd[:,0], theta_hpd[:,1], color='C2', alpha=0.5)\n",
     "\n",
@@ -1316,15 +1342,29 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Compare this plot to figure 4.13. Note that the HPD for the decision boundary is wider, reflecting the additional uncertainty. This is also reflected in the slope which is more gradual. This is reflected both in the plot, but also the beta parameter (15.77 for robust model versus 2.38 for the non robust model)"
+    "Compare this plot to figure 4.13. Note that the HPD for the decision boundary is wider, reflecting the additional uncertainty. This is also reflected in the slope which is more gradual. This is reflected both in the plot, but also the beta parameter (15.77 for robust model versus 2.38 for the non-robust model)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exercise 11\n",
+    "***\n",
+    "\n",
+    "*Read and run the following notebooks from PyMC3's documentation:*\n",
+    "\n",
+    "- [GLM: Linear regression](https://docs.pymc.io/notebooks/GLM-linear.html)\n",
+    "- [GLM: Robust Linear Regression](https://docs.pymc.io/notebooks/GLM-robust.html)\n",
+    "- [GLM: Hierarchical Linear Regression](https://docs.pymc.io/notebooks/GLM-hierarchical.html)"
    ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "stat-rethink-pymc3",
    "language": "python",
-   "name": "python3"
+   "name": "stat-rethink-pymc3"
   },
   "language_info": {
    "codemirror_mode": {
@@ -1336,7 +1376,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.7"
+   "version": "3.7.3"
   }
  },
  "nbformat": 4,