From 8763a3ccf2de8e0bf40b30fb46c66547f244eacd Mon Sep 17 00:00:00 2001
From: Robert Zaremba <robert.zaremba@scale-it.pl>
Date: Fri, 14 Oct 2016 01:44:16 +0200
Subject: [PATCH] Fix Python3 notebooks latex syntax

---
 Chapter2_MorePyMC/Ch2_MorePyMC_PyMC3.ipynb    | 13 +++++------
 .../Ch5_LossFunctions_PyMC3.ipynb             | 12 +++++-----
 Chapter6_Priorities/Ch6_Priors_PyMC3.ipynb    | 22 +++++++++----------
 to_latex_pdf.sh                               | 10 ++++-----
 4 files changed, 27 insertions(+), 30 deletions(-)

diff --git a/Chapter2_MorePyMC/Ch2_MorePyMC_PyMC3.ipynb b/Chapter2_MorePyMC/Ch2_MorePyMC_PyMC3.ipynb
index a3cf2f9b..ae98b565 100644
--- a/Chapter2_MorePyMC/Ch2_MorePyMC_PyMC3.ipynb
+++ b/Chapter2_MorePyMC/Ch2_MorePyMC_PyMC3.ipynb
@@ -313,10 +313,9 @@
     "\n",
     "$$\n",
     "\\lambda = \n",
-    "\\cases{\n",
-    "\\lambda_1  & \\text{if } t \\lt \\tau \\cr\n",
+    "\\begin{cases}\\lambda_1  & \\text{if } t \\lt \\tau \\cr\n",
     "\\lambda_2 & \\text{if } t \\ge \\tau\n",
-    "}\n",
+    "\\end{cases}\n",
     "$$\n",
     "\n",
     "And in PyMC3 code:"
@@ -772,7 +771,7 @@
    "source": [
     "Had we had stronger beliefs, we could have expressed them in the prior above.\n",
     "\n",
-    "For this example, consider $p_A = 0.05$, and $N = 1500$ users shown site A, and we will simulate whether the user made a purchase or not. To simulate this from $N$ trials, we will use a *Bernoulli* distribution: if  $ X\\ \\sim \\text{Ber}(p)$, then $X$ is 1 with probability $p$ and 0 with probability $1 - p$. Of course, in practice we do not know $p_A$, but we will use it here to simulate the data."
+    "For this example, consider $p_A = 0.05$, and $N = 1500$ users shown site A, and we will simulate whether the user made a purchase or not. To simulate this from $N$ trials, we will use a *Bernoulli* distribution: if  $X\\ \\sim \\text{Ber}(p)$, then $X$ is 1 with probability $p$ and 0 with probability $1 - p$. Of course, in practice we do not know $p_A$, but we will use it here to simulate the data."
    ]
   },
   {
@@ -2558,9 +2557,9 @@
  "metadata": {
   "anaconda-cloud": {},
   "kernelspec": {
-   "display_name": "Python [conda env:bayes]",
+   "display_name": "Python 3",
    "language": "python",
-   "name": "conda-env-bayes-py"
+   "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
@@ -2572,7 +2571,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.5.2"
+   "version": "3.4.5"
   }
  },
  "nbformat": 4,
diff --git a/Chapter5_LossFunctions/Ch5_LossFunctions_PyMC3.ipynb b/Chapter5_LossFunctions/Ch5_LossFunctions_PyMC3.ipynb
index eeebcc39..afa21844 100644
--- a/Chapter5_LossFunctions/Ch5_LossFunctions_PyMC3.ipynb
+++ b/Chapter5_LossFunctions/Ch5_LossFunctions_PyMC3.ipynb
@@ -59,8 +59,8 @@
     "\n",
     "Other popular loss functions include:\n",
     "\n",
-    "-  $ L( \\theta, \\hat{\\theta} ) = \\mathbb{1}_{ \\hat{\\theta} \\neq \\theta } $ is the zero-one loss often used in machine learning classification algorithms.\n",
-    "-  $ L( \\theta, \\hat{\\theta} ) = -\\hat{\\theta}\\log( \\theta ) - (1-\\hat{ \\theta})\\log( 1 - \\theta ), \\; \\; \\hat{\\theta} \\in {0,1}, \\; \\theta \\in [0,1]$, called the *log-loss*, also used in machine learning. \n",
+    "-  $L( \\theta, \\hat{\\theta} ) = \\mathbb{1}_{ \\hat{\\theta} \\neq \\theta }$ is the zero-one loss often used in machine learning classification algorithms.\n",
+    "-  $L( \\theta, \\hat{\\theta} ) = -\\hat{\\theta}\\log( \\theta ) - (1-\\hat{ \\theta})\\log( 1 - \\theta ), \\; \\; \\hat{\\theta} \\in {0,1}, \\; \\theta \\in [0,1]$, called the *log-loss*, also used in machine learning. \n",
     "\n",
     "Historically, loss functions have been motivated from 1) mathematical convenience, and 2) they are robust to application, i.e., they are objective measures of loss. The first reason has really held back the full breadth of loss functions. With computers being agnostic to mathematical convenience, we are free to design our own loss functions, which we take full advantage of later in this Chapter.\n",
     "\n",
@@ -69,10 +69,10 @@
     "By shifting our focus from trying to be incredibly precise about parameter estimation to focusing on the outcomes of our parameter estimation, we can customize our estimates to be optimized for our application. This requires us to design new loss functions that reflect our goals and outcomes. Some examples of more interesting loss functions:\n",
     "\n",
     "\n",
-    "- $ L( \\theta, \\hat{\\theta} ) = \\frac{ | \\theta - \\hat{\\theta} | }{ \\theta(1-\\theta) }, \\; \\; \\hat{\\theta}, \\theta \\in [0,1] $ emphasizes an estimate closer to 0 or 1 since if the true value $\\theta$ is near 0 or 1, the loss will be *very* large unless $\\hat{\\theta}$ is similarly close to 0 or 1. \n",
+    "- $L( \\theta, \\hat{\\theta} ) = \\frac{ | \\theta - \\hat{\\theta} | }{ \\theta(1-\\theta) }, \\; \\; \\hat{\\theta}, \\theta \\in [0,1]$ emphasizes an estimate closer to 0 or 1 since if the true value $\\theta$ is near 0 or 1, the loss will be *very* large unless $\\hat{\\theta}$ is similarly close to 0 or 1. \n",
     "This loss function might be used by a political pundit who's job requires him or her to give confident \"Yes/No\" answers. This loss reflects that if the true parameter is close to 1 (for example, if a political outcome is very likely to occur), he or she would want to strongly agree as to not look like a skeptic. \n",
     "\n",
-    "-  $L( \\theta, \\hat{\\theta} ) =  1 - \\exp \\left( -(\\theta -  \\hat{\\theta} )^2 \\right) $ is bounded between 0 and 1 and reflects that the user is indifferent to sufficiently-far-away estimates. It is similar to the zero-one loss above, but not quite as penalizing to estimates that are close to the true parameter. \n",
+    "-  $L( \\theta, \\hat{\\theta} ) =  1 - \\exp \\left( -(\\theta -  \\hat{\\theta} )^2 \\right)$ is bounded between 0 and 1 and reflects that the user is indifferent to sufficiently-far-away estimates. It is similar to the zero-one loss above, but not quite as penalizing to estimates that are close to the true parameter. \n",
     "-  Complicated non-linear loss functions can programmed: \n",
     "\n",
     "        def loss(true_value, estimate):\n",
@@ -703,7 +703,7 @@
     "\n",
     "$$R_i(x) =  \\alpha_i + \\beta_ix + \\epsilon $$\n",
     "\n",
-    "where $\\epsilon \\sim \\text{Normal}(0, \\sigma_i) $ and $i$ indexes our posterior samples. We wish to find the solution to \n",
+    "where $\\epsilon \\sim \\text{Normal}(0, \\sigma_i)$ and $i$ indexes our posterior samples. We wish to find the solution to \n",
     "\n",
     "$$ \\arg \\min_{r} \\;\\;E_{R(x)}\\left[ \\; L(R(x), r) \\; \\right] $$\n",
     "\n",
@@ -802,7 +802,7 @@
     "1. Construct a prior distribution for the halo positions $p(x)$, i.e. formulate our expectations about the halo positions before looking at the data.\n",
     "2. Construct a probabilistic model for the data (observed ellipticities of the galaxies) given the positions of the dark matter halos: $p(e | x)$.\n",
     "3. Use Bayes’ rule to get the posterior distribution of the halo positions, i.e. use to the data to guess where the dark matter halos might be.\n",
-    "4. Minimize the expected loss with respect to the posterior distribution over the predictions for the halo positions: $ \\hat{x} = \\arg \\min_{\\text{prediction} } E_{p(x|e)}[ L( \\text{prediction}, x) ]$ , i.e. tune our predictions to be as good as possible for the given error metric.\n",
+    "4. Minimize the expected loss with respect to the posterior distribution over the predictions for the halo positions: $\\hat{x} = \\arg \\min_{\\text{prediction} } E_{p(x|e)}[ L( \\text{prediction}, x) ]$ , i.e. tune our predictions to be as good as possible for the given error metric.\n",
     "\n"
    ]
   },
diff --git a/Chapter6_Priorities/Ch6_Priors_PyMC3.ipynb b/Chapter6_Priorities/Ch6_Priors_PyMC3.ipynb
index c5ec49bd..173c63dd 100644
--- a/Chapter6_Priorities/Ch6_Priors_PyMC3.ipynb
+++ b/Chapter6_Priorities/Ch6_Priors_PyMC3.ipynb
@@ -112,7 +112,7 @@
    "source": [
     "We must remember that choosing a prior, whether subjective or objective, is still part of the modeling process. To quote Gelman [5]:\n",
     "\n",
-    ">...after the model has been ﬁt, one should look at the posterior distribution\n",
+    ">... after the model has been fit, one should look at the posterior distribution\n",
     "and see if it makes sense. If the posterior distribution does not make sense, this implies\n",
     "that additional prior knowledge is available that has not been included in the model,\n",
     "and that contradicts the assumptions of the prior distribution that has been used. It is\n",
@@ -985,7 +985,7 @@
     "\n",
     "$$r_t = \\frac{ S_t - S_{t-1} }{ S_{t-1} } $$\n",
     "\n",
-    "The *expected daily return* of a stock is denoted $\\mu = E[ r_t ] $. Obviously, stocks with high expected returns are desirable. Unfortunately, stock returns are so filled with noise that it is very hard to estimate this parameter. Furthermore, the parameter might change over time (consider the rises and falls of AAPL stock), hence it is unwise to use a large historical dataset. \n",
+    "The *expected daily return* of a stock is denoted $\\mu = E[ r_t ]$. Obviously, stocks with high expected returns are desirable. Unfortunately, stock returns are so filled with noise that it is very hard to estimate this parameter. Furthermore, the parameter might change over time (consider the rises and falls of AAPL stock), hence it is unwise to use a large historical dataset. \n",
     "\n",
     "Historically, the expected return has been estimated by using the sample mean. This is a bad idea. As mentioned, the sample mean of a small sized dataset has enormous potential to be very wrong (again, see Chapter 4 for full details). Thus Bayesian inference is the correct procedure here, since we are able to see our uncertainty along with probable values.\n",
     "\n",
@@ -1440,7 +1440,7 @@
     "\n",
     "Earlier, we talked about objective priors rarely being *objective*. Partly what we mean by this is that we want a prior that doesn't bias our posterior estimates. The flat prior seems like a reasonable choice as it assigns equal probability to all values. \n",
     "\n",
-    "But the flat prior is not transformation invariant. What does this mean? Suppose we have a random variable $ \\bf X $ from Bernoulli($\\theta$). We define the prior on  $p(\\theta) = 1$. "
+    "But the flat prior is not transformation invariant. What does this mean? Suppose we have a random variable $\\textbf X$ from Bernoulli($\\theta$). We define the prior on $p(\\theta) = 1$. "
    ]
   },
   {
@@ -1532,17 +1532,17 @@
     "\n",
     "We can see this mathematically. First, recall Bayes Theorem from Chapter 1 that relates the prior to the posterior. The following is a sample from [What is the relationship between sample size and the influence of prior on posterior?](http://stats.stackexchange.com/questions/30387/what-is-the-relationship-between-sample-size-and-the-influence-of-prior-on-poste)[1] on CrossValidated.\n",
     "\n",
-    ">The posterior distribution for a parameter $\\theta$, given a data set ${\\bf X}$ can be written as \n",
+    ">The posterior distribution for a parameter $\\theta$, given a data set ${\\textbf X}$ can be written as \n",
     "\n",
-    "$$p(\\theta | {\\bf X}) \\propto \\underbrace{p({\\bf X} | \\theta)}_{{\\rm likelihood}}  \\cdot  \\overbrace{ p(\\theta) }^{ {\\rm prior} }  $$\n",
+    "$$p(\\theta | {\\textbf X}) \\propto \\underbrace{p({\\textbf X} | \\theta)}_{{\\textrm likelihood}}  \\cdot  \\overbrace{ p(\\theta) }^{ {\\textrm prior} }  $$\n",
     "\n",
     "\n",
     "\n",
     ">or, as is more commonly displayed on the log scale, \n",
     "\n",
-    "$$ \\log( p(\\theta | {\\bf X})  ) = c + L(\\theta;{\\bf X}) + \\log(p(\\theta)) $$\n",
+    "$$ \\log( p(\\theta | {\\textbf X})  ) = c + L(\\theta;{\\textbf X}) + \\log(p(\\theta)) $$\n",
     "\n",
-    ">The log-likelihood, $L(\\theta;{\\bf X}) = \\log \\left( p({\\bf X}|\\theta) \\right)$, **scales with the sample size**, since it is a function of the data, while the prior density does not. Therefore, as the sample size increases, the absolute value of $L(\\theta;{\\bf X})$ is getting larger while $\\log(p(\\theta))$ stays fixed (for a fixed value of $\\theta$), thus the sum $L(\\theta;{\\bf X}) + \\log(p(\\theta))$ becomes more heavily influenced by $L(\\theta;{\\bf X})$ as the sample size increases. \n",
+    ">The log-likelihood, $L(\\theta;{\\textbf X}) = \\log \\left( p({\\textbf X}|\\theta) \\right)$, **scales with the sample size**, since it is a function of the data, while the prior density does not. Therefore, as the sample size increases, the absolute value of $L(\\theta;{\\textbf X})$ is getting larger while $\\log(p(\\theta))$ stays fixed (for a fixed value of $\\theta$), thus the sum $L(\\theta;{\\textbf X}) + \\log(p(\\theta))$ becomes more heavily influenced by $L(\\theta;{\\textbf X})$ as the sample size increases. \n",
     "\n",
     "There is an interesting consequence not immediately apparent. As the sample size increases, the chosen prior has less influence. Hence inference converges regardless of chosen prior, so long as the areas of non-zero probabilities are the same. \n",
     "\n",
@@ -1619,15 +1619,15 @@
     "Y = X\\beta + \\epsilon\n",
     "\\end{equation}\n",
     "\n",
-    "where $\\epsilon \\sim \\text{Normal}( {\\bf 0}, \\sigma{\\bf I })$. Simply, the observed $Y$ is a linear function of $X$ (with coefficients $\\beta$) plus some noise term. Our unknown to be determined is $\\beta$. We use the following property of Normal random variables:\n",
+    "where $\\epsilon \\sim \\text{Normal}( {\\textbf 0}, \\sigma{\\textbf I })$. Simply, the observed $Y$ is a linear function of $X$ (with coefficients $\\beta$) plus some noise term. Our unknown to be determined is $\\beta$. We use the following property of Normal random variables:\n",
     "\n",
     "$$ \\mu' + \\text{Normal}( \\mu, \\sigma ) \\sim \\text{Normal}( \\mu' + \\mu , \\sigma ) $$\n",
     "\n",
     "to rewrite the above linear model as:\n",
     "\n",
     "\\begin{align}\n",
-    "& Y = X\\beta + \\text{Normal}( {\\bf 0}, \\sigma{\\bf I }) \\\\\\\\\n",
-    "& Y = \\text{Normal}( X\\beta , \\sigma{\\bf I }) \\\\\\\\\n",
+    "& Y = X\\beta + \\text{Normal}( {\\textbf 0}, \\sigma{\\textbf I }) \\\\\\\\\n",
+    "& Y = \\text{Normal}( X\\beta , \\sigma{\\textbf I }) \\\\\\\\\n",
     "\\end{align}\n",
     "\n",
     "In probabilistic notation, denote $f_Y(y \\; | \\; \\beta )$ the probability distribution of $Y$, and recalling the density function for a Normal random variable (see [here](http://en.wikipedia.org/wiki/Normal_distribution) ):\n",
@@ -1660,7 +1660,7 @@
     "\n",
     "2\\. If we have reason to believe the elements of $\\beta$ are not too large, we can suppose that *a priori*:\n",
     "\n",
-    "$$ \\beta \\sim \\text{Normal}({\\bf 0 }, \\lambda {\\bf I } ) $$\n",
+    "$$ \\beta \\sim \\text{Normal}({\\textbf 0 }, \\lambda {\\textbf I } ) $$\n",
     "\n",
     "The resulting posterior density function for $\\beta$ is *proportional to*:\n",
     "\n",
diff --git a/to_latex_pdf.sh b/to_latex_pdf.sh
index 95e49abc..376d30eb 100755
--- a/to_latex_pdf.sh
+++ b/to_latex_pdf.sh
@@ -1,6 +1,4 @@
-cd Chapter1_Introduction/ && ipython nbconvert Chapter1.ipynb --to latex --post PDF  --template article
-cd ../Chapter2_MorePyMC/ && ipython nbconvert Chapter2.ipynb --to latex --post PDF --template article
-cd ../Chapter3_MCMC/ && ipython nbconvert Chapter3.ipynb --to latex --post PDF --template article
-cd ../Chapter4_TheGreatestTheoremNeverTold/ && ipython nbconvert Chapter4.ipynb --to latex --post PDF --template article
-cd ../Chapter5_LossFunctions/ && ipython nbconvert Chapter5.ipynb --to latex --post PDF --template article
-cd ../Chapter6_Priorities/ && ipython nbconvert Chapter6.ipynb --to latex --post PDF --template article
+find Prologue Chapter* -name "*.ipynb" | grep -v "PyMC2" | xargs ipython3 nbconvert --to pdf --template article
+
+# merge all files:
+pdfjoin Prologue.pdf Ch*.pdf DontOverfit.pdf MachineLearning.pdf