diff --git a/docs/reference.md b/docs/reference.md
index f5b3ba4..f6beb5c 100644
--- a/docs/reference.md
+++ b/docs/reference.md
@@ -2051,3 +2051,681 @@ def lqr(A,B,Q,R):
     K=linalg.inv(R) @ B.T @ P
     return K
 ```
+
+
+# Observers
+T. Bretl, S. Bout, J. Virklan
+
+## What is an observer?
+
+Consider the usual state space system 
+
+$$\begin{align*}
+    \dot{x} = Ax + Bu
+\end{align*}$$ 
+        
+The controller $u = -Kx$
+assumes complete knowledge of the state $x$. In reality, this is almost
+never the case. In the scenario where complete knowledge of the state is
+not known, an observer is used. An observer is the \"best guess\", or
+the best estimate of the state based on the information available. The
+state estimate is denoted as $\hat{x}$, and the choice of input $u$ will
+be based on the state estimate instead of the state. The controller
+
+$$\begin{aligned}
+    u = -Kx
+\end{aligned}$$ 
+        
+becomes 
+
+$$\begin{aligned}
+    u = -K \hat{x}
+\end{aligned}$$ 
+
+The primary source of the
+information used to determine $\hat{x}$ comes from sensors. If these
+measurements from the sensors are denoted as $y$, then the goal is to
+somehow go from these measurements $y$, to the state $\hat{x}$. The
+relationship between the measurements and the state can be formulated as
+
+$$\begin{aligned}
+    y = C \hat{x}
+\end{aligned}$$ 
+
+An approach we might consider is
+
+$$\begin{aligned}
+    \hat{x}(t) = C^{-1}y(t)
+\end{aligned}$$ 
+
+The reason this will not
+generally work is because the $C$ matrix may not be square or
+invertible.\
+The approach that will generally work, is an observer. An observer is
+defined as a set of ordinary differential equations that look very
+similar to the original state space model, but has $\hat{x}$ instead of
+$x$. A corrective term that is proportional to $C\hat{x} - y$ is added.
+The resulting formula is called an observer 
+
+$$\begin{aligned}
+    \dot{\hat{x}} = A \hat{x} + Bu - L(C \hat{x} - y)
+\end{aligned}$$
+
+If the measurements $y$ and the inputs $u$ are plugged in, and
+$\dot{\hat{x}}$ is integrated, this will produce a very good state
+estimate which will allow for performance similar to control applied to
+the exact value of the state itself.
+
+## Do observers make sense?
+
+So far, we have shown that if we integrate the observer
+
+$$\begin{aligned}
+    \dot{\hat{x}} = A\hat{x} + Bu - L(C \hat{x} - y)
+\end{aligned}$$
+
+it will produce an estimate of the state of the system 
+
+$$\begin{aligned}
+    \dot{x} = Ax + Bu \\
+    y = Cx
+\end{aligned}$$ 
+
+Why is this the case? And where does this
+expression for an observer come from? To understand this, we must look
+back to how controllers are designed. In the absence of a controller
+where there is no input, the state would behave as 
+
+$$\begin{aligned}
+    \dot{x} = Ax
+\end{aligned}$$ 
+
+If we want the system to behave in a
+different manner, such as to achieve a stable system where it was
+unstable, we add an expression that is negatively proportional to the
+error 
+
+$$\begin{aligned}
+    \dot{x} = Ax - BK(x - x_{desired})
+\end{aligned}$$
+
+where error is
+the difference between the actual state, and the desired state. In the
+case of control, the desired state is 0, at least in the context of
+state feedback to generate the usual expression for a closed-loop system
+
+$$\begin{aligned}
+    \dot{x} = (A - BK)x 
+\end{aligned}$$ 
+
+What about in the case of
+state estimation? In the absence of any other information, the best
+guess for the state is produced by integrating the state space model
+from some initial condition 
+
+$$\begin{aligned}
+    \dot{x} = A\hat{x} + Bu, \; \hat{x}(0) = x_{0}
+\end{aligned}$$ 
+
+If
+the guess at the initial condition is perfect, and the state space model
+is perfect, than integrating (1) would produce an exact estimate of the
+state. In reality, the estimate of the initial condition and the model
+are never perfect. One way to address that problem is to add a term to
+the state space model that is negatively proportional to the error, the
+difference between the state estimate and the actual state, resulting in
+
+$$\begin{aligned}
+    \dot{\hat{x}} = A\hat{x} + Bu - L(\hat{x} - x)
+\end{aligned}$$
+
+This is something that cannot be implemented because in order to
+implement it, we would have to know the value of $x$, which of course we
+do not know because that is the whole point of finding the estimate of
+the state $x$. However, we can do something similar, where a term is
+added to the state space model that is negatively proportional to some
+measure of error, but instead of the measure of error between the state
+estimate and the actual state, which cannot be implemented, the term
+added is negatively proportional to the difference between what the
+model tells us our measurement would be, $C\hat{x}$ if $\hat{x}$ were
+the state, and what the measurement actually is $y$. The resulting
+observer is 
+
+$$\begin{aligned}
+    \dot{\hat{x}} = A\hat{x} + Bu - L(C \hat{x} - y)
+\end{aligned}$$
+
+where $C \hat{x} - y$ is the error in prediction of the measurement $y$.
+So we see just like in the case of control, we take some nominal
+behaviour and correct it by adding a term that is negatively
+proportional to some measure of error. Observers make sense!
+
+## When does an observer work?
+
+One way to decide if an observer is going to work is to predict what the
+error in the observer measurements is, that is to say: 
+
+$$\begin{aligned}
+    e &= (x - \hat{x})\\
+    x &= Ax + Bu\\
+    \dot{\hat{x}} &= A\hat{x} + Bu - L(Cx - C\hat{x})
+\end{aligned}$$
+
+subtracting the 2nd and 3rd equation above yields an equation that can be
+described purely in terms of the error and its time derivative, that is:
+
+$$\begin{aligned}
+    \dot{e} = (A-LC)e
+\end{aligned}$$ 
+
+The error equation is similar to
+the case for controller design where $\dot{x} = (A-BK)x$, and the system
+can be analyzed for asymptotic stability in the same way. By defining
+the characteristic equation: 
+
+$$\begin{aligned}
+    0 = det(sI - (A-LC))
+\end{aligned}$$ 
+
+the solution to this equation is
+the eigenvalues of $A-LC$, and similar to control design, if the real
+parts of those eigenvalues are all negative, the system will be
+asymptotically stable through time.
+
+## How to choose L for an observer
+
+The fact that negative real parts of the eigenvalues of $A-LC$ leads to
+a stable system leads one to desire a method similar to Ackerman's
+method for controller design. Approaching the problem in this context,
+we desire to place poles for the system such that stability is achieved
+in the observer. The issue in the equations given and applying those to
+Ackerman's method is that the scripts developed for Ackerman's method
+take specific sized matrices to translate to a properly sized K matrix.
+We can resolve these issues by transposing the characteristic equation
+of $A-LC$ yielding $A^T - C^TL^T$, and now the system looks exactly like
+that of $A-BK$, but transposed to fit properly. When placing this into
+the acker() algorithm, this yields $L^T =$ acker($A^T$, $C^T$,
+$p_{obs}$) where $p_{obs}$ are the desired poles to place. We want L and
+not the transpose of L, but this can be resolved by taking the
+transpose of acker().
+
+## Do observers break controllers
+
+Given the overarching model with the controller and observer design
+implemented together, its important that we check that for any observer
+designed separately from the controller in this context maintains
+overall system stability. we know that: 
+
+$$\begin{aligned}
+    x & = A + Bu\\
+    y & = Cx\\
+    u &= -Kx\\
+    \dot{\hat{x}} &= A\hat{x} + Bu - L(y-C\hat{x})
+\end{aligned}$$
+
+Combining these equations together, we can get: 
+
+$$\begin{aligned}
+    \begin{bmatrix}
+    \dot{x} \\
+    \dot{e}
+    \end{bmatrix}
+    =
+    \begin{bmatrix}
+    [A-BK] & [-BK]\\
+    [0] & [A-LC]
+    \end{bmatrix}
+    \begin{bmatrix}
+    x \\
+    e
+    \end{bmatrix}
+    =
+    [H]q
+\end{aligned}$$ 
+
+with \[0\] being an appropriately sized matrix
+to make the overall structure of the block matrix square. Taking the
+roots of the block matrix yields: 
+
+$$\begin{aligned}
+    0 = det(sI - H)
+\end{aligned}$$ 
+
+One important concept from linear
+algebra has to deal with manipulation of block matrices within
+determinants, that rule is: 
+
+$$\begin{aligned}
+    det(\begin{bmatrix}
+    F & E \\
+    0 & G
+    \end{bmatrix} = det(F)det(G)
+\end{aligned}$$ 
+
+applying this to the
+block matrix H yields: 
+
+$$\begin{aligned}
+    det(sI-H) = det(sI-(A-BK))det(sI-(A-LC))
+\end{aligned}$$ 
+
+The roots of
+both determinants are given by the eigenvalues of their respective
+matrices. This means that if both determinants present an asymptotically
+stable system, then the overarching system will remain stable in time.
+This gives rise to the Separation Principle, that is, one can design an
+observer and controller separately while maintaining the stability of
+each and merge them together while and still achieve overall stability.
+We can design the controller and observer while ignoring that the other
+even exists.
+
+## When is observer design possible
+
+We know of the controllability matrix from controller design
+$W_c = [B AB ... A^{n-1}B]$ which tells whether the system is
+controllable given as the controllability matrix being of full rank.
+Similarly, we can construct an observability matrix as:
+
+$$\begin{aligned}
+    W_o = \begin{bmatrix}
+    C^{T} &
+    A^{T} C^{T} &
+    . . .  &
+    A^{n-1T} C^T
+    \end{bmatrix}
+\end{aligned}$$
+
+Wishing to eliminate the messiness of the transposes, we rewrite $W_o$
+as: 
+
+$$\begin{aligned}
+    W_o^T = \begin{bmatrix}
+        C \\
+    CA\\
+    \vdots \\
+    CA^{n-1}
+    \end{bmatrix}
+\end{aligned}$$ 
+
+similar to the control problem, the
+system is observable if $W_o^T$ is of full rank. One should always check
+the observability of a system before trying to design an observer for
+it.
+
+# Optimal observers
+
+## What is an optimal observer?
+
+Consider the usual state space system 
+
+$$\begin{aligned}
+    \dot{x} = Ax + Bu \\ 
+    y = Cx
+\end{aligned}$$ 
+
+let us denote $n_{x}$ as the number of
+states in that system, $n_{u}$ as the number of inputs, and $n_{y}$ as
+the number of outputs. The optimal controller 
+
+$$\begin{aligned}
+    u = -Kx
+\end{aligned}$$ 
+
+where $K$ can be found using the Python LQR()
+function we defined earlier in the \"LQR Code\" section under \"LQR\" in
+the \"Optimization and Optimal Control\" section. 
+
+```
+K=LQR(A, B, Qc, Rc)
+```
+
+$Q_{c}$ is size $n_x$ x
+$n_x$ and $R_{c}$ is size $n_{u}$ x $n_{u}$. We know these matrices have
+to satisfy certain properties. In particular, $Q_{c}$ must be positive
+semidefinite, $R_{c}$ must be positive definite. An easy way to ensure
+these properties are satisfied is to choose both of these matrices as
+diagonal with positive numbers (i.e. identity matrix). What optimal
+means is that the choice of $K$ is making the best trade-off it can
+between error, which is being penalized by $Q_{c}$, and effort, which is
+being penalized by $R_{c}$. If we increase the numbers in the diagonal
+of either matrix, the resulting $K$ is going to be making a different
+trade-off between error and effort depending on those numbers. The same
+approach can be used to design an optimal observer. Any observer has the
+form 
+
+$$\begin{aligned}
+    \dot{\hat{x}} = A\hat{x} + Bu - L(C \hat{x} - y)
+\end{aligned}$$ 
+
+An
+optimal observer is where L is chosen using the Python function LQR
+
+```
+L = LQR(A.T, B.T, linalg.inv(Qo), linalg.inv(Ro)).T
+```
+
+Notice $A$ and $B$ are transposed, and that $Q_{o}$ and $R_{o}$ are
+inverted. This LQR function produced the transpose of $L$, so we need to
+take the transpose to get our final $L$. $Q_{o}$ is size $n_y$ x $n_y$
+and $R_{o}$ is size $n_{x}$ x $n_{x}$. The matrices need to satisfy
+certain properties. An easy way to ensure this is to choose diagonal
+matrices with positive numbers. An identity matrix is a good place to
+start. $Q_{o}$ penalizes sensor noise, and $R_{o}$ penalized process
+noise. Increasing the numbers in $Q_{o}$ means we trust the sensors more
+because we are penalizing the noise that we are allowing the observer
+design process to assume is happening. Similarly, increasing the numbers
+in $R_{o}$, we are trusting our state space model more, and
+de-emphasizing the information we are getting from the sensors. The nice
+part about this design is that LQR can be used to design both the
+controller and the optimal observer.
+
+## What problem is solved to produce an optimal observer?
+
+This is the optimal control problem that is solved to produce an optimal
+controller 
+
+$$\begin{aligned}
+\mathop{\mathrm{minimize}}_{u_[t_0, \infty)}
+&\qquad
+\int_{t_{0}}^{\infty} \left( x(t)^{T}Q_{c}x(t)+u(t)^{T}R_{c}u(t) \right) dt \\
+\text{subject to}
+&\qquad
+\dot{x}(t) = Ax(t)+Bu(t) \qquad \text{for} \qquad t \epsilon [t_0, \infty] \\
+&\qquad
+x(t_0) = x_0
+\end{aligned}$$
+
+. This optimal controller
+has the form 
+
+$$\begin{aligned}
+    u = -Kx
+\end{aligned}$$ 
+
+where K can be found using the LQR() Python
+function we previously defined. This is the optimal control problem that
+is solved to produce an optimal observer 
+
+$$\begin{aligned}
+\mathop{\mathrm{minimize}}_{x(t_{1}),n_{(- \infty,t_{1}]},d_{(- \infty,t_{1}]}}
+&\qquad
+\int_{- \infty}^{t_{1}} \left( n(t)^{T}Q_{o}n(t)+d(t)^{T}R_{o}d(t) \right) dt \\
+\text{subject to}
+&\qquad
+\dot{x}(t) = Ax(t)+Bu(t)+d(t) \qquad for \qquad t \epsilon (- \infty, t_{1}]\\
+&\qquad
+y(t) = Cx(t)+n(t)\end{aligned}$$
+
+Here again, the solution
+involves using a gain matrix, in this case we call this matrix L, which
+can also be found using the LQR() Python function we previously defined.
+L in this case enters into a set of ordinary differential equations
+
+$$\begin{aligned}
+    \dot{\hat{x}} = A\hat{x}+Bu-L(c\hat{x}-y)
+\end{aligned}$$ 
+
+Let's take
+a look at some differences between these two problems. In the optimal
+control problem, the term, $x(t)^TQ_c x(t)$, penalizes error. The goal
+of the controller is to drive $x$ to zero, and when that is not zero,
+this error grows quadratically. The term $u(t)^TR_c u(t)$ penalizes
+effort quadratically as well, but this time on the choice of input $u$.
+Now if we take a look at our optimal observer problem, instead of
+penalizing error and effort, the term $n(t)^T Q_o n(t)$ penalizes sensor
+noise and the term $d(t)^T R_o d(t)$ penalizes the process disturbance.
+These variables $n(t)$ and $d(t)$ are variables we are adding to the
+state space model to capture real world faults. The disturbance $d(t)$
+captures how the inputs we apply are always corrupted in some way in the
+real world. We can see this 
+
+$$\begin{aligned}
+    d(t) = \dot{x}(t) - (Ax(t)+Bu(t)
+\end{aligned}$$ 
+
+from this we see
+$d(t)$ captures the error in our model of the dynamics. If our model of
+the dynamics was perfect, $\dot{x}(t) = (Ax(t) + Bu(t)$. The fact that
+it is not, is represented by $d(t)$. The term $d(t)^T R_o d(t)$
+recognizes this an imposes a quadratic penalty on the extent to which
+our model of the dynamics is wrong. Similarly, $n(t)$ captures the
+extent to which our sensor measurements are wrong in the real world.
+
+$$\begin{aligned}
+    -n(t) = Cx(t) - y(t)
+\end{aligned}$$ 
+
+We see again, similar to
+disturbance, this is the error in our prediction of what the measurement
+should be. So, while the optimal controller quadratically penalizes
+error and effort, the optimal observer quadratically penalizes the
+sensor noise and process disturbance. For the optimal controller, we
+integrate from the current time $t_0$ to the end of time $\infty$. For
+the optimal observer, we integrate from the beginning of time $-\infty$
+of the system, to the current time. In particular, for the optimal
+controller, the value we are fundamentally trying to find, is the choice
+of input $u$, while for the optimal observer, the value we are
+fundamentally trying to find, is the current state. We might notice that
+there is given current state $x(t_0)$ in the optimal observer as we see
+in the optimal controller. This is the whole point of an optimal
+observer. The optimal observer gives the best estimate of the current
+state, which allows the controller to give the best inputs based on this
+reported current state. The two complement each other.
+
+We see that the optimal observer also minimizes over the noise and the
+disturbance. This is because we want the optimal observer to also
+explain the best estimate it has arrived to. The noise and the
+disturbance effectively explain the differences between the expected and
+actual measurements and model dynamics respectively. The last difference
+between the optimal controller and optimal observer is a bit more
+subtle. The final result of the optimal controller is the formula
+
+$$\begin{aligned}
+u = -Kx
+\end{aligned}$$ 
+
+The optimal observer does not have such a
+declarative formula. While there is a similar expression for state
+estimate, the formula would be dependent on inputs and measurements that
+have been chosen and observed respectively from the beginning of time
+$-\infty$. This means that in order to evaluate this expression, we
+would need to keep track of every single input and measurement since the
+beginning of time of that system. If that sounds ridiculous, it's
+because it is. The best we can do is an expression that explains how the
+observer evolves over time in the form of the ordinary differential
+equation 
+
+$$\begin{align*}
+\tag{1}
+    \dot{\hat{x}} = A\hat{x} + Bu - L (C\hat{x} - y)
+\end{align*}$$
+
+Suppose we were to solve the optimal observer problem, we notice it has
+an upper limit of integration $t_1$ which we consider the \"current
+time\", it produces an estimate of the current state $x(t_1)$. For
+example, if we solved the optimal observer problem such that
+$t_1 = t_\alpha$, then $x(t_1) = \alpha$, and we solved it again such
+that $t_1 = t_\beta$, then $x(t_1) = \beta$. If we integrate (1)
+starting at $\hat{x}(t_\alpha) = \alpha$ then the solution is
+$\hat{x}(t_\beta) = \beta$. This can effectively be considered an update
+expression. It allows us to go from a solution at one time of the
+observer problem, and produces a new solution at another time, without
+having to track all the inputs and measurements from the beginning of
+time. Instead, we only need to know the inputs and measurements from
+time $t_\alpha$ to time $t_\beta$ in this scenario. This is why we see
+an explicit formula for the optimal controller, and a more subtle
+\"update expression\" for the optimal observer.
+
+## Do optimal observers make any sense at all?
+
+The equations that govern optimal observers may appear confusing. Using
+a simple model 
+
+$$\begin{aligned}
+    y = x
+\end{aligned}$$
+
+ and a simple sensor measurement $y = 10$ it is
+safe to assume the state estimate $\hat{x} = 10$. Now assume there are
+two sensors $y_1 = 10$ and $y_2 = 20$. The state estimate is no longer
+so easy. Some might consider $\hat{x} = 15$, an average between the two.
+Let's say for example, these sensor measurements record altitude. $y_1$
+measures the value from an altimeter, and $y_2$ measures the value from
+a laser range finder. Those with some familiarity of these two sensors
+will know that a laser range finder tends to be significantly more
+accurate, so $\hat{x} = 15$ is not really a reasonable choice of state
+estimate. Instead, we would prefer to choose a value much closer to the
+more accurate sensor, in this scenario the laser range finder. For
+example, we would decide $\hat{x} = 19$ instead, to reflect the higher
+accuracy of the laser range finder over the altimeter. This could still
+be described as an average, but it is now a weighted average. In this
+scenario 
+
+$$\begin{align*}
+\tag{2}
+    \hat{x} = 0.1y_1 + 0.9y_2
+\end{align*}$$ 
+
+This raises the question:
+is there a best, or optimal, choice of weights for this particular case,
+or for a more general case? Hint: the answer is yes. One way to see that
+would be to minimize the cost such that we can derive the weighted
+average as a consequence. One way to do that would be to extend the
+sensor models. Suppose we add a term to each sensor 
+
+$$\begin{aligned}
+    y_1=x+n_1 \\
+    y_2 = x + n_2
+\end{aligned}$$
+
+ where $y_1$ is the altimeter, $y_2$ is
+the laser range finder, and the added terms $n_1$ and $n_2$ is the noise
+of the sensor. Sensor noise is a way of acknowledging that real world
+sensor measurements are never perfect and explicitly modeling how much
+noise is being added to the state in order to produce each measurement.
+For any given guess at what the state is, we can quantify how much noise
+must have been added to produce the first measurement, and how much
+noise must have been added to produce the second measurement. For
+example, if we assume the actual state $x = 19$, and the altimeter
+measures $\hat{x} = 10$, then the measurement from the altimeter, $y_1$,
+
+$$\begin{aligned}
+    n_1 = y_1 - x \\
+    -9 = 10 - 19
+\end{aligned}$$
+
+ is off by a value of -9, and the
+measurements from the laser range finder measure $\hat{x} = 20$, then
+$y_2$ 
+
+$$\begin{aligned}
+    n_2 = y_2 - x \\
+    1 = 20 - 19
+\end{aligned}$$ 
+
+is only off by a value of 1. A trade-off
+occurs between the assumed noise of the two measurements, in this case
+the altimeter and the laser range finder, in order to explain the
+measurements we see for a given choice of the actual state. The best
+choice of state estimate, a choice that minimizes the noise of both
+measurements. Here is the cost that quantifies that. We want to minimize
+over all possible choices of $x$, $n_1$, and $n_2$: 
+
+$$\begin{aligned}
+\mathop{\mathrm{minimize}}_{x, n_1, n_2}
+&\qquad
+q_1(n_1)^2 + q_2(n_2)^2 \\
+\text{subject to}
+&\qquad
+y_1 = x + n_1\\
+&\qquad
+y_2 = x+n_2\end{aligned}$$
+
+where $q_1$ and $q_2$ are
+weights. So for a state estimate $x$ to be good, we have to minimize the
+sum squared of the noise from the first and second sensor. This
+optimization problem is easily simplified and solved. The first step is
+to replace the noise variables with the measurements and state
+equivalent: 
+
+$$\begin{aligned}
+\mathop{\mathrm{minimize}}_{x}
+&\qquad
+q_1(y_1-x)^2 + q_2(y_2-x)^2
+\end{aligned}$$
+
+Let's call this function $f(x)$. This allows us to eliminate our constraints and the ends to our
+minimization. We can now recognize that this function we are minimizing
+is a single scalar valued function of $x$. We know how to minimize
+single scalar valued functions. We set the first derivative of that
+function to zero 
+
+$$\begin{aligned}
+    0 = \frac{df}{dx} = -2q_1(y_1-x) - 2q_2(y_2-x) \\
+    = -2 (q_1(y_1-x)+q_2(y_2-x)) \\
+    =-2((q_1 y_1 + q_2 y_2) - (q_1 + q_2) x)
+\end{aligned}$$ 
+
+which
+ultimately leads to 
+
+$$\begin{aligned}
+    x = (\frac{q_1}{q_1+q_2})y_1 + (\frac{q_2}{q_1+q_2})y_2 
+\end{aligned}$$
+
+A couple things to note. We have derived the same weighted expression
+earlier as shown in (2). This means taking a weighted average is
+equivalent to minimizing a sum squared cost on the amount of noise we
+would have to assume to explain the measurements we see. But how do we
+choose these weights? Choosing $q_1$ and $q_2$ is quite easy in
+practice. The right way to choose these weights is to take
+
+$$\begin{align*}
+\tag{3}
+    \frac{1}{\sigma^2}
+\end{align*}$$ 
+
+where $\sigma$ is the associated
+standard deviation with a given sensor. This means, if we assume the
+noise in the laser range finder is normally distributed with a zero mean
+and a standard deviation $\sigma_2$, then the right way to take the
+weight $q_2$ is by using equation (3). Looking back at the optimal observer
+function 
+
+$$\begin{aligned}
+\mathop{\mathrm{minimize}}_{x(t_{1}),n_{(- \infty,t_{1}]},d_{(- \infty,t_{1}]}}
+&\qquad
+\int_{- \infty}^{t_{1}} \left( n(t)^{T}Q_{o}n(t)+d(t)^{T}R_{o}d(t) \right) dt \\
+\text{subject to}
+&\qquad
+\dot{x}(t) = Ax(t)+Bu(t)+d(t) \qquad for \qquad t \epsilon (- \infty, t_{1}]\\
+&\qquad
+y(t) = Cx(t)+n(t)\end{aligned}$$
+
+
+ we can see we are minimizing
+over all possible choices of state estimate, the sum squared of the
+noise $n(t)$ and the disturbance $d(t)$ that would have to be assumed in
+order to explain the differences we observe between $\dot{x}(t)$ and
+$Ax(t) + Bu(t)$ and the differences we observe between $y(t)$ and
+$Cx(t)$. In exactly the same way that in order to derive the weighted
+average to combine the measurements between the altimeter and the laser
+range finder in our example, we minimized the weighted sum squared of
+the noises that we would have to assume to explain the differences
+between the measurements and the state. A note: While taking the inverse
+squared standard deviation to find the optimal weights for $q_1$ and
+$q_2$, we can apply this to the matrices $Q_o$ and $R_o$ such that
+
+$$\begin{aligned}
+    Q_o = \sum_{n}^{-1} \\
+    R_o = \sum_{d}^{-1}
+\end{aligned}$$ 
+
+where $\sum_{n}^{-1}$ is the
+inverse of the covariance matrix that is associated with the noise $n$
+and $\sum_{d}^{-1}$ is the inverse of the covariance matrix that is
+associated with the disturbance $d$. We can see it is advantageous to
+use this method rather than a simple weighted average because there is a
+direct way to find the weights to minimize the observer function.
+Ultimately, all an optimal observer is doing, is computing a more
+complicated weighted average.