diff --git a/docs/api/analysis.md b/docs/api/analysis.md
new file mode 100644
index 00000000..97c026e4
--- /dev/null
+++ b/docs/api/analysis.md
@@ -0,0 +1,6 @@
+# Analysis
+
+```{eval-rst}
+.. automodule:: mpol.onedim
+    :members:
+```
\ No newline at end of file
diff --git a/docs/api/coordinates.md b/docs/api/coordinates.md
new file mode 100644
index 00000000..e31d4243
--- /dev/null
+++ b/docs/api/coordinates.md
@@ -0,0 +1,6 @@
+# Coordinates
+
+```{eval-rst}
+.. automodule:: mpol.coordinates
+    :members:
+```
\ No newline at end of file
diff --git a/docs/api/crossval.md b/docs/api/crossval.md
new file mode 100644
index 00000000..922e6fa7
--- /dev/null
+++ b/docs/api/crossval.md
@@ -0,0 +1,6 @@
+# Cross-validation
+
+```{eval-rst}
+.. automodule:: mpol.crossval
+    :members:
+```
diff --git a/docs/api/datasets.md b/docs/api/datasets.md
new file mode 100644
index 00000000..b90d4523
--- /dev/null
+++ b/docs/api/datasets.md
@@ -0,0 +1,6 @@
+# Datasets
+
+```{eval-rst}
+.. automodule:: mpol.datasets
+    :members:
+```
\ No newline at end of file
diff --git a/docs/api/fourier.md b/docs/api/fourier.md
new file mode 100644
index 00000000..36081861
--- /dev/null
+++ b/docs/api/fourier.md
@@ -0,0 +1,6 @@
+# Fourier
+
+```{eval-rst}
+.. automodule:: mpol.fourier
+    :members:
+```
\ No newline at end of file
diff --git a/docs/api/geometry.md b/docs/api/geometry.md
new file mode 100644
index 00000000..7fbe96b3
--- /dev/null
+++ b/docs/api/geometry.md
@@ -0,0 +1,6 @@
+# Geometry
+
+```{eval-rst}
+.. automodule:: mpol.geometry
+    :members:
+```
diff --git a/docs/api/gridding.md b/docs/api/gridding.md
new file mode 100644
index 00000000..3faeb917
--- /dev/null
+++ b/docs/api/gridding.md
@@ -0,0 +1,6 @@
+# Gridding
+
+```{eval-rst}
+.. automodule:: mpol.gridding
+    :members:
+```
diff --git a/docs/api/images.md b/docs/api/images.md
new file mode 100644
index 00000000..5ae0fc55
--- /dev/null
+++ b/docs/api/images.md
@@ -0,0 +1,6 @@
+# Images
+
+```{eval-rst}
+.. automodule:: mpol.images
+    :members:
+```
diff --git a/docs/api/losses.md b/docs/api/losses.md
new file mode 100644
index 00000000..d8922513
--- /dev/null
+++ b/docs/api/losses.md
@@ -0,0 +1,227 @@
+# Losses
+
+We have separated the loss functions into two categories: those involving [data](#data-loss-function-api) (derived from likelihood functions) and those acting as [regularizers](#regularizer-loss-function-api) (or priors). We briefly review likelihood functions and their application to Fourier data.
+
+## Review
+
+```{admonition} Reference Literature
+If you are new to Bayesian inference or its notation, we recommend reviewing the following excellent resources as a prerequisite to the following discussion.
+- [Eadie et al. 2023](https://ui.adsabs.harvard.edu/abs/2023arXiv230204703E/abstract): Practical Guidance for Bayesian Inference in Astronomy 
+- [Data Analysis: A Bayesian Tutorial](https://www.amazon.com/Data-Analysis-Bayesian-Devinderjit-Sivia/dp/0198568320) by Sivia and Skilling
+- [Data analysis recipes: Fitting a model to data](https://ui.adsabs.harvard.edu/abs/2010arXiv1008.4686H/abstract) by Hogg, Bovy, and Lang
+- [Data analysis recipes: Probability calculus for inference](https://ui.adsabs.harvard.edu/abs/2012arXiv1205.4446H/abstract) by Hogg.
+```
+
+### Fitting a line example 
+
+Typically, when astronomers fit a model to some dataset, such as a line $y = m x + b$ to a collection of $\boldsymbol{X} = \{x_1, x_2, \ldots\, x_N\}$ and $\boldsymbol{Y} = \{y_1, y_2, \ldots\, y_N\}$ points, we require a likelihood function. The likelihood function specifies the probability of the data, given a model, and encapsulates our assumptions about the data and noise generating processes.
+
+For most real-world datasets, we don't measure the "true" $y$ value of the line (i.e., $mx + b$), but rather make a measurement which has been partially corrupted by some "noise." We say that each $y_i$ data point is actually generated by
+
+$$
+y_i = m x_i + b + \epsilon_i
+$$
+
+where $\epsilon_i$ is a noise realization from a standard [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution) with standard deviation $\sigma_i$, i.e.,
+
+$$
+\epsilon_i \sim \mathcal{N}(0, \sigma_i).
+$$
+
+This information allows us to write down a likelihood function to calculate the probability of the data, given a set of model parameters $p(\boldsymbol{Y} |\,\boldsymbol{\theta})$. Sometimes it is written as $\mathcal{L}(\boldsymbol{\theta}; \boldsymbol{Y})$ to emphasize that it is in fact a *function* (note, not probability) of the model parameters, and that it takes the data as an argument. Frequently, when employed in computation, we'll use the logarithm of the likelihood function, or "log-likelihood," $\ln \mathcal{L}$ to avoid numerical under/overflow issues. Let's call $\boldsymbol{\theta} = \{m, b\}$ and $M(x_i |\, \boldsymbol{\theta}) = m x_i + b$ to emphasize that we can use the model $M$ to calculate a response at input $x_i$ conditional on model parameters $\boldsymbol{\theta}$. The likelihood function for this line problem is
+
+$$
+\mathcal{L}(\boldsymbol{\theta}; \boldsymbol{Y}) = \prod_i^N \frac{1}{\sqrt{2 \pi} \sigma} \exp \left [ - \frac{(y_i - M(x_i |\,\boldsymbol{\theta}))^2}{2 \sigma_i^2}\right ]
+$$
+
+The logarithm of the likelihood function is
+
+$$
+\ln \mathcal{L}(\boldsymbol{\theta}; \boldsymbol{Y}) = -\frac{N}{2} \ln(2 \pi) - \sum_i^N \ln(\sigma_i) - \frac{1}{2} \sum_i^N \frac{(y_i - M(x_i |\,\boldsymbol{\theta}))^2}{\sigma_i^2}
+$$
+
+You may recognize the last term contains the $\chi^2$ metric,
+
+$$
+\chi^2(\boldsymbol{\theta}; \boldsymbol{Y}) = \sum_i^N \frac{(y_i - M(x_i |\,\boldsymbol{\theta}))^2}{\sigma_i^2}
+$$
+
+Assuming that the uncertainty ($\sigma_i$) on each data point is known (and remains constant), the first two terms remain constant and we have
+
+$$
+\ln \mathcal{L}(\boldsymbol{\theta}; \boldsymbol{Y}) = C - \frac{1}{2} \chi^2 (\boldsymbol{\theta}; \boldsymbol{Y})
+$$
+
+where $C$ is a constant with respect to the model parameters. It is common to use shorthand to say that "the likelihood function is $\chi^2$" to indicate situations where the data uncertainties are Gaussian. Very often, we are interested in the parameter values $\boldsymbol{\theta}_\mathrm{MLE}$ which maximize the likelihood function, called the *maximum likelihood estimate* (or MLE).
+
+$\chi^2$ is not the end of the story for Bayesian parameter inference. To do so, we need the posterior probability distribution of the model parameters given the dataset, $p(\boldsymbol{\theta}|\,\boldsymbol{Y})$. We can calculate this quantity using Bayes rule
+
+$$
+p(\boldsymbol{\theta}|\,\boldsymbol{Y}) = \frac{p(\boldsymbol{Y}|\,\boldsymbol{\theta})\, p(\boldsymbol{\theta})}{p(\boldsymbol{Y})}
+$$
+
+The denominator is a constant so long as the model specification remains the same, leaving
+
+$$
+p(\boldsymbol{\theta}|\,\boldsymbol{Y}) \propto p(\boldsymbol{Y}|\,\boldsymbol{\theta})\, p(\boldsymbol{\theta}).
+$$
+
+So we need a prior probability distribution $p(\boldsymbol{\theta})$ in addition to the likelihood function to calculate the posterior probability distribution of the model parameters. Analogous to the maximum likelihood estimate, there is also the *maximum a posteriori* estimate (or MAP), which includes the effect of the prior probability distribution.
+
+### Fourier data
+
+```{admonition} Reference Literature
+A full introduction to Fourier transforms, radio astronomy, and interferometry is beyond the scope of MPoL documentation. We recommend reviewing these resources as a prerequisite to the discussion.
+
+- [Essential radio astronomy](https://www.cv.nrao.edu/~sransom/web/xxx.html) textbook by James Condon and Scott Ransom, and in particular, Chapter 3.7 on Radio Interferometry.
+- NRAO's [17th Synthesis Imaging Workshop](http://www.cvent.com/events/virtual-17th-synthesis-imaging-workshop/agenda-0d59eb6cd1474978bce811194b2ff961.aspx) recorded lectures and slides available
+- [Interferometry and Synthesis in Radio Astronomy](https://ui.adsabs.harvard.edu/abs/2017isra.book.....T/abstract) by Thompson, Moran, and Swenson. An excellent and comprehensive reference on all things interferometry.
+- NJIT's online course materials for [Radio Astronomy](https://web.njit.edu/~gary/728/)
+- Ian Czekala's lecture notes on [Radio Interferometry and Imaging](https://iancze.github.io/courses/as5003/lectures/)
+```
+
+Interferometers acquire samples of data in the Fourier domain, also called the visibility domain. The visibility domain is the Fourier transform of the image sky brightness
+
+$$
+{\cal V}(u,v) = \iint I(l,m) \exp \left \{- 2 \pi i (ul + vm) \right \} \, \mathrm{d}l\,\mathrm{d}m,
+$$
+
+where $l$ and $m$ are direction cosines (roughly equivalent to R.A. and Dec) which parameterize the surface brightness distribution of the image $I(l,m)$, and $u$ and $v$ are spatial frequencies which parameterize the visibility function $\cal{V}(u,v)$. For more information on the meaning of these units, see {ref}`units-conventions-label`.
+
+The visibility function is complex-valued, and each measurement of it (denoted by $V_i$) is made in the presence of noise
+
+$$
+V_i = \mathcal{V}(u_i, v_i) + \epsilon_i.
+$$
+
+where $\epsilon_i$ represents a noise realization from a [complex normal](https://en.wikipedia.org/wiki/Complex_normal_distribution) (Gaussian) distribution. Thankfully, most interferometric datasets do not exhibit significant covariance between the real and imaginary noise components, so we could equivalently say that the real and imaginary components of the noise are separately generated by draws from normal distributions characterized by standard deviation $\sigma_i$
+
+$$
+\epsilon_\Re \sim \mathcal{N}(0, \sigma_i) \\
+\epsilon_\Im \sim \mathcal{N}(0, \sigma_i)
+$$
+
+and
+
+$$
+\epsilon_i = \epsilon_\Re + i \epsilon_\Im
+$$
+
+Radio interferometers will commonly represent the uncertainty on each visibility measurement by a "weight" $w_i$, where
+
+$$
+w_i = \frac{1}{\sigma_i^2}
+$$
+
+A full interferometric dataset is a collection of visibility measurements, which we represent by
+
+$$
+\boldsymbol{V} = \{V_1, V_2, \ldots \}_{i=1}^N
+$$
+
+A typical ALMA dataset might contain a half-million individual visibility samples, acquired over a range of spatial frequencies.
+
+Assume we have some forward model that can predict the value of the visibility function for any spatial frequency, $\mathcal{V}(u, v) = M_\mathcal{V}(u, v |, \boldsymbol{\theta})$. It's difficult to reason about all but the simplest models directly in the Fourier plane, so usually models are constructed in the image plane $M_I(l,m |,\boldsymbol{\theta})$ and then Fourier transformed (either analytically, or via the FFT) to construct visibility models $M_\mathcal{V}(u, v |, \boldsymbol{\theta}) \leftrightharpoons M_I(l,m |,\boldsymbol{\theta})$. 
+
+As with the line example, our statement about the data generating process
+
+$$
+V_i = \mathcal{V}(u_i, v_i) + \epsilon_i
+$$
+
+leads to the formulation of the likelihood function
+
+$$
+\ln \mathcal{L}(\boldsymbol{\theta}; \boldsymbol{V}) = - \frac{1}{2} \chi^2(\boldsymbol{\theta}; \boldsymbol{V}) + C
+$$
+
+Because the data and model are complex-valued, $\chi^2$ is evaluated as
+
+$$
+\chi^2(\boldsymbol{\theta}; \boldsymbol{V}) = \sum_i^N  w_i |V_i - M_\mathcal{V}(u_i, v_i |\,\boldsymbol{\theta})|^2
+$$
+
+where $| |$ denotes the modulus squared. Equivalently, the calculation can be broken up into sums over the real ($\Re$) and imaginary ($\Im$) components of the visibility data and model
+
+$$
+\chi^2(\boldsymbol{\theta}; \boldsymbol{V}) = \sum_i^N w_i (V_{\Re,i} - M_\mathcal{V,\Re}(u_i, v_i |\,\boldsymbol{\theta}))^2 + \sum_i^N w_i (V_{\Im,i} - M_\mathcal{V,\Im}(u_i, v_i |\,\boldsymbol{\theta}))^2
+$$
+
+```{admonition} Spectral covariance
+The $\chi^2$ likelihood function as formulated above is appropriate for visibilities with minimal spectral covariance. When modeling spectral line datasets, in particular those that have not been channel-averaged and retain the spectral response function from their Hann windowing, this covariance must be taken into account in the likelihood function. More information on how to derive these covariance matrices is provided in the appendices of [Loomis et al. 2018](https://ui.adsabs.harvard.edu/abs/2018AJ....155..182L/abstract) and will be detailed in forthcoming tutorials.
+```
+
+```{admonition} Hermitian visibilities
+Because the sky brightness $I_\nu$ is real, the visibility function $\mathcal{V}$ is Hermitian, meaning that
+
+$$
+\mathcal{V}(u, v) = \mathcal{V}^*(-u, -v).
+$$
+  
+Most datasets (e.g., those extracted from CASA) will only record one visibility measurement per baseline and not include the duplicate Hermitian pair (to save storage space). We recommend that you evaluate all data loss functions *without the Hermitian pair*.
+
+For the likelihood function calculation to be accurate with Hermitian pairs, the simple $\chi^2$ sum of data - model would need to be replaced with a multivariate Gaussian likelihood that included a perfect covariance between each data point and its Hermitian pair. This complicates the calculation unnecessarily.
+
+The one case where Hermitian pairs *do* need to be included is when using the inverse FFT. This applies to the {meth}`mpol.gridding.DirtyImager` and is handled internally.
+
+You can check whether your dataset already includes Hermitian pairs using {meth}`mpol.gridding.verify_no_hermitian_pairs`.
+```
+
+### Averaged loss functions
+
+In an optimization workflow, we usually minimize a loss function $L$ rather than maximize the log likelihood function. If we are just optimizing (instead of sampling), we only care about the value of $\hat{\boldsymbol{\theta}}$ that minimizes the function $L$. The normalization of $L$ does not matter, we only care that $L(\hat{\boldsymbol{\theta}}) < L(\hat{\boldsymbol{\theta}} + \varepsilon)$, not by how much. In these applications we recommend using an *averaged* data loss function, whose value remains approximately constant as the size of the dataset varies.
+
+In the most common scenario where you are keeping data and weights fixed, we recommend using a reduced $\chi_R^2$ data loss function, available in either {meth}`mpol.losses.r_chi_squared` or {meth}`mpol.losses.r_chi_squared_gridded`. The hope is that for many applications, the reduced $\chi^2_R$ loss function will have a minimum value of $\chi^2_R(\hat{\boldsymbol{\theta}}) \approx 1$ for a well-fit model (regardless of the number of data points).
+
+In the situation where you may be modifying the data and weights (as in a self-calibration workflow), we recommend using the {meth}`mpol.losses.neg_log_likelihood_avg` loss function.
+
+### Regularization
+
+With RML imaging, we're trying to come up with a model that will fit the dataset. But rather than using a parametric model like a series of Gaussian rings for a protoplanetary disk, we're using a non-parametric model of *the image itself*. This could be as simple as parameterizing the image using the intensity values of the pixels themselves, i.e.,
+
+$$
+\boldsymbol{\theta} = \{I_1, I_2, \ldots, I_{N^2} \}
+$$
+
+assuming we have an $N \times N$ image.
+
+This flexible image model is analogous to using a spline or Gaussian process to fit a series of points $\boldsymbol{Y} = \{y_1, y_2, \ldots\, y_N\}$---the model will nearly always have enough flexibility to capture the structure that exists in the dataset. The pixel basis set is the most straightforward non-parametric image model, but we could also use more sophisticated basis sets like a set of wavelet coefficients, or even more exotic basis sets constructed from trained neural networks. 
+
+Because the Fourier transform is a linear operation with respect to the pixel basis, the maximum likelihood model image (called the dirty image) can be calculated analytically by the inverse Fourier transform. The point spread function of the dirty image is called the dirty beam. By construction, all unsampled spatial frequencies are set to zero power. This means that the dirty image will only contain spatial frequencies about which we have at least some data. This assumption, however, rarely translates into good image fidelity, especially if there are many unsampled spatial frequencies which carry significant power. It's also important to recognize that dirty image is only *one* out of a set of *many* images that could maximize the likelihood function. From the perspective of the likelihood calculation, we could modify the unsampled spatial frequencies of the dirty image to whatever power we might like, and, because they are *unsampled*, the value of the likelihood calculation won't change, i.e., it will still remain maximal.
+
+When synthesis imaging is described as an "ill-posed inverse problem," this is what is meant. There is a (potentially infinite) range of images that could *exactly* fit the dataset, and without additional information we have no way of discriminating which is best. This is where "regularization" comes in.
+
+One can talk about regularization from a Bayesian perspective using *priors*, i.e., we introduce terms like $p(\boldsymbol{\theta})$ such that we might calculate the maximum a posteriori (MAP) image $\boldsymbol{\theta}_\mathrm{MAP}$ using the posterior probability distribution
+
+$$
+p(\boldsymbol{\theta} |\, \boldsymbol{V}) \propto \mathcal{L}(\boldsymbol{\theta}; \boldsymbol{V}) \, p(\boldsymbol{\theta}).
+$$
+
+For computational reasons related to numerical over/underflow, we would most likely use the logarithm of the posterior probability distribution
+
+$$
+\ln p(\boldsymbol{\theta} |\, \boldsymbol{V}) \propto \ln \mathcal{L}( \boldsymbol{\theta}; \boldsymbol{V}) + \ln p(\boldsymbol{\theta}).
+$$
+
+One could also describe the optimization processes without Bayesian terminology as minimizing an objective loss function comprising data loss functions and regularization penalties, e.g.,
+
+$$
+L(\boldsymbol{\theta}) = L_\mathrm{data}(\boldsymbol{\theta})  + L_\mathrm{sparsity}(\boldsymbol{\theta}) + L_\mathrm{TSV}(\boldsymbol{\theta}) + \ldots
+$$
+
+The relative "strength" of the regularization is controlled via a scalar prefactor $\lambda$, internal to each loss function. This is one situation where a normalized data loss function is useful, because it preserves the relative strength of the regularizer even if the dataset (or mini-batches of it, in a stochastic gradient descent setting) change size.
+
+## Data loss function API
+
+```{eval-rst}
+.. automodule:: mpol.losses
+   :members: r_chi_squared, r_chi_squared_gridded, log_likelihood, log_likelihood_gridded, neg_log_likelihood_avg
+```
+
+
+## Regularizer loss function API
+
+```{eval-rst}
+.. automodule:: mpol.losses
+    :members: entropy, TV_image, TV_channel, TSV, sparsity, UV_sparsity, PSD, edge_clamp
+```
\ No newline at end of file
diff --git a/docs/api/plotting.md b/docs/api/plotting.md
new file mode 100644
index 00000000..3986f76d
--- /dev/null
+++ b/docs/api/plotting.md
@@ -0,0 +1,6 @@
+# Plotting
+
+```{eval-rst}
+.. automodule:: mpol.plot
+    :members:
+```
diff --git a/docs/api/precomposed.md b/docs/api/precomposed.md
new file mode 100644
index 00000000..915e3806
--- /dev/null
+++ b/docs/api/precomposed.md
@@ -0,0 +1,8 @@
+# Precomposed Modules
+
+For convenience, we provide some "precomposed" [modules](https://pytorch.org/docs/stable/notes/modules.html) which may be useful for simple imaging or modeling applications. In general, though, we encourage you to compose your own set of layers if your application requires it. The source code for a precomposed network can provide useful a starting point. We also recommend checking out the PyTorch documentation on [modules](https://pytorch.org/docs/stable/notes/modules.html).
+
+```{eval-rst}
+.. automodule:: mpol.precomposed
+    :members:
+```
diff --git a/docs/api/train_test.md b/docs/api/train_test.md
new file mode 100644
index 00000000..cb1b5da9
--- /dev/null
+++ b/docs/api/train_test.md
@@ -0,0 +1,6 @@
+# Training and testing
+
+```{eval-rst}
+.. automodule:: mpol.training
+    :members:
+```
\ No newline at end of file
diff --git a/docs/api/utilities.md b/docs/api/utilities.md
new file mode 100644
index 00000000..6cbf7a60
--- /dev/null
+++ b/docs/api/utilities.md
@@ -0,0 +1,6 @@
+# Utilities
+
+```{eval-rst}
+.. automodule:: mpol.utils
+    :members:
+```
diff --git a/docs/background.md b/docs/background.md
new file mode 100644
index 00000000..7af182d1
--- /dev/null
+++ b/docs/background.md
@@ -0,0 +1,42 @@
+# Background and prerequisites
+
+## Radio astronomy
+
+A background in radio astronomy, Fourier transforms, and interferometry is a prerequisite for using MPoL but is beyond the scope of this documentation. We recommend reviewing these resources as needed.
+
+- [Essential radio astronomy](https://www.cv.nrao.edu/~sransom/web/xxx.html) textbook by James Condon and Scott Ransom, and in particular, Chapter 3.7 on Radio Interferometry.
+- NRAO's [17th Synthesis Imaging Workshop](http://www.cvent.com/events/virtual-17th-synthesis-imaging-workshop/agenda-0d59eb6cd1474978bce811194b2ff961.aspx) recorded lectures and slides available
+- [Interferometry and Synthesis in Radio Astronomy](https://ui.adsabs.harvard.edu/abs/2017isra.book.....T/abstract) by Thompson, Moran, and Swenson. An excellent and comprehensive reference on all things interferometry.
+- The [Revisiting the radio interferometer measurement equation](https://ui.adsabs.harvard.edu/abs/2011A%26A...527A.106S/abstract) series by O. Smirnov, 2011
+- Ian Czekala's lecture notes on [Radio Interferometry and Imaging](https://iancze.github.io/courses/as5003/lectures/)
+
+RML imaging is different from CLEAN imaging, which operates as a deconvolution procedure in the image plane. However, CLEAN is by far the dominant algorithm used to synthesize images from interferometric data at sub-mm and radio wavelengths, and it is useful to have at least a basic understanding of how it works. We recommend
+
+- [Interferometry and Synthesis in Radio Astronomy](https://ui.adsabs.harvard.edu/abs/2017isra.book.....T/abstract) Chapter 11.1
+- David Wilner's lecture on [Imaging and Deconvolution in Radio Astronomy](https://www.youtube.com/watch?v=mRUZ9eckHZg)
+- For a discussion on using both CLEAN and RML techniques to robustly interpret kinematic data of protoplanetary disks, see Section 3 of [Visualizing the Kinematics of Planet Formation](https://ui.adsabs.harvard.edu/abs/2020arXiv200904345D/abstract) by The Disk Dynamics Collaboration
+
+## Statistics and Machine Learning
+
+MPoL is built on top of the [PyTorch](https://pytorch.org/) machine learning framework and adopts much of the terminology and design principles of machine learning workflows. As a prerequisite, we recommend at least a basic understanding of statistics and machine learning principles. Two excellent (free) textbooks are
+
+- [Dive into Deep Learning](https://d2l.ai/), in particular chapters 1 - 3 to cover the basics of forward models, automatic differentiation, and optimization.
+- [Deep Learning: Foundations and Concepts](https://www.bishopbook.com/) for a lengthier discussion of these concepts and other foundational statistical concepts.
+
+And we highly recommend the informative and entertaining 3b1b lectures on [deep learning](https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&ab_channel=3Blue1Brown).
+
+## PyTorch
+
+As a PyTorch library, MPoL expects that the user will write Python code that uses MPoL primitives as building blocks to solve their interferometric imaging workflow, much the same way the artificial intelligence community writes Python code that uses PyTorch layers to implement new neural network architectures (for [example](https://github.com/pytorch/examples)). You will find MPoL easiest to use if you follow PyTorch customs and idioms, e.g., feed-forward neural networks, data storage, GPU acceleration, and train/test optimization loops. Therefore, a basic familiarity with PyTorch is considered a prerequisite for MPoL.
+
+If you are new to PyTorch, we recommend starting with the official [Learn the Basics](https://pytorch.org/tutorials/beginner/basics/intro.html) guide. You can also find high quality introductions on YouTube and in textbooks.
+
+## RML Imaging
+
+MPoL is a modern PyTorch imaging library, however many of the key concepts behind Regularized Maximum Likelihood image have been around for some time. We recommend checking out the following (non-exhaustive) list of resources
+
+- [Regularized Maximum Likelihood Image Synthesis and Validation for ALMA Continuum Observations of Protoplanetary Disks](https://ui.adsabs.harvard.edu/abs/2023PASP..135f4503Z/abstract) by Zawadzki et al. 2023
+- The fourth paper in the 2019 [Event Horizon Telescope Collaboration series](https://ui.adsabs.harvard.edu/abs/2019ApJ...875L...4E/abstract) describing the imaging principles
+- [Maximum entropy image restoration in astronomy](https://ui.adsabs.harvard.edu/abs/1986ARA%26A..24..127N/abstract) AR&A by Narayan and Nityananda 1986
+- [Multi-GPU maximum entropy image synthesis for radio astronomy](https://ui.adsabs.harvard.edu/abs/2018A%26C....22...16C/abstract) by Cárcamo et al. 2018
+- Dr. Katie Bouman's Ph.D. thesis ["Extreme Imaging via Physical Model Inversion: Seeing Around Corners and Imaging Black Holes"](https://people.csail.mit.edu/klbouman/pw/papers_and_presentations/thesis.pdf)
diff --git a/docs/developer-documentation.md b/docs/developer-documentation.md
index 5c59e99d..ac104620 100644
--- a/docs/developer-documentation.md
+++ b/docs/developer-documentation.md
@@ -18,8 +18,6 @@ Extra packages required for development can be installed via
 
 This directs pip to install whatever package is in the current working directory (`.`) as an editable package (`-e`), using the set of `[dev]` optional packages. There is also a more limited set of packages under `[test]`. You can view these packages in the `pyproject.toml` file. 
 
-
-(testing-reference-label)=
 ## Testing
 
 MPoL includes a test suite written using [pytest](https://docs.pytest.org/). We aim for this test suite to be as comprehensive as possible, since this helps us achieve our goal of shipping stable software.
diff --git a/docs/index.md b/docs/index.md
index 714a10cf..22d2a5b9 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -9,7 +9,7 @@ As a PyTorch *library*, MPoL expects that the user will write Python code to lin
 
 MPoL is *not* an imaging application nor a pipeline, though MPoL components could be used to build specialized workflows. We are focused on providing a numerically correct and expressive set of core primitives so the user can leverage the full power of the PyTorch (and Python) ecosystem to solve their research-grade imaging tasks. This is already a significant development and maintenance burden for the limited resources of our small research team, so our immediate scope must necessarily be limited.
 
-To get a sense of what MPoL is and what background material this library assumes, please look at the [](introduction.md). If the package is right for your needs, follow the [installation instructions](installation.md).
+To get a sense of what background material MPoL assumes, please look at the [](background.md). If the package is right for your needs, follow the [installation instructions](installation.md).
 
 This documentation covers the API and a short set of tutorials demonstrating key components of the MPoL library. Longer examples demonstrating how one might use MPoL components to build an imaging workflow are packaged together in the [MPoL-dev/examples](https://github.com/MPoL-dev/examples) repository.
 
@@ -21,8 +21,8 @@ If you'd like to help build the MPoL package, please check out the [](developer-
 :caption: User Guide
 :maxdepth: 2
 
-introduction.md
-installation.md
+background
+installation
 ci-tutorials/PyTorch
 ci-tutorials/gridder
 ci-tutorials/optimization
diff --git a/docs/installation.md b/docs/installation.md
index 22bf3e20..b60b6dfe 100644
--- a/docs/installation.md
+++ b/docs/installation.md
@@ -1,6 +1,6 @@
 # MPoL Installation
 
-MPoL requires `python >= 3.8`.
+MPoL requires `python >= 3.10`.
 
 ## Using pip
 
@@ -16,8 +16,6 @@ Or if you require a specific version of MPoL (e.g., `0.2.0`), you can install vi
 $ pip install MPoL==0.2.0
 ```
 
-We recommend that you install the latest version of MPoL. Though, if you are working on a project across several compute environments, you may wish to ensure that each environment has the same version of MPoL installed.
-
 ## From source
 
 If you'd like to install the package from source to access the latest development version, download or `git clone` the MPoL repository and install
@@ -30,7 +28,7 @@ $ pip install .
 
 If you have trouble installing please raise a [github issue](https://github.com/MPoL-dev/MPoL/issues) with the particulars of your system.
 
-If you're interested in contributing to the MPoL package, please see the {ref}`developer-documentation-label`.
+If you're interested in contributing to the MPoL package, please see the [](developer-documentation.md).
 
 ## Upgrading
 
@@ -60,10 +58,4 @@ $ python
 
 The documentation served online ([here](https://mpol-dev.github.io/MPoL/index.html)) corresponds to the `main` branch. This represents the current state of MPoL and is usually the best place to reference MPoL functionality. However, this documentation may be more current than last tagged version or the version you have installed. If you require the new features detailed in the documentation, then we recommend installing the package from source (as above).
 
-In the (foreseeably rare) situation where the latest online documentation significantly diverges from the package version you wish to use (but there are reasons you do not want to build the `main` branch from source), you can access the documentation for that version by {ref}`building and viewing the documentation locally <documentation-build-reference-label>`. To do so, clone the repository as above, checkout the version tag, and build the docs locally.
-
-## Using CUDA acceleration
-
-MPoL uses PyTorch for its Neural Networks as seen in the `Optimization Loop` tutorial. If you are interested in using PyTorch's full potential by utilizing a Nvidia graphics card, then the CUDA tool kit will need to be installed (TensorVision is also required). More information on this is available in the `GPU Tutorial` page. It is worth noting that PyTorch may need to be (re)installed separately using a specific `pip` for your system.
-
-More information on this can be found on the PyTorch homepage: [pytorch.org](https://pytorch.org/).
+In the (foreseeably rare) situation where the latest online documentation significantly diverges from the package version you wish to use (but there are reasons you do not want to build the `main` branch from source), you can access the documentation for that version by [building the older documentation locally](developer-documentation.md#older-documentation-versions)
\ No newline at end of file
diff --git a/docs/introduction.md b/docs/introduction.md
deleted file mode 100644
index 6447a201..00000000
--- a/docs/introduction.md
+++ /dev/null
@@ -1,115 +0,0 @@
-# Orientation
-
-*Million Points of Light* or "MPoL" is a [PyTorch](https://pytorch.org/) *library* supporting Regularized Maximum Likelihood (RML) imaging and Bayesian Inference workflows with Fourier datasets. We provide the building blocks to create flexible image models and optimize them to fit interferometric datasets.
-
-## Background and prerequisites
-
-### Radio astronomy
-
-A background in radio astronomy, Fourier transforms, and interferometry is a prerequisite for using MPoL but is beyond the scope of this documentation. We recommend reviewing these resources as needed.
-
-- [Essential radio astronomy](https://www.cv.nrao.edu/~sransom/web/xxx.html) textbook by James Condon and Scott Ransom, and in particular, Chapter 3.7 on Radio Interferometry.
-- NRAO's [17th Synthesis Imaging Workshop](http://www.cvent.com/events/virtual-17th-synthesis-imaging-workshop/agenda-0d59eb6cd1474978bce811194b2ff961.aspx) recorded lectures and slides available
-- [Interferometry and Synthesis in Radio Astronomy](https://ui.adsabs.harvard.edu/abs/2017isra.book.....T/abstract) by Thompson, Moran, and Swenson. An excellent and comprehensive reference on all things interferometry.
-- Ian Czekala's lecture notes on [Radio Interferometry and Imaging](https://iancze.github.io/courses/as5003/lectures/)
-
-RML imaging is different from CLEAN imaging, which operates as a deconvolution procedure in the image plane. However, CLEAN is by far the dominant algorithm used to synthesize images from interferometric data at sub-mm and radio wavelengths, and it is useful to have at least a basic understanding of how it works. We recommend
-
-- [Interferometry and Synthesis in Radio Astronomy](https://ui.adsabs.harvard.edu/abs/2017isra.book.....T/abstract) Chapter 11.1
-- David Wilner's lecture on [Imaging and Deconvolution in Radio Astronomy](https://www.youtube.com/watch?v=mRUZ9eckHZg)
-- For a discussion on using both CLEAN and RML techniques to robustly interpret kinematic data of protoplanetary disks, see Section 3 of [Visualizing the Kinematics of Planet Formation](https://ui.adsabs.harvard.edu/abs/2020arXiv200904345D/abstract) by The Disk Dynamics Collaboration
-
-### Statistics and Machine Learning
-
-MPoL is built on top of the [PyTorch](https://pytorch.org/) machine learning framework and adopts much of the terminology and design principles of machine learning workflows. As a prerequisite, we recommend at least a basic understanding of statistics and machine learning principles. Two excellent (free) textbooks are
-
-- [Dive into Deep Learning](https://d2l.ai/), in particular chapters 1 - 3 to cover the basics of forward models, automatic differentiation, and optimization.
-- [Deep Learning: Foundations and Concepts](https://www.bishopbook.com/) for a lengthier discussion of these concepts and other foundational statistical concepts.
-
-And we highly recommend the informative and entertaining 3b1b lectures on [deep learning](https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&ab_channel=3Blue1Brown).
-
-### PyTorch
-
-As a PyTorch library, MPoL expects that the user will write Python code that uses MPoL primitives as building blocks to solve their interferometric imaging workflow, much the same way the artificial intelligence community writes Python code that uses PyTorch layers to implement new neural network architectures (for [example](https://github.com/pytorch/examples)). You will find MPoL easiest to use if you follow PyTorch customs and idioms, e.g., feed-forward neural networks, data storage, GPU acceleration, and train/test optimization loops. Therefore, a basic familiarity with PyTorch is considered a prerequisite for MPoL.
-
-If you are new to PyTorch, we recommend starting with the official [Learn the Basics](https://pytorch.org/tutorials/beginner/basics/intro.html) guide. You can also find high quality introductions on YouTube and in textbooks.
-
-## MPoL for Regularized Maximum Likelihood imaging
-
-This document is an attempt to provide a whirlwind introduction to what Regularized Maximum Likelihood (RML) imaging is, and why you might want to use this MPoL package to perform it with your interferometric dataset. Of course, the field is rich, varied, and this short introduction couldn't possibly do justice to cover the topic in depth. We recommend that you check out many of the links and suggestions in this document for further reading and understanding.
-
-
-
-We strive to
-
-- create an open, welcoming, and supportive community for new users and contributors (see our [code of conduct](https://github.com/MPoL-dev/MPoL/blob/main/CODE_OF_CONDUCT.md) and [developer documentation](developer-documentation.md))
-- support well-tested and stable releases (i.e., `pip install mpol`) that run on all currently-supported Python versions, on Linux, MacOS, and Windows
-- maintain up-to-date {ref}`API documentation <api-reference-label>`
-- cultivate tutorials covering real-world applications
-
-:::{seealso}
-We also recommend checking out several other excellent packages for RML imaging:
-
-- [SMILI](https://github.com/astrosmili/smili)
-- [eht-imaging](https://github.com/achael/eht-imaging)
-- [GPUVMEM](https://github.com/miguelcarcamov/gpuvmem)
-:::
-
-There are a few things about  MPoL that we believe make it an appealing platform for RML modeling.
-
-**Built on PyTorch**: Many of MPoL's exciting features stem from the fact that it is built on top of a rich computational library that supports autodifferentiation and construction of complex neural networks. Autodifferentiation libraries like [Theano/Aesara](https://github.com/aesara-devs/aesara), [Tensorflow](https://www.tensorflow.org/), [PyTorch](https://pytorch.org/), and [JAX](https://jax.readthedocs.io/) have revolutionized the way we compute and optimize functions. For now, PyTorch is the library that best satisfies our needs, but we're keeping a close eye on the Python autodifferentiation ecosystem should a more suitable framework arrive. If you are familiar with scientific computing with Python but haven't yet tried any of these frameworks, don't worry, the syntax is easy to pick up and quite similar to working with numpy arrays. For example, check out our tutorial [introduction to PyTorch](ci-tutorials/PyTorch.md).
-
-**Autodifferentiation**: PyTorch gives MPoL the capacity to autodifferentiate through a model. The *gradient* of the objective function is exceptionally useful for finding the "downhill" direction in a large parameter space (such as the set of image pixels). Traditionally, these gradients would have needed to been calculated analytically (by hand) or via finite-difference methods which can be noisy in high dimensions. By leveraging the autodifferentiation capabilities, this allows us to rapidly formulate and implement complex prior distributions which would otherwise be difficult to differentiate by hand.
-
-**Optimization**: PyTorch provides a full-featured suite of research-grade [optimizers](https://pytorch.org/docs/stable/optim.html) designed to train deep neural networks. These same optimizers can be employed to quickly find the optimum RML image.
-
-**GPU acceleration**: PyTorch wraps CUDA libraries, making it seamless to take advantage of (multi-)GPU acceleration to optimize images. No need to use a single line of CUDA.
-
-**Model composability**: Rather than being a monolithic program for single-click RML imaging, MPoL strives to be a flexible, composable, RML imaging *library* that provides primitives that can be used to easily solve your particular imaging challenge. One way we do this is by mimicking the PyTorch ecosystem and writing the RML imaging workflow using [PyTorch modules](https://pytorch.org/tutorials/beginner/nn_tutorial.html). This makes it easy to mix and match modules to construct arbitrarily complex imaging workflows. We're working on tutorials that describe these ideas in depth, but one example would be the ability to use a single latent space image model to simultaneously fit single dish and interferometric data.
-
-**A bridge to the machine learning/neural network community**: MPoL will happily calculate RML images for you using "traditional" image priors, lest you are the kind of person that turns your nose up at the words "machine learning" or "neural network." However, if you are the kind of person that sees opportunity in these tools, because MPoL is built on PyTorch, it is straightforward to take advantage of them for RML imaging. For example, if one were to train a variational autoencoder on protoplanetary disk emission morphologies, the latent space + decoder architecture could be easily plugged in to MPoL and serve as an imaging basis set.
-
-To get started with MPoL, we recommend [installing the package](installation.md) and reading through the tutorial series. If you have any questions about the package, we invite you to join us on our [Github discussions page](https://github.com/MPoL-dev/MPoL/discussions).
-
-second par
-* brief description to what RML is, with call-outs to Zawadzki and other seminal resources.
-
-
-## Scope of tutorials nad exmaples
-
-* scope of tutorial examples to follow, showing module building blocks
-* browse the API
-* organization of examples folder
-
-
-
-
-
-
-
-
-
-Longer workflow exmaples are in examples.
-
-Including Pyro.
-
-
-
-:::{seealso}
-That's RML imaging in a nutshell, but we've barely scratched the surface. We highly recommend checking out the following excellent resources.
-
-- The fourth paper in the 2019 [Event Horizon Telescope Collaboration series](https://ui.adsabs.harvard.edu/abs/2019ApJ...875L...4E/abstract) describing the imaging principles
-- [Maximum entropy image restoration in astronomy](https://ui.adsabs.harvard.edu/abs/1986ARA%26A..24..127N/abstract) AR&A by Narayan and Nityananda 1986
-- [Multi-GPU maximum entropy image synthesis for radio astronomy](https://ui.adsabs.harvard.edu/abs/2018A%26C....22...16C/abstract) by Cárcamo et al. 2018
-:::
-
-
-
-
-
-```{rubric} Footnotes
-```
-
-[^mle-solution]: There's actually a lot to unpack here. When your model has many parameters (i.e., the posterior distribution is high dimensional), the MLE (or MAP) solution is unlikely to represent a *typical* realization of your model parameters. This is a quirk of the geometry of high dimensional spaces. For more information, we recommend checking out Chapter 1 of [Betancourt 2017](https://arxiv.org/abs/1701.02434). Still, the MLE solution is often a useful quantity to communicate, summarizing the mode of the probability distribution.
-
-[^relative-strength]: This is where the factor of $1/2$ in front of $\chi^2$ becomes important. You could use something like $L_\mathrm{nll}(\boldsymbol{\theta}) = \chi^2(\boldsymbol{\theta})$, but then you'd need to change the value of $\lambda_\mathrm{sparsity}$ to achieve the same relative regularization.