diff --git a/joss.06906/paper.jats/10.21105.joss.06906.jats b/joss.06906/paper.jats/10.21105.joss.06906.jats new file mode 100644 index 0000000000..b60d64a204 --- /dev/null +++ b/joss.06906/paper.jats/10.21105.joss.06906.jats @@ -0,0 +1,608 @@ + + +
+ + + + +Journal of Open Source Software +JOSS + +2475-9066 + +Open Journals + + + +6906 +10.21105/joss.06906 + +lintsampler: Easy random sampling via linear +interpolation + + + +https://orcid.org/0000-0001-6841-1496 + +Naik +Aneesh P. + + +* + + +https://orcid.org/0000-0003-1517-3935 + +Petersen +Michael S. + + + + + +Institute for Astronomy, University of Edinburgh, +UK + + + + +* E-mail: + + +14 +6 +2024 + +9 +102 +6906 + +Authors of papers retain copyright and release the +work under a Creative Commons Attribution 4.0 International License (CC +BY 4.0) +2022 +The article authors + +Authors of papers retain copyright and release the work under +a Creative Commons Attribution 4.0 International License (CC BY +4.0) + + + +Python +statistics +numpy +random variates +random sampling +low discrepancy sequence + + + + + + Summary +

lintsampler provides a Python implementation + of a technique we term ‘linear interpolant sampling’: an algorithm to + efficiently draw pseudo-random samples from an arbitrary probability + density function (PDF). First, the PDF is evaluated on a grid-like + structure. Then, it is assumed that the PDF can be approximated + between grid vertices by the (multidimensional) linear interpolant. + With this assumption, random samples can be efficiently drawn via + inverse transform sampling + (Devroye, + 1986).

+

lintsampler is primarily written with + numpy + (Harris + et al., 2020), drawing some additional functionality from + scipy + (Virtanen + et al., 2020). Under the most basic usage of + lintsampler, the user provides a Python + function defining the target PDF and some parameters describing a + grid-like structure to the LintSampler class, + and is then able to draw samples via the sample + method. Additionally, there is functionality for the user to set the + random seed, employ quasi-Monte Carlo sampling, or sample within a + premade grid (DensityGrid) or tree + (DensityTree) structure.

+
+ + Statement of need +

Below is a (non-exhaustive) list of ‘use cases’, i.e., situations + where a user might find lintsampler (and/or the + the linear interpolant sampling algorithm underpinning it) to be + preferable over random sampling techniques such as importance + sampling, rejection sampling or Markov chain Monte Carlo (MCMC). MCMC + in particular is a powerful class of methods with many excellent + Python implementations + (Coullon + & Nemeth, 2022; + Foreman-Mackey + et al., 2019; + Marignier, + 2023; + Patil + et al., 2010). In certain use cases as described below, + lintsampler can offer more convenient and/or + more efficient sampling compared with these various techniques.

+

We’ll assume that the target PDF the user wishes to sample from + does not have its own exact sampling algorithm (such as the Box-Muller + transform for a Gaussian PDF). The power of + lintsampler lies in its applicability to + arbitrary PDFs for which tailor-made sampling algorithms are not + available.

+ + Use Cases + + 1. Expensive PDF +

If the PDF being sampled from has high computatational overhead + to evaluate (referred to as computationally ‘expensive’) and a + large number of samples is desired, then + lintsampler might be the most + cost-effective option. This is because + lintsampler does not evaluate the PDF for + each sample (as would be the case for other random sampling + techniques), but on the nodes of the user-chosen grid. + Particularly in a low-dimensional setting where the grid does not + have too many nodes, this can mean far fewer PDF evaluations. This + point is demonstrated in the + first + example notebook in the + lintsampler + docs.1

+
+ + 2. Multimodal PDF +

If the target PDF has a highly complex structure with multiple, + well-separated modes, then lintsampler + might be the easiest option (in terms of user + configuration). In such scenarios, MCMC might struggle unless the + walkers are carefully preconfigured to start near the modes. + Similarly, rejection sampling or importance sampling would be + highly suboptimal unless the proposal distribution is carefully + chosen to match the structure of the target PDF. With + lintsampler, the user needs only to ensure + that the resolution of their chosen grid is sufficient to resolve + the PDF structure, and so the setup remains straightforward. This + is demonstrated in the + second + example notebook in the + lintsampler + docs.2

+
+ + 3. PDF with large dynamic range +

If the target PDF has a very large dynamic range, then the + DensityTree object provided by + lintsampler might be an effective solution. + Here, the PDF is evaluated not on a fixed grid, but on the leaves + of a tree. The tree is refined such that regions of concentrated + probability are more finely resolved, based on accuracy criteria. + Such an example is shown in the + third + example notebook in the + lintsampler docs.

+
+ + 4. Noise needs to be minimised +

In Quasi-Monte Carlo (QMC) sampling, one purposefully generates + more ‘balanced’ (and thus less random) draws from a target PDF, so + that sampling noise decreases faster than + + + 𝒪(N1/2). + lintsampler allows easy QMC sampling with + arbitrary PDFs. We are not aware of such capabilities with any + other package. We give an example of using + lintsampler for QMC in the + fourth + example notebook in the + lintsampler docs.

+
+
+ + ‘Real World’ Example +

Any one of the four use cases above would serve by itself as a + sufficient case for choosing lintsampler, but + here we give an example of a real-world scenario that combines all + of the use cases. It is drawn from our own primary research + interests in computational astrophysics.

+

Much of computational astrophysics consists of large-scale high + performance computational simulations of gravitating systems. For + example, simulations of planets evolving and interacting within a + solar system, simulations of stars interacting within a galaxy, or + vast cosmological simulations in which a whole universe is grown + in silico.

+

There exists a myriad of codes used to run these simulations, + each using different algorithms to solve the governing equations. + One class of simulation code that has gained much attention in + recent years is the class of code employing basis function + expansions + (Petersen + et al., 2022; + Vasiliev, + 2019). In these codes, the matter density at any point in + space is represented by a sum over basis functions (not unlike a + Fourier series), with the coefficients in the sum changing over + space and time. As such, the matter comprising the system is + represented everywhere as a smooth, continous fluid, but for many + applications and/or downstream analyses of the simulated system, one + needs to instead represent the system as a set of discrete + particles. These particles can be obtained by drawing samples from + the continuous density field.

+

This is a scenario that satisfies all four of the use cases list + above. To explain further:

+ + +

The PDF we are sampling from (i.e., the basis expansion + representation of the matter density field) can be expensive to + evaluate if a large number of terms are included in the sum.

+
+ +

The PDF can be highly multimodal when the system we are + simulating comprises many distinct gravitating substructures, + such as stellar clusters.

+
+ +

The PDF can have a large dynamic range. Astrophysical + structures such as galaxies and dark matter ‘haloes’ often have + power-law density profiles, such as the Navarro-Frenk-White + profile + (Navarro + et al., 1997). Further complicating the issue is that a + typical dark matter halo will host several ‘subhaloes’, which in + turn might host ‘subsubhaloes’, and so on. In short, a range of + spatial scales needs to be resolved.

+
+ +

If the particle set being sampled is to be used for further + simulation, it can be helpful to draw as ‘noiseless’ a sample as + possible for reasons of numerical stability.

+
+
+

For these reasons, this kind of astrophysical simulation code + provides an excellent example of a ‘real world’ application for + lintsampler. Here, one would cover the + simulation domain with a DensityTree instance + (or several instances, one for each primary structure), call the + refine method to better resolve the + high-density regions, then feed the tree to a + LintSampler instance and call + sample to generate particles. The + qmc flag can be passed to the sampler in + order to employ Quasi-Monte Carlo sampling.

+
+ + Caveats +

In all use cases listed above, it is assumed that the dimension + of the problem is not too high. lintsampler + works by evaluating a given PDF on the nodes of a grid (or grid-like + structure, such as a tree), so the number of evaluations (and memory + occupancy) grows exponentially with the number of dimensions. As a + consequence, many of the efficiency arguments given for + lintsampler below don’t apply to higher + dimensional problems. We probably wouldn’t use + lintsampler in more than 6 dimensions, but + there is no hard limit here: the question of how many dimensions is + too many will depend on the problem at hand.

+
+
+ + Usage +

lintsampler is designed with an interface + that makes sampling from an input PDF straightforward. For example, if + you have PDF with multiple separated peaks:

+ import numpy as np +from scipy.stats import norm + +def gmm_pdf(x): + mu = np.array([-3.0, 0.5, 2.5]) + sig = np.array([1.0, 0.25, 0.75]) + w = np.array([0.4, 0.25, 0.35]) + return np.sum([w[i] * norm.pdf(x, mu[i], sig[i]) for i in range(3)], axis=0) +

lintsampler can efficiently draw samples + from it on some defined interval:

+ from lintsampler import LintSampler + +grid = np.linspace(-7,7,100) +samples = LintSampler(grid,pdf=gmm_pdf).sample(N=10000) +

samples is then an array of 10000 samples + drawn from the PDF. Apart from defining the PDF, + lintsampler enables creating discrete samples + from a continuous PDF in a small handful of lines.

+
+ + Features +

Although lintsampler is written in pure + Python, making the code highly readable, the methods make extensive + use of numpy functionality to provide rapid + sampling. After the structure spanning the domain has been + constructed, sampling proceeds with computational effort scaling + linearly with number of sample points.

+

We provide two methods to define the domain, both optimised with + numpy functionality for efficient construction. + The DensityGrid class takes highly flexible + inputs for defining a grid. In particular, the grid need not be evenly + spaced (or even continuous) in any dimension; the user can + preferentially place grid elements near high-density regions. The + DensityTree class takes error tolerance + parameters and constructs an adaptive structure to achieve the + specified tolerance. We also provide a base class + (DensityStructure) such that the user could + extend the methods for spanning the domain.

+

Documentation for lintsampler, including + example notebooks demonstrating a range of problems, is available via + a + readthedocs + page. The documentation also has an extensive explanation + of the interfaces, including optimisation parameters for increasing + the efficiency in sampling.

+
+ + Acknowledgements +

We would like to thank Sergey Koposov for useful discussions. APN + acknowledges funding support from an Early Career Fellowship from the + Leverhulme Trust. MSP acknowledges funding support from a UKRI Stephen + Hawking Fellowship.

+
+ + + + + + + + DevroyeLuc + + Non-uniform random variate generation + Springer-Verlag + New York, NY, USA + 1986 + 10.1007/978-1-4613-8643-8 + + + + + + NavarroJulio F. + FrenkCarlos S. + WhiteSimon D. M. + + A Universal Density Profile from Hierarchical Clustering + + 199712 + 490 + 2 + https://arxiv.org/abs/astro-ph/9611107 + 10.1086/304888 + 493 + 508 + + + + + + Foreman-MackeyDaniel + FarrWill + SinhaManodeep + ArchibaldAnne + HoggDavid + SandersJeremy + ZuntzJoe + WilliamsPeter + NelsonAndrew + de Val-BorroMiguel + ErhardtTobias + PashchenkoIlya + PlaOriol + + emcee v3: A Python ensemble sampling toolkit for affine-invariant MCMC + The Journal of Open Source Software + 201911 + 4 + 43 + https://arxiv.org/abs/1911.07688 + 10.21105/joss.01864 + 1864 + + + + + + + PatilAnand + HuardDavid + FonnesbeckChristopher J. + + PyMC: Bayesian Stochastic Modelling in Python + Journal of Statistical Software + 2010 + 35 + 4 + https://www.jstatsoft.org/index.php/jss/article/view/v035i04 + 10.18637/jss.v035.i04 + 1 + 81 + + + + + + MarignierAugustin + + PxMCMC: A Python package for proximal Markov Chain Monte Carlo + The Journal of Open Source Software + 202307 + 8 + 87 + 10.21105/joss.05582 + 5582 + + + + + + + CoullonJeremie + NemethChristopher + + SGMCMCJax: A lightweight JAX library for stochastic gradient Markov Chain Monte Carlo algorithms + Journal of Open Source Software + The Open Journal + 2022 + 7 + 72 + https://doi.org/10.21105/joss.04113 + 10.21105/joss.04113 + 4113 + + + + + + + VirtanenPauli + GommersRalf + OliphantTravis E. + HaberlandMatt + ReddyTyler + CournapeauDavid + BurovskiEvgeni + PetersonPearu + WeckesserWarren + BrightJonathan + van der WaltStéfan J. + BrettMatthew + WilsonJoshua + MillmanK. Jarrod + MayorovNikolay + NelsonAndrew R. J. + JonesEric + KernRobert + LarsonEric + CareyC J + Polatİlhan + FengYu + MooreEric W. + VanderPlasJake + LaxaldeDenis + PerktoldJosef + CimrmanRobert + HenriksenIan + QuinteroE. A. + HarrisCharles R. + ArchibaldAnne M. + RibeiroAntônio H. + PedregosaFabian + van MulbregtPaul + SciPy 1.0 Contributors + + SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python + Nature Methods + 2020 + 17 + 10.1038/s41592-019-0686-2 + 261 + 272 + + + + + + HarrisCharles R. + MillmanK. Jarrod + WaltStéfan J. van der + GommersRalf + VirtanenPauli + CournapeauDavid + WieserEric + TaylorJulian + BergSebastian + SmithNathaniel J. + KernRobert + PicusMatti + HoyerStephan + KerkwijkMarten H. van + BrettMatthew + HaldaneAllan + RíoJaime Fernández del + WiebeMark + PetersonPearu + Gérard-MarchantPierre + SheppardKevin + ReddyTyler + WeckesserWarren + AbbasiHameer + GohlkeChristoph + OliphantTravis E. + + Array programming with NumPy + Nature + Springer Science; Business Media LLC + 202009 + 585 + 7825 + https://doi.org/10.1038/s41586-020-2649-2 + 10.1038/s41586-020-2649-2 + 357 + 362 + + + + + + VasilievEugene + + AGAMA: action-based galaxy modelling architecture + + 201901 + 482 + 2 + https://arxiv.org/abs/1802.08239 + 10.1093/mnras/sty2672 + 1525 + 1544 + + + + + + PetersenMichael S. + WeinbergMartin D. + KatzNeal + + EXP: N-body integration using basis function expansions + + 202203 + 510 + 4 + https://arxiv.org/abs/2104.14577 + 10.1093/mnras/stab3639 + 6201 + 6217 + + + + + +

Similarly, there might be situations where the + user is not so concerned about strict statistical representativeness + but wants to generate a huge number of samples from a target PDF + with the least possible computational cost (such as e.g., generating + realistic point cloud distributions in video game graphics). They + can use lintsampler with a coarse grid (so + minimal PDF evaluations), then sample() to + their heart’s content.

+
+ +

It is worth noting that in these kinds of + complex, multimodal problems, a single fixed grid might not be the + most cost-effective sampling domain. For this reason, + lintsampler also provides simple + functionality for sampling over multiple disconnected grids.

+
+
+
+