Skip to content

Commit

Permalink
run black and build docs. In general I should make a GH site hosting …
Browse files Browse the repository at this point in the history
…the docs
  • Loading branch information
Nathaniel Imel authored and Nathaniel Imel committed Mar 4, 2024
1 parent e8aec1b commit 1ffd6ac
Show file tree
Hide file tree
Showing 34 changed files with 8,713 additions and 4,923 deletions.
161 changes: 156 additions & 5 deletions docs/sciterra.html
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,14 @@

<h2>Contents</h2>
<ul>
<li><a href="#sciterra">sciterra</a></li>
<li><a href="#sciterra-a-python-library-for-similarity-based-scientometrics">sciterra: a python library for similarity-based scientometrics</a>
<ul>
<li><a href="#overview">Overview</a></li>
<li><a href="#installing-sciterra">Installing sciterra</a></li>
<li><a href="#usage">Usage</a></li>
<li><a href="#additional-features">Additional features</a></li>
<li><a href="#acknowledgments">Acknowledgments</a></li>
</ul></li>
</ul>


Expand Down Expand Up @@ -75,13 +82,157 @@ <h2>API Documentation</h2>
<h1 class="modulename">
sciterra </h1>

<div class="docstring"><h1 id="sciterra">sciterra</h1>
<div class="docstring"><h1 id="sciterra-a-python-library-for-similarity-based-scientometrics">sciterra: a python library for similarity-based scientometrics</h1>

<p><a href="https://github.com/nathimel/sciterra/actions/workflows/test.yml"><img src="https://github.com/nathimel/sciterra/actions/workflows/test.yml/badge.svg" alt="build" /></a></p>
<p><a href="https://github.com/nathimel/sciterra/actions/workflows/build.yml"><img src="https://github.com/nathimel/sciterra/actions/workflows/build.yml/badge.svg" alt="build" /></a></p>

<p>Software library to support data-driven analyses of scientific literature.</p>
<p>Sciterra is a software libary to support data-driven analyses of scientific literature, with a focus on unifying different bibliographic database APIs and document-embedding methods for systematic scientometrics research.</p>

<p>This library is a reimplementation of Zach Hafen's <a href="https://github.com/zhafen/cc">cc</a> library.</p>
<h2 id="overview">Overview</h2>

<p>The main purpose of sciterra is to perform similarity-based retrieval of scientific publications for metascience/scientometrics research. While there are many services that can make the individual steps of this simple, this software library exists to</p>

<ol>
<li><p>Unify the different APIs and vector-based retrieval methods</p></li>
<li><p>Support scientometrics analyses of citation dynamics, especially with respect to a vectorized 'landscape' of literature.</p></li>
</ol>

<h2 id="installing-sciterra">Installing sciterra</h2>

<p>First, set up a virtual environment (e.g. via <a href="https://docs.conda.io/projects/miniconda/en/latest/">miniconda</a>, <code>conda create -n sciterra</code>, and <code>conda activate sciterra</code>).</p>

<ol>
<li><p>Install sciterra via git:</p>

<p><code>python -m pip install 'sciterra @ git+https://github.com/nathimel/sciterra.git'</code></p></li>
<li><p>Alternatively, download or clone this repository and navigate to the root folder, and install locally:</p>

<p><code>pip install -e .</code></p></li>
<li><p>It is not yet recommended because sciterra is still in development, but you can also install via pip from pypi:</p>

<p><code>pip install sciterra</code></p></li>
</ol>

<h2 id="usage">Usage</h2>

<h3 id="atlas">Atlas</h3>

<p>The central object in sciterra is the <a href="src/sciterra/mapping/atlas.py"><code>Atlas</code></a>. This is a basic data structure for containing scientific publications that are returned from calls to various bibliographic database APIs.</p>

<p>An Atlas minimally requires a list of <a href="src/sciterra/mapping/publication.py"><code>Publications</code></a>.</p>

<h4 id="publication">Publication</h4>

<p>A publication object is a minimal wrapper around publication data, and should have a string identifier. It is designed to standardize the basic metadata contained in the results from some bibliographic database API.</p>

<div class="pdoc-code codehilite">
<pre><span></span><code><span class="kn">from</span> <span class="nn">sciterra</span> <span class="kn">import</span> <span class="n">Atlas</span><span class="p">,</span> <span class="n">Publication</span>

<span class="n">atl</span> <span class="o">=</span> <span class="n">Atlas</span><span class="p">([</span><span class="n">Publication</span><span class="p">({</span><span class="s2">&quot;identifier&quot;</span><span class="p">:</span> <span class="s2">&quot;id&quot;</span><span class="p">})])</span>
</code></pre>
</div>

<p>Alternatively, you can construct an Atlas by passing in a .bib file. The entries in this bibtex file will be parsed for unique identifiers (e.g., DOIs), and sent in an API call, and returned as Publications, which then populate an Atlas.</p>

<div class="pdoc-code codehilite">
<pre><span></span><code><span class="n">atl</span> <span class="o">=</span> <span class="n">crt</span><span class="o">.</span><span class="n">bibtex_to_atlas</span><span class="p">(</span><span class="n">bibtex_filepath</span><span class="p">)</span>
</code></pre>
</div>

<p>In the line of code above, the variable <code>crt</code> is an instance of a <a href="src/sciterra/mapping/cartography.py"><code>Cartographer</code></a> object, which encapsulates the bookkeeping involved in querying a bibliographic database for publications.</p>

<h3 id="cartographer">Cartographer</h3>

<p>The Cartographer class is named because interfaces with an Atlas to build out a library of publications. Since it does so via similarity-based retrieval, the resulting Atlas can be considered a 'region' of publications.</p>

<p>To do this, a Cartographer needs two things: an API with which to interface, and a way of getting document embeddings. Both are encapsulated, respectively, by the <a href="src/sciterra/librarians/librarian.py"><code>Librarian</code></a> and the <a href="src/sciterra/vectorization/vectorizer.py"><code>Vectorizer</code></a> classes.</p>

<div class="pdoc-code codehilite">
<pre><span></span><code><span class="kn">from</span> <span class="nn">sciterra</span> <span class="kn">import</span> <span class="n">Cartographer</span>
<span class="kn">from</span> <span class="nn"><a href="sciterra/librarians.html">sciterra.librarians</a></span> <span class="kn">import</span> <span class="n">SemanticScholarLibrarian</span> <span class="c1"># or ADSLibrarian</span>
<span class="kn">from</span> <span class="nn"><a href="sciterra/vectorization.html">sciterra.vectorization</a></span> <span class="kn">import</span> <span class="n">SciBERTVectorizer</span> <span class="c1"># among others</span>

<span class="n">crt</span> <span class="o">=</span> <span class="n">Cartographer</span><span class="p">(</span>
<span class="n">librarian</span><span class="o">=</span><span class="n">SemanticScholarLibrarian</span><span class="p">(),</span>
<span class="n">vectorizer</span><span class="o">=</span><span class="n">SciBERTVectorizer</span><span class="p">(),</span>
<span class="p">)</span>
</code></pre>
</div>

<h4 id="librarian">Librarian</h4>

<p>Each Librarian subclass is designed to be a wrapper for an existing python API service, such as the <a href="https://ads.readthedocs.io/en/latest/">ads</a> package or the <a href="https://github.com/danielnsilva/semanticscholar#">semanticscholar</a> client library.</p>

<p>A Librarian subclass also overrides two methods. The first is <code>get_publications</code>, which takes a list of identifiers, should query the specific API for that Librarian, and returns a list of Publications. Keyword arguments can be passed to specify the metadata that is kept for each publication (e.g. date, title, journal, authors, etc.) The second method is <code>convert_publication</code>, which defines how the result of an API call should be converted to a sciterra Publication object.</p>

<p>Contributions to sciterra in the form of new Librarian subclasses are encouraged and appreciated.</p>

<h3 id="vectorizer">Vectorizer</h3>

<p>Vectorizer subclasses override one function, <code>embed_documents</code>, which takes a list of strings, representing the text of a publication (currently, just its abstract), and returns an <code>np.ndarray</code> of embeddings.</p>

<p>Under the hood, the <code>project</code> method of Cartographer, which is used during similarity-based retrieval, uses the vectorizer roughly as follows</p>

<div class="pdoc-code codehilite">
<pre><span></span><code><span class="c1"># Get abstracts</span>
<span class="n">docs</span> <span class="o">=</span> <span class="p">[</span><span class="n">atlas</span><span class="p">[</span><span class="n">identifier</span><span class="p">]</span><span class="o">.</span><span class="n">abstract</span> <span class="k">for</span> <span class="n">identifier</span> <span class="ow">in</span> <span class="n">identifiers</span><span class="p">]</span>

<span class="c1"># Embed abstracts</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">vectorizer</span><span class="o">.</span><span class="n">embed_documents</span><span class="p">(</span><span class="n">docs</span><span class="p">)</span>
<span class="n">embeddings</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="s2">&quot;embeddings&quot;</span><span class="p">]</span>

<span class="c1"># depending on the vectorizer, sometimes not all embeddings can be obtained due to out-of-vocab issues</span>
<span class="n">success_indices</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="s2">&quot;success_indices&quot;</span><span class="p">]</span> <span class="c1"># shape `(len(embeddings),)`</span>
<span class="n">fail_indices</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="s2">&quot;fail_indices&quot;</span><span class="p">]</span> <span class="c1"># shape `(len(docs) - len(embeddings))``</span>
</code></pre>
</div>

<p>Currently, sciterra has vectorizers using <a href="https://aclanthology.org/D19-1371/">SciBERT</a>, <a href="https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models">SBERT</a>, <a href="https://huggingface.co/docs/transformers/en/model_doc/gpt2">GPT-2</a>, <a href="https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#">Word2Vec</a>, and a simple bag-of-words (BOW) vectorizer that uses the same vocabulary as the Word2Vec vectorizer. Contributions to sciterra in the form of new Vectorizer subclasses are also encouraged and appreciated.</p>

<h3 id="putting-it-all-together">Putting it all together</h3>

<p>The main use case for all of these ingredients is to iteratively build out a region of publications. This is done using <code>iterate_expand</code>:</p>

<div class="pdoc-code codehilite">
<pre><span></span><code><span class="kn">from</span> <span class="nn"><a href="sciterra/mapping/tracing.html">sciterra.mapping.tracing</a></span> <span class="kn">import</span> <span class="n">iterate_expand</span>

<span class="c1"># Assuming the initial atlas contains just one publication</span>
<span class="p">(</span><span class="n">atl</span><span class="o">.</span><span class="n">center</span><span class="p">,</span> <span class="p">)</span> <span class="o">=</span> <span class="n">atl</span><span class="o">.</span><span class="n">publications</span><span class="o">.</span><span class="n">values</span><span class="p">()</span>
<span class="c1"># build out an atlas to contain 10,000 publications, with increasing dissimilarity to the initial publication, saving progress in binary files to the directory named &quot;atlas&quot;.</span>
<span class="n">iterate_expand</span><span class="p">(</span>
<span class="n">atl</span><span class="o">=</span><span class="n">atl</span><span class="p">,</span>
<span class="n">crt</span><span class="o">=</span><span class="n">crt</span><span class="p">,</span>
<span class="n">atlas_dir</span><span class="o">=</span><span class="s2">&quot;atlas&quot;</span><span class="p">,</span>
<span class="n">target_size</span><span class="o">=</span><span class="mi">10000</span><span class="p">,</span>
<span class="n">center</span><span class="o">=</span><span class="n">atl</span><span class="o">.</span><span class="n">center</span><span class="p">,</span>
<span class="p">)</span>
</code></pre>
</div>

<p>This method has a number of keyword arguments that enable tracking the Atlas expansion, limiting the number of publications per expansion, how many times to try to get a response if there are connection issues, etc.</p>

<p>In practice, it may be helpful to use the <a href="src/sciterra/mapping/tracing.py"><code><a href="sciterra/mapping/tracing.html#AtlasTracer">sciterra.mapping.tracing.AtlasTracer</a></code></a> data structure to reduce most of the loading/initialization boilerplate described above. For an example, see <a href="src/examples/scratch/main.py">main.py</a>.</p>

<h2 id="additional-features">Additional features</h2>

<ul>
<li>The <a href="src/sciterra/mapping/topography.py">topography</a> submodule contains similarity-based metrics for publications, to support scientometrics analyses.</li>
</ul>

<h2 id="acknowledgments">Acknowledgments</h2>

<p>This software is a reimplimentation of Zachary Hafen-Saavedra's library, <a href="https://github.com/zhafen/cc">cc</a>.</p>

<p>To cite sciterra, please use the following workshop paper,</p>

<pre><code>@inproceedings{Imel2023,
author = {Imel, Nathaniel, and Hafen, Zachary},
title = {Citation-similarity relationships in astrophysics},
booktitle = {AI for Scientific Discovery: From Theory to Practice Workshop (AI4Science @ NeurIPS)},
year = {2023},
url = {https://openreview.net/pdf?id=mISayy7DPI},
}
</code></pre>
</div>

<input id="mod-sciterra-view-source" class="view-source-toggle-state" type="checkbox" aria-hidden="true" tabindex="-1">
Expand Down
39 changes: 33 additions & 6 deletions docs/sciterra/librarians.html
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,13 @@ <h2>Submodules</h2>
<li><a href="librarians/s2librarian.html">s2librarian</a></li>
</ul>

<h2>API Documentation</h2>
<ul class="memberlist">
<li>
<a class="variable" href="#librarians">librarians</a>
</li>
</ul>



<a class="attribution" title="pdoc: Python API documentation generator" href="https://pdoc.dev" target="_blank">
Expand All @@ -75,16 +82,36 @@ <h1 class="modulename">

<label class="view-source-button" for="mod-librarians-view-source"><span>View Source</span></label>

<div class="pdoc-code codehilite"><pre><span></span><span id="L-1"><a href="#L-1"><span class="linenos">1</span></a><span class="kn">from</span> <span class="nn">.librarian</span> <span class="kn">import</span> <span class="n">Librarian</span>
</span><span id="L-2"><a href="#L-2"><span class="linenos">2</span></a><span class="kn">from</span> <span class="nn">.adslibrarian</span> <span class="kn">import</span> <span class="n">ADSLibrarian</span>
</span><span id="L-3"><a href="#L-3"><span class="linenos">3</span></a><span class="kn">from</span> <span class="nn">.s2librarian</span> <span class="kn">import</span> <span class="n">SemanticScholarLibrarian</span>
</span><span id="L-4"><a href="#L-4"><span class="linenos">4</span></a>
</span><span id="L-5"><a href="#L-5"><span class="linenos">5</span></a><span class="sd">&quot;&quot;&quot;Why is there not an ArxivLibrarian? For now, we are restricting to APIs that allow us to traverse literature graphs, and arxiv does not have one. While there is a useful pip-installable package for querying the arxiv api for papers, https://pypi.org/project/arxiv/, the returned object does not have information on references and citations. However, it may still be possible to obtain a large sample of publications with abstracts and submission dates (though no citation counts), because the arxiv API&#39;s limit for a single query is 300,000 results.</span>
</span><span id="L-6"><a href="#L-6"><span class="linenos">6</span></a><span class="sd">&quot;&quot;&quot;</span>
<div class="pdoc-code codehilite"><pre><span></span><span id="L-1"><a href="#L-1"><span class="linenos"> 1</span></a><span class="kn">from</span> <span class="nn">.librarian</span> <span class="kn">import</span> <span class="n">Librarian</span>
</span><span id="L-2"><a href="#L-2"><span class="linenos"> 2</span></a><span class="kn">from</span> <span class="nn">.adslibrarian</span> <span class="kn">import</span> <span class="n">ADSLibrarian</span>
</span><span id="L-3"><a href="#L-3"><span class="linenos"> 3</span></a><span class="kn">from</span> <span class="nn">.s2librarian</span> <span class="kn">import</span> <span class="n">SemanticScholarLibrarian</span>
</span><span id="L-4"><a href="#L-4"><span class="linenos"> 4</span></a>
</span><span id="L-5"><a href="#L-5"><span class="linenos"> 5</span></a><span class="n">librarians</span> <span class="o">=</span> <span class="p">{</span>
</span><span id="L-6"><a href="#L-6"><span class="linenos"> 6</span></a> <span class="s2">&quot;S2&quot;</span><span class="p">:</span> <span class="n">SemanticScholarLibrarian</span><span class="p">,</span>
</span><span id="L-7"><a href="#L-7"><span class="linenos"> 7</span></a> <span class="s2">&quot;ADS&quot;</span><span class="p">:</span> <span class="n">ADSLibrarian</span><span class="p">,</span>
</span><span id="L-8"><a href="#L-8"><span class="linenos"> 8</span></a><span class="p">}</span>
</span><span id="L-9"><a href="#L-9"><span class="linenos"> 9</span></a>
</span><span id="L-10"><a href="#L-10"><span class="linenos">10</span></a><span class="sd">&quot;&quot;&quot;Why is there not an ArxivLibrarian? For now, we are restricting to APIs that allow us to traverse literature graphs, and arxiv does not have one. While there is a useful pip-installable package for querying the arxiv api for papers, https://pypi.org/project/arxiv/, the returned object does not have information on references and citations. However, it may still be possible to obtain a large sample of publications with abstracts and submission dates (though no citation counts), because the arxiv API&#39;s limit for a single query is 300,000 results.</span>
</span><span id="L-11"><a href="#L-11"><span class="linenos">11</span></a><span class="sd">&quot;&quot;&quot;</span>
</span></pre></div>


</section>
<section id="librarians">
<div class="attr variable">
<span class="name">librarians</span> =
<input id="librarians-view-value" class="view-value-toggle-state" type="checkbox" aria-hidden="true" tabindex="-1">
<label class="view-value-button pdoc-button" for="librarians-view-value"></label><span class="default_value">{&#39;S2&#39;: &lt;class &#39;<a href="librarians/s2librarian.html#SemanticScholarLibrarian">sciterra.librarians.s2librarian.SemanticScholarLibrarian</a>&#39;&gt;, &#39;ADS&#39;: &lt;class &#39;<a href="librarians/adslibrarian.html#ADSLibrarian">sciterra.librarians.adslibrarian.ADSLibrarian</a>&#39;&gt;}</span>


</div>
<a class="headerlink" href="#librarians"></a>

<div class="docstring"><p>Why is there not an ArxivLibrarian? For now, we are restricting to APIs that allow us to traverse literature graphs, and arxiv does not have one. While there is a useful pip-installable package for querying the arxiv api for papers, <a href="https://pypi.org/project/arxiv/">https://pypi.org/project/arxiv/</a>, the returned object does not have information on references and citations. However, it may still be possible to obtain a large sample of publications with abstracts and submission dates (though no citation counts), because the arxiv API's limit for a single query is 300,000 results.</p>
</div>


</section>
</main>
<script>
function escapeHTML(html) {
Expand Down
Loading

0 comments on commit 1ffd6ac

Please sign in to comment.