Skip to content

Custom Downloader

GeoffreyChen777 edited this page Aug 20, 2022 · 2 revisions

This page introduce how to setup your custom downloader.

If the builtin PDF downloaders are not suitable for your research, you can write your own downloader.

Design

A PDF downloader consists of three main functions: preProcess, queryProcess, and downloadImpl. The return values of the preProcess function usually are three elements: queryUrl, headers, enable. queryProcess request the queryUrl to get the real download url. This entityDraft will go through all enabled downloaders until the PDF is downloaded. downloadImpl firstly calls the preProcess, then does the network requesting, and finally download the PDF.

Add a custom downloader

Open the preference window, click the downloader tab, click the + button.

downloadImpl

The default downloadImpl function is:

    const { queryUrl, headers, enable } = this.preProcess(
    entityDraft
  );

  if (enable) {
    const agent = this.getProxyAgent();
    const downloadUrl = await this.queryProcess(queryUrl, headers, entityDraft);
    if (downloadUrl) {
      this.sharedState.set("viewState.processInformation", "Downloading...");
      const downloadedUrl = await downloadPDFs([downloadUrl], agent);

      if (downloadedUrl.length > 0) {
        entityDraft.mainURL = downloadedUrl[0];
        return entityDraft;
      } else {
        return null;
      }
    } else {
      return null;
    }
  } else {
    return null;
  }
  

Usually, it is unnecessary to modify this function.

preProcess

Let's use the built-in ArXiv downloader as an example.

        const enable = entityDraft.arxiv !== "" && this.getEnable("arxiv");

    let queryUrl;
    queryUrl = `https://arxiv.org/pdf/${entityDraft.arxiv}.pdf`;

    const headers = {
      "user-agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
    };

    if (enable) {
      this.sharedState.set(
        "viewState.processInformation",
        `Downloading PDF from ArXiv ...`
      );
    }

This function firstly determines whether this downloader should be enabled or not. Here, if the entityDraft has a valid arxiv property and you've enabled this downloader in the preference window, the enable would be true.

After that, we construct the queryUrl.

Finally, we send a message to Paperlib that your scraper are going to download the PDF of this paper.

queryProcess

    return queryUrl;

The queryProcess is very easy to understand. It just request the queryUrl to get the real download url.

Here we directly return the queryUrl since it is the real download url.

Clone this wiki locally