Skip to content

Commit

Permalink
[Doc] update PDF gen (backport #54806) (#54816)
Browse files Browse the repository at this point in the history
Co-authored-by: Dan Roscigno <[email protected]>
  • Loading branch information
mergify[bot] and DanRoscigno authored Jan 8, 2025
1 parent 6094dd9 commit 453ec3d
Show file tree
Hide file tree
Showing 5 changed files with 35 additions and 73 deletions.
3 changes: 2 additions & 1 deletion docs/docusaurus/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,9 @@ FROM node:21

WORKDIR /app/docusaurus
ENV NODE_OPTIONS="--max-old-space-size=8192 --no-warnings=ExperimentalWarning"
ENV DISABLE_VERSIONING=true

RUN apt update && apt install -y neovim python3.11-venv ghostscript
RUN apt update && apt install -y neovim python3.11-venv ghostscript pdftk

EXPOSE 3000

Expand Down
33 changes: 19 additions & 14 deletions docs/docusaurus/PDF/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@

# Generate PDFs from the StarRocks Docusaurus documentation site

Node.js code to:
Expand Down Expand Up @@ -115,43 +116,47 @@ node generatePdf.js http://0.0.0.0:3000/zh/docs/introduction/StarRocks_intro/

> Note:
>
> There are 900+ PDF files and more than 4,000 pages in total. Combining takes three hours on my laptop, just let it run. I am looking for a faster method to combine the files.
> Change the name of the PDF output file as needed, in the example this is `StarRocks_33`
```bash
source .venv/bin/activate
pdfcombine -y combine.yaml --title="StarRocks 3.3" -o ../../PDFoutput/StarRocks_3.3.pdf
cd ../../PDFoutput/
pdftk 00*pdf output StarRocks_33.pdf
```

> Note:
>
> You may see this message during the `pdfcombine` step:
>
> `GPL Ghostscript 10.03.1: Missing glyph CID=93, glyph=005d in the font IAAAAA+Menlo-Regular . The output PDF may fail with some viewers.`
>
> I have not had any complaints about the missing glyph from readers of the documents produced with this.
## Finished file

The individual PDF files and the combined file will be on your local machine in `starrocks/docs/PDFoutput/`

## Customizing the docs site for PDF

Gotenberg generates the PDF files without the side navigation, header, and footer as these components are not displayed when the `media` is set to `print`. In our docs it does not make sense to have the edit URLs or Feedback widget show. These are filtered out using CSS by adding `display: none` to the classes of these objects when `@media print`.
Gotenberg generates the PDF files without the side navigation, header, and footer as these components are not displayed when the `media` is set to `print`. In our docs it does not make sense to have the breadcrumbs, edit URLs, or Feedback widget show. These are filtered out using CSS by adding `display: none` to the classes of these objects when `@media print`.

Removing the Feedback form from the PDF can be done with CSS. This snippet is added to the Docusaurus CSS file `src/css/custom.css`:

```css
/* When we generate PDF files we do not need to show the feedback widget. */
/* When we generate PDF files we do not need to show the:
- edit URL
- Feedback widget
- breadcrumbs
*/
@media print {
.feedback_Ak7m {
display: none;
}

.theme-doc-footer-edit-meta-row {
display: none;
};

.breadcrumbs {
display: none;
};
}
```

## Links

- [`docusaurus-prince-pdf`](https://github.com/signcl/docusaurus-prince-pdf)
- [`Gotenberg`](https://pptr.dev/)
- [`pdftk`](https://gitlab.com/pdftk-java/pdftk)
- [Ghostscript](https://www.ghostscript.com/)
- [`pdfcombine`](https://github.com/tdegeus/pdfcombine.git)
10 changes: 1 addition & 9 deletions docs/docusaurus/PDF/docker-compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,10 @@ services:
gotenberg:
image: gotenberg/gotenberg
healthcheck:
test: ["CMD", "curl", "--silent", "--fail", "http://localhost:3000/health"]
test: ["CMD", "curl", "--silent", "--fail", "http://gotenberg:3000/health"]

docusaurus:
build: ../
environment:
- DISABLE_VERSIONING='true'
- PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true
- PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser
ports:
- 3000:3000
volumes:
Expand All @@ -26,10 +22,6 @@ services:
timeout: 5s
entrypoint: >
/bin/bash -c "
cd PDF && yarn install &&
python3 -m venv .venv &&
source .venv/bin/activate &&
pip3 install pdfcombine &&
cd /app/docusaurus &&
npm install -g [email protected] &&
yarn install &&
Expand Down
54 changes: 5 additions & 49 deletions docs/docusaurus/PDF/generatePdf.js
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,6 @@ const cheerio = require('cheerio');
const process = require('process');
const util = require('node:util');

async function getPageTitle(url) {
try {
const response = await axios.get(url);
const html = response.data;

const $ = cheerio.load(html);
const h1Text = $('h1').text();
if (h1Text !== "") { return h1Text; }
else { return "blank"; }

} catch (error) {
console.error('Error:', error);
}
}

function getUrls(url) {
var execSync = require('child_process').execSync;

Expand All @@ -29,7 +14,7 @@ function getUrls(url) {
let docusaurusUrl =
url.replace("localhost", "docusaurus").replace("0.0.0.0", "docusaurus");

var command = `npx docusaurus-prince-pdf --list-only -u ${docusaurusUrl} --file URLs.txt`
var command = `npx docusaurus-prince-pdf --list-only -u ${docusaurusUrl} --include-index --file URLs.txt`

try {
const {stdout, stderr} = execSync(command);
Expand All @@ -41,13 +26,15 @@ function getUrls(url) {

async function callGotenberg(docusaurusUrl, fileName) {

const path = require("path");
//const path = require("path");
const FormData = require("form-data");

try {
// Convert URL content to PDF using Gotenberg
const form = new FormData();
form.append('url', `${docusaurusUrl}`)
form.append('waitDelay', `3s`)
form.append('generateDocumentOutline', `true`)

const response = await axios.post(
"http://gotenberg:3000/forms/chromium/convert/url",
Expand Down Expand Up @@ -82,7 +69,7 @@ async function processLineByLine() {
});
console.log("Generating PDFs");
for await (const line of rl) {
// Each line in input.txt will be successively available here as `line`.
// Each line in URLs.txt will be successively available here as `line`.
//console.log(`URL: ${line}`);
await requestPage(line).then(resp => {
//console.log(`done.\n`);
Expand All @@ -95,48 +82,17 @@ async function processLineByLine() {

async function requestPage(url) {
const fileName = '../../PDFoutput/'.concat(String(i).padStart(4, '0')).concat('.', 'pdf');

// Get the details to write the YAML file
// We need title and filename
const pageTitle = await getPageTitle(url);
const cleanedTitle = pageTitle.replaceAll('\[', '').replaceAll('\]', '').replaceAll(':', '').replaceAll(' | StarRocks', '')
const pageDetails = ` - file: ${fileName}\n title: ${cleanedTitle}\n`;

fs.appendFile('./combine.yaml', pageDetails, err => {
if (err) {
console.error(err);
} else {
//console.log(`Title is ${pageTitle}`);
//console.log(`Filename is ` + fileName );
// file written successfully
}
});

await callGotenberg(url, fileName);
process.stdout.write(".");
i++;

}




function main() {
// startingUrl is the URL for the first page of the docs
// Get all of the URLs and write to URLs.txt
console.log("Crawling from %s", startingUrl);
getUrls(startingUrl);

const yamlHeader = 'files:\n';

fs.writeFile('./combine.yaml', yamlHeader, err => {
if (err) {
console.error(err);
} else {
// file written successfully
}
});

processLineByLine();
};

Expand Down
8 changes: 8 additions & 0 deletions docs/docusaurus/src/css/custom.css
Original file line number Diff line number Diff line change
Expand Up @@ -95,4 +95,12 @@ a {
.feedback_Ak7m {
display: none;
}

.theme-doc-footer-edit-meta-row {
display: none;
};

.breadcrumbs {
display: none;
};
}

0 comments on commit 453ec3d

Please sign in to comment.