Generating PDFs in software development: Using Puppeteer and Node.js

Generating PDFs in software development: Using Puppeteer and Node.js

Danny Humphrey

24 February 2022 - 11 min read

APINode.js Puppeteer
Generating PDFs in software development: Using Puppeteer and Node.js

Developing Software to Generate PDFs: The Background

In a recent project we had a requirement to add in the option to generate PDFs to an already extensive reporting tool built using React. All these reports have extensive filters and options that change what data is shown and how it is shown. When generating a PDF, we need to be able to maintain these report filters and options so a user can generate exactly what they have selected.

The Development Project's Tech Stack

For the API the project uses Node.js and hapi with typescript. For the frontend we are using reactjs with typescript.

Why Puppeteer

Before settling on Puppeteer we did look at a few other options:

1.      HTML Canvas

At a basic level this would create an image using the HTML Canvas that we could then save as a PDF on the client side. Some experiments with this looked promising. However as most of these reports would span more than one page, getting around the formatting issues was deemed to be too much work for the moment.

2.      Use a third-party PDF generation application

We did look at a few third-party packages for generating PDFs, however much of these had costs associated with them or needed their own markdown which would need to be maintained separately to the main report.

We decided to not pursue these for now as the level of work required was either too high or the costs were too high to pursue at this early stage.

That left us with trying to use the browsers existing print to PDF function along with print specific CSS to create a PDF. We can quickly verify that this works as expected locally, however we did find that we would get differing results across different browsers and versions. We really wanted to give consistent results across all users and browsers. So, this is where Puppeteer comes in, this should allow us to navigate to a given page and generate a PDF report and download to the client.

Puppeteer is a headless (chromium) browser that comes with an easy-to-consume API that allows you to control a chromium instance programmatically. Puppeteer is not the only package to give this ability; Microsoft also have their own offering called Playwright, which shares much of the same functionality of Puppeteer, whilst also giving the option for using other browsers (which makes it a really good candidate for automated testing). However, we decided to go with Puppeteer as we already had some experience with it on the team.

Prototype Development

Before we committed completely to building a PDF generation tool, we wanted to create a small proof of concept to prove that all the technologies will work together. For us this entailed taking a branch of the source code, adding an API endpoint that would return a PDF, updating the front end to call this end point and save the PDF to the client.

We decided it best that we build the prototype using a branch of the source code as we really wanted to test that Puppeteer would integrate nicely with the existing tech stack.

Software Development: Back-end

Installing Puppeteer and adding a new API end point, I’m not going to go through setting up a new Node.js/HAPI API project as there’s already plenty of documentation on that.

Installing Puppeteer on the API using the following command ‘npm install puppeteer’.

Add a new API end point such as the following

    server.route({
        method: 'generate-pdf',
        path: '/',
        handler: async (request, h) => {
        }
    });

I never like to have too much going on inside of my APIs so I created a separate function that would wrap around Puppeteer such like:

import puppeteer from "puppeteer";

interface IGenerator {
  generate(content: string): Promise<Buffer>;

}

const pdfGenerator = (): IGenerator => {
  return {
    async generate(content: string): Promise<Buffer> {
      const browser = await puppeteer.launch();

      const page = await browser.newPage();
      await page.setContent(content);

      const pdf = await page.pdf();
      await browser.close();

      return pdf;

    },
  };
};

This function creates a new instance of Puppeteer, opens a new page, and browses to a specific web page (as we only want to do a quick prototype this has been hard coded in for now).

To ensure our PDFs come out consistently and everything had loaded before we generate, we use ‘page.waitForNavigation({waitUntil: ‘networkidel2’ })` which will complete when there are no more than two active network connections for at least 500ms (see: here). If we were to wait for all network traffic to stop we could end up in a situation where we never stop waiting for network calls to finish, for example we might have something like Application Insights that sends out telemetry which could end up blocking the wait time.

This also has the added benefit of returning the PDF in the shortest amount of time possible, rather than waiting for, say, 10 seconds, to ensure the page has loaded as it protects us against possible slow internet connections which might make loading a page take longer the specified wait time.

Once our page has fully loaded, we can generate a PDF using ‘page.pdf()`, close the browser and return the generated PDF as a Buffer (note, this does not save to the servers file system, its all done in memory).

Then calling this from the API end point like:

 server.route({

    method: "generate-pdf",
    path: "/",
    handler: async (request, h) => {
      const generator = pdfGenerator();
      const responce = generator
        .generate(request.payload as string)
        .then((pdfResult) => {
          return h
            .response(pdfResult)
            .header("Content-Type", "application/pdf")
            .header("Content-Length", "" + pdfResult.length);
        });

      return responce;
    },
  });

  await server.start();
  console.log("Server running on %s", server.info.uri);

};

A few things are going on here. The generator returns a Promise that we consume inside the ‘then’ callback and we create new response containing the Buffer and set content-type to ‘application/pdf’ and content-length based on the size of the returned PDF.

Normally we would add in additional error handling to this and return an error code to the client but seeing at this a quick prototype we’re going to leave this out for now.

Client Changes

For us to consume the API response we need a way to save the result to a file. For this, we used a package called file-saver ( installed as ‘npm install file-saver’). This contains a method called ‘saveAs’ which we can use to save the API result as file and give a name.

This looks like the following:

import { post } from './request';
import { saveAs } from 'file-saver';

export const downloadPdf = async (
    fileName: string
): Promise<void> => {
    post(\`/reportpdf\`)
        .then((res) => {

           res.arrayBuffer().then((res) => {
                const blob = new Blob(\[res], { type: 'application/pdf' });
                saveAs(blob, fileName);

            });
        });
};

Calling this from the UI is a simple matter attaching this to the callback of a onClick on a button. Once all of this is wired up, we are given a new PDF of the Audacia home page.

With all of that working locally we had one last hurdle to get over and test. The API is deployed through Docker, so we had to build the docker image and run that locally and test.

Running in Docker

Using the following docker file to create our image

 FROM node:14.16-alpine3.12 as base
WORKDIR /usr/src/app

COPY . .
RUN npm install --no-package-lock
RUN rm .npmrc .dockerignore

FROM node:14.13.1-alpine3.12
COPY --from=base /usr/src/app /usr/src/app
WORKDIR /usr/src/app
RUN addgroup -S appgroup && adduser -S nodeuser -G appgroup
RUN chown -R nodeuser /usr/src/app    
ENV NODE_ENV=production
ENV HOST=0.0.0.0
EXPOSE 8081
USER nodeuser
CMD \[ "npm", "start" ]

 We can successfully create our docker image; now we run it using ‘docker build’ then ‘docker run’ from the CLI.

Once that’s running, we have a page with a single button it to download a PDF. We immediately run into an issue.; nothing generates and we can see a failing request. What’s going on? A quick search around and we find this, which, explains that running Puppeteer inside of docker doesn’t quite work as the base docker image is missing some dependencies that it needs to run chromium.

At a basic level we need to update our docker build to install chromium separately and tell puppeteer to use that installed version of chromium. So based on that we update our dockerfile by adding the following two snippets.

The first tells docker to install chromium and other needed dependencies.

RUN apk add --no-cache \
      chromium \
      nss \
      freetype \
      harfbuzz \
      ca-certificates \
      ttf-freefont

 The second tell puppeteer to not install its own chromium instance.

ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true

When added into the docker file this looks like this:

FROM node:14.16-alpine3.12 as base
WORKDIR /usr/src/app
COPY . .

RUN npm install --no-package-lock
RUN rm .npmrc .dockerignore

FROM node:14.13.1-alpine3.12
RUN apk add --no-cache \

      chromium \
      nss \
      freetype \
      harfbuzz \
      ca-certificates \
      ttf-freefont

ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true
COPY --from=base /usr/src/app /usr/src/app
WORKDIR /usr/src/app
RUN addgroup -S appgroup && adduser -S nodeuser -G appgroup
RUN chown -R nodeuser /usr/src/app    
ENV NODE_ENV=production
ENV HOST=0.0.0.0
EXPOSE 8081
USER nodeuser
CMD \[ "npm", "start" ]

 We also need to update our puppeteer code to use the installed version of chromium, which we can do with options that we pass into a new instance of puppeteer.

We decided to wrap this up inside another function, as shown here:

const getStartUpArgs = () => {
    if (process.env.NODE_ENV === 'production') {
        return {
            args: [
                '--disable-dev-shm-usage',
                '--no-sandbox',
                '--disable-gpu',
            ],
            executablePath: '/usr/bin/chromium-browser',
        };
    }

    return { };
}

 We also decided to make the start-up variables optional based on what environment you are running in, as we still wanted this to work outside of docker when working locally.

And, when creating a new puppeteer instance, we simply call into that function like so.

const browser = await puppeteer.launch(getStartUpArgs());

With the above changes we can now build and start our docker image, as we did before, and we should now be able to generate a PDF using puppeteer inside of a docker image.

Summary

In this post we have outlined the steps we took in rapidly prototyping a PDF generation tool. The results of this process allowed us to greatly reduce the risk of development of the tool and to make more informed decisions moving forward. The code in this document is by no means production-ready and never intended to be so, but rather a demonstration of a possible route going ahead.

Audacia is a software development company based in the UK, headquartered in Leeds. View more technical insights from our teams of consultants, business analysts, developers and testers on our technology insights blog.

Technology Insights

Ebook Available

How to maximise the performance of your existing systems

Free download

Danny Humphrey was a Principal Consultant at Audacia from 2012-2022. He was responsible for delivering a variety of projects across many industries, from ticketing systems for telecommunications and manufacturing companies, to HMRC integration for self assessment tax tools and contract management for an international commodities trader.