Converting Office Docs to PDF with AWS Lambda

Madhav Palshikar
5 min readSep 27, 2020

--

source: aws.amazon.com

Now days every platform has reports and most of the time we need to convert the reports into PDF formats. I got the similar task few days back. I have done this conversion to PDF in past also but this time I decided to explore serverless approach.

There are multiple ways to implement this in AWS with servers and serverless:
1. EC2 Instances
2. ECS Fargate
3. EKS
4. Step Functions
5. Lambda

We are going to explore with Lambda. Lambda is an event-driven, serverless computing service which we can integrate with many other services like S3, SNS, DynamoDB etc.

Why Serverless?

As we know conversion is a CPU intensive process. And converting documents to PDF at scale is a common problem. With Serverless approach we don’t need to worry about scaling of our resources. Resources gets scaled automatically, scaling is handled by AWS. We are responsible only for our code, required memory and execution time.

Our plan…

  1. User will upload Office Document to S3 bucket.
  2. S3 bucket will trigger Lambda function with uploaded files details.
  3. Lambda function will convert document with LibreOffice.
  4. After conversion Lambda function will upload PDF to S3 Bucket.
  5. After uploading you can write additional code to update details in Database or call any API to inform server.

But one problem

lambda has size limit on code package. You can upload 50mb zipped or 250 MB unzipped code. For document conversion we need to use LibreOffice which is 85 MB compress file and after extracting it becomes 300 MB. So it will not fit in the limit.

In Lambda we get 512 MB in /tmp location. With Lambda function we can pull in additional code and content in the form of layers. A layer is a ZIP archive that contains libraries, a custom runtime, or other dependencies. So we will use layer for LibreOffice 85 MB compressed zip. LibreOffice is compressed with brotli , We will need to extract it to /tmp location before using it.

Let’s start setting up…

  1. Create S3 Bucket ‘doc-conversion-test’

2. Create Lambda function

3. Add LibreOffice Layer to function

You can find LibreOffice layer ARN here https://github.com/shelfio/libreoffice-lambda-layer . Select ARN according to your region and add it.

4. Set Execution Timeout & Memory for our function:

You can decide on how much Memory and Timeout you need depending upon you average document size and complexity of document content. You can put some restrictions on size of the document for uploading. It will help you reduce Timeout and Memory allocation.

I have kept timeout to 20 secs because we are going to perform following tasks in our lambda function:

  1. Extract LibreOffice to /tmp
  2. Download Office Document from S3 bucket
  3. Convert the document to PDF
  4. Upload PDF to S3 Bucket

You can reduce or increase timeout depending upon your analysis. Lambda support 15 mins of Execution time. But the longer the execution time will cost you more. So try to keep your code optimized to reduce execution time.

5. Setup S3 Bucket Trigger Event for Lambda Function

Go to your S3 bucket, Under properties tab you will find Events. Click on that and Add Notification.

Add ‘Name’ for notification, Select PUT event so we will get notification once file upload is completed. Add ‘Prefix’ if you want or leave it blank. Select Lambda Function from ‘Send to’ dropdown box. Then select our Lambda function name and save it.

6. Let upload our code…

We are using Node.js 12.x as our runtime. We will install all required NPM packages on our local and then zip it for uploading code to Lambda.

Here we are first extracting our LibreOffice to /tmp . After that we are taking S3 bucket filename from PUT event sent by S3 bucket to trigger the lambda function.

Now we are download the object to /tmp with AWS S3 library, And executing the LibreOffice document conversion command for PDF.

`export HOME=/tmp && /tmp/lo/instdir/program/soffice.bin --headless --norestore --invisible --nodefault --nofirststartwizard --nolockcheck --nologo --convert-to "pdf:writer_pdf_Export" --outdir /tmp /tmp/${s3fileName}`

Finally.. we are uploading generated PDF back to S3 Bucket.

You can download full script here…
https://gist.github.com/madhavpalshikar/96e72889c534443caefd89000b2e69b5

Now we are ready to test it. You can test it with sample event json in Lambda function console or just upload file to S3 Bucket. For debugging you can use Cloud Watch logs and X-Ray from monitoring tab.

You can explore more with Serverless framework or SAM for lambda. And let me know if you find any other interesting solutions :)

--

--

Madhav Palshikar
Madhav Palshikar

Written by Madhav Palshikar

Tech Lead | AWS Solution Architect | Cloud Enthusiast | Photographer

Responses (3)