Web automation and scraping have become essential tools for businesses and developers alike today. Enter Puppeteer, a powerful Node.js library that provides a high-level API to control Chrome or Chromium browsers programmatically. Puppeteer excels in tasks such as web scraping, automated testing, and generating screenshots and PDFs of web pages.
Common use cases for Puppeteer range from data extraction and market research to automated testing of web applications and generating pre-rendered content for static websites.
Before we dive into combining Puppeteer with serverless architecture, let's briefly explore the benefits of serverless computing:
Combining Puppeteer with serverless architecture offers a powerful solution for web automation tasks. This approach allows you to run Puppeteer scripts on-demand without managing dedicated servers.
The Benefits Include:
Moreover, by utilizing layers in serverless platforms, we can enhance the reusability and modularity of our Puppeteer-based functions, making it easier to maintain and update our automation scripts.
To get started with serverless Puppeteer automation, follow these steps:
Step 1. Install Puppeteer
Step 2. Choose a serverless platform: Popular options include AWS Lambda, Azure Functions, and Google Cloud Functions. For this guide, we are going to use AWS Lambda.
Step 3. Set up the AWS CLI and configure your credentials.
Step 4. Create a new Lambda function and configure it to use the Node.js runtime.
Here's a basic example of a serverless function using Puppeteer to take a screenshot of a website:
import chromium from '@sparticuz/chromium';
import puppeteer from 'puppeteer-core';
export const handler = async (event) => {
let browser = null;
let result = null;
try {
browser = await puppeteer.launch({
args: chromium.args,
defaultViewport: chromium.defaultViewport,
executablePath: await chromium.executablePath('/opt/nodejs/node_modules/@sparticuz/chromium/bin'),
headless: chromium.headless,
ignoreHTTPSErrors: true,
});
const page = await browser.newPage();
await page.goto(event.url || 'https://example.com');
const screenshot = await page.screenshot({ encoding: 'base64' });
result = {
statusCode: 200,
headers: {
'Content-Type': 'image/png',
},
body: screenshot,
isBase64Encoded: true,
};
} catch (error) {
console.error(error);
result = {
statusCode: 500,
body: JSON.stringify({ error: error.message }),
};
} finally {
if (browser) {
await browser.close();
}
}
return result;
};
This function takes a URL as input, navigates to the web page, and returns a base64-encoded screenshot. (note we are using @sparticuz/chromium for chromium-browser because we are using chromium lambda layers provided by sparticuz/chromium arn link in aws).
Leveraging Layers for Modularity and Reusability
Layers in serverless computing allow you to package and share common code and dependencies across multiple functions. For Puppeteer, we can create a layer containing Puppeteer and its dependencies:
When working with Puppeteer on Lambda, using layers can significantly improve the management and deployment of your functions. This is especially useful when you have multiple functions that require Puppeteer.
mkdir puppeteer-layer && cd puppeteer-layer
npm init -y
npm install puppeteer
zip -r puppeteer-layer.zip node_modules
Create and add the layer in the aws layers window then attach it to the respective lambda function
Click on Add Layers and Specify ARN add the below link then verify and add
Note: Add the browser version along with the region based on your configuration and click on add.
arn:aws:lambda:ap-south-1:764866452798:layer:chrome-aws-lambda:46
Ref: https://github.com/shelfio/chrome-aws-lambda-layer?tab=readme-ov-file
Again add the layer custom layer we custom-made, We also need to change the Executable path to start from /opt/ for the custom layer to work with the lambda function.
By using layers, you can keep your function code lean and easily update Puppeteer across all your functions by updating the layer.
Now if you check the layers we have two layers that we added.
To run the browser in lambda we need at least 2 GB of RAM and a timeout of 3 minutes for the function to run because it needs to open a browser and perform the automation the default 3 seconds wound budge. So go to the configuration tab under the same lambda function.
After the configuration, it should look like this:
After saving the changes click the deploy button and test it with the test button
Serverless Puppeteer automation with layers provides a powerful, scalable, and cost-effective solution for web scraping, testing, and other automation tasks. By mastering AWS Lambda and Puppeteer integration, developers can create efficient and scalable web automation workflows. Whether you're scraping data, generating reports, or running automated tests, Puppeteer on AWS Lambda provides a flexible and powerful solution.
Ready to get started? Set up your first serverless Puppeteer function today and unlock the potential of scalable web automation!
Hello world!! Cloud and Ai enthusiast here and i can solve captcha
What is AWS CDK?
Imagine you're a developer needing to set up a bunch of AWS resources for your new project. Instead of manually configuring everything through the AWS console, you can use the AWS Cloud Development Kit (AWS CDK). This toolkit...
How to Analyse Documents Using AWS Services
Let's assume a situation where you must get information from multiple documents, like customer details from a form. Generally before AI, a person manually added all the fields to make it online. But what if you want to do ...
Serverless vs. Microservices: Which Architecture to Choose
Selecting the right architecture is fraught with challenges. The emergence of serverless architecture and microservices architecture has created confusion. The dilemma between serverless vs. microservices isn't a mere acad...