Using AWS Pandas Layer (AWS Wrangler) in Serverless Framework

When building machine learning models that run on AWS Lambda, managing dependencies becomes a critical challenge. During a recent project that required pandas for data processing in a Lambda-based ML pipeline, I encountered the common issue of Lambda deployment package size limits and version conflicts between numpy and other dependencies.

The Problem

Lambda functions have a deployment package size limit of 250 MB (unzipped). When building models that depend on pandas, you quickly hit this limit because:

Pandas itself is approximately 40-50 MB
Numpy (pandas dependency) adds another 30-40 MB
Additional dependencies for data processing and ML operations compound the size
Transitive dependencies often create version conflicts, particularly with numpy

In my case, the deployment package exceeded the limit, and I encountered the common numpy version error:

IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

Importing the numpy c-extensions failed. This error can happen for
different reasons, often due to issues with your setup.

This error occurs when different packages require incompatible numpy versions, creating a dependency resolution conflict.

AWS SDK for pandas (formerly AWS Data Wrangler)

AWS SDK for pandas is an open-source Python library that extends pandas to work with AWS services. AWS provides pre-built Lambda layers containing this library along with its dependencies, including:

pandas
numpy
pyarrow
awswrangler
boto3
Other AWS-related Python packages

The layer is compiled and optimized for the Lambda execution environment, eliminating the need to package these dependencies in your deployment artifact.

Solution: Lambda Layers

Lambda layers allow you to package dependencies separately from your function code. The AWS-managed layer for AWS SDK for pandas provides:

Pre-compiled dependencies optimized for Lambda
Compatible numpy version that works across the stack
Significant reduction in deployment package size
Faster deployment times
Consistent dependency versions across functions

Implementation in Serverless Framework

Here’s how to configure the AWS SDK for pandas layer in your serverless.yml:

service: ml-model-service

provider:
  name: aws
  runtime: python3.11
  region: us-east-1
  memorySize: 1024
  timeout: 300

functions:
  processData:
    handler: src/handlers.process_data
    layers:
      - arn:aws:lambda:us-east-1:336392948345:layer:AWSSDKPandas-Python311:23
    events:
      - http:
          path: process
          method: post

  trainModel:
    handler: src/handlers.train_model
    layers:
      - arn:aws:lambda:us-east-1:336392948345:layer:AWSSDKPandas-Python311:23
    events:
      - s3:
          bucket: ml-training-data
          event: s3:ObjectCreated:*

package:
  individually: true
  exclude:
    - node_modules/**
    - venv/**
    - .git/**
  patterns:
    - "!pandas/**"
    - "!numpy/**"

Layer ARN Structure

The layer ARN follows this format:

arn:aws:lambda:{region}:336392948345:layer:AWSSDKPandas-Python{version}:{layer-version}

region: AWS region (e.g., us-east-1)
Python{version}: Python version (e.g., Python311 for Python 3.11)
{layer-version}: Layer version number (e.g., 23)

The account ID 336392948345 is the AWS-managed account that publishes these layers.

Usage in Lambda Function

Once configured, use pandas and numpy in your Lambda function without including them in your deployment package:

Note: I replaced the code from my work due to sensitive nature of it, with a more primitive example.

import json
import pandas as pd
import numpy as np
import awswrangler as wr

def process_data(event, context):
    """
    Process data using pandas and AWS Wrangler
    """
    # Read data from S3
    df = wr.s3.read_parquet(path="s3://bucket/data.parquet")

    # Data processing with pandas
    df['feature_engineered'] = df['value'].apply(lambda x: np.log1p(x))
    df_grouped = df.groupby('category').agg({
        'value': ['mean', 'std'],
        'feature_engineered': 'sum'
    })

    # Write results back to S3
    wr.s3.to_parquet(
        df=df_grouped,
        path="s3://bucket/processed/",
        dataset=True
    )

    return {
        'statusCode': 200,
        'body': json.dumps({
            'rows_processed': len(df),
            'categories': len(df_grouped)
        })
    }

def train_model(event, context):
    """
    Train ML model using pandas for data preparation
    """
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']

    # Load training data
    df = wr.s3.read_csv(path=f"s3://{bucket}/{key}")

    # Model logic here
    # ...

    return {
        'statusCode': 200,
        'body': json.dumps({'message': 'Model trained'})
    }

Resolving Numpy Version Conflicts

The AWS SDK for pandas layer includes a numpy version that is compiled specifically for the Lambda execution environment. This resolves the common version conflict issue because:

The layer’s numpy is built against the correct C API version for the Lambda runtime
All dependencies in the layer (pandas, pyarrow, etc.) are compiled against the same numpy version
You don’t need to manage numpy versions in your requirements.txt

Your requirements.txt can now exclude these dependencies:

scikit-learn==1.3.0
joblib==1.3.2
# pandas - provided by layer
# numpy - provided by layer
# awswrangler - provided by layer
# pyarrow - provided by layer

Deployment Size Comparison

Without layer:

Deployment package size: ~200 MB (unzipped)
- Function code: 1 MB
- pandas: 45 MB
- numpy: 35 MB
- pyarrow: 40 MB

With layer:

Deployment package size: ~1 MB (unzipped)
- Function code: 1 MB
- pandas/numpy/pyarrow: 0 MB (in layer)

This reduction significantly improves deployment times and keeps the package well under Lambda limits.

Regional Availability

AWS publishes the AWS SDK for pandas layer in multiple regions. To find the correct ARN for your region:

aws lambda list-layer-versions \
  --layer-name AWSSDKPandas-Python311 \
  --region us-east-1 \
  --query 'LayerVersions[0].LayerVersionArn'

Or reference the official AWS GitHub repository: aws-sdk-pandas

Limitations and Considerations

Layer size limit: Lambda layers have a 250 MB limit (unzipped). The AWS SDK for pandas layer is approximately 100 MB.
Total layers limit: Each Lambda function can have up to 5 layers attached.
Version management: AWS updates the layer periodically. Pin to specific versions in production to avoid unexpected changes.
Custom packages: If you need packages not included in the layer, you must include them in your deployment package.

Monitoring and Troubleshooting

Check the layer is loaded correctly by adding logging:

import sys
import pandas as pd

def handler(event, context):
    print(f"Pandas version: {pd.__version__}")
    print(f"Pandas location: {pd.__file__}")
    print(f"Python path: {sys.path}")

    # Your logic here

The layer path should appear in the Python path as /opt/python/.

Conclusion

Using the AWS SDK for pandas Lambda layer in Serverless Framework deployments solves multiple problems:

Reduces deployment package size by excluding heavy dependencies
Resolves numpy version conflicts through pre-compiled, compatible binaries
Accelerates deployment times
Provides consistent dependency versions across functions

For ML workloads on Lambda that require pandas and numpy, utilizing this managed layer is a practical solution to dependency management challenges. The configuration is minimal, and the benefits are immediate.

The layer ARN arn:aws:lambda:us-east-1:336392948345:layer:AWSSDKPandas-Python311:23 provides pandas 2.x, numpy 1.x, and awswrangler optimized for Python 3.11 Lambda functions in the us-east-1 region.