When building machine learning models that run on AWS Lambda, managing dependencies becomes a critical challenge. During a recent project that required pandas for data processing in a Lambda-based ML pipeline, I encountered the common issue of Lambda deployment package size limits and version conflicts between numpy and other dependencies.
The Problem
Lambda functions have a deployment package size limit of 250 MB (unzipped). When building models that depend on pandas, you quickly hit this limit because:
- Pandas itself is approximately 40-50 MB
- Numpy (pandas dependency) adds another 30-40 MB
- Additional dependencies for data processing and ML operations compound the size
- Transitive dependencies often create version conflicts, particularly with numpy
In my case, the deployment package exceeded the limit, and I encountered the common numpy version error:
IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!
Importing the numpy c-extensions failed. This error can happen for
different reasons, often due to issues with your setup.
This error occurs when different packages require incompatible numpy versions, creating a dependency resolution conflict.
AWS SDK for pandas (formerly AWS Data Wrangler)
AWS SDK for pandas is an open-source Python library that extends pandas to work with AWS services. AWS provides pre-built Lambda layers containing this library along with its dependencies, including:
- pandas
- numpy
- pyarrow
- awswrangler
- boto3
- Other AWS-related Python packages
The layer is compiled and optimized for the Lambda execution environment, eliminating the need to package these dependencies in your deployment artifact.
Solution: Lambda Layers
Lambda layers allow you to package dependencies separately from your function code. The AWS-managed layer for AWS SDK for pandas provides:
- Pre-compiled dependencies optimized for Lambda
- Compatible numpy version that works across the stack
- Significant reduction in deployment package size
- Faster deployment times
- Consistent dependency versions across functions
Implementation in Serverless Framework
Here’s how to configure the AWS SDK for pandas layer in your serverless.yml
:
service: ml-model-service
provider:
name: aws
runtime: python3.11
region: us-east-1
memorySize: 1024
timeout: 300
functions:
processData:
handler: src/handlers.process_data
layers:
- arn:aws:lambda:us-east-1:336392948345:layer:AWSSDKPandas-Python311:23
events:
- http:
path: process
method: post
trainModel:
handler: src/handlers.train_model
layers:
- arn:aws:lambda:us-east-1:336392948345:layer:AWSSDKPandas-Python311:23
events:
- s3:
bucket: ml-training-data
event: s3:ObjectCreated:*
package:
individually: true
exclude:
- node_modules/**
- venv/**
- .git/**
patterns:
- "!pandas/**"
- "!numpy/**"
Layer ARN Structure
The layer ARN follows this format:
arn:aws:lambda:{region}:336392948345:layer:AWSSDKPandas-Python{version}:{layer-version}
region
: AWS region (e.g., us-east-1)Python{version}
: Python version (e.g., Python311 for Python 3.11){layer-version}
: Layer version number (e.g., 23)
The account ID 336392948345
is the AWS-managed account that publishes these layers.
Usage in Lambda Function
Once configured, use pandas and numpy in your Lambda function without including them in your deployment package:
Note: I replaced the code from my work due to sensitive nature of it, with a more primitive example.
import json
import pandas as pd
import numpy as np
import awswrangler as wr
def process_data(event, context):
"""
Process data using pandas and AWS Wrangler
"""
# Read data from S3
df = wr.s3.read_parquet(path="s3://bucket/data.parquet")
# Data processing with pandas
df['feature_engineered'] = df['value'].apply(lambda x: np.log1p(x))
df_grouped = df.groupby('category').agg({
'value': ['mean', 'std'],
'feature_engineered': 'sum'
})
# Write results back to S3
wr.s3.to_parquet(
df=df_grouped,
path="s3://bucket/processed/",
dataset=True
)
return {
'statusCode': 200,
'body': json.dumps({
'rows_processed': len(df),
'categories': len(df_grouped)
})
}
def train_model(event, context):
"""
Train ML model using pandas for data preparation
"""
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
# Load training data
df = wr.s3.read_csv(path=f"s3://{bucket}/{key}")
# Model logic here
# ...
return {
'statusCode': 200,
'body': json.dumps({'message': 'Model trained'})
}
Resolving Numpy Version Conflicts
The AWS SDK for pandas layer includes a numpy version that is compiled specifically for the Lambda execution environment. This resolves the common version conflict issue because:
- The layer’s numpy is built against the correct C API version for the Lambda runtime
- All dependencies in the layer (pandas, pyarrow, etc.) are compiled against the same numpy version
- You don’t need to manage numpy versions in your requirements.txt
Your requirements.txt
can now exclude these dependencies:
scikit-learn==1.3.0
joblib==1.3.2
# pandas - provided by layer
# numpy - provided by layer
# awswrangler - provided by layer
# pyarrow - provided by layer
Deployment Size Comparison
Without layer:
Deployment package size: ~200 MB (unzipped)
- Function code: 1 MB
- pandas: 45 MB
- numpy: 35 MB
- pyarrow: 40 MB
With layer:
Deployment package size: ~1 MB (unzipped)
- Function code: 1 MB
- pandas/numpy/pyarrow: 0 MB (in layer)
This reduction significantly improves deployment times and keeps the package well under Lambda limits.
Regional Availability
AWS publishes the AWS SDK for pandas layer in multiple regions. To find the correct ARN for your region:
aws lambda list-layer-versions \
--layer-name AWSSDKPandas-Python311 \
--region us-east-1 \
--query 'LayerVersions[0].LayerVersionArn'
Or reference the official AWS GitHub repository: aws-sdk-pandas
Limitations and Considerations
- Layer size limit: Lambda layers have a 250 MB limit (unzipped). The AWS SDK for pandas layer is approximately 100 MB.
- Total layers limit: Each Lambda function can have up to 5 layers attached.
- Version management: AWS updates the layer periodically. Pin to specific versions in production to avoid unexpected changes.
- Custom packages: If you need packages not included in the layer, you must include them in your deployment package.
Monitoring and Troubleshooting
Check the layer is loaded correctly by adding logging:
import sys
import pandas as pd
def handler(event, context):
print(f"Pandas version: {pd.__version__}")
print(f"Pandas location: {pd.__file__}")
print(f"Python path: {sys.path}")
# Your logic here
The layer path should appear in the Python path as /opt/python/
.
Conclusion
Using the AWS SDK for pandas Lambda layer in Serverless Framework deployments solves multiple problems:
- Reduces deployment package size by excluding heavy dependencies
- Resolves numpy version conflicts through pre-compiled, compatible binaries
- Accelerates deployment times
- Provides consistent dependency versions across functions
For ML workloads on Lambda that require pandas and numpy, utilizing this managed layer is a practical solution to dependency management challenges. The configuration is minimal, and the benefits are immediate.
The layer ARN arn:aws:lambda:us-east-1:336392948345:layer:AWSSDKPandas-Python311:23
provides pandas 2.x, numpy 1.x, and awswrangler optimized for Python 3.11 Lambda functions in the us-east-1 region.