Machine learning operations (MLOps) combines individuals, processes, and technology to efficiently bring ML use cases into production. Enterprises need MLOps platforms that ensure reproducibility, robustness, and complete observability throughout the ML lifecycle. These platforms often utilize multi-account setups and CI/CD practices, restricting user interactions for enhanced security.
Build a Secure MLOps Platform with Terraform and GitHub
Terraform’s Infrastructure as Code (IaC) approach is widely used for developing, deploying, and standardizing AWS infrastructure in multi-cloud environments. GitHub and GitHub Actions are also popular for development repositories and CI/CD, respectively. This article details how to implement an MLOps platform using Terraform, GitHub, and GitHub Actions for automated ML use case deployment.
We’ll explore the necessary infrastructure and how to use custom templates, which include example repositories to help data scientists and ML engineers deploy ML services like endpoints or batch transform jobs via Terraform. The complete source code is available on GitHub.
Solution Overview
This MLOps architecture creates the resources needed for a complete training pipeline, registers models in the model registry, and deploys them to pre-production and production environments. This infrastructure provides a systematic approach to ML operations, streamlining the process from model development to deployment.
End-users (data scientists or ML engineers) select a SageMaker Project template suited to their use case. SageMaker Projects help standardize developer environments and CI/CD systems. Deploying a project creates a private GitHub repository and CI/CD resources that data scientists can customize. Additional project-specific resources are also created, depending on the SageMaker project selected.
Custom SageMaker Project Templates
SageMaker Projects deploys an associated template from a Service Catalog product to provision and manage infrastructure and resources, including source code repository integration. Currently, four custom SageMaker Projects templates are available:
- MLOps template for LLM training and evaluation: A one-account setup for large language models (LLMs). The template supports fine-tuning and evaluation.
- MLOps template for model building and training: A simple one-account SageMaker Pipelines setup. The template supports model training and evaluation.
- MLOps template for model building, training, and deployment: Trains models using SageMaker Pipelines and deploys them to pre-production and production accounts. The template supports real-time inference, batch inference pipelines, and bring-your-own-containers (BYOC).
- MLOps template for promoting the full ML pipeline across environments: Demonstrates how to move a SageMaker pipeline across environments from development to production. The template supports a pipeline for batch inference.
Each SageMaker project template has associated GitHub repository templates:
- MLOps template for LLM training and evaluation: Associated with the LLM training repository.
- MLOps template for model building and training: Associated with the model training repository.
- MLOps template for model building, training, and deployment: Associated with the BYOC repository (optional), model training repository, and real-time inference repository or batch inference repository.
- MLOps template for promoting the full ML pipeline across environments: Associated with the pipeline promotion repository.
When a custom SageMaker project is deployed, the associated GitHub template repositories are cloned via a Lambda function called <prefix>_clone_repo_lambda
, creating a new GitHub repository for the project.
Infrastructure Terraform Modules
The Terraform code, located under base-infrastructure/terraform
, is structured with reusable modules used across different deployment environments. Their instantiation can be found under base-infrastructure/terraform/<environment>/main.tf
. Key reusable modules include:
- KMS: Creates an AWS Key Management Service (AWS KMS) key for encryption.
- Lambda: Creates a Lambda function and associated CloudWatch log group.
- Networking: Creates a virtual private cloud (VPC), subnets, security groups, NAT gateway, internet gateway, route tables, routes, and VPC endpoints for secure communication.
- S3: Creates an S3 bucket for artifact storage.
- SageMaker: Creates SageMaker Studio domain and SageMaker user profiles.
- SageMaker Roles: Creates AWS Identity and Access Management (IAM) roles for SageMaker Studio with appropriate permissions.
- Service Catalog: Creates Service Catalog products from CloudFormation templates.
Environment-specific resources are located directly under base-infrastructure/terraform/<environment>
.
Prerequisites
Before deploying this MLOps platform, ensure you have completed these essential steps:
- Prepare AWS Accounts: Set up at least three AWS accounts for development, pre-production, and production environments. While you can test with a single account, a multi-account setup is strongly recommended for production use to maintain proper isolation and security boundaries.
- Create a GitHub Organization: Set up a GitHub organization that will host all your MLOps repositories and manage access control for your team.
- Install Required Tools: Ensure you have the AWS CLI, Terraform (version 1.0 or higher), and Git installed on your local machine.
- Configure AWS Credentials: Set up AWS credentials with appropriate permissions to create resources across your accounts.
- Generate GitHub Personal Access Token: Create a GitHub personal access token with repo and admin:org permissions for repository automation.
Step-by-Step Deployment Guide
Step 1: Configure Environment Variables
First, export the required environment variables for your deployment. Replace the placeholder values with your actual configuration:
# Set your environment name (dev, preprod, or prod)
export ENV=dev
# Set your GitHub organization name
export GITHUB_ORG=your-org-name
# Optional: Customize Terraform state bucket prefix
export TerraformStateBucketPrefix=terraform-state
# Optional: Customize Terraform state lock table name
export TerraformStateLockTableName=terraform-state-locks
Step 2: Deploy the Bootstrap Stack
The bootstrap stack creates the foundational resources needed for Terraform state management and GitHub integration. Run the following AWS CloudFormation command:
aws cloudformation create-stack \
--stack-name mlops-bootstrap-${ENV} \
--template-body file://bootstrap.yaml \
--capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM \
--parameters \
ParameterKey=Environment,ParameterValue=$ENV \
ParameterKey=GitHubOrg,ParameterValue=$GITHUB_ORG \
ParameterKey=OIDCProviderArn,ParameterValue="" \
ParameterKey=TerraformStateBucketPrefix,ParameterValue=$TerraformStateBucketPrefix \
ParameterKey=TerraformStateLockTableName,ParameterValue=$TerraformStateLockTableName
Step 3: Configure Environment Settings
Create a configuration file for your environments with the following structure. Save this as environments.json
:
{
"dev": {
"region": "us-east-1",
"dev_account_number": "111111111111",
"preprod_account_number": "222222222222",
"prod_account_number": "333333333333"
},
"preprod": {
"region": "us-east-1",
"dev_account_number": "111111111111",
"preprod_account_number": "222222222222",
"prod_account_number": "333333333333"
},
"prod": {
"region": "us-east-1",
"dev_account_number": "111111111111",
"preprod_account_number": "222222222222",
"prod_account_number": "333333333333"
}
}
Step 4: Initialize and Deploy Terraform
Navigate to your Terraform directory and initialize the Terraform backend:
cd base-infrastructure/terraform/${ENV}
terraform init
terraform plan
terraform apply
Step 5: Verify Deployment
After successful deployment, verify that all resources have been created:
- Check that SageMaker Studio domain is active in the AWS Console
- Verify that GitHub repositories have been created in your organization
- Confirm that Service Catalog products are available
- Test GitHub Actions workflows are configured correctly
Post-Deployment Configuration
Setting Up User Access: Add data scientists and ML engineers to your SageMaker Studio domain and assign appropriate IAM roles. Configure GitHub team permissions to control repository access.
Customizing Project Templates: Modify the SageMaker project templates in base-infrastructure/terraform/dev/sagemaker_templates/
to match your organization’s specific requirements and workflows.
Implementing Monitoring: Set up CloudWatch dashboards and alarms to monitor your ML pipelines, model endpoints, and infrastructure costs. Configure SNS topics for alerting on critical events.
Best Practices for Production
- Enable MFA: Require multi-factor authentication for all users accessing production environments.
- Implement Least Privilege: Grant only the minimum necessary permissions to each role and user.
- Enable Logging: Turn on CloudTrail, VPC Flow Logs, and S3 access logging for comprehensive audit trails.
- Regular Backups: Implement automated backups for critical data and model artifacts stored in S3.
- Cost Monitoring: Set up AWS Cost Explorer and budgets to track spending across environments.
- Version Control: Always use tagged releases for production deployments and maintain detailed change logs.
Troubleshooting Common Issues
Issue: Terraform state lock timeout
Solution: Check if another process is holding the lock in DynamoDB. If necessary, manually release the lock after confirming no other operations are running.
Issue: GitHub repository creation fails
Solution: Verify your GitHub personal access token has the correct permissions and hasn’t expired. Check Lambda function logs in CloudWatch for detailed error messages.
Issue: SageMaker project deployment fails
Solution: Ensure the Service Catalog products are properly shared across accounts and that IAM roles have the necessary permissions to create SageMaker resources.