Home Data Analytics Visualize Data Lineage with SageMaker Catalog for EMR/Glue

Data Analytics

Visualize Data Lineage with SageMaker Catalog for EMR/Glue

October 15, 2025

Understanding how to visualize data lineage in SageMaker is essential for navigating today’s dynamic landscape.

What Is Amazon SageMaker Catalog Data Lineage?

Amazon SageMaker Catalog’s data lineage feature provides a clear view of data’s journey across AWS analytics services such as AWS Glue, Amazon EMR Serverless, and Amazon Redshift, eliminating the challenge of tracing data origins and transformations across different systems.

Prerequisites for Getting Started

Before using SageMaker Catalog, ensure the following prerequisites are met: an active AWS account with billing enabled, an AWS Identity and Access Management (IAM) user with administrator access or specific permissions, an S3 bucket (e.g., `datazone-{account_id}`), sufficient VPC capacity, and AWS IAM Identity Center setup.

Step-by-Step Guide to Visualizing Data Lineage

The process of visualizing data lineage involves several steps:

Step 1: Configure SageMaker Unified Studio with AWS CloudFormation

Launch the vpc-analytics-lineage-sus stack using the CloudFormation template. Necessary parameters include:

DatazoneS3Bucket
DomainName
EnvironmentName
PrivateSubnet CIDRs
ProjectName
PublicSubnet CIDRs
UsersList
VpcCIDR

This configuration process takes approximately 20 minutes.

Step 2: Prepare Your Data

Upload sample data files (attendance.csv and employees.csv) to the designated S3 bucket (e.g., s3://datazone-{account_id}/csv/).

Then, ingest the employee data from the employees.csv file into an Amazon RDS for MySQL table:

Connect to the MySQL instance using the MySQL client
Create a database named employeedb
Create a table named employee within that database

The employee table schema includes the following columns:

EmployeeID (text)
Name (text)
Department (text)
Role (text)
HireDate (text)
Salary (decimal)
PerformanceRating (integer)
Shift (text)
Location (text)

Populate the employee table with data from the employees.csv file.

Step 3: Capture Lineage from AWS Glue ETL Job

Create a new AWS Glue ETL job and enable the “Generate lineage events” option, providing the appropriate domain ID.

Utilize the provided Python code snippet in the AWS Glue ETL job script. The script performs the following actions:

Reads employee data from the MySQL database using JDBC connection details extracted from a Glue connection named “connectionname”
Reads attendance data from the S3 bucket
Joins the two datasets on “EmployeeID”
Writes the resulting joined data as a Parquet file to an S3 location
Saves it as a table named “gluedbname.tablename” in the Glue Data Catalog

Confirm the job runs successfully without any failures.

Step 4: Capture Lineage from AWS Glue Notebook

Create an AWS Glue notebook and use the provided code, ensuring to add the necessary Spark configuration to generate lineage.

The configuration includes specifying:

OpenLineage Spark listener
Amazon DataZone API transport type
Domain ID
AWS account ID
Custom environment variables
Glue job name

The Python code reads attendance data from an S3 bucket, writes it as a Parquet file to another S3 location, and saves it as a table named “gluedbname.tablename” in the Glue Data Catalog.

Step 5: Capture Lineage from Amazon Redshift

Create the employee and absent tables in Amazon Redshift.

Absent Table Schema

employeeid (text)
date (date)
shiftstart (timestamp)
shiftend (timestamp)
absent (boolean)
overtimehours (integer)

Employee Table Schema

employeeid (text)
name (text)
department (text)
role (text)
hiredate (date)
salary (double precision)
performancerating (integer)
shift (text)
location (text)

Load data from the corresponding CSV files in Amazon S3 into these tables using the COPY command, specifying the IAM role for access. Finally, create a new table called employeewithabsent in Amazon Redshift by joining the employee and absent tables on the EmployeeID column.

Step 6: Capture Lineage from EMR Serverless Job

Create an EMR Spark serverless application with the specified configuration, which includes enabling Iceberg support.

Submit a Spark application to generate lineage events. The Spark application:

Reads employee data from the MySQL database via JDBC
Reads attendance data from the Redshift database via JDBC
Joins the dataframes on the EmployeeID column
Writes the resulting joined dataframe as a Parquet file to an S3 location
Registers it as a table in the Glue Data Catalog

The aws emr-serverless start-job-run command is used to submit the Spark job, specifying:

Application ID
Execution role ARN
Job name
Job driver details including the entry point script
Spark configuration parameters for executor and driver memory
AWS Glue Data Catalog configuration
OpenLineage listener settings
DataZone API transport details
Necessary JAR files for MySQL connectivity

Real-World Example: Employee Attendance Analysis

A practical example involves analyzing employee attendance data. SageMaker Catalog allows tracing raw data from an Amazon RDS for MySQL database and Amazon S3, transforming it through an AWS Glue ETL job, and ultimately creating a unified dataset in the Data Catalog.

Best Practices for Optimal Data Lineage

For optimal data lineage, consider these best practices:

Automate lineage capture using SageMaker Catalog’s features
Monitor data quality using lineage information
Regularly update data sources to maintain accuracy

Next Steps: Explore and Learn More

Explore Amazon SageMaker and Amazon SageMaker Catalog documentation for comprehensive details and to start visualizing your data’s journey, unlocking new insights.

Visualize Data Lineage with SageMaker Catalog for EMR/Glue

What Is Amazon SageMaker Catalog Data Lineage?

Prerequisites for Getting Started

Step-by-Step Guide to Visualizing Data Lineage

Step 1: Configure SageMaker Unified Studio with AWS CloudFormation

Step 2: Prepare Your Data

Step 3: Capture Lineage from AWS Glue ETL Job

Step 4: Capture Lineage from AWS Glue Notebook

Step 5: Capture Lineage from Amazon Redshift

Absent Table Schema

Employee Table Schema

Step 6: Capture Lineage from EMR Serverless Job

Real-World Example: Employee Attendance Analysis

Best Practices for Optimal Data Lineage

Next Steps: Explore and Learn More

LEAVE A REPLY Cancel reply

Join the conversation

Datadog: Companies Adopt Data Perimeters Amid Theft

What Is Amazon SageMaker Catalog Data Lineage?

Prerequisites for Getting Started

Step-by-Step Guide to Visualizing Data Lineage

Step 1: Configure SageMaker Unified Studio with AWS CloudFormation

Step 2: Prepare Your Data

Step 3: Capture Lineage from AWS Glue ETL Job

Step 4: Capture Lineage from AWS Glue Notebook

Step 5: Capture Lineage from Amazon Redshift

Absent Table Schema

Employee Table Schema

Step 6: Capture Lineage from EMR Serverless Job

Real-World Example: Employee Attendance Analysis

Best Practices for Optimal Data Lineage

Next Steps: Explore and Learn More

RELATED ARTICLESMORE FROM AUTHOR

Virgin Mobile UAE Deploys AWS GenAI for Faster Service

Cloud Computing Stocks: 3 Tech Giants Leading The AI Revolution

LEAVE A REPLY Cancel reply

Join the conversation

Datadog: Companies Adopt Data Perimeters Amid Theft

RELATED ARTICLES MORE FROM AUTHOR