Guidance for Generating Rule Recommendations for Entity Resolution on AWS

Overview

This Guidance demonstrates an automated approach for generating rule recommendations to match, link, and enhance related records using AWS Entity Resolution rule-based matching. It showcases an AWS Glue notebook that streamlines the process of creating effective matching rules. The Guidance reads input data from Amazon S3, performs data quality analysis, and harnesses the power of a large language model (LLM) on Amazon Bedrock to produce customized rule recommendations. Each recommendation comes with accompanying reasoning, providing insights into the suggested rules. Furthermore, the Guidance implements a sampling approach to test the generated rules and resolve entities.

How it works

Overview

This architecture diagram shows an overview of how to generate rule recommendations using an LLM hosted on Amazon Bedrock and an AWS Glue notebook and how to use these rules in a rule-based matching workflow in AWS Entity Resolution.

Download the architecture diagram Overview Step 1
Load your input dataset (CSV/parquet) in an Amazon Simple Storage Service (Amazon S3) bucket and use an AWS Glue Crawler to create an AWS Glue table within the AWS Glue Data Catalog.
Step 2
Create a schema mapping in AWS Entity Resolution using the AWS Glue table as the source.
Step 3
Run the notebook in AWS Glue, which uses the AWS Entity Resolution schema mapping to understand the shape of the data. The notebook reads the data from Amazon S3 and generates data quality metrics. It feeds these metrics to an LLM hosted on Amazon Bedrock. The LLM recommends rules to apply to an AWS Entity Resolution matching workflow for resolving entities.
Step 4
The recommended rules generated by the AWS Glue notebook are used to create a rule-based matching workflow within AWS Entity Resolution.
Step 5
An AWS Step Functions workflow orchestrates the execution of the rule-based matching workflow to process the incremental source data.
Incremental rule-based workflow

This architecture diagram shows how to run an incremental rule-based matching workflow in AWS Entity Resolution using an AWS Step Functions workflow.

Download the architecture diagram Incremental rule-based workflow Step 1
Create a schedule in Amazon EventBridge to trigger Step Functions at a desired frequency.
Step 2
Step Functions triggers an AWS Glue extract, transform, load (ETL) job that pre-processes the incremental source data and prepares it for AWS Entity Resolution rule-based matching workflow.
Step 3
An AWS Lambda function triggers the rule-based matching workflow in AWS Entity Resolution. The workflow reads the incremental data from the source Amazon S3 bucket and processes it.
Step 4
The Lambda function checks the status of the matching workflow running in AWS Entity Resolution until the job status changes to Completed.
Step 5
Upon completion, the AWS Entity Resolution matching workflow writes the output to an S3 output bucket.
Step 6
The AWS Glue post-processing ETL job reads the output from AWS Entity Resolution and writes it to an Amazon S3 table. The Amazon S3 table is chosen as the destination because it supports Atomicity, Consistency, Isolation, Durability (ACID) transactions.
Step 7
The AWS Entity Resolution incremental matching workflow has the capability to merge or split records. Given this ability, a datastore that supports ACID transactions is an ideal choice to help ensure data integrity and consistency.

Deploy with confidence

Everything you need to launch this Guidance in your account is right here.

Let's make it happen

Ready to deploy? Review the sample code on GitHub for detailed deployment instructions to deploy as-is or customize to fit your needs.

Well-Architected Pillars

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

Operational Excellence

AWS Glue is a managed service that runs workloads and provides monitoring metrics for jobs. It offers fault tolerance with support for retries in case of failures. AWS Glue Crawler automates the discovery of data schematics. These features create a scalable, fault-tolerant system that provides insights into runtime metrics of jobs.

Read the Operational Excellence whitepaper

Security

AWS Identity and Access Management (IAM) policies are scoped down to the minimum permissions required for services to function properly. Data stored in Amazon S3 uses encryption at rest. These measures limit unauthorized access to resources and protect data integrity. By implementing tight access controls and encrypting data at rest, the Guidance enhances overall security posture and helps meet compliance requirements.

Read the Security whitepaper

Reliability

As managed services, AWS Glue, AWS Entity Resolution, Amazon Bedrock, and Step Functions reduce the operational burden of maintaining reliability, allowing the system to recover from failures automatically. These services support retries for recovery from failures and integrate with Amazon CloudWatch to provide operational insights.

Read the Reliability whitepaper

Performance Efficiency

AWS Glue offers a serverless architecture that scales compute resources up or down based on workload demands. It provides different instance types for users to choose based on their specific workload requirements. AWS Glue connects with other AWS services through AWS networking services and can run within a virtual private cloud (VPC). This flexibility in resource selection and automatic scaling helps ensure that the system can efficiently handle varying workload intensities.

Read the Performance Efficiency whitepaper

Cost Optimization

This Guidance uses managed services that follow a pay-as-you-go pricing model, meaning you only pay for the resources you use. AWS Glue is serverless, providing scaling capabilities that help optimize costs. AWS Entity Resolution charges based on the volume of ingested data. Amazon S3 costs depend on data storage and access patterns. Step Functions charges based on the number of state transitions. This usage-based pricing across services helps ensure that costs align closely with actual resource consumption.

Read the Cost Optimization whitepaper

Sustainability

As a serverless service, AWS Glue only consumes resources when actively processing data. It offers features like data partitioning and compression, which reduce storage and compute resource requirements for data processing pipelines. AWS Glue offers automatic scaling based on workload helps optimize resource utilization and reduce energy consumption.

Read the Sustainability whitepaper