

# Validating data quality in AWS Glue DataBrew
Validating data quality

To ensure the quality of your datasets, you can define a list of data quality rules in a ruleset. A *ruleset* is a set of rules that compare different data metrics against expected values. If any of a rule's criteria isn't met, the ruleset as a whole fails validation. You can then inspect individual results for each rule. For any rule that causes a validation failure, you can make the necessary corrections and revalidate.

Examples of rules include the following: 
+ Value in column `"APY"` is between 0 and 100
+ Number of missing values in column `group_name` doesn't exceed 5%

You can define each rule for an individual column or independently apply it to several selected columns, for example:
+ Max value doesn’t exceed 100 for columns `"rate"`, `"pay"`, `"increase"`.

A rule can consist of multiple simple checks. You can define whether all of them should be true or any, for example:
+ Value in column `"ProductId"` should start with `"asin-"` AND length of value in column `"ProductId"` is 32. 

You can verify rules against either aggregate values such as `max`, `min`, or `number of duplicate values` where there is only one value being compared, or nonaggregate values in each row of a column. In the latter case, you can also define a "passing" threshold such as `value in columnA > value in columnB for at least 95% of rows`.

 As with profile information, you can define column-level data quality rules only for columns of simple types, such as strings and numbers. You can't define data quality rules for columns of complex types, such as arrays or structures. For more details about working with profile information, see [Creating and working with AWS Glue DataBrew profile jobs](jobs.profile.md).

## Validating data quality rules
Validating data quality rules

After a ruleset is defined, you can add it to a profile job for validation. You can define more than one ruleset for a dataset. 

For example, one ruleset might contain rules with minimally acceptable criteria. A validation failure for that ruleset might mean that the data isn't acceptable for further use. An example is missing values in key columns of a dataset used for machine learning training. You can use a second ruleset with stricter rules to verify whether the dataset has such good quality that no cleanup is required. 

You can apply one or more rulesets defined for a given dataset in a profile job configuration. When the profile job runs, it produces a validation report in addition to the data profile. The validation report is available at the same location as your profile data. As with profile information, you can explore the results in the DataBrew console. In the **Dataset details** view, choose the **Data Quality** tab to view the results. For more details about working with profile information, see [Creating and working with AWS Glue DataBrew profile jobs](jobs.profile.md). 

## Acting on validation results
Acting on validation results

When a DataBrew profile job completes, DataBrew sends an Amazon CloudWatch event with the details of that job run. If you also configured your job to validate data quality rules, DataBrew sends an event for each validated ruleset. The event contains its result (`SUCCEEDED`, `FAILED`, or `ERROR`) and a link to the detailed data quality validation report. You can then automate further action by invoking **next action** depending on the status of validation. For more information on connecting events to target actions, such as Amazon SNS notification, AWS Lambda function invocations and others, see [Getting started with Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eventbridge-getting-set-up.html). 

 Following is an example of a DataBrew Validation Result event:

```
{
  "version": "0",
  "id": "fb27348b-112d-e7c2-560d-85e7c2c09964",
  "detail-type": "DataBrew Ruleset Validation Result",   
  "source": "aws.databrew",                     
  "account": "123456789012",          
  "time": "2021-11-18T13:15:46Z", 
  "region": "us-east-1",
  "resources": [],
  "detail": {
    "datasetName": "MyDataset", 
    "jobName": "MyProfileJob",
    "jobRunId": "db_f07954d20d083de0c1fc1eee11498d8635ee5be4ca416af27d33933e91ff4e6e",
    "rulesetName": "MyRuleset",
    "validationState": "FAILED",
    "validationReportLocation": "s3://MyBucket/MyKey/MyDataset_f07954d20d083de0c1fc1eee11498d8635ee5be4ca416af27d33933e91ff4e6e_dq-validation-report.json"
  }
}
```

You can use attributes of events such as `detail-type`, `source` and nested properties of the `detail` attribute to [create event patterns](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-event-patterns.html#eb-create-pattern) in Amazon Eventbridge. For example an event pattern to match all failed validations from any DataBrew job would look like this:

```
{
  "source": ["aws.databrew"],
  "detail-type": ["DataBrew Ruleset Validation Result"],
  "detail": {
   "validationState": ["FAILED"]
  }
}
```

For an example of creating a ruleset and validating its rules, see [Creating a ruleset with data quality rules](profile.data-quality-rules-create.md). For more information about working with CloudWatch events in DataBrew, see [Automating DataBrew with CloudWatch Events](monitoring.cloudwatch-events.md) 

# Creating a ruleset with data quality rules
Creating a ruleset with data quality rules

In the following procedure, you can find an example of creating a ruleset and applying it to a dataset. A *ruleset* is a set of rules that compare different data metrics against expected values. You then can use this ruleset in a profile job to validate the data quality rules that it includes.

**To create an example ruleset with data quality rules**

1. Sign in to the AWS Management Console and open the DataBrew console at [https://console.aws.amazon.com/databrew/](https://console.aws.amazon.com/glue/).

1. Choose **DQ RULES** from the navigation pane, and then choose **Create data quality ruleset**.

1. Enter a name for your ruleset. Optionally, enter a description for your ruleset.

1. Under **Associated dataset**, choose a dataset to associate with the ruleset.

   After you select a dataset, you can view the **Dataset preview** pane at right. 

1. Use the preview in the **Dataset preview** pane to explore the values and schema for the dataset as you determine the data quality rules to create. The preview can give you insight about potential issues that you might have with the data.

   Some data sources, such as databases, don't support data preview. In that case, you can run a profile job without validating the data quality rules first. Then you can get information about the data schema and values distribution by using the data profile. 

1. Check the **Recommendations** tab, which lists some rule suggestions that you can use when creating your ruleset. You can select all, some, or none of the recommendations. 

   After selecting relevant recommendations, choose **Add to ruleset**.

   This will add rules to your ruleset. Inspect and modify parameters if needed. Note that only columns of simple types such as *string*, *numbers* and *boolean* can be used in data quality rules.

1. Choose **Add another rule** to add a rule not covered by recommendations. You can change rule names to make it easier to interpret validation results later.

1. Use **Data quality check scope** to choose whether individual columns will be selected per each check in this rule or whether they should be applied to a group of columns you select. For example, if your dataset has several numeric columns that should have values between 0 and 100, you can define the rule once and select all these columns to be checked by this rule.

1. If your rule will have more than one check, then in the **Rule success criteria** dropdown, choose whether all checks should be met or which ones meet the criteria.

1. Select a check that will be performed to verify this rule in the **Data quality check** dropdown. For more information about available checks, see [Available checks](profile.data-quality-available-checks.md). 

1. If you chose **Individual check for each column** in the **Data quality check scope**, choose a column. Select or type the column name for this check.

1. Select parameters depending on the check. Some conditions accept only provided custom values and some also support reference to another column.

1. If you choose checks for **Column values** such as *Contains* condition for string values, then you can specify “passing” threshold. For example, if you want at least 95 percent of values to satisfy the condition, you need to choose *Greater than equals* as a threshold’s **Condition**, enter 95 as a **Threshold** and leave *"%(percent) rows"* in the next dropdown in the **Threshold** section. Or if you want no more than 10 rows where *value is missing* condition is true, then you can select *Less than equals* as a **Condition**, enter 10 for **Threshold** and choose **rows** in the next dropdown. Please note that you might get different results if you're using samples of different size during validation.

1. Add more rules if needed.

1. Choose **Create ruleset**.

# Creating a profile job using a ruleset
Creating a profile job

After you create a ruleset as described preceding, you are directed to the **Data quality rules** page, which displays all rulesets in your account.

**To create a profile job including a ruleset**

1. Choose the name of the ruleset that you previously created to view its details.

1. Choose **Create profile job with ruleset**. 

   The **Job name** is automatically filled, but you can change it as needed.

1. For **Job run sample**, you can choose to run the entire dataset or a limited number of rows. 

   If you choose to run a limited sample size, be aware that for certain rules, results might differ compared to the full dataset. 

1. For **Job output settings**, choose an **S3** location for the job output. Choose any folder in a named Amazon S3 bucket that you have access to. If you enter a folder name for this bucket that doesn't exist, this folder is created. 

   Upon successful completion of the profile job, this folder will contain profiles of the data and data quality rules validation report in JSON format. 

1. Under **Data quality rules**, note your ruleset is listed under **Data quality ruleset name**.

1. Under **Permissions**, select or create a role to grant DataBrew access to read from the input Amazon S3 location and write to the job output location. If you don't have a role ready, select **Create new IAM role**.

1. Modify any other optional settings as described in [Creating and working with AWS Glue DataBrew profile jobs](jobs.profile.md), if needed.

1. Choose **Create and run job**.

## Inspecting validation results for and updating data quality rules


After your profile job completes, you can view the validation results for your data quality rules and as needed update your rules. 

**To view validation data for your data quality rules**

1. On the DataBrew console, choose **View data profile**. Doing this displays the **Data profile overview** tab for your dataset.

1. Choose the **Data quality rules** tab. On this tab, you can view the results for all of your data quality rules.

1. Select an individual rule for more details about that rule. 

For any rule that failed validation, you can make the necessary corrections.

**To update your data quality rules**

1. On the navigation pane, choose **DQ RULES**.

1. Under **Data quality ruleset name**, choose the dataset that contains the rules that you plan to edit.

1. Choose the rule that you want to change, and then choose **Edit**.

1. Make the necessary corrections, and then choose **Update ruleset**.

1. Rerun the job. Repeat this process until all validations pass.

# Available checks


The following table lists references for all available conditions that can be used in your rules. Note that aggregated conditions cannot be combined with non-aggregated conditions in the same rule. 

**Note**  
For SDK users, to apply the same rule to multiple columns use the [ColumnSelectors](https://docs.aws.amazon.com/databrew/latest/dg/API_ColumnSelector.html) attribute of a [Rule](https://docs.aws.amazon.com/databrew/latest/dg/API_Rule.html) and specify validated columns using either their names or a regular expression. In this case, you should use implicit *CheckExpression*. For example, `“> :val”` to compare values in each of the selected columns with the provided value. DataBrew uses implicit syntax for defining [FilterExpression](https://docs.aws.amazon.com/databrew/latest/dg/API_FilterExpression.html) in dynamic datasets. If you want to specify column(s) for each check individually, don't set the *ColumnSelectors* attribute. Instead, provide an explicit expression. For example, `“:col > :val”` as a *CheckExpression* in a *Rule*.


****  
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/databrew/latest/dg/profile.data-quality-available-checks.html)

**Numeric comparisons**

DataBrew supports the following operations for numeric comparison: *Is equals (==)*, *Is not equals (\$1=)*, *Less than (<)*, *Less than equals (<=)*, *Greater than (>)*, *Greater than equals (>=)* and *Is between (is\$1between :val1 and :val2)*.

**String comparisons**

The following string comparisons are supported: *Starts with*, *Doesn’t start with*, *Ends with*, *Doesn’t end with*, *Contains*, *Doesn’t contain*, *Is equals*, *Is not equals*, *Matches*, *Doesn’t match*. 

The following table displays available statistics that you can use for Value distribution statistics and Numerical statistics:


****  
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/databrew/latest/dg/profile.data-quality-available-checks.html)