

# DQDL rule type reference
<a name="dqdl-rule-types"></a>

This section provides a reference for each rule type that AWS Glue Data Quality supports.

**Note**  
DQDL doesn't currently support nested or list-type column data.
Bracketed values in the below table will be replaced with the information provided in rule arguments.
Rules typically require an additional argument for expression.


| Ruletype | Description | Arguments | Reported Metrics | Supported as Rule? | Supported as Analyzer? | Returns row-level Results? | Dynamic rule support? | Generates Observations | Supports Where Clause Syntax? | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| AggregateMatch | Checks if two datasets match by comparing summary metrics like total sales amount. Useful for financial institutions to compare if all data is ingested from source systems. | One or more aggregations |  When first and second aggregation column names match: `Column.[Column].AggregateMatch` When first and second aggregation column names different: `Column.[Column1,Column2].AggregateMatch`  | Yes | No | No | No | No | No | 
| AllStatistics | Standalone analyzer to gather multiple metrics for the provided column in a dataset. | A single column name |  For columns of all types: `Dataset.*.RowCount` `Column.[Column].Completeness` `Column.[Column].Uniqueness` Additional metrics for string-valued columns: `ColumnLength metrics` Additional metrics for numeric-valued columns: `ColumnValues metrics`  | No | Yes | No | No | No | No | 
| ColumnCorrelation | Checks how well two columns are correlated. | Exactly two column names | Multicolumn.[Column1,Column2].ColumnCorrelation | Yes | Yes | No | Yes | No | Yes | 
| ColumnCount | Checks if any columns are dropped. | None | Dataset.\$1.ColumnCount | Yes | Yes | No | Yes | Yes | No | 
| ColumnDataType | Checks if a column is compliant with a datatype. | Exactly one column name | Column.[Column].ColumnDataType.Compliance | Yes | No | No | Yes, in row-level threshold expression | No | Yes | 
| ColumnExists | Checks if columns exist in a dataset. This allows customers building self service data platforms to ensure certain columns are made available. | Exactly one column name | N/A | Yes | No | No | No | No | No | 
| ColumnLength | Checks if length of data is consistent. | Exactly one column name |  `Column.[Column].MaximumLength` `Column.[Column].MinimumLength` Additional metric when row-level threshold provided: `Column.[Column].ColumnValues.Compliance`  | Yes | Yes | Yes, when row-level threshold provided | No | Yes. Only generates observations by analyzing Minimum and Maximum length | Yes | 
| ColumnNamesMatchPattern | Checks if column names match defined patterns. Useful for governance teams to enforce column name consistency.  | A regex for column names | Dataset.\$1.ColumnNamesPatternMatchRatio | Yes | No | No | No | No | No | 
| ColumnValues | Checks if data is consistent per defined values. This rule supports regular expressions. | Exactly one column name |  `Column.[Column].Maximum` `Column.[Column].Minimum` Additional metric when row-level threshold provided: `Column.[Column].ColumnValues.Compliance`  | Yes | Yes | Yes, when row-level threshold provided | No | Yes. Only generates observations by analyzing Minimum and Maximum values | Yes | 
| Completeness | Checks for any blank or NULLs in data. | Exactly one column name | `Column.[Column].Completeness` | Yes | Yes | Yes | Yes | Yes | Yes | 
| CustomSql |  Customers can implement almost any type of data quality checks in SQL. |  A SQL statement (Optional) A row-level threshold  |  `Dataset.*.CustomSQL` Additional metric when row-level threshold provided: `Dataset.*.CustomSQL.Compliance`  | Yes | No | Yes, when row-level threshold provided | Yes | No | No | 
| DataFreshness | Checks if data is fresh. | Exactly one column name | Column.[Column].DataFreshness.Compliance | Yes | No | Yes | No | No | Yes | 
| DatasetMatch | Compares two datasets and identifies if they are in synch. |  Name of a reference dataset A column mapping (Optional) Columns to check for matches  | Dataset.[ReferenceDatasetAlias].DatasetMatch | Yes | No | Yes | Yes | No | No | 
| DistinctValuesCount | Checks for duplicate values. | Exactly one column name | Column.[Column].DistinctValuesCount | Yes | Yes | Yes | Yes | Yes | Yes | 
| DetectAnomalies | Checks for anomalies in another rule type's reported metrics. | A rule type | Metric(s) reported by the rule type argument | Yes | No | No | No | No | No | 
| Entropy | Checks for entropy of the data. | Exactly one column name | Column.[Column].Entropy | Yes | Yes | No | Yes | No | Yes | 
| IsComplete | Checks if 100% of the data is complete. | Exactly one column name | Column.[Column].Completeness | Yes | No | Yes | No | No | Yes | 
| IsPrimaryKey | Checks if a column is a primary key (not NULL and unique). | Exactly one column name |  For single column: `Column.[Column].Uniqueness` For multiple columns: `Multicolumn.[CommaDelimitedColumns].Uniqueness`  | Yes | No | Yes | No | No | Yes | 
| IsUnique | Checks if 100% of the data is unique. | Exactly one column name | Column.[Column].Uniqueness | Yes | No | Yes | No | No | Yes | 
| Mean | Checks if the mean matches the set threshold. | Exactly one column name | Column.[Column].Mean | Yes | Yes | Yes | Yes | No | Yes | 
| ReferentialIntegrity | Checks if two datasets have referential integrity. |  One or more column names from dataset One or more column names from reference dataset  | Column.[ReferenceDatasetAlias].ReferentialIntegrity | Yes | No | Yes | Yes | No | No | 
| RowCount | Checks if record counts match a threshold. | None | Dataset.\$1.RowCount | Yes | Yes | No | Yes | Yes | Yes | 
| RowCountMatch | Checks if record counts between two datasets match. | Reference dataset alias | Dataset.[ReferenceDatasetAlias].RowCountMatch | Yes | No | No | Yes | No | No | 
| StandardDeviation | Checks if standard deviation matches the threshold. | Exactly one column name | Column.[Column].StandardDeviation | Yes | Yes | Yes | Yes | No | Yes | 
| SchemaMatch | Checks if schema between two datasets match. | Reference dataset alias | Dataset.[ReferenceDatasetAlias].SchemaMatch | Yes | No | No | Yes | No | No | 
| Sum | Checks if sum matches a set threshold. | Exactly one column name | Column.[Column].Sum | Yes | Yes | No | Yes | No | Yes | 
| Uniqueness | Checks if uniqueness of dataset matches threshold. | Exactly one column name | Column.[Column].Uniqueness | Yes | Yes | Yes | Yes | No | Yes | 
| UniqueValueRatio | Checks if the unique value ration matches threshold. | Exactly one column name | Column.[Column].UniqueValueRatio | Yes | Yes | Yes | Yes | No | Yes | 
| FileFreshness | Checks if files in Amazon S3 are fresh. | File or Folder path and a threshold. |  `Dataset.*.FileFreshness.Compliance` `Dataset.*.FileCount`  | Yes | No | No | No | No | No | 
| FileMatch | Checks if contents of file match to a checksum or with other file. This rule uses checksums to validate if two files are same. | Source File or Folder path and Target file or folder path. | No statistics are generated. | Yes | No | No | No | No | No | 
| FileSize | Checks if the size of a file matches with a specified condition. | File or folder path and threshold. | `Dataset.*.FileSize.Compliance` `Dataset.*.FileCount` `Dataset.*.MaximumFileSize` `Dataset.*.MinimumFileSize`  | Yes | No | No | No | No | No | 
| FileUniqueness | Checks if files are unique using checksums. | File or folder path and threshold. | `Dataset.*.FileUniquenessRatio` `Dataset.*.FileCount`  | Yes | No | No | No | No | No | 

**Topics**
+ [

# AggregateMatch
](dqdl-rule-types-AggregateMatch.md)
+ [

# ColumnCorrelation
](dqdl-rule-types-ColumnCorrelation.md)
+ [

# ColumnCount
](dqdl-rule-types-ColumnCount.md)
+ [

# ColumnDataType
](dqdl-rule-types-ColumnDataType.md)
+ [

# ColumnExists
](dqdl-rule-types-ColumnExists.md)
+ [

# ColumnLength
](dqdl-rule-types-ColumnLength.md)
+ [

# ColumnNamesMatchPattern
](dqdl-rule-types-ColumnNamesMatchPattern.md)
+ [

# ColumnValues
](dqdl-rule-types-ColumnValues.md)
+ [

# Completeness
](dqdl-rule-types-Completeness.md)
+ [

# CustomSQL
](dqdl-rule-types-CustomSql.md)
+ [

# DataFreshness
](dqdl-rule-types-DataFreshness.md)
+ [

# DatasetMatch
](dqdl-rule-types-DatasetMatch.md)
+ [

# DistinctValuesCount
](dqdl-rule-types-DistinctValuesCount.md)
+ [

# Entropy
](dqdl-rule-types-Entropy.md)
+ [

# IsComplete
](dqdl-rule-types-IsComplete.md)
+ [

# IsPrimaryKey
](dqdl-rule-types-IsPrimaryKey.md)
+ [

# IsUnique
](dqdl-rule-types-IsUnique.md)
+ [

# Mean
](dqdl-rule-types-Mean.md)
+ [

# ReferentialIntegrity
](dqdl-rule-types-ReferentialIntegrity.md)
+ [

# RowCount
](dqdl-rule-types-RowCount.md)
+ [

# RowCountMatch
](dqdl-rule-types-RowCountMatch.md)
+ [

# StandardDeviation
](dqdl-rule-types-StandardDeviation.md)
+ [

# Sum
](dqdl-rule-types-Sum.md)
+ [

# SchemaMatch
](dqdl-rule-types-SchemaMatch.md)
+ [

# Uniqueness
](dqdl-rule-types-Uniqueness.md)
+ [

# UniqueValueRatio
](dqdl-rule-types-UniqueValueRatio.md)
+ [

# DetectAnomalies
](dqdl-rule-types-DetectAnomalies.md)
+ [

# FileFreshness
](dqdl-rule-types-FileFreshness.md)
+ [

# FileMatch
](dqdl-rule-types-FileMatch.md)
+ [

# FileUniqueness
](dqdl-rule-types-FileUniqueness.md)
+ [

# FileSize
](dqdl-rule-types-FileSize.md)

# AggregateMatch
<a name="dqdl-rule-types-AggregateMatch"></a>

Checks the ratio of two column aggregations against a given expression. This ruletype works on multiple datasets. The two column aggregations are evaluated and a ratio is produced by dividing the result of the first column aggregation with the result of the second column aggregation. The ratio is checked against the provided expression to produce a boolean response.

**Syntax**

**Column aggregation**

```
AggregateMatch <AGG_OPERATION> (<OPTIONAL_REFERENCE_ALIAS>.<COL_NAME>)
```
+ **AGG\$1OPERATION** – The operation to use for the aggregation. Currently, `sum` and `avg` are supported.

  **Supported column types**: Byte, Decimal, Double, Float, Integer, Long, Short
+ **OPTIONAL\$1REFERENCE\$1ALIAS** – This parameter needs to be provided if the column is from a reference dataset and not the primary dataset. If you are using this rule in the AWS Glue Data Catalog, your reference alias must follow the format "<database\$1name>.<table\$1name>.<column\$1name>

  **Supported column types**: Byte, Decimal, Double, Float, Integer, Long, Short
+ **COL\$1NAME** – The name of the column to aggregate.

  **Supported column types**: Byte, Decimal, Double, Float, Integer, Long, Short

**Example: Average**

```
"avg(rating)"
```

**Example: Sum**

```
"sum(amount)"
```

**Example: Average of column in reference dataset**

```
"avg(reference.rating)"
```

**Rule**

```
AggregateMatch <AGG_EXP_1> <AGG_EXP_2> <EXPRESSION>
```
+ **AGG\$1EXP\$11** – The first column aggregation.

  Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

  **Supported column types**: Byte, Decimal, Double, Float, Integer, Long, Short
+ **AGG\$1EXP\$12** – The second column aggregation.

  Supported column types: Byte, Decimal, Double, Float, Integer, Long, Short

  **Supported column types**: Byte, Decimal, Double, Float, Integer, Long, Short
+ **EXPRESSION** – An expression to run against the rule type response in order to produce a Boolean value. For more information, see [Expressions](dqdl.md#dqdl-syntax-rule-expressions).

**Example: Aggregate Match using sum**

The following example rule checks whether the sum of the values in the `amount` column is exactly equal to the sum of the values in the `total_amount` column.

```
AggregateMatch "sum(amount)" "sum(total_amount)" = 1.0
```

**Example: Aggregate Match using average**

The following example rule checks whether the average of the values in the `ratings` column is equal to at least 90% of the average of the values in the `ratings` column in the `reference` dataset. The reference dataset is provided as an additional data source in the ETL or Data Catalog experience.

In AWS Glue ETL, you can use:

```
AggregateMatch "avg(ratings)" "avg(reference.ratings)" >= 0.9
```

In the AWS Glue Data Catalog, you can use:

```
AggregateMatch "avg(ratings)" "avg(database_name.tablename.ratings)" >= 0.9
```

**Null behavior**

 The `AggregateMatch` rule will ignore rows with NULL values in the calculation of the aggregation methods (sum/mean). For example: 

```
+---+-----------+
|id |units      |
+---+-----------+
|100|0          | 
|101|null       |
|102|20         |
|103|null       |
|104|40         |
+---+-----------+
```

 The mean of column `units` will be (0 \$1 20 \$1 40) / 3 = 20. Rows 101 and 103 are not considered in this calculation. 

# ColumnCorrelation
<a name="dqdl-rule-types-ColumnCorrelation"></a>

Checks the *correlation* between two columns against a given expression. AWS Glue Data Quality uses the Pearson correlation coefficient to measure the linear correlation between two columns. The result is a number between -1 and 1 that measures the strength and direction of the relationship. 

**Syntax**

```
ColumnCorrelation <COL_1_NAME> <COL_2_NAME> <EXPRESSION>
```
+ **COL\$11\$1NAME** – The name of the first column that you want to evaluate the data quality rule against.

  **Supported column types**: Byte, Decimal, Double, Float, Integer, Long, Short
+ **COL\$12\$1NAME** – The name of the second column that you want to evaluate the data quality rule against.

  **Supported column types**: Byte, Decimal, Double, Float, Integer, Long, Short
+ **EXPRESSION** – An expression to run against the rule type response in order to produce a Boolean value. For more information, see [Expressions](dqdl.md#dqdl-syntax-rule-expressions).

**Example: Column correlation**

The following example rule checks whether the correlation coefficient between the columns `height` and `weight` has a strong positive correlation (a coefficient value greater than 0.8).

```
ColumnCorrelation "height" "weight" > 0.8
```

```
ColumnCorrelation "weightinkgs" "Salary" > 0.8 where "weightinkgs > 40"
```

 **Sample dynamic rules** 
+ `ColumnCorrelation "colA" "colB" between min(last(10)) and max(last(10))`
+ `ColumnCorrelation "colA" "colB" < avg(last(5)) + std(last(5))`

**Null behavior**

 The `ColumnCorrelation` rule will ignore rows with `NULL` values in the calculation of the correlation. For example: 

```
+---+-----------+
|id |units      |
+---+-----------+
|100|0          | 
|101|null       |
|102|20         |
|103|null       |
|104|40         |
+---+-----------+
```

 Rows 101 and 103 will be ignored, and the `ColumnCorrelation` will be 1.0. 

# ColumnCount
<a name="dqdl-rule-types-ColumnCount"></a>

Checks the column count of the primary dataset against a given expression. In the expression, you can specify the number of columns or a range of columns using operators like `>` and `<`.

**Syntax**

```
ColumnCount <EXPRESSION>
```
+ **EXPRESSION** – An expression to run against the rule type response in order to produce a Boolean value. For more information, see [Expressions](dqdl.md#dqdl-syntax-rule-expressions).

**Example: Column count numeric check**

The following example rule checks whether the column count is within a given range.

```
ColumnCount between 10 and 20
```

**Sample dynamic rules**
+ `ColumnCount >= avg(last(10))`
+ `ColumnCount between min(last(10))-1 and max(last(10))+1`

# ColumnDataType
<a name="dqdl-rule-types-ColumnDataType"></a>

 Checks if the values in a given column can be cast in Apache Spark to the provided type. Accepts a `with threshold` expression to check for a subset of the values in the column. 

 **Syntax** 

```
ColumnDataType <COL_NAME> = <EXPECTED_TYPE>
```
+ **COL\$1NAME** – The name of the column that you want to evaluate the data quality rule against.

  Supported column types: String type

  **Supported column types**: Byte, Decimal, Double, Float, Integer, Long, Short
+ **EXPECTED\$1TYPE** – The expected type of the values in the column.

  Supported values: Boolean, Date, Timestamp, Integer, Double, Float, Long

  **Supported column types**: Byte, Decimal, Double, Float, Integer, Long, Short
+ **EXPRESSION** – An optional expression to specify the percentage of values that should be of the expected type.

  **Supported column types**: Byte, Decimal, Double, Float, Integer, Long, Short

 **Example: Column data type integers as strings** 

 The following example rule checks whether the values in the given column, which is of type string, can be cast as integers.

```
ColumnDataType "colA" = "INTEGER"
```

 **Example: Column data type integers as strings check for a subset of the values** 

 The following example rule checks whether more than 90% of the values in the given column, which is of type string, can be cast as integers. 

```
ColumnDataType "colA" = "INTEGER" with threshold > 0.9
```

# ColumnExists
<a name="dqdl-rule-types-ColumnExists"></a>

Checks whether a column exists.

**Syntax**

```
ColumnExists <COL_NAME>
```
+ **COL\$1NAME** – The name of the column that you want to evaluate the data quality rule against.

  **Supported column types**: Any column type

**Example: Column exists**

The following example rule checks whether the column named `Middle_Name` exists.

```
ColumnExists "Middle_Name"
```

# ColumnLength
<a name="dqdl-rule-types-ColumnLength"></a>

Checks whether the length of each row in a column conforms to a given expression.

**Syntax**

```
ColumnLength <COL_NAME><EXPRESSION>
```
+ **COL\$1NAME** – The name of the column that you want to evaluate the data quality rule against.

  **Supported column types**: String
+ **EXPRESSION** – An expression to run against the rule type response in order to produce a Boolean value. For more information, see [Expressions](dqdl.md#dqdl-syntax-rule-expressions).

**Example: Column row length**

The following example rule checks whether the value in each row in the column named `Postal_Code` is 5 characters long.

```
ColumnLength "Postal_Code" = 5
ColumnLength "weightinkgs" = 2 where "weightinkgs > 10"
```

**Null behavior**

 The `ColumnLength` rule treats `NULL`s as 0 length strings. For a `NULL` row: 

```
ColumnLength "Postal_Code" > 4 # this will fail
```

```
ColumnLength "Postal_Code" < 6 # this will succeed 
```

 The following example compound rule provides a way to explicitly fail `NULL` values: 

```
(ColumnLength "Postal_Code" > 4) AND (ColumnValues "Postal_Code" != NULL)
```

# ColumnNamesMatchPattern
<a name="dqdl-rule-types-ColumnNamesMatchPattern"></a>

Checks whether the names of all columns in the primary dataset match the given regular expression.

**Syntax**

```
ColumnNamesMatchPattern <PATTERN>
```
+ **PATTERN** – The pattern you want to evaluate the data quality rule against.

  **Supported column types**: Byte, Decimal, Double, Float, Integer, Long, Short

**Example: Column names match pattern**

The following example rule checks whether all columns start with the prefix "aws\$1"

```
ColumnNamesMatchPattern "aws_.*"
ColumnNamesMatchPattern "aws_.*" where "weightinkgs > 10"
```

# ColumnValues
<a name="dqdl-rule-types-ColumnValues"></a>

Runs an expression against the values in a column.

**Syntax**

```
ColumnValues <COL_NAME> <EXPRESSION>
```
+ **COL\$1NAME** – The name of the column that you want to evaluate the data quality rule against.

  **Supported column types**: Any column type
+ **EXPRESSION** – An expression to run against the rule type response in order to produce a Boolean value. For more information, see [Expressions](dqdl.md#dqdl-syntax-rule-expressions).

**Example: Allowed values**

The following example rule checks whether each value in the specified column is in a set of allowed values (including null, empty, and strings with only whitespaces).

```
ColumnValues "Country" in [ "US", "CA", "UK", NULL, EMPTY, WHITESPACES_ONLY ]
ColumnValues "gender" in ["F", "M"] where "weightinkgs < 10"
```

**Example: Regular expression**

The following example rule checks the values in a column against a regular expression.

```
ColumnValues "First_Name" matches "[a-zA-Z]*"
```

**Example: Date values**

The following example rule checks the values in a date column against a date expression.

```
ColumnValues "Load_Date" > (now() - 3 days)
```

**Example: Numeric values**

The following example rule checks whether the column values match a certain numeric constraint.

```
ColumnValues "Customer_ID" between 1 and 2000
```

**Null behavior**

 For all `ColumnValues` rules (other than `!=` and `NOT IN`), `NULL` rows will fail the rule. If the rule fails due to a null value, the failure reason will display the following: 

```
Value: NULL does not meet the constraint requirement!
```

 The following example compound rule provides a way to explicitly allow for `NULL` values: 

```
(ColumnValues "Age" > 21) OR (ColumnValues "Age" = NULL)
```

 Negated ColumnValues rules using the `!=` and `not in` syntax will pass for `NULL` rows. For example: 

```
ColumnValues "Age" != 21
```

```
ColumnValues "Age" not in [21, 22, 23]
```

 The following examples provide a way to explicitly fail `NULL` values 

```
(ColumnValues "Age" != 21) AND (ColumnValues "Age" != NULL)
```

```
ColumnValues "Age" not in [21, 22, 23, NULL]
```

# Completeness
<a name="dqdl-rule-types-Completeness"></a>

Checks the percentage of complete (non-null) values in a column against a given expression.

**Syntax**

```
Completeness <COL_NAME> <EXPRESSION>
```
+ **COL\$1NAME** – The name of the column that you want to evaluate the data quality rule against.

  **Supported column types**: Any column type
+ **EXPRESSION** – An expression to run against the rule type response in order to produce a Boolean value. For more information, see [Expressions](dqdl.md#dqdl-syntax-rule-expressions).

**Example: Null value percentage**

The following example rules check if more than 95 percent of the values in a column are complete.

```
Completeness "First_Name" > 0.95
Completeness "First_Name" > 0.95 where "weightinkgs > 10"
```

 **Sample dynamic rules** 
+ `Completeness "colA" between min(last(5)) - 1 and max(last(5)) + 1`
+ `Completeness "colA" <= avg(last(10))`

**Null behavior**

 Note on CSV Data Formats: Blank rows on CSV columns can display multiple behaviors. 
+  If a column is of `String` type, the blank row will be recognized as an empty string and will not fail the `Completeness` rule. 
+  If a column is of another data type like `Int`, the blank row will be recognized as `NULL` and will fail the `Completeness` rule. 

# CustomSQL
<a name="dqdl-rule-types-CustomSql"></a>

This rule type has been extended to support two use cases:
+ Run a custom SQL statement against a dataset and checks the return value against a given expression.
+ Run a custom SQL statement where you specify a column name in your SELECT statement against which you compare with some condition to get row-level results.

**Syntax**

```
CustomSql <SQL_STATEMENT> <EXPRESSION>
```
+ **SQL\$1STATEMENT** – A SQL statement that returns a single numeric value, surrounded by double quotes.
+ **EXPRESSION** – An expression to run against the rule type response in order to produce a Boolean value. For more information, see [Expressions](dqdl.md#dqdl-syntax-rule-expressions).

**Example: Custom SQL to retrieve an overall rule outcome**

This example rule uses a SQL statement to retrieve the record count for a data set. The rule then checks that the record count is between 10 and 20.

```
CustomSql "select count(*) from primary" between 10 and 20
```

**Example: Custom SQL to retrieve row-level results**

This example rule uses a SQL statement wherein you specify a column name in your SELECT statement against which you compare with some condition to get row level results. A threshold condition expression defines a threshold of how many records should fail for the entire rule to fail. Note that a rule may not contain both a condition and keyword together.

```
CustomSql "select Name from primary where Age  > 18"
```

or

```
CustomSql "select Name from primary where Age > 18" with threshold  > 3
```

**Important**  
The `primary` alias stands in for the name of the data set that you want to evaluate. When you work with visual ETL jobs on the console, `primary` always represents the `DynamicFrame` being passed to the `EvaluateDataQuality.apply()` transform. When you use the AWS Glue Data Catalog to run data quality tasks against a table, `primary` represents the table.

If you are in AWS Glue Data Catalog, you can also use the actual table names:

```
CustomSql "select count(*) from database.table" between 10 and 20
```

You can also join multiple tables to compare different data elements:

```
CustomSql "select count(*) from database.table inner join database.table2 on id1 = id2" between 10 and 20
```

 In AWS Glue ETL, CustomSQL can identify records that failed the data quality checks. For this to work, you need to return records that are part of the primary table for which you are evaluating data quality. Records that are returned as part of the query are considered successful and records that are not returned are considered failed. This works by joining the result of your CustomSQL query with the original dataset. There may be performance implications based on the complexity of your SQL query. 

 To do this: 
+  You need to select at least 1 column from your primary table. 
  +  `select count(*) from primary` is a valid query for OVERALL CustomSQL DQ rule but not for Row Level Custom SQL. 
  +  This rule will throw an error during evaluation: `The output from CustomSQL must contain at least one column that matches the input dataset for AWS Glue Data Quality to provide row level results. The SQL query is a valid query but the columns from the SQL result are not present in the Input Dataset. Ensure that matching columns are returned from the SQL.` 
+  In your SQL query, select a `Primary Key` from your table or select a set of columns that form a composite key. Not doing so may result in inconsistent results due to matching of duplicate rows and degraded performance. 
+  Select keys ONLY from your primary table and not from your reference tables. 

The following rule will ensure that records with age < 100 are identified as successful and records that are above are marked as failed.

```
CustomSql "select id from primary where age < 100" 
```

This CustomSQL rule will pass when 50% of the records have age > 10 and will also identify records that failed. The records returned by this CustomSQL will be considered passed while the ones not returned will be considered failed.

```
CustomSQL "select ID, CustomerID from primary where age > 10" with threshold > 0.5
```

Note: CustomSQL rule will fail if you return records that are not available in the dataset. 

# DataFreshness
<a name="dqdl-rule-types-DataFreshness"></a>

Checks the freshness of data in a column by evaluating the difference between the current time and the values of a date column. You can specify a time-based expression for this rule type to make sure that column values are up to date.

**Syntax**

```
DataFreshness <COL_NAME> <EXPRESSION>
```
+ **COL\$1NAME** – The name of the column that you want to evaluate the data quality rule against.

  **Supported column types**: Date 
+ **EXPRESSION** – A numeric expression in hours or days. You must specify the time unit in your expression.

**Example: Data freshness**

The following example rules check for data freshness.

```
DataFreshness "Order_Date" <= 24 hours
DataFreshness "Order_Date" between 2 days and 5 days
```

**Null behavior**

 The `DataFreshness` rules will fail for rows with `NULL` values. If the rule fails due to a null value, the failure reason will display the following: 

```
80.00 % of rows passed the threshold
```

 where 20% of the rows that failed include the rows with `NULL`. 

 The following example compound rule provides a way to explicitly allow for `NULL` values: 

```
(DataFreshness "Order_Date" <= 24 hours) OR (ColumnValues "Order_Date" = NULL)
```

**Data Freshness for Amazon S3 objects**

 Sometimes you will need to validate the freshness of data based on the Amazon S3 file creating time. To do this, you can use the following code to get the timestamp and add it to your dataframe, and then apply Data Freshness checks. 

```
df = glueContext.create_data_frame.from_catalog(database = "default", table_name = "mytable")
df = df.withColumn("file_ts", df["_metadata.file_modification_time"])

Rules = [
 DataFreshness "file_ts" < 24 hours 
]
```

# DatasetMatch
<a name="dqdl-rule-types-DatasetMatch"></a>

 Checks if the data in the primary dataset matches the data in a reference dataset. The two datasets are joined using the provided key column mappings. Additional column mappings can be provided should you wish to check for the equality of the data in only those columns. Note that for **DataSetMatch** to work, your join keys should be unique and should not be NULL (must be a primary key). If you don’t satisfy these conditions, you will get the error message, “Provided key map not suitable for given data frames”. In cases where you can’t have joined keys that are unique, consider using other ruletypes such as **AggregateMatch** to match on summary data. 

**Syntax**

```
DatasetMatch <REFERENCE_DATASET_ALIAS> <JOIN CONDITION WITH MAPPING> <OPTIONAL_MATCH_COLUMN_MAPPINGS> <EXPRESSION>
```
+ **REFERENCE\$1DATASET\$1ALIAS** – The alias of the reference dataset with which you compare data from the primary dataset.
+ **KEY\$1COLUMN\$1MAPPINGS** – A comma-separated list of column names that form a key in the datasets. If the column names are not the same in both datasets, you must separated them with a `->`
+ **OPTIONAL\$1MATCH\$1COLUMN\$1MAPPINGS** – You can supply this parameter if you want to check for matching data only in certain columns. It uses the same syntax as the key column mappings. If this parameter is not provided, we will match the data in all remaining columns. The remaining, non-key columns must have the same names in both datasets.
+ **EXPRESSION** – An expression to run against the rule type response in order to produce a Boolean value. For more information, see [Expressions](dqdl.md#dqdl-syntax-rule-expressions).

**Example: Match set datasets using ID column**

The following example rule checks that more than 90% of the primary dataset matches the reference dataset, using the "ID" column to join the two datasets. It compares all columns in this case.

```
DatasetMatch "reference" "ID" >= 0.9
```

**Example: Match set datasets using multiple key columns**

In the following example, the primary dataset and the reference dataset have different names for the key columns. `ID_1` and `ID_2` together form a composite key in the primary dataset. `ID_ref1` and `ID_ref2` together forms a composite key in the reference dataset. In this scenario, you can use the special syntax to supply the column names.

```
DatasetMatch "reference" "ID_1->ID_ref1,ID_2->ID_ref2" >= 0.9
```

**Example: Match set datasets using multiple key columns and check that specific column matches**

This example builds on the previous example. We want to check that only the column containing the amounts match. This column is named `Amount1` in the primary dataset and `Amount2` in the reference dataset. You want an exact match.

```
DatasetMatch "reference" "ID_1->ID_ref1,ID_2->ID_ref2" "Amount1->Amount2" >= 0.9
```

# DistinctValuesCount
<a name="dqdl-rule-types-DistinctValuesCount"></a>

Checks the number of distinct values in a column against a given expression.

**Syntax**

```
DistinctValuesCount <COL_NAME> <EXPRESSION>
```
+ **COL\$1NAME** – The name of the column that you want to evaluate the data quality rule against.

  **Supported column types**: Any column type
+ **EXPRESSION** – An expression to run against the rule type response in order to produce a Boolean value. For more information, see [Expressions](dqdl.md#dqdl-syntax-rule-expressions).

**Example: Distinct column value count**

The following example rule checks that the column named `State` contains more than 3 distinct values.

```
DistinctValuesCount "State" > 3
DistinctValuesCount "Customer_ID" < 6  where "Customer_ID < 10"
```

**Sample dynamic rules**
+ `DistinctValuesCount "colA" between avg(last(10))-1 and avg(last(10))+1`
+ `DistinctValuesCount "colA" <= index(last(10),2) + std(last(5))`

# Entropy
<a name="dqdl-rule-types-Entropy"></a>

Checks whether the *entropy* value of a column matches a given expression. Entropy measures the level of information that's contained in a message. Given the probability distribution over values in a column, entropy describes how many bits are required to identify a value.

**Syntax**

```
Entropy <COL_NAME> <EXPRESSION>
```
+ **COL\$1NAME** – The name of the column that you want to evaluate the data quality rule against.

  **Supported column types**: Any column type
+ **EXPRESSION** – An expression to run against the rule type response in order to produce a Boolean value. For more information, see [Expressions](dqdl.md#dqdl-syntax-rule-expressions).

**Example: Column entropy**

The following example rule checks that the column named `Feedback` has an entropy value greater than one.

```
Entropy "Star_Rating" > 1
Entropy "First_Name" > 1  where "Customer_ID < 10"
```

 **Sample dynamic rules** 
+ `Entropy "colA" < max(last(10))`
+ `Entropy "colA" between min(last(10)) and max(last(10))`

# IsComplete
<a name="dqdl-rule-types-IsComplete"></a>

Checks whether all of the values in a column are complete (non-null). 

**Syntax**

```
IsComplete <COL_NAME>
```
+ **COL\$1NAME** – The name of the column that you want to evaluate the data quality rule against.

  **Supported column types**: Any column type

**Example: Null values**

The following example checks whether all of the values in a column named `email` are non-null.

```
IsComplete "email"
IsComplete "Email" where "Customer_ID between 1 and 50"
IsComplete "Customer_ID"  where "Customer_ID < 16 and Customer_ID != 12"
IsComplete "passenger_count" where "payment_type<>0"
```

**Null behavior**

 Note on CSV Data Formats: Blank rows on CSV columns can display multiple behaviors. 
+  If a column is of `String` type, the blank row will be recognized as an empty string and will not fail the `Completeness` rule. 
+  If a column is of another data type like `Int`, the blank row will be recognized as `NULL` and will fail the `Completeness` rule. 

# IsPrimaryKey
<a name="dqdl-rule-types-IsPrimaryKey"></a>

Checks whether a column contains a primary key. A column contains a primary key if all of the values in the column are unique and complete (non-null). You can also check for primary keys with multiple columns. 

**Syntax**

```
IsPrimaryKey <COL_NAME>
```
+ **COL\$1NAME** – The name of the column that you want to evaluate the data quality rule against.

  **Supported column types**: Any column type

**Example: Primary key**

The following example rule checks whether the column named `Customer_ID` contains a primary key.

```
IsPrimaryKey "Customer_ID"
IsPrimaryKey "Customer_ID" where "Customer_ID < 10"
```

 **Example: Primary key with multiple columns. Any of the following examples are valid.** 

```
IsPrimaryKey "colA" "colB"
IsPrimaryKey "colA" "colB" "colC"
IsPrimaryKey colA "colB" "colC"
```

# IsUnique
<a name="dqdl-rule-types-IsUnique"></a>

Checks whether all of the values in a column are unique, and returns a Boolean value.

**Syntax**

```
IsUnique <COL_NAME>
```
+ **COL\$1NAME** – The name of the column that you want to evaluate the data quality rule against.

  **Supported column types**: Any column type

**Examples**

The following example rule checks whether all of the values in a column named `email` are unique.

```
IsUnique "email"
IsUnique "Customer_ID" where "Customer_ID < 10"]
```

The following example rule checks multiple columns. 

```
IsUnique "vendorid" "tpep_pickup_datetime"
```

# Mean
<a name="dqdl-rule-types-Mean"></a>

Checks whether the mean (average) of all the values in a column matches a given expression.

**Syntax**

```
Mean <COL_NAME> <EXPRESSION>
```
+ **COL\$1NAME** – The name of the column that you want to evaluate the data quality rule against.

  **Supported column types**: Byte, Decimal, Double, Float, Integer, Long, Short
+ **EXPRESSION** – An expression to run against the rule type response in order to produce a Boolean value. For more information, see [Expressions](dqdl.md#dqdl-syntax-rule-expressions).

**Example: Average value**

The following example rule checks whether the average of all of the values in a column exceeds a threshold.

```
Mean "Star_Rating" > 3
Mean "Salary" < 6200 where "Customer_ID < 10"
```

 **Sample dynamic rules** 
+ `Mean "colA" > avg(last(10)) + std(last(2))`
+ `Mean "colA" between min(last(5)) - 1 and max(last(5)) + 1`

**Null behavior**

 The `Mean` rule will ignore rows with `NULL` values in the calculation of the mean. For example: 

```
+---+-----------+
|id |units      |
+---+-----------+
|100|0          |
|101|null       |
|102|20         |
|103|null       |
|104|40         |
+---+-----------+
```

 The mean of column `units` will be (0 \$1 20 \$1 40) / 3 = 20. Rows 101 and 103 are not considered in this calculation. 

# ReferentialIntegrity
<a name="dqdl-rule-types-ReferentialIntegrity"></a>

Checks to what extent the values of a set of columns in the primary dataset are a subset of the values of a set of columns in a reference dataset.

**Syntax**

```
ReferentialIntegrity <PRIMARY_COLS> <REFERENCE_DATASET_COLS> <EXPRESSION>
```
+ **PRIMARY\$1COLS** – A comma-separated list of column names in the primary dataset.

  **Supported column types**: Byte, Decimal, Double, Float, Integer, Long, Short
+ **REFERENCE\$1DATASET\$1COLS** – This parameter contains two parts separated by a period. The first part is the alias of the reference dataset. The second part is the comma-separated list of column names in the reference dataset enclosed in braces.

  **Supported column types**: Byte, Decimal, Double, Float, Integer, Long, Short
+ **EXPRESSION** – An expression to run against the rule type response in order to produce a Boolean value. For more information, see [Expressions](dqdl.md#dqdl-syntax-rule-expressions).

**Example: Check the referential integrity of a zip code column**

The following example rule checks that more than 90% of the values in the `zipcode` column in the primary dataset, are present in the `zipcode` column in the `reference` dataset.

```
ReferentialIntegrity "zipcode" "reference.zipcode" >= 0.9
```

**Example: Check the referential integrity of the city and state columns**

In the following example, columns containing city and state information exist in the primary dataset and the reference dataset. The names of the columns are different in both datasets. The rule checks if the set of values of the columns in the primary dataset is exactly equal to the set of values of the columns in the reference dataset.

```
ReferentialIntegrity "city,state" "reference.{ref_city,ref_state}" = 1.0
```

 **Sample dynamic rules** 
+ `ReferentialIntegrity "city,state" "reference.{ref_city,ref_state}" > avg(last(10))`
+ `ReferentialIntegrity "city,state" "reference.{ref_city,ref_state}" between min(last(10)) - 1 and max(last(10)) + 1`

# RowCount
<a name="dqdl-rule-types-RowCount"></a>

Checks the row count of a dataset against a given expression. In the expression, you can specify the number of rows or a range of rows using operators like `>` and `<`.

**Syntax**

```
RowCount <EXPRESSION>
```
+ **EXPRESSION** – An expression to run against the rule type response in order to produce a Boolean value. For more information, see [Expressions](dqdl.md#dqdl-syntax-rule-expressions).

**Example: Row count numeric check**

The following example rule checks whether the row count is within a given range.

```
RowCount between 10 and 100
RowCount between 1 and 50 where "Customer_ID < 10"
```

 **Sample dynamic rules** 

```
RowCount > avg(last(10)) *0.8
```

# RowCountMatch
<a name="dqdl-rule-types-RowCountMatch"></a>

Checks the ratio of the row count of the primary dataset and the row count of a reference dataset against the given expression.

**Syntax**

```
RowCountMatch <REFERENCE_DATASET_ALIAS> <EXPRESSION>
```
+ **REFERENCE\$1DATASET\$1ALIAS** – The alias of the reference dataset against which to compare row counts.

  **Supported column types**: Byte, Decimal, Double, Float, Integer, Long, Short
+ **EXPRESSION** – An expression to run against the rule type response in order to produce a Boolean value. For more information, see [Expressions](dqdl.md#dqdl-syntax-rule-expressions).

**Example: Row count check against a reference dataset**

The following example rule checks whether the row count of the primary dataset is at least 90% of the row count of the reference dataset.

```
RowCountMatch "reference" >= 0.9
```

# StandardDeviation
<a name="dqdl-rule-types-StandardDeviation"></a>

Checks the standard deviation of all of the values in a column against a given expression.

**Syntax**

```
StandardDeviation <COL_NAME> <EXPRESSION>
```
+ **COL\$1NAME** – The name of the column that you want to evaluate the data quality rule against.

  **Supported column types**: Byte, Decimal, Double, Float, Integer, Long, Short
+ **EXPRESSION** – An expression to run against the rule type response in order to produce a Boolean value. For more information, see [Expressions](dqdl.md#dqdl-syntax-rule-expressions).

**Example: Standard deviation**

The following example rule checks whether the standard deviation of the values in a column named `colA` is less than a specified value.

```
StandardDeviation "Star_Rating" < 1.5
StandardDeviation "Salary" < 3500 where "Customer_ID < 10"
```

 **Sample dynamic rules** 
+ `StandardDeviation "colA" > avg(last(10) + 0.1`
+ `StandardDeviation "colA" between min(last(10)) - 1 and max(last(10)) + 1`

**Null behavior**

 The `StandardDeviation` rule will ignore rows with `NULL` values in the calculation of standard deviation. For example: 

```
+---+-----------+-----------+
|id |units1     |units2     |
+---+-----------+-----------+
|100|0          |0          |
|101|null       |0          |
|102|20         |20         |
|103|null       |0          |
|104|40         |40         |
+---+-----------+-----------+
```

 The standard deviation of column `units1` will not consider rows 101 and 103 and result to 16.33. The standard deviation for column `units2` will result in 16. 

# Sum
<a name="dqdl-rule-types-Sum"></a>

Checks the sum of all the values in a column against a given expression.

**Syntax**

```
Sum <COL_NAME> <EXPRESSION>
```
+ **COL\$1NAME** – The name of the column that you want to evaluate the data quality rule against.

  **Supported column types**: Byte, Decimal, Double, Float, Integer, Long, Short
+ **EXPRESSION** – An expression to run against the rule type response in order to produce a Boolean value. For more information, see [Expressions](dqdl.md#dqdl-syntax-rule-expressions).

**Example: Sum**

The following example rule checks whether the sum of all of the values in a column exceeds a given threshold.

```
Sum "transaction_total" > 500000
Sum "Salary" < 55600 where "Customer_ID < 10"
```

 **Sample dynamic rules** 
+ `Sum "ColA" > avg(last(10))`
+ `Sum "colA" between min(last(10)) - 1 and max(last(10)) + 1`

**Null behavior**

 The `Sum` rule will ignore rows with `NULL` values in the calculation of sum. For example: 

```
+---+-----------+
|id |units      |
+---+-----------+
|100|0          |
|101|null       |
|102|20         |
|103|null       |
|104|40         |
+---+-----------+
```

 The sum of column `units` will not consider rows 101 and 103 and result to (0 \$1 20 \$1 40) = 60. 

# SchemaMatch
<a name="dqdl-rule-types-SchemaMatch"></a>

Checks if the schema of the primary dataset matches the schema of a reference dataset. The schema check is done column by column. The schema of two columns match if the names are identical and the types are identical. The order of the columns does not matter.

**Syntax**

```
SchemaMatch <REFERENCE_DATASET_ALIAS> <EXPRESSION>
```
+ **REFERENCE\$1DATASET\$1ALIAS** – The alias of the reference dataset against which to compare schemas.

  **Supported column types**: Byte, Decimal, Double, Float, Integer, Long, Short
+ **EXPRESSION** – An expression to run against the rule type response in order to produce a Boolean value. For more information, see [Expressions](dqdl.md#dqdl-syntax-rule-expressions).

**Example: SchemaMatch**

The following example rule checks whether the schema of the primary dataset exactly matches the schema of a reference dataset.

```
SchemaMatch "reference" = 1.0
```

# Uniqueness
<a name="dqdl-rule-types-Uniqueness"></a>

Checks the percentage of unique values in a column against a given expression. Unique values occur exactly once.

**Syntax**

```
Uniqueness <COL_NAME> <EXPRESSION>
```
+ **COL\$1NAME** – The name of the column that you want to evaluate the data quality rule against.

  **Supported column types**: Any column type
+ **EXPRESSION** – An expression to run against the rule type response in order to produce a Boolean value. For more information, see [Expressions](dqdl.md#dqdl-syntax-rule-expressions).

**Example**

The following example rule checks whether the percentage of unique values in a column matches certain numeric criteria.

```
Uniqueness "email" = 1.0
Uniqueness "Customer_ID" != 1.0 where "Customer_ID < 10"
```

The following example rule checks multiple columns. 

```
Uniqueness "vendorid" "tpep_pickup_datetime" = 1
```

 **Sample dynamic rules** 
+ `Uniqueness "colA" between min(last(10)) and max(last(10))`
+ `Uniqueness "colA" >= avg(last(10))`

# UniqueValueRatio
<a name="dqdl-rule-types-UniqueValueRatio"></a>

Checks the *unique value ratio* of a column against a given expression. A unique value ratio is the fraction of unique values divided by the number of all distinct values in a column. Unique values occur exactly one time, while distinct values occur *at least* once. 

For example, the set `[a, a, b]` contains one unique value (`b`) and two distinct values (`a` and `b`). So the unique value ratio of the set is ½ = 0.5.

**Syntax**

```
UniqueValueRatio <COL_NAME> <EXPRESSION>
```
+ **COL\$1NAME** – The name of the column that you want to evaluate the data quality rule against.

  **Supported column types**: Any column type
+ **EXPRESSION** – An expression to run against the rule type response in order to produce a Boolean value. For more information, see [Expressions](dqdl.md#dqdl-syntax-rule-expressions).

**Example: Unique value ratio**

This example checks the unique value ratio of a column against a range of values.

```
UniqueValueRatio "test_score" between 0 and 0.5
UniqueValueRatio "Customer_ID" between 0 and 0.9 where "Customer_ID < 10"
```

 **Sample dynamic rules** 
+ `UniqueValueRatio "colA" > avg(last(10))`
+ `UniqueValueRatio "colA" <= index(last(10),2) + std(last(5))`

# DetectAnomalies
<a name="dqdl-rule-types-DetectAnomalies"></a>

 Detects anomalies for a given data quality rule. Every execution of DetectAnomalies rule result in saving evaluated value for the given rule. When there is enough data gathered, anomaly detection algorithm takes all historical data for that given rule and runs anomaly detection. DetectAnomalies rule fails when anomaly is detected. More info about what anomaly was detected can be obtained from Observations. 

**Syntax**

```
       DetectAnomalies <RULE_NAME> <RULE_PARAMETERS>
```

 `RULE_NAME` – The name of the rule that you want to evaluate and detect anomalies for. Supported rules: 
+ "RowCount"
+ "Completeness"
+ "Uniqueness"
+ "Mean"
+ "Sum"
+ "StandardDeviation"
+ "Entropy"
+ "DistinctValuesCount"
+ "UniqueValueRatio"
+ "ColumnLength"
+ "ColumnValues"
+ "ColumnCorrelation"
+ "CustomSQL"
+ "ColumnCount"

 `RULE_PARAMETERS` – some rules require additional parameters to run. Refer to the given rule documentation to see required parameters. 

**Example: Anomalies for RowCount**

For example, if we want to detect RowCount anomalies, we provide RowCount as a rule name. 

```
DetectAnomalies "RowCount"
```

**Example: Anomalies for ColumnLength**

For example, if we want to detect ColumnLength anomalies, we provide ColumnLength as a rule name and the column name. 

```
DetectAnomalies "ColumnLength" "id"
```

# FileFreshness
<a name="dqdl-rule-types-FileFreshness"></a>

 FileFreshness ensures your data files are fresh based on the condition you provide. It uses your files' last modified time to ensure that data files or the entire folder is up-to-date. 

 This rule gathers two metrics: 
+ FileFreshness compliance based on the rule you set up
+ The number of files that were scanned by the rule

```
Dataset.*.FileFreshness.Compliance: 1, Dataset.*.FileCount: 1
```

 Anomaly detection does not consider these metrics. 

 **Checking file freshness** 

 The following rule ensures that tickets.parquet was created in the past 24 hours. 

```
FileFreshness "s3://amzn-s3-demo-bucket/artifacts/file/tickets/tickets.parquet" > (now() - 24 hours)
```

 **Checking folder freshness** 

 The following rule passes if all files in the folder were created or modified in past 24 hours. 

```
FileFreshness "s3://bucket/" >= (now() - 1 days)
FileFreshness "s3://amzn-s3-demo-bucket/artifacts/file/tickets/" >= (now() - 24 hours)
```

 **Checking folder or file freshness with threshold** 

 The following rule passes if 10% of the files in the folder “tickets“ were created or modified in the past 10 days. 

```
FileFreshness "s3://amzn-s3-demo-bucket/artifacts/file/tickets/" > (now() - 10 days) with threshold > 0.1
```

 **Checking files or folders with specific dates** 

 You can check for file freshness for specific days. 

```
FileFreshness "s3://amzn-s3-demo-bucket/artifacts/file/tickets/" > "2020-01-01"
FileFreshness "s3://amzn-s3-demo-bucket/artifacts/file/tickets/" between "2023-01-01" and "2024-01-01"
```

 **Checking files or folders with time** 

 You can use FileFreshness to ensure that files have arrived based on certain times. 

```
FileFreshness "s3://amzn-s3-demo-bucket/artifacts/file/tickets/" between (now() - 45 minutes) and now()
FileFreshness "s3://amzn-s3-demo-bucket/artifacts/file/tickets/" between "9:30 AM" and "9:30 PM"
FileFreshness "s3://amzn-s3-demo-bucket/artifacts/file/tickets/" > (now() - 10 minutes)
FileFreshness "s3://amzn-s3-demo-bucket/artifacts/file/tickets/" > now()
FileFreshness "s3://amzn-s3-demo-bucket/artifacts/file/tickets/" between (now() - 2 hours) and (now() + 15 minutes)
FileFreshness "s3://amzn-s3-demo-bucket/artifacts/file/tickets/" between (now() - 3 days) and (now() + 15 minutes)
FileFreshness "s3://amzn-s3-demo-bucket/artifacts/file/tickets/" between "2001-02-07" and (now() + 15 minutes)
FileFreshness "s3://amzn-s3-demo-bucket/artifacts/file/tickets/" > "21:45"
FileFreshness "s3://amzn-s3-demo-bucket/artifacts/file/tickets/" > "2024-01-01"
FileFreshness "s3://amzn-s3-demo-bucket/artifacts/file/tickets/" between "02:30" and "04:30"
FileFreshness "s3://amzn-s3-demo-bucket/artifacts/file/tickets/" between "9:30 AM" and "22:15"
```

 Key considerations: 
+  FileFreshness can evaluate files using days, hours, and minutes units 
+  For times, it supports AM / PM and 24-hour 
+  Times are calculated in UTC unless an override is specified 
+  Dates are calculated in UTC at time 00:00 

 FileFreshness that are time-based works as follows: 

```
FileFreshness "s3://amzn-s3-demo-bucket/artifacts/file/tickets/" > "21:45"
```
+  First, the time “21:45” is combined with today’s date in UTC format to create a date-time field 
+  Next, if a timezone override is specified, the date-time is converted to that timezone (otherwise UTC is used) 
+  Finally, the rule is evaluated 

 **Optional File-based Rule Tags:** 

 Tags allow you to control the rule behavior. 

 **recentFiles** 

 This tag limits the number of files processed by keeping the most recent file first. 

```
FileFreshness "s3://amzn-s3-demo-bucket/" between (now() - 100 minutes) and (now() + 10 minutes) with recentFiles = 1
```

 **uriRegex** 

**Note**  
 The `uriRegex` tag is available in AWS Glue 5.0 and later. 

 This tag filters files by applying a regex pattern to the file path. Only files whose paths match the pattern are processed. You can also use a negative lookahead to exclude files that match a pattern. 

```
# Match only files with a .csv extension
FileFreshness "s3://amzn-s3-demo-bucket/" > (now() - 24 hours) with uriRegex = "\.csv$"
# Match Parquet files that contain "orders_" in the path
FileFreshness "s3://amzn-s3-demo-bucket/" > (now() - 24 hours) with uriRegex = ".*orders_.*\.parquet"
# Exclude files ending in .tmp using a negative lookahead
FileFreshness "s3://amzn-s3-demo-bucket/" > (now() - 24 hours) with uriRegex = "(?!.*\.tmp$).*"
```

 **filterOrder** 

**Note**  
 The `filterOrder` tag is available in AWS Glue 5.0 and later. 

 When you use multiple filter tags such as `recentFiles` and `uriRegex` together, the `filterOrder` tag controls the order in which they are applied. The default order is `recentFiles` first, then `uriRegex`. 

```
FileFreshness "s3://amzn-s3-demo-bucket/" > (now() - 24 hours) with recentFiles = 1 with uriRegex = "inventory_" with filterOrder = ["uriRegex","recentFiles"]
```

 In the example above, the `uriRegex` filter is applied first to select only files matching "inventory\$1", and then `recentFiles = 1` takes the most recent file from that filtered set. Without `filterOrder`, the default behavior would take the single most recent file first and then apply the regex, which could result in no files matching if the most recent file doesn't match the pattern. 

**Note**  
 All values in the `filterOrder` list must reference other filter tags (`recentFiles` or `uriRegex`) that are also present on the same rule. Non-filter tags such as `timeZone` or `failFast` are not valid in `filterOrder`. 

 **failFast** 

 When set to `"true"`, the rule returns failure immediately on the first file that fails the freshness condition, instead of evaluating all files and computing a compliance ratio. 

```
FileFreshness "s3://amzn-s3-demo-bucket/" > (now() - 24 hours) with failFast = "true"
```

 **timeZone** 

 Accepted time zone overrides, see [Allowed Time Zones](https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html) for supported time zones. 

```
FileFreshness "s3://path/" > "21:45" with timeZone = "America/New_York"
```

```
FileFreshness "s3://path/" > "21:45" with timeZone = "America/Chicago"
```

```
FileFreshness "s3://path/" > "21:45" with timeZone = "Europe/Paris"
```

```
FileFreshness "s3://path/" > "21:45" with timeZone = "Asia/Shanghai"
```

```
FileFreshness "s3://path/" > "21:45" with timeZone = "Australia/Darwin"
```

 **Inferring file names directly from data frames** 

 You don't always have to provide a file path. For instance, when you are authoring the rule in the AWS Glue Data Catalog, it may be hard to find which folders the catalog tables are using. AWS Glue Data Quality can find the specific folders or files used to populate your dataframe and can detect if they are fresh. 

**Note**  
 This feature will only work when files are successfully read into the DynamicFrame or DataFrame. 

```
FileFreshness > (now() - 24 hours)
```

 This rule will find the folder path or files that are used to populate the dynamic frame or data frame. This works for Amazon S3 paths or Amazon S3-based AWS Glue Data Catalog tables. There are a few considerations: 

1.  In AWS Glue ETL, you must have the **EvaluateDataQuality** Transform immediately after an Amazon S3 or AWS Glue Data Catalog transform.   
![\[The screenshot shows an Evaluate Data Quality node connected to an Amazon S3 node.\]](http://docs.aws.amazon.com/glue/latest/dg/images/data-quality-file-freshness.png)

1.  This rule will not work in AWS Glue Interactive Sessions. 

 If you attempt this in both of the cases, or when AWS Glue can’t find the files, AWS Glue will throw the following error: `“Unable to parse file path from DataFrame”` 

# FileMatch
<a name="dqdl-rule-types-FileMatch"></a>

 The FileMatch rule allows you to compare files against other files or checksums. This can be useful in a few scenarios: 

1.  Validating files received from external sources: You can use FileMatch to ensure that you have received the correct files from external sources by comparing against checksums. This helps validate the integrity of the data you're ingesting. 

1.  Comparing data in two different folders: FileMatch can be used to compare files between two folders. 

 This rule gathers one metric: the number of files that were scanned by the rule. 

```
Dataset.*.FileCount: 1
```

 **Validate file with a checksum:** 

 FileMatch accepts a file and a set checksums to ensure that at least one checksums match the file. 

```
FileMatch "s3://amzn-s3-demo-bucket/file.json" in ["3ee0d8617ac041793154713e5ef8f319"] with hashAlgorithm = "MD5"
FileMatch "s3://amzn-s3-demo-bucket/file.json" in ["3ee0d8617ac041793154713e5ef8f319"] with hashAlgorithm = "SHA-1"
FileMatch "s3://amzn-s3-demo-bucket/file.json" in ["3ee0d8617ac041793154713e5ef8f319"] with hashAlgorithm = "SHA-256"
FileMatch "s3://amzn-s3-demo-bucket/file.json" in ["3ee0d8617ac041793154713e5ef8f319"]
```

 The following standard algorithms are supported: 
+ MD5
+ SHA-1
+ SHA-256

 If you do not supply an algorithm, the default is SHA-256. 

 **Validate all files in a folder with set of checksum:** 

```
FileMatch "s3://amzn-s3-demo-bucket /" in ["3ee0d8617ac041793154713e5ef8f319", "7e8617ac041793154713e5ef8f319"] with hashAlgorithm = "MD5"
FileMatch "s3://amzn-s3-demo-bucket /internal-folder/" in ["3ee0d8617ac041793154713e5ef8f319", "7e8617ac041793154713e5ef8f319"]
```

 **Compare files in different folders** 

```
# Compare all files across two buckets
FileMatch "s3://original_bucket/" "s3://archive_bucket/"
# Compare files within specific subfolders
FileMatch "s3://original_bucket/internal-folder/" "s3://original_bucket/other-folder/"
# Compare only .json files across two folders
FileMatch "s3://original_bucket/" "s3://archive_bucket/" with uriRegex = "\.json$"
# Compare only the 5 most recent .csv files
FileMatch "s3://original_bucket/" "s3://archive_bucket/" with recentFiles = 5 with uriRegex = "\.csv$" with filterOrder = ["uriRegex","recentFiles"]
```

 FileMatch will check the contents of the files in `original_bucket` and ensure they match what’s in `archive_bucket`. The rule will fail if they don’t exactly match. It also can check the contents of internal folders or individual files. 

 FileMatch can also check individual files against each other. 

```
FileMatch "s3://amzn-s3-demo-bucket /file_old.json" "s3://amzn-s3-demo-bucket /file_new.json"
```

 **Inferring file names directly from data frames** 

 You don't always have to provide a file path. For instance, when you are authoring the rule in the AWS Glue Data Catalog (backed by Amazon S3), it may be hard to find which folders the catalog tables are using. AWS Glue Data Quality can find the specific folders or files used to populate your data frame. 

**Note**  
 This feature will only work when files are successfully read into the DynamicFrame or DataFrame. 

```
FileMatch in ["3ee0d8617ac041793154713e5ef8f319"] with hashAlgorithm = "MD5"
FileMatch in ["3ee0d8617ac041793154713e5ef8f319"] with hashAlgorithm = "SHA-1"
FileMatch in ["3ee0d8617ac041793154713e5ef8f319"] with hashAlgorithm = "SHA-256"
FileMatch in ["3ee0d8617ac041793154713e5ef8f319"]
```

 If the supplied checksum is different than what’s computed, FileMatch will alert you to the difference. 

![\[The screenshot shows a rule with the DQ status of Rule failed. FileMatch explains the failure.\]](http://docs.aws.amazon.com/glue/latest/dg/images/data-quality-file-match.png)


 **Optional File-based Rule Tags:** 

 Tags allow you to control the rule behaviour. 

 **recentFiles** 

 This tag limits the number of files processed by keeping the most recent file first. 

```
FileMatch "s3://bucket/" in ["3ee0d8617ac04179sam4713e5ef8f319"] with recentFiles = 1
```

 **uriRegex** 

**Note**  
 The `uriRegex` tag is available in AWS Glue 5.0 and later. 

 This tag filters files by applying a regex pattern to the file path. Only files whose paths match the pattern are processed. You can also use a negative lookahead to exclude files that match a pattern. 

```
# Match only files with a .json extension
FileMatch "s3://bucket/" in ["3ee0d8617ac04179sam4713e5ef8f319"] with uriRegex = "\.json$"
# Exclude files ending in .tmp using a negative lookahead
FileMatch "s3://bucket/" in ["3ee0d8617ac04179sam4713e5ef8f319"] with uriRegex = "(?!.*\.tmp$).*"
```

 **filterOrder** 

**Note**  
 The `filterOrder` tag is available in AWS Glue 5.0 and later. 

 When you use multiple filter tags such as `recentFiles` and `uriRegex` together, the `filterOrder` tag controls the order in which they are applied. The default order is `recentFiles` first, then `uriRegex`. 

```
FileMatch "s3://bucket/" in ["3ee0d8617ac04179sam4713e5ef8f319"] with recentFiles = 1 with uriRegex = "\.json$" with filterOrder = ["uriRegex","recentFiles"]
```

 **matchFileName** 

 This tag ensures that files don’t have duplicate names. Default behavior is false. 

```
FileMatch "s3://amzn-s3-demo-bucket/file.json" in ["3ee0d8617ac04179sam4713e5ef8f319"] with matchFileName = "true"
```

 There are a few considerations: 

1.  In AWS Glue ETL, you must have the **EvaluateDataQuality** Transform immediately after an Amazon S3 or AWS Glue Data Catalog transform.   
![\[The screenshot shows a rule with the DQ status of Rule failed. FileMatch explains the failure.\]](http://docs.aws.amazon.com/glue/latest/dg/images/data-quality-file-match-transform.png)

1.  This rule will not work in AWS Glue Interactive Sessions. 

# FileUniqueness
<a name="dqdl-rule-types-FileUniqueness"></a>

 File Uniqueness allows you to ensure that there are no duplicate files in the data you have received from your data producers. 

 It gathers the following data statistics: 

1.  The number of files that were scanned by the rule 

1.  The Uniqueness Ratio of the files 

```
Dataset.*.FileUniquenessRatio: 1.00, Dataset.*.FileCount: 8
```

 **Find duplicate files in a folder:** 

```
FileUniqueness "s3://bucket/" > 0.5
FileUniqueness "s3://bucket/folder/" = 1
```

 **Inferring folder names directly from data frames to detect duplicates:** 

 You don't always have to provide a file path. For instance, when you are authoring the rule in the AWS Glue Data Catalog, it may be hard to find which folders the catalog tables are using. AWS Glue Data Quality can find the specific folders or files used to populate your data frame. 

**Note**  
 When using inference, file-based rules can only detect files successfully read into the DynamicFrame or DataFrame. 

```
FileUniqueness > 0.5
```

 **Optional File-based Rule Tags:** 

 Tags allow you to control the rule behaviour. 

 **recentFiles** 

 This tag limits the number of files processed by keeping the most recent file first. 

```
FileUniqueness "s3://amzn-s3-demo-bucket/" > 0.5 with recentFiles = 1
```

 **uriRegex** 

**Note**  
 The `uriRegex` tag is available in AWS Glue 5.0 and later. 

 This tag filters files by applying a regex pattern to the file path. Only files whose paths match the pattern are processed. You can also use a negative lookahead to exclude files that match a pattern. 

```
# Match only files with a .csv extension
FileUniqueness "s3://bucket/" > 0.5 with uriRegex = "\.csv$"
# Exclude files ending in .tmp using a negative lookahead
FileUniqueness "s3://bucket/" > 0.5 with uriRegex = "(?!.*\.tmp$).*"
```

 **filterOrder** 

**Note**  
 The `filterOrder` tag is available in AWS Glue 5.0 and later. 

 When you use multiple filter tags such as `recentFiles` and `uriRegex` together, the `filterOrder` tag controls the order in which they are applied. The default order is `recentFiles` first, then `uriRegex`. 

```
FileUniqueness "s3://bucket/" > 0.5 with recentFiles = 5 with uriRegex = "\.csv$" with filterOrder = ["uriRegex","recentFiles"]
```

 **matchFileName** 

 This tag ensures that files don’t have duplicate names. Default behavior is false. 

```
FileUniqueness "s3://amzn-s3-demo-bucket/" > 0.5 with matchFileName = "true"
```

 There are a few considerations: 

1.  In AWS Glue ETL, you must have the **EvaluateDataQuality** Transform immediately after an Amazon S3 or AWS Glue Data Catalog transform. 

1.  This rule will not work in AWS Glue Interactive Sessions. 

# FileSize
<a name="dqdl-rule-types-FileSize"></a>

 The FileSize ruletype allows you to ensure that files meet a certain file size criteria. This is useful for following use cases: 

1.  Ensure that producers are not sending empty or substantially smaller files for processing. 

1.  Ensure that your target buckets don’t have smaller files which may lead to performance issues. 

 FileSize gathers the following metrics: 

1.  Compliance: returns the % of files that meet the rule threshold you have established 

1.  File Count: the number of files that were scanned by the rule 

1.  Minimum file size in bytes 

1.  Maximum file size in bytes 

```
Dataset.*.FileSize.Compliance: 1, Dataset.*.FileCount: 8, Dataset.*.MaximumFileSize: 327413121, Dataset.*.MinimumFileSize: 204558920
```

 Anomaly detection is not supported for these metrics. 

 **Validate size of files** 

 This rule will pass when file.dat is greater than 2 MB. 

```
FileSize "s3://amzn-s3-demo-bucket/file.dat" > 2 MB
```

 The supported units include B (bytes), KB (kilobytes), MB (megabytes), GB (gigabytes) and TB (terabytes). 

 **Validate size of files in folders** 

```
FileSize "s3://bucket/" > 5 B
FileSize "s3://bucket/" < 2 GB
```

 This rule will pass if 70% of the files in s3://amzn-s3-demo-bucket is between 2 GB and 1 TB. 

```
FileSize "s3://amzn-s3-demo-bucket/" between 2 GB and 1 TB with threshold > 0.7
```

 **Inferring file names directly from data frames** 

 You don't always have to provide a file path. For instance, when you are authoring the rule in the Data Catalog, it may be hard to find which folders the catalog tables are using. AWS Glue Data Quality can find the specific folders or files used to populate your data frame. 

**Note**  
 This feature will only work when files are successfully read into the DynamicFrame or DataFrame. 

```
FileSize < 10 MB with threshold > 0.7
```

 **Optional File-based Rule Tags:** 

 Tags allow you to control the rule behaviour. 

 **recentFiles** 

 This tag limits the number of files processed by keeping the most recent file first. 

```
FileSize "s3://amzn-s3-demo-bucket/" > 5 B with recentFiles = 1
```

 **uriRegex** 

**Note**  
 The `uriRegex` tag is available in AWS Glue 5.0 and later. 

 This tag filters files by applying a regex pattern to the file path. Only files whose paths match the pattern are processed. You can also use a negative lookahead to exclude files that match a pattern. 

```
# Match only files with a .dat extension
FileSize "s3://bucket/" > 5 B with uriRegex = "\.dat$"
# Exclude files ending in .tmp using a negative lookahead
FileSize "s3://bucket/" > 5 B with uriRegex = "(?!.*\.tmp$).*"
```

 **filterOrder** 

**Note**  
 The `filterOrder` tag is available in AWS Glue 5.0 and later. 

 When you use multiple filter tags such as `recentFiles` and `uriRegex` together, the `filterOrder` tag controls the order in which they are applied. The default order is `recentFiles` first, then `uriRegex`. 

```
FileSize "s3://bucket/" > 5 B with recentFiles = 5 with uriRegex = "\.dat$" with filterOrder = ["uriRegex","recentFiles"]
```

 **failFast** 

 When set to `"true"`, the rule returns failure immediately on the first file that fails the size condition, instead of evaluating all files and computing a compliance ratio. 

```
FileSize "s3://bucket/" > 2 MB with failFast = "true"
```

 There are a few considerations: 

1.  In AWS Glue ETL, you must have Evaluate DataQuality Transform immediately after the Amazon S3 or Data Catalog transform. 

1.  This rule will not work in AWS Glue Interactive Sessions. 