

# Creating, running, and scheduling AWS Glue DataBrew jobs
Jobs

AWS Glue DataBrew has a job subsystem that serves two purposes:

1. Applying a data transformation recipe to a DataBrew dataset. You do this with a DataBrew recipe job.

1. Analyzing a dataset to create a comprehensive profile of the data. You do this with a DataBrew profile job.

**Topics**
+ [

# Creating and working with AWS Glue DataBrew recipe jobs
](jobs.recipe.md)
+ [

# Creating and working with AWS Glue DataBrew profile jobs
](jobs.profile.md)

# Creating and working with AWS Glue DataBrew recipe jobs
Recipe jobs

Use a DataBrew *recipe job* to clean and normalize the data in a DataBrew dataset and write the result to an output location of your choice. Running a recipe job doesn't affect the dataset or the underlying source data. When a job runs, it connects to the source data in a read-only fashion. The job output is written to an output location that you define in Amazon S3, the AWS Glue Data Catalog, or a supported JDBC database.

Use the following procedure to create a DataBrew recipe job.

**To create a recipe job**

1. Sign in to the AWS Management Console and open the DataBrew console at [https://console.aws.amazon.com/databrew/](https://console.aws.amazon.com/glue/).

1. Choose **JOBS** from the navigation pane, choose the **Recipe jobs** tab, and then choose **Create job**.

1. Enter a name for your job, and then choose **Create a recipe job**.

1. For **Job input**, enter details on the job that you want to create: the name of the dataset to be processed, and the recipe to use.

   A recipe job uses a DataBrew recipe to transform a dataset. To use a recipe, make sure to publish it first.

1. Configure your job output settings.

   Provide a destination for your job output. If you don't have a DataBrew connection configured for your output destination, configure it first on the **DATASETS** tab as described in [Supported connections for data sources and outputs](supported-data-connection-sources.md). Choose one of the following output destinations: 
   + Amazon S3, with or without AWS Glue Data Catalog support
   + Amazon Redshift, with or without AWS Glue Data Catalog support
   + JDBC
   + Snowflake tables
   + Amazon RDS database tables with AWS Glue Data Catalog support. Amazon RDS database tables support the following database engines: 
     + Amazon Aurora
     + MySQL
     + Oracle
     + PostgreSQL
     + Microsoft SQL Server
   + Amazon S3 with AWS Glue Data Catalog support.

   For AWS Glue Data Catalog output based on AWS Lake Formation, DataBrew supports only replacing existing files. In this approach, the files are replaced to keep your existing Lake Formation permissions intact for your data access role. Also, DataBrew gives precedence to the Amazon S3 location from the AWS Glue Data Catalog table. Thus, you can't override the Amazon S3 location when creating a recipe job. 

   In some cases, the Amazon S3 location in the job output differs from the Amazon S3 location in the Data Catalog table. In these cases, DataBrew updates the job definition automatically with the Amazon S3 location from the catalog table. It does this when you update or start your existing jobs.

1. For Amazon S3 output destinations only, you have further choices:

   1. Choose one of the available data output formats for Amazon S3, optional compression, and an optional custom delimiter. Supported delimiters for output files are the same as those for input: comma, colon, semicolon, pipe, tab, caret, backslash, and space. For formatting details, see the following table.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/databrew/latest/dg/jobs.recipe.html)

   1. <a name="singlefileoutput"></a>Choose whether to output a single file or multiple files. There are three options for file output with Amazon S3:
      + **Autogenerate files (recommended)** – Has DataBrew determine the optimal number of output files.
      + **Single file output** – Causes a single output file to be generated. This option might result in additional job execution time because post-processing is required.
      + **Multiple file output** – Has you specify the number of files for your job output. Valid values are 2–999. Fewer files than you specify might be output if column partitioning is used or if the number of rows in the output is fewer than the number of files you specify.

   1. <a name="columnpartioning"></a>(Optional) Choose column partitioning for recipe job output. 

      Column partitioning provides another way to partition your recipe job output into multiple files. Column partitioning can be used with new or existing Amazon S3 output or with new Data Catalog Amazon S3 output. It cannot be used with existing Data Catalog Amazon S3 tables. The output files are based on the values of column names that you specify. If the column names you specify are unique, the resulting Amazon S3 folder paths are based on the order of the column names.

      For an example of column partitioning, see [Example of column partitioning](#jobs.recipe-column-partition-example), following.

1. (Optional) Choose **Enable encryption for job output** to encrypt the job output that DataBrew writes to your output location, and then choose the encryption method:
   + **Use SSE-S3 encryption** – The output is encrypted using server-side encryption with Amazon S3–managed encryption keys.
   + **Use AWS Key Management Service (AWS KMS)** – The output is encrypted using AWS KMS. To use this option, choose the Amazon Resource Name (ARN) of the AWS KMS key that you want to use. If you don't have an AWS KMS key, you can create one by choosing **Create an AWS KMS key**.

1. For **Access permissions**, choose an AWS Identity and Access Management (IAM) role that allows DataBrew to write to your output location. For a location owned by your AWS account, you can choose the `AwsGlueDataBrewDataAccessRole` service-managed role. Doing this allows DataBrew to access AWS resources that you own.

1. On the **Advanced job settings** pane, you can choose more options for how your job is to run:
   + **Maximum number of units** – DataBrew processes jobs using multiple compute nodes, running in parallel. The default number of nodes is 5. The maximum number of nodes is 149.
   + **Job timeout – **If a job takes more than the number of minutes that you set here to run, it fails with a timeout error. The default value is 2,880 minutes, or 48 hours.
   + **Number of retries** – If a job fails while running, DataBrew can try to run it again. By default, the job isn't retried.
   + **Enable Amazon CloudWatch Logs for job** – Allows DataBrew to publish diagnostic information to CloudWatch Logs. These logs can be useful for troubleshooting purposes, or for more details on how the job is processed.

1. For **Schedule jobs**, you can apply a DataBrew job schedule so that your job runs at a particular time, or on a recurring basis. For more information, see [Automating job runs with a schedule](#jobs.scheduling).

1. When the settings are as you want them, choose **Create job**. Or, if you want to run the job immediately, choose **Create and run job**.

You can monitor your job's progress by checking its status while the job is running. When the job run is complete, the status changes to **Succeeded**. The job output is now available at your chosen output location.

DataBrew saves your job definition, so that you can run the same job later. To rerun a job, choose **Jobs** from the navigation pane. Choose the job that you want to work with, and then choose **Run job**.

## Example of column partitioning


As an example of column partitioning, assume that you specify three columns, each row of which contains one of two possible values. The `Dept` column can have the value `Admin` or `Eng`. The `Staff-type` column can have the value `Part-time` or `Full-time`. The `Location` column can have the value `Office1` or `Office2`. The Amazon S3 buckets for your job output look something like the following.

```
s3://bucket/output-folder/Dept=Admin/Staff-type=Part-time/Area=Office1/jobId_timestamp_part0001.csv
s3://bucket/output-folder/Dept=Admin/Staff-type=Part-time/Location=Office2/jobId_timestamp_part0002.csv
s3://bucket/output-folder/Dept=Admin/Staff-type=Full-time/Location=Office1/jobId_timestamp_part0003.csv
s3://bucket/output-folder/Dept=Admin/Staff-type=Full-time/Location=Office2/jobId_timestamp_part0004.csv
s3://bucket/output-folder/Dept=Eng/Staff-type=Part-time/Location=Office1/jobId_timestamp_part0005.csv
s3://bucket/output-folder/Dept=Eng/Staff-type=Part-time/Location=Office2/jobId_timestamp_part0006.csv
s3://bucket/output-folder/Dept=Eng/Staff-type=Full-time/Location=Office1/jobId_timestamp_part0007.csv
s3://bucket/output-folder/Dept=Eng/Staff-type=Full-time/Location=Office2/jobId_timestamp_part0008.csv
```

## Automating job runs with a schedule


You can rerun DataBrew jobs at any time and also automate DataBrew job runs with a schedule. 

**To rerun a DataBrew job**

1. Sign in to the AWS Management Console and open the DataBrew console at [https://console.aws.amazon.com/databrew/](https://console.aws.amazon.com/databrew/). 

1. On the navigation pane, choose **Jobs**. Choose the job that you want to run, and then choose **Run job**.

To run a DataBrew job at a particular time, or on a recurring basis, create a DataBrew job schedule. You can then set up your job to run according to the schedule.

**To create a DataBrew job schedule**

1. On the DataBrew console's navigation pane, choose **Jobs**. Choose the **Schedules** tab, and choose **Add schedule**.

1. Enter a name for your schedule, and then choose a value for **Run frequency**:
   + **Recurring** – Choose how frequently that you want the job to run (for example, every 12 hours). Then choose which day or days to run the job on. Optionally, you can enter the time of day when the job runs.
   + **At a particular time** – Enter the time of day when you want the job to run. Then choose which day or days to run the job on. 
   + **Enter CRON** – Define the job schedule by entering a valid cron expression. For more information, see [Working with cron expressions for recipe jobs](#jobs.cron).

1. When the settings are as you want them, choose **Save**.

**To associate a job with a schedule**

1. On the navigation pane, choose **Jobs**.

1. Choose the job that you want to work with, and then for **Actions**, choose **Edit.**.

1. On the **Schedule jobs** pane, choose **Associate schedule**. Choose the name of the schedule that you want to use.

1. When the settings are as you want them, choose **Save**.

## Working with cron expressions for recipe jobs


Cron expressions have six required fields, which are separated by white space. The syntax is as follows.

```
Minutes Hours Day-of-month Month Day-of-week Year
```

In the preceding syntax, the following values and wildcards are used for the indicated fields.


| **Fields** | **Values** | **Wildcards** | 
| --- | --- | --- | 
|  Minutes  |  0–59  |  , - \$1 /  | 
|  Hours  |  0–23  |  , - \$1 /  | 
|  Day-of-month  |  1–31  |  , - \$1 ? / L W  | 
|  Month  |  1–12 or JAN-DEC  |  , - \$1 /  | 
|  Day-of-week  |  1–7 or SUN-SAT  |  , - \$1 ? / L  | 
|  Year  |  1970–2199  |  , - \$1 /  | 

Use these wildcards as follows:
+ The **,** (comma) wildcard includes additional values. In the `Month` field, `JAN,FEB,MAR` includes January, February, and March.
+ The **-** (en dash) wildcard specifies ranges. In the `Day` field, 1–15 includes days 1 through 15 of the specified month.
+ The **\$1** (asterisk) wildcard includes all values in the field. In the `Hours` field, **\$1** includes every hour.
+ The **/** (slash) wildcard specifies increments. In the `Minutes` field, you can enter **1/10** to specify every 10th minute, starting from the first minute of the hour (for example, the 11th, 21st, and 31st minute).
+ The **?** (question mark) wildcard specifies one or another. For example, suppose that in the `Day-of-month` field you enter **7**. If you didn't care what day of the week the seventh was, you can then enter **?** in the `Day-of-week` field.
+ The **L** wildcard in the `Day-of-month` or `Day-of-week` field specifies the last day of the month or week.
+ The **W** wildcard in the `Day-of-month` field specifies a weekday. In the `Day-of-month` field, `3W` specifies the day closest to the third weekday of the month.

These fields and values have the following limitations: 
+ You can't specify the `Day-of-month` and `Day-of-week` fields in the same cron expression. If you specify a value in one of the fields, you must use a **?** (question mark) in the other.
+ Cron expressions that lead to rates faster than 5 minutes aren't supported. 

When creating a schedule, you can use the following sample cron strings.


| Minutes | Hours | Day of month | Month | Day of week | Year | Meaning | 
| --- | --- | --- | --- | --- | --- | --- | 
|  0  |  10  |  \$1  |  \$1  |  ?  |  \$1  |  Run at 10:00 AM (UTC) every day  | 
|  15  |  12  |  \$1  |  \$1  |  ?  |  \$1  |  Run at 12:15 PM (UTC) every day  | 
|  0  |  18  |  ?  |  \$1  |  MON-FRI  |  \$1  |  Run at 6:00 PM (UTC) every Monday through Friday  | 
|  0  |  8  |  1  |  \$1  |  ?  |  \$1  |  Run at 8:00 AM (UTC) every first day of the month  | 
|  0/15  |  \$1  |  \$1  |  \$1  |  ?  |  \$1  |  Run every 15 minutes  | 
|  0/10  |  \$1  |  ?  |  \$1  |  MON-FRI  |  \$1  |  Run every 10 minutes Monday through Friday  | 
|  0/5  |  8–17  |  ?  |  \$1  |  MON-FRI  |  \$1  |  Run every 5 minutes Monday through Friday between 8:00 AM and 5:55 PM (UTC)  | 

For example, you can use the following cron expression to run a job every day at 12:15 UTC.

```
15 12 * * ? *
```

## Deleting jobs and job schedules


If you no longer need a job or job schedule, you can delete it.

**To delete a job**

1. On the navigation pane, choose **Jobs**.

1. Choose the job that you want to delete, and then for **Actions**, choose **Delete.**.

**To delete a job schedule**

1. On the navigation pane, choose **Jobs**, and then choose the **Schedules** tab.

1. Choose the schedule that you want to delete, and then for **Actions**, choose **Delete.**.

# Creating and working with AWS Glue DataBrew profile jobs
Profile jobs

*Profile jobs* run a series of evaluations on a dataset and output the results to Amazon S3. The information that data profiling gathers helps you understand your dataset and decide what kind of data preparation steps you might want to run in your recipe jobs. 

The simplest way to run a profile job is using the default DataBrew settings. You can configure your profile job before running it so that it returns just the information that you want.

Use the following procedure to create a DataBrew profile job.

**To create a profile job**

1. Sign in to the AWS Management Console and open the DataBrew console at [https://console.aws.amazon.com/databrew/](https://console.aws.amazon.com/glue/).

1. Choose **JOBS** from the navigation pane, choose the **Profile jobs** tab, and then choose **Create job**.

1. Enter a name for your job, and then choose **Create a profile job**.

1. For **Job input**, provide the name of the dataset to be profiled.

1. (Optional) Configure the following on the **Data profile configurations** pane:
   + **Dataset level configurations** – Configure details of your profile job for all columns in your dataset. 

     Optionally, you can turn on the ability to detect and count duplicate rows in the dataset. You can also choose **Enable correlations matrix** and select columns to see how closely the values in multiple columns are related. For details of the statistics that you can configure at the dataset level, see [Configurable statistics at the dataset level](profile.configuration.md#statistics.table01). You can configure statistics on the DataBrew console, or using the DataBrew API or AWS SDKs.
   + **Column level configurations** – Using **Default profile configuration settings**, you can select the columns to include in your profile job. Use **Add configuration override** to select the columns for which to limit the number of statistics gathered, or override the default configuration of certain statistics. For details of the statistics that you can configure at the column level, see [Configurable statistics at the column level](profile.configuration.md#statistics.table02). You can configure statistics on the DataBrew console, or using the DataBrew API or AWS SDKs.

     Be sure that any configuration overrides that you specify apply to columns that you included in your profile job. If there are conflicts between different overrides that you configured for a column, the last conflicting override has priority.

1. (Optional) You can create **Data quality rules** and apply additional rulesets associated with this dataset or remove already applied ones. For more information on data quality validation, see [Validating data quality in AWS Glue DataBrew](profile.data-quality-rules.md).

1. On the **Advanced job settings** pane, you can choose more options for how your job is to run:
   + **Maximum number of units** – DataBrew processes jobs using multiple compute nodes, running in parallel. The default number of nodes is 5. The maximum number of nodes is 149.
   + **Job timeout – **If a job takes more than the number of minutes that you set here to run, it fails with a timeout error. The default value is 2,880 minutes, or 48 hours.
   + **Number of retries** – If a job fails while running, DataBrew can try to run it again. By default, the job isn't retried.
   + **Enable Amazon CloudWatch Logs for job** – Allows DataBrew to publish diagnostic information to CloudWatch Logs. These logs can be useful for troubleshooting purposes, or for more details on how the job is processed.

1. For **Associated Schedule**, you can apply a DataBrew job schedule so that your job runs at a particular time, or on a recurring basis. For more information, see [Automating job runs with a schedule](jobs.recipe.md#jobs.scheduling).

1. When the settings are as you want them, choose **Create job**. Or, if you want to run the job immediately, choose **Create and run job**.

# Building a profile job configuration programmatically in AWS Glue DataBrew
Building a profile job configuration programmatically

In this section, you can find descriptions of profile job steps and functions that you can use programmatically. You can use them either from the AWS Command Line Interface (AWS CLI) or by using one of the AWS SDKs.

In a profile job, you can customize a configuration to control how DataBrew evaluates your dataset. You can apply the configuration to a dataset or apply it to particular columns. You can build the configuration when creating a profile job, and then update it anytime.

A profile configuration structure includes four parts:
+ [ProfileColumns section](#profile-columns.statistics)
+ [DatasetStatisticsConfiguration section](#profile-dataset-stats-config)
+ [ColumnStatisticsConfigurations section](#profile-column-stats-config)
+ [EntityDetectorConfiguration section for configuring PII](#entity-detector-configuration)

Following is an example.

```
{
    "ProfileColumns": [
        {
            "Name": "example"
        },
        {
            "Regex": "example.*"
        }
    ],
    "DatasetStatisticsConfiguration": {
        "IncludedStatistics": [
            "CORRELATION"
        ],
        "Overrides": [
            {
                "Statistic": "CORRELATION",
                "Parameters": {
                    "columnSelectors": "[{\"name\":\"example\"}, {\"regex\":\"example.*\"}]"
                }
            }
        ]
    },
    "ColumnStatisticsConfigurations": [
        {
            "Selectors": [
                {
                    "Name": "example"
                }
            ],
            "Statistics": {
                "IncludedStatistics": [
                    "CORRELATION",
                    "DUPLICATE_ROWS_COUNT"
                ],
                "Overrides": [
                    {
                        "Statistic": "VALUE_DISTRIBUTION",
                        "Parameters": {
                            "binNumber": "10"
                        }
                    }
                ]
            }
        }
    ]
}
```

## ProfileColumns section


In the `ProfileColumns` section of your structure, set the columns from your dataset that you want to evaluate in your profile job. `ProfileColumns` is a list of column selectors (`Selectors`). You can specify either a column name or a regular expression in a column selector. An example follows.

```
"ProfileColumns": [{"Name": "example"}, {"Regex": "example.*"}]
```

When `ProfileColumns` is specified, only columns whose names match a name or regular expression in `ProfileColumns` are included in the profile job. If the profile job doesn't support a selected column's data type, DataBrew skips the selected column during the job run.

If ProfileColumns is undefined, the profile job evaluates all supported columns. Supported columns are columns containing data of a supported data type: `ByteType`, `ShortType`, `IntegerType`, `LongType`, `FloatType`, `DoubleType`, `String`, or `Boolean`.

## DatasetStatisticsConfiguration section


In the `DatasetStatisticsConfiguration` section of your structure, you can build a configuration for intercolumn evaluations. The configuration includes `IncludedStatistics` and `Overrides`. An example follows.

```
"DatasetStatisticsConfiguration": {
    "IncludedStatistics": ["CORRELATION"],
    "Overrides": [
        {
            "Statistic": "CORRELATION",
            "Parameters": {
                "columnSelectors": "[{\"name\":\"example\"}, {\"regex\":\"example.*\"}]"
            }
        }
    ]
}
```

You can select evaluations that you want to have by adding evaluation names to `IncludedStatistics`. An example follows.

```
"IncludedStatistics": ["CORRELATION", "DUPLICATE_ROWS_COUNT"]
```

When you specify `IncludedStatistics`, only evaluations in the list are included in the profile job. If `IncludedStatistics` is undefined, the profile job runs all supported evaluations with default settings. You can exclude all evaluations by adding NONE to `IncludedStatistics`. An example follows.

```
"IncludedStatistics": ["NONE"]
```

### Configurable statistics at the dataset level


In the `DatasetStatisticsConfiguration` section of your structure, a profile job supports the evaluations shown in the table following.


| **Statistic name** | **Description** | **Supported data types** | **Default status** | **Attributes of profile result** | **Type of profile result** | 
| --- | --- | --- | --- | --- | --- | 
| DUPLICATE\$1ROWS\$1COUNT | Count of duplicate rows in the dataset | all | Enable | duplicateRowsCount | Int | 
| CORRELATION | Pearson Correlation Coefficient between two columns | number | Enable | correlations (in each selected column) | Object | 

In `IncludedStatistics`, you can override each evaluation's default settings by adding an override. Each override includes the name of a particular evaluation and a parameter map.

In `DatasetStatisticsConfiguration`, a profile job supports the `CORRELATION` override. This override calculates the Pearson Correlation Coefficient between two columns from a list of selected columns. The default setting is selecting the first 10 numeric columns. You can specify either a number of columns or a list of column selectors to override the default setting.

`CORRELATION` takes these parameters:
+ `columnNumber` – The number of numeric columns. The profile job selects the first *n* columns from the dataset. This value should be greater than 1. Use `"ALL"` to select all numeric columns.
+ `columnSelectors:` – List of column selectors. Each selector can have either a column name or a regular expression.

An example follows.

```
{
    "Statistic": "CORRELATION",
    "Parameters": {
        "columnSelectors": "[{\"name\":\"example\"}, {\"regex\":\"example.*\"}]"
    }
}
```

## ColumnStatisticsConfigurations section


In the `ColumnStatisticsConfigurations` section of your structure, you can build configurations for particular columns. `ColumnStatisticsConfigurations` is a list of `ColumnStatisticsConfiguration` settings. In `ColumnStatisticsConfiguration`, there are `Selectors`, a list of column selectors, and `Statistics` for the configuration of statistics. An example follows.

```
{ 
    "Selectors": [{"Name": "example"}
    ],
    "Statistics": {
       "IncludedStatistics": ["CORRELATION", "DUPLICATE_ROWS_COUNT"]
        "Overrides": [
            {
                "Statistic": "VALUE_DISTRIBUTION",
                "Parameters": {
                    "binNumber": "10"
                }
            }
        ]
    }
}
```

`Selectors` is a list of column selectors. As with `ProfileColumns`, you can specify either a column name or a regular expression in each column selector. When you specify `Selectors`, the column configuration is applied to columns that match any column selector in `Selectors`. Otherwise, the configuration is applied to all supported columns.

In `Statistics`, you can override settings of selected columns. As with `DatasetStatisticsConfiguration`, `Statistics` has `IncludedStatistics` and `Overrides`.

To select the evaluations that you want, add evaluation names to `IncludedStatistics`. 

```
"IncludedStatistics": ["CORRELATION", "DUPLICATE_ROWS_COUNT"]                    
```

When you specify `IncludedStatistics`, only evaluations in the list are included in the profile job. Otherwise, the profile job runs all supported evaluations with default settings.

You can exclude all evaluations by adding `NONE` to `IncludedStatistics`.

```
"IncludedStatistics": ["NONE"]                    
```

 In some cases, there might be multiple configurations in `ColumnStatisticsConfigurations` that have different `IncludedStatistics` that you can apply to the same column. In these cases, the profile job picks the last configuration in `ColumnStatisticsConfigurations` and applies its `IncludedStatistics` to the selected column. A new configuration overrides older configurations.

### Configurable statistics at the column level


In `ColumnStatisticsConfigurations`, a profile job supports the evaluations shown in the table following.

A supported data type of `number` in this table means that the attribute's data type is one of the following: `ByteType`, `ShortType`, `IntegerType`, `LongType`, `FloatType`, or `DoubleType`.


| **Statistic name** | **Description** | **Supported data types** | **Default status** | **Attributes of profile result** | **Type of profile result** | 
| --- | --- | --- | --- | --- | --- | 
| – | Name of the column. | all | – | name | string | 
| – | Data type of the column. | all | – | type | string | 
| DISTINCT\$1VALUES\$1COUNT | Number of distinct values. A *distinct value* is value that appears at least once. | number/boolean/string | Enabled | distinctValuesCount | Int | 
| ENTROPY | Entropy (information theory). | number/boolean/string | Enabled | entropy | Double | 
| INTER\$1QUARTILE\$1RANGE | Range between the 25th percent and 75th percent of numbers. | number | Enabled | interquartileRange | Double | 
| KURTOSIS | Kurtosis of the column. | number | Enabled | kurtosis | Double | 
| MAX | Maximum value in the column. | number/string length | Enabled | max | Int/Double | 
| MAXIMUM\$1VALUES | List of the maximum values in the column and their counts. | number | Enabled | maximumValues | List | 
| MEAN | Mean value of values in the column. | number/string length | Enabled | mean | Double | 
| MEDIAN | Median of values in the column. | number/string length | Enabled | median | Double | 
| MEDIAN\$1ABSOLUTE\$1DEVIATION | The median of the absolute differences between each data point and the median of a numeric column. | number | Enabled | medianAbsoluteDeviation | Double | 
| MIN | Minimum value in the column. | number/string length | Enabled | min | Int/Double | 
| MINIMUM\$1VALUES | List of the minimum values in the column and their counts. | number | Enabled | minimumValues | List | 
| MISSING\$1VALUES\$1COUNT | Number of missing values in the column. Null and empty strings are considered as missing. | all | Enabled | missingValuesCount | Int | 
| MODE | The most frequently occurring value in the column. If several values appear that often, the mode is one of those values. | number/string length | Enabled | mode | Int/Double | 
| MOST\$1COMMON\$1VALUES | List of the most common values in the column. | number/boolean/string | Enabled | mostCommonValues | List | 
| OUTLIER\$1DETECTION | Detect outliers in the column by Z\$1score algorithm. Count the number of outliers and extract a list of samples from detected outliers. | number/string length | Enabled | zScoreOutliersCount, zScoreOutliersSample | Int/List | 
| PERCENTILES | Percentile values of numeric column (5%, 25%, 75%, 95%). | number | Enabled | percentile5, percentile25, percentile75, percentile95 | Double | 
| RANGE | Range of values in the column. | number | Enabled | range | Int/Double | 
| SKEWNESS | Skewness of values in the column. | number | Enabled | skewness | Double | 
| STANDARD\$1DEVIATION | Unbiased sample standard deviation of values in the column. | number/string length | Enabled | standardDeviation | Double | 
| SUM | Sum of values in the column. | number | Enabled | sum | Int/Double | 
| UNIQUE\$1VALUES\$1COUNT | Number of unique values. A unique value means that the value appears only once. | number/boolean/string | Enabled | uniqueValuesCount | Int | 
| VALUE\$1DISTRIBUTION |  Measure of the distribution of values in the column by range.  | number/string length | Enabled | valueDistribution | List | 
| VARIANCE | Variance of values in the column. | number | Enabled | variance | Double | 
| Z\$1SCORE\$1DISTRIBUTION | Measure of the distribution of data points' z-score values by range. | number | Enabled | zScoreDistribution | List | 
| ZEROS\$1COUNT | Number of zeroes (0s) in the column. | number | Enabled | zerosCount | Int | 

In `IncludedStatistics`, you can override each evaluation's default parameters by adding an override. Each override includes the name of a particular evaluation and a parameter map.

## Parameters for ColumnStatisticsConfigurations columns


In `ColumnStatisticsConfigurations`, a profile job supports the following parameters. 

In some cases, there might be multiple configurations in `ColumnStatisticsConfigurations` that have different `IncludedStatistics` that you can apply to the same column. In these cases, the profile job picks the last configuration in `ColumnStatisticsConfigurations` and applies its `IncludedStatistics` to the selected column. A new configuration overrides older configurations.

### MAXIMUM\$1VALUES


Lists the maximum values in the numeric column and their counts. The default list size is 5. You can override the list size by specifying a value for `sampleSize`.

**Settings**

`sampleSize` – The size of list that includes the maximum number and count of values in the numeric column. This value should be greater than 0. Use `"ALL"` to list all values. 

**Example**

```
{ 
    "Statistic": "MAXIMUM_VALUES",
    "Parameters": {
        "sampleSize": "5"
    }
}
```

### MINIMUM\$1VALUES


Lists the minimum values in the numeric column and their counts. The default list size is 5. You can override the list size by specifying a value for `sampleSize`.

**Settings**

`sampleSize` – The size of list that includes the maximum number and count of values in the numeric column. This value should be greater than 0. Use `"ALL"` to list all values. 

**Example**

```
{
    "Statistic": "MINIMUM_VALUES",
    "Parameters": {
        "sampleSize": "5"
    }
}
```

### MOST\$1COMMON\$1VALUES


Lists the most common values in the column and their counts. The default list size is 50. You can override the list size by specifying a value for `sampleSize`.

**Settings**

`sampleSize` – The size of list that includes the maximum number and count of values in the numeric column. This value should be greater than 0. Use `"ALL"` to list all values.

**Example**

```
{
    "Statistic": "MOST_COMMON_VALUES",
    "Parameters": {
        "sampleSize": "50"
    }
}
```

### OUTLIER\$1DETECTION


Detects outliers in the numeric column or string column (based on string length) by Z\$1score algorithm. 

Your profile job counts the number of outliers and generates a sample list of outliers and their z-scores. The sample list is ordered by the z-score's absolute value. The default list size is 50.

The Z\$1Score algorithm identifies a value as an outlier when it deviates from the mean by more than the standard deviation threshold. The default outlier threshold is 3.

You can provide one more threshold, a mild threshold, to get more information. Your mild threshold should be less than your threshold. This feature is turned off by default. When a mild threshold is specified, your profile job returns one more count, `zScoreMildOutliersCount`. Also, `zScoreOutliersSample` can include a sample of mild threshold outliers in this case.

**Settings**
+ `threshold` – The threshold value to use when detecting outliers. This value should be greater or equal to 0. 
+ `mildThreshold` – The mild threshold value to use when detecting outliers. This value should be greater or equal to 0 and less than `threshold`.
+ `sampleSize` – The size of list that includes outliers in the column. Use `"ALL"` to list all values.

**Example**

```
{
    "Statistic": "OUTLIER_DETECTION",
    "Parameters": {
        "threshold": "5",
        "mildThreshold": "3.5",
        "sampleSize": "20"
    }
}
```

### VALUE\$1DISTRIBUTION


Measures the distribution of values in the column by the values' ranges. A profile job groups values from a numeric column or string column (based on string length) into bins by numeric ranges, and generates a list of bins. Bins are consecutive, and the upper bound for a bucket is the lower bound for the next bucket.

**Settings**

` binNumber` – Number of bins. This value should be greater than 0. 

**Example**

```
{
    "Statistic": "VALUE_DISTRIBUTION",
    "Parameters": {
        "binNumber": "5"
    }
}
```

### Z\$1SCORE\$1DISTRIBUTION


Measures the distribution of values’ z-scores in numeric column. A profile job groups z-scores of values into bins by numeric ranges, and generates a list of bins. Bins are consecutive, and the upper bound for a bucket is the lower bound for the next bucket.

**Settings**

` binNumber` – Number of bins. This value should be greater than 0. 

**Example**

```
{
    "Statistic": "Z_SCORE_DISTRIBUTION",
    "Parameters": {
        "binNumber": "5"
    }
}
```

## EntityDetectorConfiguration section for configuring PII


In the `EntityDetectorConfiguration` section of your structure, you can configure the entity types in your dataset that you want DataBrew to detect as **personally identifiable information** (PII) for a profile job.

### EntityTypes


You configure the entity types you want DataBrew to detect as PII for your profile job. When `EntityDetectorConfiguration` is undefined, entity detection is disabled. The following entity types can be detected in your dataset:
+ USA\$1SSN
+ EMAIL
+ USA\$1ITIN
+ USA\$1PASSPORT\$1NUMBER
+ PHONE\$1NUMBER
+ USA\$1DRIVING\$1LICENSE
+ BANK\$1ACCOUNT
+ CREDIT\$1CARD
+ IP\$1ADDRESS
+ MAC\$1ADDRESS
+ USA\$1DEA\$1NUMBER
+ USA\$1HCPCS\$1CODE
+ USA\$1NATIONAL\$1PROVIDER\$1IDENTIFIER
+ USA\$1NATIONAL\$1DRUG\$1CODE
+ USA\$1HEALTH\$1INSURANCE\$1CLAIM\$1NUMBER
+ USA\$1MEDICARE\$1BENEFICIARY\$1IDENTIFIER
+ USA\$1CPT\$1CODE
+ PERSON\$1NAME
+ DATE

The entity type group `USA_ALL` is also supported, and includes all of the above entity types except `PERSON_NAME` and `DATE`.

The type of `EntityTypes` is an array of strings.

### AllowedStatistics


Configure the statistics that are allowed to be run on columns that contain detected entities. If `AllowedStatistics` is undefined, no statistics will be computed on columns that contain detected entities. See [Configurable statistics at the column level](#statistics.table02) for a list of valid values for the `AllowedStatistics` parameter.

The type of `AllowedStatistics` is an array of `AllowedStatistics` objects.