

# Use a crawler to add a table
<a name="schema-crawlers"></a>

AWS Glue crawlers help discover the schema for datasets and register them as tables in the AWS Glue Data Catalog. The crawlers go through your data and determine the schema. In addition, the crawler can detect and register partitions. For more information, see [Defining crawlers](https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html) in the *AWS Glue Developer Guide*. Tables from data that were successfully crawled can be queried from Athena.

**Note**  
Athena does not recognize [exclude patterns](https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html#crawler-data-stores-exclude) that you specify for an AWS Glue crawler. For example, if you have an Amazon S3 bucket that contains both `.csv` and `.json` files and you exclude the `.json` files from the crawler, Athena queries both groups of files. To avoid this, place the files that you want to exclude in a different location. 

## Create an AWS Glue crawler
<a name="data-sources-glue-crawler-setup"></a>

You can create a crawler by starting in the Athena console and then using the AWS Glue console in an integrated way. When you create the crawler, you specify a data location in Amazon S3 to crawl.

**To create a crawler in AWS Glue starting from the Athena console**

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home).

1. In the query editor, next to **Tables and views**, choose **Create**, and then choose **AWS Glue crawler**. 

1. On the **AWS Glue** console **Add crawler** page, follow the steps to create a crawler. For more information, see [Using AWS Glue Crawlers](#schema-crawlers) in this guide and [Populating the AWS Glue Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/populate-catalog-methods.html) in the *AWS Glue Developer Guide*.

**Note**  
Athena does not recognize [exclude patterns](https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html#crawler-data-stores-exclude) that you specify for an AWS Glue crawler. For example, if you have an Amazon S3 bucket that contains both `.csv` and `.json` files and you exclude the `.json` files from the crawler, Athena queries both groups of files. To avoid this, place the files that you want to exclude in a different location.

After a crawl, the AWS Glue crawler automatically assigns certain table metadata to help make it compatible with other external technologies like Apache Hive, Presto, and Spark. Occasionally, the crawler may incorrectly assign metadata properties. Manually correct the properties in AWS Glue before querying the table using Athena. For more information, see [Viewing and editing table details](https://docs.aws.amazon.com/glue/latest/dg/console-tables.html#console-tables-details) in the *AWS Glue Developer Guide*.

AWS Glue may mis-assign metadata when a CSV file has quotes around each data field, getting the `serializationLib` property wrong. For more information, see [Handling CSV data enclosed in quotes](schema-csv.md#schema-csv-quotes).

# Use multiple data sources with a crawler
<a name="schema-crawlers-data-sources"></a>

When an AWS Glue crawler scans Amazon S3 and detects multiple directories, it uses a heuristic to determine where the root for a table is in the directory structure, and which directories are partitions for the table. In some cases, where the schema detected in two or more directories is similar, the crawler may treat them as partitions instead of separate tables. One way to help the crawler discover individual tables is to add each table's root directory as a data store for the crawler.

The following partitions in Amazon S3 are an example:

```
s3://amzn-s3-demo-bucket/folder1/table1/partition1/file.txt
s3://amzn-s3-demo-bucket/folder1/table1/partition2/file.txt
s3://amzn-s3-demo-bucket/folder1/table1/partition3/file.txt
s3://amzn-s3-demo-bucket/folder1/table2/partition4/file.txt
s3://amzn-s3-demo-bucket/folder1/table2/partition5/file.txt
```

If the schema for `table1` and `table2` are similar, and a single data source is set to `s3://amzn-s3-demo-bucket/folder1/` in AWS Glue, the crawler may create a single table with two partition columns: one partition column that contains `table1` and `table2`, and a second partition column that contains `partition1` through `partition5`.

To have the AWS Glue crawler create two separate tables, set the crawler to have two data sources, `s3://amzn-s3-demo-bucket/folder1/table1/` and `s3://amzn-s3-demo-bucket/folder1/table2`, as shown in the following procedure.

**To add an S3 data store to an existing crawler in AWS Glue**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Crawlers**.

1. Choose the link to your crawler, and then choose **Edit**. 

1. For **Step 2: Choose data sources and classifiers**, choose **Edit**. 

1. For **Data sources and catalogs**, choose **Add a data source**.

1. In the **Add data source** dialog box, for **S3 path**, choose **Browse**. 

1. Select the bucket that you want to use, and then choose **Choose**.

   The data source that you added appears in the **Data sources** list.

1. Choose **Next**.

1. On the **Configure security settings** page, create or choose an IAM role for the crawler, and then choose **Next**.

1. Make sure that the S3 path ends in a trailing slash, and then choose **Add an S3 data source**.

1. On the **Set output and scheduling** page, for **Output configuration**, choose the target database.

1. Choose **Next**.

1. On the **Review and update** page, review the choices that you made. To edit a step, choose **Edit**.

1.  Choose **Update**.

# Schedule a crawler to keep the AWS Glue Data Catalog and Amazon S3 in sync
<a name="schema-crawlers-schedule"></a>

AWS Glue crawlers can be set up to run on a schedule or on demand. For more information, see [Time-based schedules for jobs and crawlers](https://docs.aws.amazon.com/glue/latest/dg/monitor-data-warehouse-schedule.html) in the *AWS Glue Developer Guide*.

If you have data that arrives for a partitioned table at a fixed time, you can set up an AWS Glue crawler to run on schedule to detect and update table partitions. This can eliminate the need to run a potentially long and expensive `MSCK REPAIR` command or manually run an `ALTER TABLE ADD PARTITION` command. For more information, see [Table partitions](https://docs.aws.amazon.com/glue/latest/dg/tables-described.html#tables-partition) in the *AWS Glue Developer Guide*.