本文為英文版的機器翻譯版本，如內容有任何歧義或不一致之處，概以英文版為準。 # 從查詢結果建立資料表 (CTAS) `CREATE TABLE AS SELECT` (CTAS) 查詢會根據另一個查詢的 `SELECT` 陳述式結果，在 Athena 中建立新的資料表。Athena 會將 CTAS 陳述式建立的資料檔案存放在 Simple Storage Service (Amazon S3) 中的指定位置。如需語法，請參閱[CREATE TABLE AS](create-table-as.md)。 `CREATE TABLE AS` 結合使用 `CREATE TABLE` DDL 陳述式與 `SELECT` DML 陳述式，因此技術上同時包含 DDL 和 DML。不過請注意，針對 Service Quotas 而言，Athena 中的 CTAS 查詢會被視為 DML。如需有關 Athena 中的 Service Quotas 的相關資訊，請參閱 [Service Quotas](service-limits.md)。使用 CTAS 查詢來： + 只要執行一個步驟就能從查詢結果建立資料表，而不需重複查詢原始資料集。如此更方便您使用原始資料集。 + 轉換查詢結果並將資料表遷移為其他資料表格式，例如 Apache Iceberg。這可改善查詢效能並降低 Athena 中的查詢成本。如需相關資訊，請參閱[建立 Iceberg 資料表](querying-iceberg-creating-tables.md)。 + 將查詢結果轉換為 Parquet 和 ORC 等儲存格式。這可改善查詢效能並降低 Athena 中的查詢成本。如需相關資訊，請參閱[使用單欄式儲存格式](columnar-storage.md)。 + 建立僅包含您所需資料的現有資料表的副本。 **Topics** + [CTAS 查詢的考量事項與限制](ctas-considerations-limitations.md) + [建立 CTAS 查詢](ctas-console.md) + [CTAS 範例](ctas-examples.md) + [使用 CTAS 和 INSERT INTO 以進行 ETL](ctas-insert-into-etl.md) + [解決 100 個分割區限制](ctas-insert-into.md) # CTAS 查詢的考量事項與限制下列各節會描述在 Athena 中使用 `CREATE TABLE AS SELECT` (CTAS) 查詢時要記住的注意事項和限制。 ## 了解 CTAS 查詢語法 CTAS 查詢語法與用於建立資料表的 `CREATE [EXTERNAL] TABLE` 語法不同。請參閱 [CREATE TABLE AS](create-table-as.md)。 ## 檢視與 CTAS 查詢之間的差異 CTAS 查詢會將新資料寫入到 Amazon S3 中的指定位置。檢視不會寫入任何資料。 ## 指定 CTAS 查詢結果的位置如果您的工作群組為查詢結果位置[覆寫用戶端設定](workgroups-settings-override.md)，則 Athena 會在位置 `s3://amzn-s3-demo-bucket/tables//` 中建立資料表。若要查看為工作群組指定的查詢結果位置，請[檢視工作群組的詳細資訊](viewing-details-workgroups.md)。如果您的工作群組未覆寫查詢結果位置，您可以在 CTAS 查詢中使用語法 `WITH (external_location ='s3://amzn-s3-demo-bucket/')` 來指定 CTAS 查詢結果的儲存位置。 **注意** `external_location` 屬性必須指定空白的位置。CTAS 查詢會檢查儲存貯體中的路徑位置 (前綴) 是空的，並且如果該位置中已有資料，則不會加以覆寫。若要再次使用相同的位置，請刪除儲存貯體的金鑰前綴位置中的資料。如果您省略 `external_location` 語法而且未使用工作群組設定，則 Athena 會使用查詢結果位置的[用戶端設定](query-results-specify-location-console.md)，並在位置 `s3://amzn-s3-demo-bucket////tables//` 中建立資料表。 ## 尋找孤立檔案如果 `CTAS` 或 `INSERT INTO` 陳述式失敗，則失敗或取消的查詢所在的目標資料位置可能遺留孤立的資料檔案。由於 Athena 在某些案例中不會從查詢的目標儲存貯體中刪除資料，因此後續查詢中可能會包含部分資料。若要尋找孤立檔案以進行檢測或刪除，您可以使用 Athena 提供的資料資訊清單檔案來追蹤要寫入的檔案清單。在某些極少數情況下，Athena 查詢會突然失敗，此時可能不存在資訊清單檔案。您可以手動檢查目標 S3 位置，以尋找孤立的檔案。如需詳細資訊，請參閱[識別查詢輸出檔案](querying-finding-output-files.md#querying-identifying-output-files)和 [DataManifestLocation](https://docs.aws.amazon.com/athena/latest/APIReference/API_QueryExecutionStatistics.html#athena-Type-QueryExecutionStatistics-DataManifestLocation)。我們強烈建議使用 Apache Iceberg 來實現資料表的原子交易。如需詳細資訊，請參閱[查詢 Apache Iceberg 資料表](querying-iceberg.md)。 ## 請記住，會忽略 ORDER BY 子句在 CTAS 查詢中，Athena 會忽略查詢 `SELECT` 部分中的 `ORDER BY` 子句。根據 SQL 規格 (ISO 9075 第 2 部分)，只有立即包含 `ORDER BY` 子句的查詢表達式，才能保證查詢表達式所指定之資料表資料列的順序。在任何情況下，SQL 中的資料表本身是無順序的，並且在子查詢子句中實作 `ORDER BY` 都會導致查詢執行不彰，並且將會導致無序輸出。因此，在 Athena CTAS 查詢中，無法保證在寫入資料時會保留 `ORDER BY` 子句指定的順序。 ## 選擇存放查詢結果的格式您可以將 CTAS 結果存放在 `PARQUET`、`ORC`、`AVRO`、`JSON` 和 `TEXTFILE` 中。CTAS `TEXTFILE` 格式不支援多字元分隔符號。如果您不指定資料儲存格式，則依預設，CTAS 查詢的結果會存放在 Parquet 中。 CTAS 查詢不要求指定 SerDe 來解譯格式轉換。請參閱 [Example: Writing query results to a different format](ctas-examples.md#ctas-example-format)。 ## 考慮壓縮格式 `GZIP` 壓縮用於 JSON 和 TEXTFILE 格式的 CTAS 查詢結果。對於 Parquet，您可以使用 `GZIP` 或 `SNAPPY`，預設值為 `GZIP`。對於 ORC，您可以使用 `LZ4`、`SNAPPY`、`ZLIB` 或 `ZSTD`，預設值為 `ZLIB`。如需指定壓縮的 CTAS 範例，請參閱 [Example: Specifying data storage and compression formats](ctas-examples.md#ctas-example-compression)。如需有關 Athena 中壓縮的資訊，請參閱 [在 Athena 中使用壓縮](compression-formats.md)。 ## 分割和歸納結果您可以分割和歸納 CTAS 查詢的結果資料。若要指定目標資料表的屬性，請在 `WITH` 子句的結尾包含分割和歸納述詞。如需詳細資訊，請參閱[使用分割和歸納](ctas-partitioning-and-bucketing.md)及[Example: Creating bucketed and partitioned tables](ctas-examples.md#ctas-example-bucketed)。當您使用 CTAS 建立分割的資料表時，Athena 的寫入限制為 100 個分割區。如需解決 100 個分割區限制的相關資訊，請參閱 [使用 CTAS 和 INSERT INTO 來解決 100 個分割區限制](ctas-insert-into.md)。 ## 加密結果您可以在 Simple Storage Service (Amazon S3) 中加密 CTAS 查詢結果，類似於您在 Athena 中加密其他查詢結果的方式。如需詳細資訊，請參閱[加密 Amazon S3 中存放的 Athena 查詢結果](encrypting-query-results-stored-in-s3.md)。 ## 預期的儲存貯體擁有者設定不適用於 CTAS 對於 CTAS 陳述式，預期的儲存貯體擁有者設定不適用於 Simple Storage Service (Amazon S3) 中的目的地資料表位置。預期的儲存貯體擁有者設定僅適用於您為 Athena 查詢結果指定的 Simple Storage Service (Amazon S3) 輸出位置。如需詳細資訊，請參閱[使用 Athena 主控台指定查詢結果位置](query-results-specify-location-console.md)。 ## 保留資料欄資料類型 CTAS 查詢的資料欄資料類型與為原始查詢所指定的相同。 # 在 Athena 主控台中建立 CTAS 查詢在 Athena 主控台中，您可以從另一個查詢建立一個 CTAS 查詢。 **若要從另一個查詢建立 CTAS 查詢** 1. 在 Athena 主控台查詢編輯器中執行查詢。 1. 選擇查詢編輯器最下方的 **Create** (建立) 選項，然後選擇 **Table from query** (查詢中的資料表)。 1. 在 **Create table as select** (建立所選資料表) 表單中，按如下方式填寫欄位： 1. 在 **Table name** (資料表名稱) 中，輸入新資料表的名稱。僅使用小寫和底線，例如 `my_select_query_parquet`。 1. 在 **Database configuration** (資料庫組態) 中，使用選項來選擇現有資料庫或建立資料庫。 1. (選用) 在 **Result configuration** (結果組態) 的 **Location of CTAS query results** (CTAS 查詢結果的位置) 中，如果您的工作群組查詢結果位置設定未覆寫此選項，請執行下列其中一項作業： + 在搜尋方塊中輸入現有 S3 位置的路徑，或選擇 **Browse S3** (瀏覽 S3) 從清單中選擇位置。 + 選擇 **View** (檢視)，以開啟 Amazon S3 主控台的 **Buckets** (儲存貯體) 頁面，您可以在其中檢視現有儲存貯體的詳細資訊，並選擇儲存貯體，或使用自己的設定建立儲存貯體。您應該在 Amazon S3 中指定將進行資料輸出的空白位置。如果您指定的位置中已存有資料，則查詢失敗並出現錯誤。如果您的工作群組查詢結果位置設定覆寫此位置的設定，則 Athena 會在位置 `s3://amzn-s3-demo-bucket/tables/query_id/` 中建立資料表。 1. 在 **Data format** (資料格式) 中，指定資料所使用的格式。 + **Table type** (資料表類型) – Athena 中的預設資料表類型為 Apache Hive。 + **File format** (檔案格式) – 選擇 CSV、TSV、JSON、Parquet 或 ORC 等選項。如需 Parquet 和 ORC 格式的相關資訊，請參閱 [使用單欄式儲存格式](columnar-storage.md)。 + **Write compression** (寫入壓縮) – (選用) 選擇壓縮格式。Athena 支援各種壓縮格式來讀取和寫入資料，包括從使用多種壓縮格式的資料表讀取。例如，當某些 Parquet 檔案使用 Snappy 壓縮而其他 Parquet 檔案使用 GZIP 壓縮時，Athena 可以成功讀取使用 Parquet 檔案格式的資料表中的資料。相同的原則適用於 ORC、文字檔案和 JSON 儲存格式。如需詳細資訊，請參閱[在 Athena 中使用壓縮](compression-formats.md)。 + **Partitions** (分割區) – (選用) 選取要分割的資料欄。分割您的資料會限制每個查詢所掃描的資料量，從而可以提高效能和降低成本。您可透過任何索引鍵來分割您的資料。如需詳細資訊，請參閱[分割您的資料](partitions.md)。 + **Buckets** (儲存貯體)– (選用) 選取您要歸納的資料欄。歸納是一種依照特定資料欄將資料分組到單個分割區的技術。這些資料欄稱為*儲存貯體金鑰*。將相關資料分組至單一儲存貯體 (分割區內的檔案)，可大幅減少 Athena 掃描的資料量，進而改善查詢效能並降低成本。如需詳細資訊，請參閱[使用分割和歸納](ctas-partitioning-and-bucketing.md)。 1. 使用 **Preview table query** (預覽資料表查詢) 可請檢閱您的查詢。如需查詢語法，請參閱[CREATE TABLE AS](create-table-as.md)。 1. 選擇 **Create Table** (建立資料表)。 Athena 主控台具備 SQL 範本，您也可使用該範本來建立 CTAS 查詢。 **使用 SQL 範本建立 CTAS 查詢** 使用 `CREATE TABLE AS SELECT` 範本在查詢編輯器中建立 CTAS 查詢。 1. 在 Athena 主控台中，選擇位在 **Tables and views** (資料表和檢視) 旁的 **Create table** (建立資料表)，然後選擇 **CREATE TABLE AS SELECT** (建立資料表作為選擇)。這會使用帶有預留位置值的 CTAS 查詢填入查詢編輯器。 1. 在查詢編輯器中，視所需編輯查詢。如需查詢語法，請參閱[CREATE TABLE AS](create-table-as.md)。 1. 選擇**執行**。如需範例，請參閱 [CTAS 查詢的範例](ctas-examples.md)。 # CTAS 查詢的範例使用下列範例來建立 CTAS 查詢。如需 CTAS 語法的相關資訊，請參閱[CREATE TABLE AS](create-table-as.md)。在本節中： + [Example: Duplicating a table by selecting all columns](#ctas-example-dupe-table) + [Example: Selecting specific columns from one or more tables](#ctas-example-specify-columns) + [Example: Creating an empty copy of an existing table](#ctas-example-empty-table) + [Example: Specifying data storage and compression formats](#ctas-example-compression) + [Example: Writing query results to a different format](#ctas-example-format) + [Example: Creating unpartitioned tables](#ctas-example-unpartitioned) + [Example: Creating partitioned tables](#ctas-example-partitioned) + [Example: Creating bucketed and partitioned tables](#ctas-example-bucketed) + [Example: Creating an Iceberg table with Parquet data](#ctas-example-iceberg-parquet) + [Example: Creating an Iceberg table with Avro data](#ctas-example-iceberg-avro) + [Example: Creating an S3 table using CTAS](#ctas-example-s3-table) **Example 範例：選取所有資料欄以複製資料表** 以下範例會透過複製資料表的所有資料欄來建立資料表： ``` CREATE TABLE new_table AS SELECT * FROM old_table; ``` 在相同範例的以下變化中，您的 `SELECT` 陳述式也包含 `WHERE` 子句。在這種情況下，查詢只會從資料表中選取滿足 `WHERE` 子句的那些資料列： ``` CREATE TABLE new_table AS SELECT * FROM old_table WHERE condition; ``` **Example 範例：從一或多個資料表選取特定資料欄** 以下範例會建立對來自另一個資料表的一組資料欄執行的新查詢： ``` CREATE TABLE new_table AS SELECT column_1, column_2, ... column_n FROM old_table; ``` 相同範例的這個變化會來自多個資料表的特定資料欄建立新的資料表： ``` CREATE TABLE new_table AS SELECT column_1, column_2, ... column_n FROM old_table_1, old_table_2, ... old_table_n; ``` **Example 範例：建立現有資料表的空白副本** 以下範例使用 `WITH NO DATA` 來建立空白且與原始資料表具有相同結構描述的新資料表： ``` CREATE TABLE new_table AS SELECT * FROM old_table WITH NO DATA; ``` **Example 範例：指定資料儲存和壓縮格式** 利用 CTAS，您可以使用一種儲存格式的來源資料表，建立不同儲存格式的另一個資料表。使用 `format` 屬性來指定 `ORC`、`PARQUET`、`AVRO`、`JSON` 或 `TEXTFILE` 作為新資料表的儲存格式。對於 `PARQUET`、`ORC`、`TEXTFILE`，和 `JSON` 儲存格式，請使用 `write_compression` 屬性指定新資料表資料的壓縮格式。如需每個檔案格式支援之壓縮格式的詳細資訊，請參閱[在 Athena 中使用壓縮](compression-formats.md)。下列範例指定資料表 `new_table` 中的資料以 Parquet 格式儲存並使用 Snappy 壓縮。Parquet 的預設壓縮為 `GZIP`。 ``` CREATE TABLE new_table WITH ( format = 'Parquet', write_compression = 'SNAPPY') AS SELECT * FROM old_table; ``` 下列範例指定使用 Snappy 壓縮以 ORC 格式儲存資料表 `new_table` 中的資料。ORC 的預設壓縮為 ZLIB。 ``` CREATE TABLE new_table WITH (format = 'ORC', write_compression = 'SNAPPY') AS SELECT * FROM old_table ; ``` 下列範例指定使用 Snappy 壓縮以文字檔案格式儲存資料表 `new_table` 中的資料。Textfile 和 JSON 格式的預設壓縮皆為 GZIP。 ``` CREATE TABLE new_table WITH (format = 'TEXTFILE', write_compression = 'SNAPPY') AS SELECT * FROM old_table ; ``` **Example 範例：將查詢結果寫入不同的格式** 以下 CTAS 查詢會從 `old_table` 選取所有記錄 (可用 CSV 或其他格式儲存)，並使用以 ORC 格式儲存至 Simple Storage Service (Amazon S3) 的基礎資料建立新的資料表： ``` CREATE TABLE my_orc_ctas_table WITH ( external_location = 's3://amzn-s3-demo-bucket/my_orc_stas_table/', format = 'ORC') AS SELECT * FROM old_table; ``` **Example 範例：建立未分割的資料表** 以下範例會建立未分割的資料表。資料表資料會以不同格式存放。其中的一些範例會指定外部位置。以下範例會建立可將結果存放為文字檔案的 CTAS 查詢： ``` CREATE TABLE ctas_csv_unpartitioned WITH ( format = 'TEXTFILE', external_location = 's3://amzn-s3-demo-bucket/ctas_csv_unpartitioned/') AS SELECT key1, name1, address1, comment1 FROM table1; ``` 在下列範例中，結果會存放在 Parquet 中，並且會使用預設結果位置： ``` CREATE TABLE ctas_parquet_unpartitioned WITH (format = 'PARQUET') AS SELECT key1, name1, comment1 FROM table1; ``` 在以下查詢中，資料表會存放在 JSON 中，並且從原始資料表的結果選取特定資料欄： ``` CREATE TABLE ctas_json_unpartitioned WITH ( format = 'JSON', external_location = 's3://amzn-s3-demo-bucket/ctas_json_unpartitioned/') AS SELECT key1, name1, address1, comment1 FROM table1; ``` 在下列範例中，格式為 ORC： ``` CREATE TABLE ctas_orc_unpartitioned WITH ( format = 'ORC') AS SELECT key1, name1, comment1 FROM table1; ``` 在下列範例中，格式為 Avro： ``` CREATE TABLE ctas_avro_unpartitioned WITH ( format = 'AVRO', external_location = 's3://amzn-s3-demo-bucket/ctas_avro_unpartitioned/') AS SELECT key1, name1, comment1 FROM table1; ``` **Example 範例：建立分割的資料表** 以下範例顯示採用不同儲存格式的分割資料表的 `CREATE TABLE AS SELECT` 查詢，使用 `partitioned_by`，以及 `WITH` 子句中的其他屬性。如需語法，請參閱[CTAS 資料表屬性](create-table-as.md#ctas-table-properties)。如需選擇用於分割的資料欄的詳細資訊，請參閱[使用分割和歸納](ctas-partitioning-and-bucketing.md)。在 `SELECT` 陳述式中資料欄清單結尾列出分割區資料欄。您可以依多於一個資料欄分割，且最多可以有 100 個唯一分割區和儲存貯體組合。例如，如果未指定儲存貯體，則可有 100 個分割區。 ``` CREATE TABLE ctas_csv_partitioned WITH ( format = 'TEXTFILE', external_location = 's3://amzn-s3-demo-bucket/ctas_csv_partitioned/', partitioned_by = ARRAY['key1']) AS SELECT name1, address1, comment1, key1 FROM tables1; ``` ``` CREATE TABLE ctas_json_partitioned WITH ( format = 'JSON', external_location = 's3://amzn-s3-demo-bucket/ctas_json_partitioned/', partitioned_by = ARRAY['key1']) AS select name1, address1, comment1, key1 FROM table1; ``` **Example 範例：建立歸納和分割的資料表** 以下範例顯示一個 `CREATE TABLE AS SELECT` 查詢，同時使用歸納和分割來在 Simple Storage Service (Amazon S3) 中存放查詢結果。資料表結果會依不同的資料欄分割和歸納。Athena 支援最多 100 個唯一儲存貯體和分割區組合。例如，如果您建立包含 5 個儲存貯體的資料表，則支援 20 個分割區 (每個都有 5 個儲存貯體)。如需語法，請參閱[CTAS 資料表屬性](create-table-as.md#ctas-table-properties)。如需選擇用於歸納的資料欄的詳細資訊，請參閱[使用分割和歸納](ctas-partitioning-and-bucketing.md)。 ``` CREATE TABLE ctas_avro_bucketed WITH ( format = 'AVRO', external_location = 's3://amzn-s3-demo-bucket/ctas_avro_bucketed/', partitioned_by = ARRAY['nationkey'], bucketed_by = ARRAY['mktsegment'], bucket_count = 3) AS SELECT key1, name1, address1, phone1, acctbal, mktsegment, comment1, nationkey FROM table1; ``` **Example 範例：建立一個包含 Parquet 資料的 Iceberg 資料表。** 下列範例會建立包含 Parquet 資料檔案的 Iceberg 資料表。使用 `table1` 中的 `dt` 資料欄按月份分割檔案。此範例會更新資料表上的保留屬性，以便在資料表中的每個分支上保留 10 個快照。另外，還會保留過去 7 天內的快照。如需有關 Athena 中的 Iceberg 資料表屬性的詳細資訊，請參閱 [指定資料表屬性](querying-iceberg-creating-tables.md#querying-iceberg-table-properties)。 ``` CREATE TABLE ctas_iceberg_parquet WITH (table_type = 'ICEBERG', format = 'PARQUET', location = 's3://amzn-s3-demo-bucket/ctas_iceberg_parquet/', is_external = false, partitioning = ARRAY['month(dt)'], vacuum_min_snapshots_to_keep = 10, vacuum_max_snapshot_age_seconds = 604800 ) AS SELECT key1, name1, dt FROM table1; ``` **Example 範例：建立一個包含 Avro 資料的 Iceberg 資料表。** 下列範例會建立包含 Avro 資料檔案的 Iceberg 資料表，其中資料表會依 `key1` 進行分割。 ``` CREATE TABLE ctas_iceberg_avro WITH ( format = 'AVRO', location = 's3://amzn-s3-demo-bucket/ctas_iceberg_avro/', is_external = false, table_type = 'ICEBERG', partitioning = ARRAY['key1']) AS SELECT key1, name1, date FROM table1; ``` **Example 範例：使用 CTAS 建立 S3 資料表** 以下範例會使用 CTAS 建立 S3 資料表。請注意，位置屬性會被省略，且 `table_type` 預設為 `ICEBERG`： ``` CREATE TABLE "s3tablescatalog/amzn-s3-demo-bucket"."namespace"."s3-table-name" WITH ( format = 'PARQUET' ) AS SELECT * FROM source_table; ``` 您可以指定所有其他 Iceberg 資料表屬性，例如使用與一般 Iceberg 資料表相同的語法進行分割和歸納。 # 使用 CTAS 和 INSERT INTO 以進行 ETL 和資料分析您可以在 Athena 中使用 Create Table as Select ([CTAS](ctas.md)) 與 [INSERT INTO](insert-into.md) 陳述式，以擷取、轉換和載入 (ETL) 資料至 Amazon S3 中進行資料處理。本主題說明如何使用這些陳述式以將資料集分割及轉換成單欄式資料格式，以最佳化資料分析。 CTAS 陳述式使用標準 [SELECT](select.md) 查詢來建立新的資料表。您可以使用 CTAS 陳述式來建立資料的子集以供分析。在一個 CTAS 陳述式中，您可以分割資料、指定壓縮，並將資料轉換成一個單欄格式，如 Apache Parquet 或 Apache ORC。當您執行 CTAS 查詢時，它所建立的資料表和分割會自動加入到 [AWS Glue Data Catalog](https://aws.amazon.com/glue)。這會讓它建立的新資料表和分割區立即可供後續查詢使用。 INSERT INTO 陳述式會根據在來源資料表上執行的 SELECT 查詢陳述式，將新的資料列插入目的地資料表。您可以使用 INSERT INTO 陳述式將 CSV 格式的來源資料表資料轉換並載入目的地資料表資料 (此資料使用 CTAS 支援的所有轉換)。 ## 概觀在 Athena 中，使用 CTAS 陳述式執行資料的初始批次轉換。然後使用多個 INSERT INTO 陳述式，對 CTAS 陳述式所建立的資料表進行累加式更新。 **步驟** + [步驟 1：建立以原始資料集為基礎的資料表](#ctas-insert-into-etl-step-1-create-a-table-based-on-the-original-dataset) + [步驟 2：使用 CTAS 來分割、轉換和壓縮資料](#ctas-insert-into-etl-step-2-use-ctas-to-partition-convert-and-compress-the-data) + [步驟 3：使用 INSERT INTO 以新增資料](#ctas-insert-into-etl-step-3-use-insert-into-to-add-data) + [步驟 4：測量效能與成本差異](#ctas-insert-into-etl-step-4-measure-performance-and-cost-differences) ## 步驟 1：建立以原始資料集為基礎的資料表本主題中的範例使用公開提供之 [NOAA Global Historical Climatology Network Daily (GHCN-D)](https://registry.opendata.aws/noaa-ghcn/) 資料集的 Amazon S3 可讀取子集。Amazon S3 上的資料具有下列特性。 ``` Location: s3://aws-bigdata-blog/artifacts/athena-ctas-insert-into-blog/ Total objects: 41727 Size of CSV dataset: 11.3 GB Region: us-east-1 ``` 原始資料存放在 Amazon S3 中，無任何分割區。這些資料是 CSV 格式的文件，如下所示。 ``` 2019-10-31 13:06:57 413.1 KiB artifacts/athena-ctas-insert-into-blog/2010.csv0000 2019-10-31 13:06:57 412.0 KiB artifacts/athena-ctas-insert-into-blog/2010.csv0001 2019-10-31 13:06:57 34.4 KiB artifacts/athena-ctas-insert-into-blog/2010.csv0002 2019-10-31 13:06:57 412.2 KiB artifacts/athena-ctas-insert-into-blog/2010.csv0100 2019-10-31 13:06:57 412.7 KiB artifacts/athena-ctas-insert-into-blog/2010.csv0101 ``` 此範例中的檔案大小相對較小。藉由將檔案合併至較大的檔案中，您可以減少檔案總數，以提高查詢效能。您可以使用 CTAS 和 INSERT INTO 陳述式來增強查詢效能。 **根據範例資料集建立資料庫和資料表** 1. 在 Athena 主控台中，選擇 **US East (N. Virginia)** (美國東部 (維吉尼亞北部)) AWS 區域。請務必在 `us-east-1` 的本教學課程中執行所有查詢。 1. 在 Athena 查詢編輯器中，執行 [CREATE DATABASE](create-database.md) 命令以建立資料庫。 ``` CREATE DATABASE blogdb ``` 1. 執行下列陳述式來 [建立資料表](create-table.md)。 ``` CREATE EXTERNAL TABLE `blogdb`.`original_csv` ( `id` string, `date` string, `element` string, `datavalue` bigint, `mflag` string, `qflag` string, `sflag` string, `obstime` bigint) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://aws-bigdata-blog/artifacts/athena-ctas-insert-into-blog/' ``` ## 步驟 2：使用 CTAS 來分割、轉換和壓縮資料建立資料表之後，您可以使用單一 [CTAS](ctas.md) 陳述式將資料轉換為具有 Snappy 壓縮的 Parquet 格式與，並按年份分割資料。您在步驟 1 中建立的表格有一個 `date` 欄位，其日期格式為 `YYYYMMDD` (例如，`20100104`)。因為新的資料表會在 `year` 被分割，下列程序中的範例陳述式會使用 Presto 函式 `substr("date",1,4)` 以從 `date` 欄位擷取 `year` 值。 **將資料轉換為 Snappy 壓縮的 Parquet 格式，按年份分割** + 執行以下 CTAS 陳述式，將 *your-bucket* 替換為您的 Amazon S3 儲存貯體位置。 ``` CREATE table new_parquet WITH (format='PARQUET', parquet_compression='SNAPPY', partitioned_by=array['year'], external_location = 's3://amzn-s3-demo-bucket/optimized-data/') AS SELECT id, date, element, datavalue, mflag, qflag, sflag, obstime, substr("date",1,4) AS year FROM original_csv WHERE cast(substr("date",1,4) AS bigint) >= 2015 AND cast(substr("date",1,4) AS bigint) <= 2019 ``` **注意** 在本範例中，您建立的資料表會包含從 2015 年到 2019 年的資料。在步驟 3 中，您可以使用 INSERT INTO 命令將新的資料加入到此資料表。查詢完成時，請使用下列程序來確認您在 CTAS 陳述式中指定的 Amazon S3 位置中的輸出。 **查看 CTAS 陳述式所建立的分割區和 parquet 檔案** 1. 若要顯示建立的分割區，請執行下列 AWS CLI 命令。請務必包含最後的正斜線 (/)。 ``` aws s3 ls s3://amzn-s3-demo-bucket/optimized-data/ ``` 輸出會顯示分割區。 ``` PRE year=2015/ PRE year=2016/ PRE year=2017/ PRE year=2018/ PRE year=2019/ ``` 1. 若要查看 Parquet 檔案，請執行以下命令。請注意，在 Windows 上不可使用將輸出限制為前五個結果的 `|` *head-5* 選項。 ``` aws s3 ls s3://amzn-s3-demo-bucket/optimized-data/ --recursive --human-readable | head -5 ``` 輸出結果與以下內容相似。 ``` 2019-10-31 14:51:05 7.3 MiB optimized-data/year=2015/20191031_215021_00001_3f42d_1be48df2-3154-438b-b61d-8fb23809679d 2019-10-31 14:51:05 7.0 MiB optimized-data/year=2015/20191031_215021_00001_3f42d_2a57f4e2-ffa0-4be3-9c3f-28b16d86ed5a 2019-10-31 14:51:05 9.9 MiB optimized-data/year=2015/20191031_215021_00001_3f42d_34381db1-00ca-4092-bd65-ab04e06dc799 2019-10-31 14:51:05 7.5 MiB optimized-data/year=2015/20191031_215021_00001_3f42d_354a2bc1-345f-4996-9073-096cb863308d 2019-10-31 14:51:05 6.9 MiB optimized-data/year=2015/20191031_215021_00001_3f42d_42da4cfd-6e21-40a1-8152-0b902da385a1 ``` ## 步驟 3：使用 INSERT INTO 以新增資料在步驟 2 中，您已使用 CTAS 來建立 2015 年到 2019 年的分割區資料表。不過，原始資料集也包含 2010 年至 2014 年的資料。現在您使用 [INSERT INTO](insert-into.md) 陳述式新增該資料。 **使用一或多個 INSERT INTO 陳述式將資料新增至資料表** 1. 執行下列 INSERT INTO 指令，在 WHERE 子句中指定 2015 年之前的年份。 ``` INSERT INTO new_parquet SELECT id, date, element, datavalue, mflag, qflag, sflag, obstime, substr("date",1,4) AS year FROM original_csv WHERE cast(substr("date",1,4) AS bigint) < 2015 ``` 1. 使用下列語法再次執行 `aws s3 ls` 命令。 ``` aws s3 ls s3://amzn-s3-demo-bucket/optimized-data/ ``` 該輸出會顯示新的分割區。 ``` PRE year=2010/ PRE year=2011/ PRE year=2012/ PRE year=2013/ PRE year=2014/ PRE year=2015/ PRE year=2016/ PRE year=2017/ PRE year=2018/ PRE year=2019/ ``` 1. 若要查看在 Parquet 格式中使用壓縮和單欄儲存所獲得的資料集大小縮減，請執行以下命令。 ``` aws s3 ls s3://amzn-s3-demo-bucket/optimized-data/ --recursive --human-readable --summarize ``` 以下結果顯示，用 Snappy 壓縮 parquet 後的資料集大小為 1.2 GB。 ``` ... 2020-01-22 18:12:02 2.8 MiB optimized-data/year=2019/20200122_181132_00003_nja5r_f0182e6c-38f4-4245-afa2-9f5bfa8d6d8f 2020-01-22 18:11:59 3.7 MiB optimized-data/year=2019/20200122_181132_00003_nja5r_fd9906b7-06cf-4055-a05b-f050e139946e Total Objects: 300 Total Size: 1.2 GiB ``` 1. 如果有更多 CSV 資料新增至原始資料表，則您可以使用 INSERT INTO 陳述式將該資料加入至 parquet 資料表。例如，如果您有 2020 年的新資料，則您可以執行下列 INSERT INTO 陳述式。該陳述式會將資料和相關的磁碟分割新增至 `new_parquet` 資料表。 ``` INSERT INTO new_parquet SELECT id, date, element, datavalue, mflag, qflag, sflag, obstime, substr("date",1,4) AS year FROM original_csv WHERE cast(substr("date",1,4) AS bigint) = 2020 ``` **注意** INSERT INTO 陳述式支援最多將 100 個分割區寫入目標資料表。不過，若要新增超過 100 個分割區，您可以執行多個 INSERT INTO 陳述式。如需詳細資訊，請參閱[使用 CTAS 和 INSERT INTO 來解決 100 個分割區限制](ctas-insert-into.md)。 ## 步驟 4：測量效能與成本差異在您轉換資料之後，您可以在新舊資料表上執行相同的查詢，並比較結果來衡量效能提升和成本節省。 **注意** 如需 Athena 每個查詢成本的資訊，請參閱 [Amazon Athena 定價](https://aws.amazon.com/athena/pricing)。 **衡量效能提升與成本差異** 1. 在原始資料表上執行下列查詢。該查詢會尋找一年中每個值的不同 ID 數目。 ``` SELECT substr("date",1,4) as year, COUNT(DISTINCT id) FROM original_csv GROUP BY 1 ORDER BY 1 DESC ``` 1. 請注意查詢執行的時間，以及掃描的資料量。 1. 在新資料表上執行相同的查詢，並記下查詢執行時間和掃描的資料量。 ``` SELECT year, COUNT(DISTINCT id) FROM new_parquet GROUP BY 1 ORDER BY 1 DESC ``` 1. 比較結果並計算效能和成本差異。下列範例結果顯示，在新資料表上的測試查詢比舊資料表上的查詢更快且更便宜。 **** [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_tw/athena/latest/ug/ctas-insert-into-etl.html) 1. 在原始資料表上執行下列範例查詢。該查詢會計算 2018 年地球的平均最高溫度 (攝氏)、平均最低溫度 (攝氏) 和平均降雨量 (mm)。 ``` SELECT element, round(avg(CAST(datavalue AS real)/10),2) AS value FROM original_csv WHERE element IN ('TMIN', 'TMAX', 'PRCP') AND substr("date",1,4) = '2018' GROUP BY 1 ``` 1. 請注意查詢執行的時間，以及掃描的資料量。 1. 在新資料表上執行相同的查詢，並記下查詢執行時間和掃描的資料量。 ``` SELECT element, round(avg(CAST(datavalue AS real)/10),2) AS value FROM new_parquet WHERE element IN ('TMIN', 'TMAX', 'PRCP') and year = '2018' GROUP BY 1 ``` 1. 比較結果並計算效能和成本差異。下列範例結果顯示，在新資料表上的測試查詢比舊資料表上的查詢更快且更便宜。 **** [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_tw/athena/latest/ug/ctas-insert-into-etl.html) ## 摘要本主題說明如何使用 Athena 中的 CTAS 和 INSERT INTO 陳述式來執行 ETL 操作。您會使用 CTAS 陳述式將資料轉換為具有 Snappy 壓縮的 Parquet 格式，以執行第一組轉換。CTAS 陳述式也會將資料集從非分割轉換為已分割。這會減少其大小並降低執行查詢的成本。當新資料可用時，您可以使用 INSERT INTO 陳述式來轉換並將資料載入您使用 CTAS 陳述式建立的資料表。 # 使用 CTAS 和 INSERT INTO 來解決 100 個分割區限制 Athena 每個 `CREATE TABLE AS SELECT` ([CTAS](ctas.md)) 查詢的分割區限制為 100 個。同樣地，您可以使用 [INSERT INTO](https://docs.aws.amazon.com/athena/latest/ug/insert-into.html) 陳述式將最多 100 個分割區新增至目的地資料表。如果您超出此限制，您可能會收到錯誤訊息 HIVE\$1TOO\$1MANY\$1OPEN\$1PARTITIONS: Exceeded limit of 100 open writers for partitions/buckets (HIVE\$1TOO\$1MANY\$1OPEN\$1PARTITIONS：超過分割區／儲存貯體 100 個開啟寫入器的限制)。若要避開此限制，您可以使用 CTAS 陳述式和一系列的 `INSERT INTO` 陳述式，每個陳述式可建立或插入最多 100 個分割區。本主題中的範例使用名為 `tpch100` 的資料庫，其資料位於 Amazon S3 儲存貯體位置 s3://amzn-s3-demo-bucket/。 **使用 CTAS 和 INSERT INTO 來建立超過 100 個分割區的資料表** 1. 使用 `CREATE EXTERNAL TABLE` 陳述式在您要的欄位上建立分割的資料表。下列範例陳述式會依資料欄 `l_shipdate` 來分割資料。該資料表有 2525 個分割區。 ``` CREATE EXTERNAL TABLE `tpch100.lineitem_parq_partitioned`( `l_orderkey` int, `l_partkey` int, `l_suppkey` int, `l_linenumber` int, `l_quantity` double, `l_extendedprice` double, `l_discount` double, `l_tax` double, `l_returnflag` string, `l_linestatus` string, `l_commitdate` string, `l_receiptdate` string, `l_shipinstruct` string, `l_comment` string) PARTITIONED BY ( `l_shipdate` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3://amzn-s3-demo-bucket/lineitem/' ``` 1. 執行如下所示的 `SHOW PARTITIONS ` 命令以列出分割區。 ``` SHOW PARTITIONS lineitem_parq_partitioned ``` 以下是部分樣本結果。 ``` /* l_shipdate=1992-01-02 l_shipdate=1992-01-03 l_shipdate=1992-01-04 l_shipdate=1992-01-05 l_shipdate=1992-01-06 ... l_shipdate=1998-11-24 l_shipdate=1998-11-25 l_shipdate=1998-11-26 l_shipdate=1998-11-27 l_shipdate=1998-11-28 l_shipdate=1998-11-29 l_shipdate=1998-11-30 l_shipdate=1998-12-01 */ ``` 1. 執行 CTAS 查詢以建立分割的資料表。下列範例會建立名為 `my_lineitem_parq_partitioned` 的資料表，並使用 `WHERE ` 子句來限制 `DATE` 早於 `1992-02-01`。因為範例資料集開始於 1992 年 1 月，所以只會建立 1992 年 1 月的分割區。 ``` CREATE table my_lineitem_parq_partitioned WITH (partitioned_by = ARRAY['l_shipdate']) AS SELECT l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_commitdate, l_receiptdate, l_shipinstruct, l_comment, l_shipdate FROM tpch100.lineitem_parq_partitioned WHERE cast(l_shipdate as timestamp) < DATE ('1992-02-01'); ``` 1. 執行 `SHOW PARTITIONS` 命令，以確認資料表包含您想要的分割區。 ``` SHOW PARTITIONS my_lineitem_parq_partitioned; ``` 範例中的分割區是從 1992 年 1 月開始。 ``` /* l_shipdate=1992-01-02 l_shipdate=1992-01-03 l_shipdate=1992-01-04 l_shipdate=1992-01-05 l_shipdate=1992-01-06 l_shipdate=1992-01-07 l_shipdate=1992-01-08 l_shipdate=1992-01-09 l_shipdate=1992-01-10 l_shipdate=1992-01-11 l_shipdate=1992-01-12 l_shipdate=1992-01-13 l_shipdate=1992-01-14 l_shipdate=1992-01-15 l_shipdate=1992-01-16 l_shipdate=1992-01-17 l_shipdate=1992-01-18 l_shipdate=1992-01-19 l_shipdate=1992-01-20 l_shipdate=1992-01-21 l_shipdate=1992-01-22 l_shipdate=1992-01-23 l_shipdate=1992-01-24 l_shipdate=1992-01-25 l_shipdate=1992-01-26 l_shipdate=1992-01-27 l_shipdate=1992-01-28 l_shipdate=1992-01-29 l_shipdate=1992-01-30 l_shipdate=1992-01-31 */ ``` 1. 使用 `INSERT INTO` 陳述式將分割區新增至資料表。下列範例會針對 1992 年 2 月份的日期新增分割區。 ``` INSERT INTO my_lineitem_parq_partitioned SELECT l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_commitdate, l_receiptdate, l_shipinstruct, l_comment, l_shipdate FROM tpch100.lineitem_parq_partitioned WHERE cast(l_shipdate as timestamp) >= DATE ('1992-02-01') AND cast(l_shipdate as timestamp) < DATE ('1992-03-01'); ``` 1. 再次執行 `SHOW PARTITIONS`。 ``` SHOW PARTITIONS my_lineitem_parq_partitioned; ``` 範例資料表現在有來自 1992 年 1 月和 2 月的分割區。 ``` /* l_shipdate=1992-01-02 l_shipdate=1992-01-03 l_shipdate=1992-01-04 l_shipdate=1992-01-05 l_shipdate=1992-01-06 ... l_shipdate=1992-02-20 l_shipdate=1992-02-21 l_shipdate=1992-02-22 l_shipdate=1992-02-23 l_shipdate=1992-02-24 l_shipdate=1992-02-25 l_shipdate=1992-02-26 l_shipdate=1992-02-27 l_shipdate=1992-02-28 l_shipdate=1992-02-29 */ ``` 1. 繼續使用每個讀取和新增不超過 100 個分割區的 `INSERT INTO` 陳述式。繼續進行，直到達到您需要的分割區數目為止。 **重要** 設定 `WHERE` 條件時，請確定查詢不會重疊。否則，某些分割區可能會有重複的資料。