

# Setting up network access to data stores
<a name="start-connecting"></a>

To run your extract, transform, and load (ETL) jobs, AWS Glue must be able to access your data stores. If a job doesn't need to run in your virtual private cloud (VPC) subnet—for example, transforming data from Amazon S3 to Amazon S3—no additional configuration is needed.

If a job needs to run in your VPC subnet—for example, transforming data from a JDBC data store in a private subnet—AWS Glue sets up [elastic network interfaces](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_ElasticNetworkInterfaces.html) that enable your jobs to connect securely to other resources within your VPC. Each elastic network interface is assigned a private IP address from the IP address range within the subnet you specified. No public IP addresses are assigned. Security groups specified in the AWS Glue connection are applied on each of the elastic network interfaces. For more information, see [Setting up Amazon VPC for JDBC connections to Amazon RDS data stores from AWS Glue](setup-vpc-for-glue-access.md). 

All JDBC data stores that are accessed by the job must be available from the VPC subnet. To access Amazon S3 from within your VPC, a [VPC endpoint](vpc-endpoints-s3.md) is required. If your job needs to access both VPC resources and the public internet, the VPC needs to have a Network Address Translation (NAT) gateway inside the VPC.

 A job or development endpoint can only access one VPC (and subnet) at a time. If you need to access data stores in different VPCs, you have the following options: 
+ Use VPC peering to access the data stores. For more about VPC peering, see [VPC Peering Basics](https://docs.aws.amazon.com/vpc/latest/peering/vpc-peering-basics.html) 
+ Use an Amazon S3 bucket as an intermediary storage location. Split the work into two jobs, with the Amazon S3 output of job 1 as the input to job 2.

For details on how to connect to a Amazon Redshift data store using Amazon VPC, see [Configuring Redshift connections](aws-glue-programming-etl-connect-redshift-home.md#aws-glue-programming-etl-connect-redshift-configure).

For details on how to connnect to Amazon RDS data stores using Amazon VPC, see [Setting up Amazon VPC for JDBC connections to Amazon RDS data stores from AWS Glue](setup-vpc-for-glue-access.md).

Once necessary rules are set in Amazon VPC, you create a connection in AWS Glue with the necessary properties to connect to your data stores. For more information about the connection, see [Connecting to data](glue-connections.md).

**Note**  
Make sure you set up your DNS environment for AWS Glue. For more information, see [Setting up DNS in your VPC](set-up-vpc-dns.md). 

**Topics**
+ [Setting up a VPC to connect to PyPI for AWS Glue](setup-vpc-for-pypi.md)
+ [Setting up DNS in your VPC](set-up-vpc-dns.md)

# Setting up a VPC to connect to PyPI for AWS Glue
<a name="setup-vpc-for-pypi"></a>

The Python Package Index (PyPI) is a repository of software for the Python programming language. This topic addresses the details needed to support the use of pip installed packages (as specified by the session creator using the `--additional-python-modules` flag).

Using AWS Glue interactive sessions with a connector results in the use of VPC network via the subnet specified for the connector. Consequently AWS services and other network destinations are not available unless you set up a special configuration.

The resolutions to this issue include:
+ Use of an internet gateway which is reachable by your session.
+ Set up and use of an S3 bucket with a PyPI/simple repo containing the transitive closure of a package set's dependencies.
+ Use of a CodeArtifact repository which is mirroring PyPI and attached to your VPC.

## Setting up an internet gateway
<a name="setup-vpc-for-pypi-internet-gateway"></a>

The technical aspects are detailed in [NAT gateway use cases](https://docs.aws.amazon.com/vpc/latest/userguide/nat-gateway-scenarios.html) but note these requirements for using `--additional-python-modules`. Specifically, `--additional-python-modules` requires access to pypi.org which is determined by the configuration of your VPC. Note the following requirements:

1. The requirement of installing additional python modules via pip install for a user's session. If the session uses a connector, your configuration may be affected.

1. When a connector is being used with `--additional-python-modules`, when the session is started the subnet associated with the connector's `PhysicalConnectionRequirements` has to provide a network path for reaching pypi.org.

1. You must determine whether or not your configuration is correct.

## Setting up an Amazon S3 bucket to host a targeted PyPI/simple repo
<a name="setup-vpc-for-pypi-s3-bucket"></a>

This example sets up a PyPI mirror in Amazon S3 for a set of packages and their dependencies.

To set up the PyPI mirror for a set of packages:

```
# pip download all the dependencies
pip download -d s3pypi --only-binary :all: plotly gglplot
pip download -d s3pypi --platform manylinux_2_17_x86_64 --only-binary :all: psycopg2-binary
# create and upload the pypi/simple index and wheel files to the s3 bucket
s3pypi -b test-domain-name --put-root-index -v s3pypi/*
```

If you already have an existing artifact repository, it will have an index URL for pip's use that you can provide in place of the example URL for the Amazon S3 bucket as above.

To use the custom index-url, with some example packages:

```
%%configure
{
    "--additional-python-modules": "psycopg2_binary==2.9.5",
    "python-modules-installer-option": "--no-cache-dir --verbose --index-url https://test-domain-name.s3.amazonaws.com/ --trusted-host test-domain-name.s3.amazonaws.com"
}
```

## Setting up a CodeArtifact mirror of pypi attached to your VPC
<a name="setup-vpc-for-pypi-code-artifact"></a>

To set up a mirror:

1. Create a repository in the same region as the subnet used by the connector.

   Select `Public upstream repositories` and choose `pypi-store`.

1. Provide access to the repository from the VPC for the subnet.

1. Specify the correct `--index-url` using the `python-modules-installer-option`. 

   ```
   %%configure
   {
       "--additional-python-modules": "psycopg2_binary==2.9.5",
       "python-modules-installer-option": "--no-cache-dir --verbose --index-url https://test-domain-name.s3.amazonaws.com/ --trusted-host test-domain-name.s3.amazonaws.com"
   }
   ```

For more information, see [Use CodeArtifact from a VPC](https://docs.aws.amazon.com/codeartifact/latest/ug/use-codeartifact-from-vpc.html).

# Setting up DNS in your VPC
<a name="set-up-vpc-dns"></a>

Domain Name System (DNS) is a standard by which names used on the internet are resolved to their corresponding IP addresses. A DNS hostname uniquely names a computer and consists of a host name and a domain name. DNS servers resolve DNS hostnames to their corresponding IP addresses.

To set up DNS in your VPC, ensure that DNS hostnames and DNS resolution are both enabled in your VPC. The VPC network attributes `enableDnsHostnames` and `enableDnsSupport` must be set to `true`. To view and modify these attributes, go to the VPC console at [https://console.aws.amazon.com/vpc/](https://console.aws.amazon.com/vpc/). 

For more information, see [Using DNS with your VPC](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html). Also, you can use the AWS CLI and call the [modify-vpc-attribute](https://docs.aws.amazon.com/cli/latest/reference/ec2/modify-vpc-attribute.html) command to configure the VPC network attributes.

**Note**  
If you are using Route 53, confirm that your configuration does not override DNS network attributes.