# Using PennyLane with Amazon Braket
(Advanced) PennyLane with Amazon Braket

Hybrid algorithms are algorithms that contain both classical and quantum instructions. The classical instructions are ran on classical hardware (an EC2 instance or your laptop), and the quantum instructions are ran either on a simulator or on a quantum computer. We recommend that you run hybrid algorithms using the Hybrid Jobs feature. For more information, see [When to use Amazon Braket Jobs](braket-jobs.md#braket-jobs-use).

Amazon Braket enables you to set up and run hybrid quantum algorithms with the assistance of the ** Amazon Braket PennyLane plugin**, or with the ** Amazon Braket Python SDK** and example notebook repositories. Amazon Braket example notebooks, based on the SDK, enable you to set up and run certain hybrid algorithms without the PennyLane plugin. However, we recommend PennyLane because it provides a richer experience.

 **About hybrid quantum algorithms** 

Hybrid quantum algorithms are important to the industry today because contemporary quantum computing devices generally produce noise, and therefore, errors. Every quantum gate added to a computation increases the chance of adding noise; therefore, long-running algorithms can be overwhelmed by noise, which results in faulty computation.

Pure quantum algorithms such as Shor's [(Quantum Phase Estimation example)](https://github.com/amazon-braket/amazon-braket-examples/tree/main/examples/advanced_circuits_algorithms/Quantum_Phase_Estimation) or Grover's [(Grover's example)](https://github.com/aws/amazon-braket-examples/tree/main/examples/advanced_circuits_algorithms/Grover) require thousands, or millions, of operations. For this reason, they can be impractical for existing quantum devices, which are generally referred to as *noisy intermediate-scale quantum* (NISQ) devices.

In hybrid quantum algorithms, quantum processing units (QPUs) work as co-processors for classic CPUs, specifically to speed up certain calculations in a classical algorithm. Circuit executions become much shorter, within reach of the capabilities of today's devices.

**Topics**
+ [

## Amazon Braket with PennyLane
](#pennylane-option)
+ [

## Hybrid algorithms in Amazon Braket example notebooks
](#braket-hybrid-workflow)
+ [

## Hybrid algorithms with embedded PennyLane simulators
](#hybrid-alorithms-pennylane)
+ [

## Adjoint gradient on PennyLane with Amazon Braket simulators
](#adjoint-gradient-pennylane)
+ [

# Using Hybrid Jobs and PennyLane to run a QAOA algorithm
](braket-jobs-run-qaoa-algorithm.md)
+ [

# Run hybrid workloads with PennyLane embedded simulators
](pennylane-embedded-simulators.md)

## Amazon Braket with PennyLane


Amazon Braket provides support for [PennyLane](https://pennylane.ai), an open-source software framework built around the concept of *quantum differentiable programming*. You can use this framework to train quantum circuits in the same way that you would train a neural network to find solutions for computational problems in quantum chemistry, quantum machine learning, and optimization.

The PennyLane library provides interfaces to familiar machine learning tools, including PyTorch and TensorFlow, to make training quantum circuits quick and intuitive.
+  **The PennyLane Library** -– PennyLane is pre-installed in Amazon Braket notebooks. For access to Amazon Braket devices from PennyLane, open a notebook and import the PennyLane library with the following command.

```
import pennylane as qml
```

Tutorial notebooks help you get started quickly. Alternatively, you can use PennyLane on Amazon Braket from an IDE of your choice.
+  **The Amazon Braket PennyLane plugin** — To use your own IDE, you can install the Amazon Braket PennyLane plugin manually. The plugin connects PennyLane with the [Amazon Braket Python SDK](https://github.com/aws/amazon-braket-sdk-python), so you can run circuits in PennyLane on Amazon Braket devices. To install the the PennyLane plugin, use the following command.

```
pip install amazon-braket-pennylane-plugin
```

The following example demonstrates how to set up access to Amazon Braket devices in PennyLane:

```
# to use SV1
import pennylane as qml
sv1 = qml.device("braket.aws.qubit", device_arn="arn:aws:braket:::device/quantum-simulator/amazon/sv1", wires=2)

# to run a circuit:
@qml.qnode(sv1)
def circuit(x):
    qml.RZ(x, wires=0)
    qml.CNOT(wires=[0,1])
    qml.RY(x, wires=1)
    return qml.expval(qml.PauliZ(1))

result = circuit(0.543)


#To use the local sim:
local = qml.device("braket.local.qubit", wires=2)
```

For tutorial examples and more information about PennyLane, see the [Amazon Braket examples repository](https://github.com/aws/amazon-braket-examples/tree/main/examples/pennylane).

The Amazon Braket PennyLane plugin enables you to switch between Amazon Braket QPU and embedded simulator devices in PennyLane with a single line of code. It offers two Amazon Braket quantum devices to work with PennyLane:
+  `braket.aws.qubit` for running with the Amazon Braket service's quantum devices, including QPUs and simulators
+  `braket.local.qubit` for running with the Amazon Braket SDK's local simulator

The Amazon Braket PennyLane plugin is open source. You can install it from the [PennyLane Plugin GitHub repository](https://github.com/aws/amazon-braket-pennylane-plugin-python).

For more information about PennyLane, see the documentation on the [PennyLane website](https://pennylane.ai).

## Hybrid algorithms in Amazon Braket example notebooks


Amazon Braket does provide a variety of example notebooks that do not rely on the PennyLane plugin for running hybrid algorithms. You can get started with any of these [Amazon Braket hybrid example notebooks](https://github.com/aws/amazon-braket-examples/tree/main/examples/hybrid_quantum_algorithms) that illustrate *variational methods*, such as the Quantum Approximate Optimization Algorithm (QAOA) or Variational Quantum Eigensolver (VQE).

The Amazon Braket example notebooks rely on the [Amazon Braket Python SDK](https://github.com/aws/amazon-braket-sdk-python). The SDK provides a framework to interact with quantum computing hardware devices through Amazon Braket. It is an open source library that is designed to assist you with the quantum portion of your hybrid workflow.

You can explore Amazon Braket further with our [example notebooks](https://github.com/aws/amazon-braket-examples).

## Hybrid algorithms with embedded PennyLane simulators


Amazon Braket Hybrid Jobs now comes with high performance CPU- and GPU-based embedded simulators from [PennyLane](https://github.com/PennyLaneAI/pennylane-lightning). This family of embedded simulators can be embedded directly within your hybrid jobs container and includes the fast state-vector `lightning.qubit` simulator, the `lightning.gpu` simulator accelerated using NVIDIA's [cuQuantum library](https://developer.nvidia.com/cuquantum-sdk), and others. These embedded simulators are ideally suited for variational algorithms such as quantum machine learning that can benefit from advanced methods such as the [adjoint differentiation method](https://docs.pennylane.ai/en/stable/introduction/interfaces.html#simulation-based-differentiation). You can run these embedded simulators on one or multiple CPU or GPU instances.

With Hybrid Jobs, you can now run your variational algorithm code using a combination of a classical co-processor and a QPU, an Amazon Braket on-demand simulator such as SV1, or directly using the embedded simulator from PennyLane.

The embedded simulator is already available with the Hybrid Jobs container, you need to decorate your main Python function with the `@hybrid_job` decorator. To use the PennyLane `lightning.gpu` simulator, you also need to specify a GPU instance in the `InstanceConfig` as shown in the following code snippet:

```
import pennylane as qml
from braket.jobs import hybrid_job
from braket.jobs.config import InstanceConfig


@hybrid_job(device="local:pennylane/lightning.gpu", instance_config=InstanceConfig(instanceType="ml.g4dn.xlarge"))
def function(wires):
    dev = qml.device("lightning.gpu", wires=wires)
    ...
```

Refer to the [example notebook](https://github.com/aws/amazon-braket-examples/blob/main/examples/hybrid_jobs/4_Embedded_simulators_in_Braket_Hybrid_Jobs/Embedded_simulators_in_Braket_Hybrid_Jobs.ipynb) to get started with using a PennyLane embedded simulator with Hybrid Jobs.

## Adjoint gradient on PennyLane with Amazon Braket simulators


With the PennyLane plugin for Amazon Braket, you can compute gradients using the adjoint differentiation method when running on the local state vector simulator or SV1.

 **Note:** To use the adjoint differentiation method, you must specify `diff_method='device'` in your `qnode`, and **not** `diff_method='adjoint'`. See the following example.

```
device_arn = "arn:aws:braket:::device/quantum-simulator/amazon/sv1"
dev = qml.device("braket.aws.qubit", wires=wires, shots=0, device_arn=device_arn)
                
@qml.qnode(dev, diff_method="device")
def cost_function(params):
    circuit(params)
    return qml.expval(cost_h)

gradient = qml.grad(circuit)
initial_gradient = gradient(params0)
```

**Note**  
Currently, PennyLane will compute grouping indices for QAOA Hamiltonians and use them to split the Hamiltonian into multiple expectation values. If you want to use SV1's adjoint differentiation capability when running QAOA from PennyLane, you will need reconstruct the cost Hamiltonian by removing the grouping indices, like so: `cost_h, mixer_h = qml.qaoa.max_clique(g, constrained=False) cost_h = qml.Hamiltonian(cost_h.coeffs, cost_h.ops)` 

# Using Hybrid Jobs and PennyLane to run a QAOA algorithm


In this section, you will use what you have learned to write an actual hybrid program using PennyLane with parametric compilation. You use the algorithm script to address a Quantum Approximate Optimization Algorithm (QAOA) problem. The program creates a cost function corresponding to a classical Max Cut optimization problem, specifies a parametrized quantum circuit, and uses a gradient descent method to optimize the parameters so that the cost function is minimized. In this example, we generate the problem graph in the algorithm script for simplicity, but for more typical use cases the best practice is to provide the problem specification through a dedicated channel in the input data configuration. The flag `parametrize_differentiable` defaults to `True` so you automatically get the benefits of improved runtime performance from parametric compilation on supported QPUs.

```
import os
import json
import time

from braket.jobs import save_job_result
from braket.jobs.metrics import log_metric

import networkx as nx
import pennylane as qml
from pennylane import numpy as np
from matplotlib import pyplot as plt

def init_pl_device(device_arn, num_nodes, shots, max_parallel):
    return qml.device(
        "braket.aws.qubit",
        device_arn=device_arn,
        wires=num_nodes,
        shots=shots,
        # Set s3_destination_folder=None to output task results to a default folder
        s3_destination_folder=None,
        parallel=True,
        max_parallel=max_parallel,
        parametrize_differentiable=True, # This flag is True by default.
    )

def start_here():
    input_dir = os.environ["AMZN_BRAKET_INPUT_DIR"]
    output_dir = os.environ["AMZN_BRAKET_JOB_RESULTS_DIR"]
    job_name = os.environ["AMZN_BRAKET_JOB_NAME"]
    checkpoint_dir = os.environ["AMZN_BRAKET_CHECKPOINT_DIR"]
    hp_file = os.environ["AMZN_BRAKET_HP_FILE"]
    device_arn = os.environ["AMZN_BRAKET_DEVICE_ARN"]

    # Read the hyperparameters
    with open(hp_file, "r") as f:
        hyperparams = json.load(f)

    p = int(hyperparams["p"])
    seed = int(hyperparams["seed"])
    max_parallel = int(hyperparams["max_parallel"])
    num_iterations = int(hyperparams["num_iterations"])
    stepsize = float(hyperparams["stepsize"])
    shots = int(hyperparams["shots"])

    # Generate random graph
    num_nodes = 6
    num_edges = 8
    graph_seed = 1967
    g = nx.gnm_random_graph(num_nodes, num_edges, seed=graph_seed)

    # Output figure to file
    positions = nx.spring_layout(g, seed=seed)
    nx.draw(g, with_labels=True, pos=positions, node_size=600)
    plt.savefig(f"{output_dir}/graph.png")

    # Set up the QAOA problem
    cost_h, mixer_h = qml.qaoa.maxcut(g)

    def qaoa_layer(gamma, alpha):
        qml.qaoa.cost_layer(gamma, cost_h)
        qml.qaoa.mixer_layer(alpha, mixer_h)

    def circuit(params, **kwargs):
        for i in range(num_nodes):
            qml.Hadamard(wires=i)
        qml.layer(qaoa_layer, p, params[0], params[1])

    dev = init_pl_device(device_arn, num_nodes, shots, max_parallel)

    np.random.seed(seed)
    cost_function = qml.ExpvalCost(circuit, cost_h, dev, optimize=True)
    params = 0.01 * np.random.uniform(size=[2, p])

    optimizer = qml.GradientDescentOptimizer(stepsize=stepsize)
    print("Optimization start")

    for iteration in range(num_iterations):
        t0 = time.time()

        # Evaluates the cost, then does a gradient step to new params
        params, cost_before = optimizer.step_and_cost(cost_function, params)
        # Convert cost_before to a float so it's easier to handle
        cost_before = float(cost_before)

        t1 = time.time()

        if iteration == 0:
            print("Initial cost:", cost_before)
        else:
            print(f"Cost at step {iteration}:", cost_before)

        # Log the current loss as a metric
        log_metric(
            metric_name="Cost",
            value=cost_before,
            iteration_number=iteration,
        )

        print(f"Completed iteration {iteration + 1}")
        print(f"Time to complete iteration: {t1 - t0} seconds")

    final_cost = float(cost_function(params))
    log_metric(
        metric_name="Cost",
        value=final_cost,
        iteration_number=num_iterations,
    )

    # We're done with the hybrid job, so save the result.
    # This will be returned in job.result()
    save_job_result({"params": params.numpy().tolist(), "cost": final_cost})
```

**Note**  
Parametric compilation is supported on all superconducting, gate-based QPUs from Rigetti Computing with the exception of pulse level programs.

# Run hybrid workloads with PennyLane embedded simulators


Lets look at how you can use embedded simulators from PennyLane on Amazon Braket Hybrid Jobs to run hybrid workloads. Pennylane's GPU-based embedded simulator, `lightning.gpu`, uses the [Nvidia cuQuantum library](https://developer.nvidia.com/cuquantum-sdk) to accelerate circuit simulations. The embedded GPU simulator is pre-configured in all of the Braket [job containers](https://github.com/amazon-braket/amazon-braket-containers) that users can use out of the box. In this page, we show you how to use `lightning.gpu` to speed up your hybrid workloads.

## Using `lightning.gpu` for QAOA workloads


Consider the Quantum Approximate Optimization Algorithm (QAOA) examples from this [notebook](https://github.com/amazon-braket/amazon-braket-examples/tree/main/examples/hybrid_jobs/2_Using_PennyLane_with_Braket_Hybrid_Jobs). To select an embedded simulator, you specify the `device` argument to be a string of the form: `"local:<provider>/<simulator_name>"`. For example, you would set `"local:pennylane/lightning.gpu"` for `lightning.gpu`. The device string you give to the Hybrid Job when you launch is passed to the job as the environment variable `"AMZN_BRAKET_DEVICE_ARN"`.

```
device_string = os.environ["AMZN_BRAKET_DEVICE_ARN"]
prefix, device_name = device_string.split("/")
device = qml.device(simulator_name, wires=n_wires)
```

In this page, compare the two embedded PennyLane state vector simulators `lightning.qubit` (which is CPU-based) and `lightning.gpu` (which is GPU-based). Provide the simulators with custom gate decompositions to compute various gradients.

Now you are ready to prepare the hybrid job launching script. Run the QAOA algorithm using two instance types: `ml.m5.2xlarge` and `ml.g4dn.xlarge`. The `ml.m5.2xlarge` instance type is comparable to a standard developer laptop. The `ml.g4dn.xlarge` is an accelerated computing instance that has a single NVIDIA T4 GPU with 16GB of memory.

To run the GPU, we first need to specify a compatible image and the correct instance (which defaults to a `ml.m5.2xlarge` instance).

```
from braket.aws import AwsSession
from braket.jobs.image_uris import Framework, retrieve_image

image_uri = retrieve_image(Framework.PL_PYTORCH, AwsSession().region)
instance_config = InstanceConfig(instanceType="ml.g4dn.xlarge")
```

We then need to input these to the hybrid job decorator, along with updated device parameters in both the system and hybrid job arguments.

```
@hybrid_job(
        device="local:pennylane/lightning.gpu",
        input_data=input_file_path,
        image_uri=image_uri,
        instance_config=instance_config)
def run_qaoa_hybrid_job_gpu(p=1, steps=10):
    params = np.random.rand(2, p)

    braket_task_tracker = Tracker()

    graph = nx.read_adjlist(input_file_path, nodetype=int)
    wires = list(graph.nodes)
    cost_h, _mixer_h = qaoa.maxcut(graph)

    device_string = os.environ["AMZN_BRAKET_DEVICE_ARN"]
    prefix, device_name = device_string.split("/")
    dev= qml.device(simulator_name, wires=len(wires))
    ...
```

**Note**  
If you specify the `instance_config` as using a GPU-based instance, but choose the `device` to be the embedded CPU-based simulator (`lightning.qubit`), the GPU will not be used. Make sure to use the embedded GPU-based simulator if you wish to target the GPU\$1

The mean iteration time for the `m5.2xlarge` instance is about 73 seconds, while for the `ml.g4dn.xlarge` instance it is about 0.6 seconds. For this 21-qubit workflow, the GPU instance gives us a 100x speedup. If you look at the Amazon Braket Hybrid Jobs [pricing page](https://aws.amazon.com/braket/pricing/), you can see that the cost per minute for an `m5.2xlarge` instance is \$10.00768, while for the `ml.g4dn.xlarge` instance it is \$10.01227. In this instance it is faster and cheaper to run on the GPU instance.

## Quantum machine learning and data parallelism


If your workload type is quantum machine learning (QML) that trains on datasets, you can further accelerate your workload using data parallelism. In QML, the model contains one or more quantum circuits. The model may or may not also contain classical neural nets. When training the model with the dataset, the parameters in the model are updated to minimize the loss function. A loss function is usually defined for a single data point, and the total loss for the average loss over the whole dataset. In QML, the losses are usually computed in serial before averaging to total loss for gradient computations. This procedure is time consuming, especially when there are hundreds of data points.

Because the loss from one data point does not depend on other data points, the losses can be evaluated in parallel\$1 Losses and gradients associated with different data points can be evaluated at the same time. This is known as data parallelism. With SageMaker's distributed data parallel library, Amazon Braket Hybrid Jobs make it easier for you to use data parallelism to accelerate your training.

Consider the following QML workload for data parallelism which uses the [Sonar dataset](https://archive.ics.uci.edu/dataset/151/connectionist+bench+sonar+mines+vs+rocks) dataset from the well-known UCI repository as an example for binary classification. The Sonar dataset have 208 data points each with 60 features that are collected from sonar signals bouncing off materials. Each data points is either labeled as "M" for mines or "R" for rocks. Our QML model consists of an input layer, a quantum circuit as a hidden layer, and an output layer. The input and output layers are classical neural nets implemented in PyTorch. The quantum circuit is integrated with the PyTorch neural nets using PennyLane's qml.qnn module. See our [example notebooks](https://github.com/aws/amazon-braket-examples) for more detail about the workload. Like the QAOA example above, you can harness the power of GPU by using embedded GPU-based simulators like PennyLane's `lightning.gpu` to improve the performance over embedded CPU-based simulators.

To create a hybrid job, you can call `AwsQuantumJob.create` and specify the algorithm script, device, and other configurations through its keyword arguments.

```
instance_config = InstanceConfig(instanceType='ml.g4dn.xlarge')

hyperparameters={"nwires": "10",
                 "ndata": "32",
                 ...
}

job = AwsQuantumJob.create(
    device="local:pennylane/lightning.gpu",
    source_module="qml_source",
    entry_point="qml_source.train_single",
    hyperparameters=hyperparameters,
    instance_config=instance_config,
    ...
)
```

In order to use data parallelism, you need to modify few lines of code in the algorithm script for the SageMaker distributed library to correctly parallelize the training. First, you import the `smdistributed` package which does most of the heavy-lifting for distributing your workloads across multiple GPUs and multiple instances. This package is preconfigured in the Braket PyTorch and TensorFlow containers. The `dist` module tells our algorithm script what the total number of GPUs for the training (`world_size`) is as well as the `rank` and `local_rank` of a GPU core. `rank` is the absolute index of a GPU across all instances, while `local_rank` is the index of a GPU within an instance. For example, if there are four instances each with eight GPUs allocated for the training, the `rank` ranges from 0 to 31 and the `local_rank` ranges from 0 to 7.

```
import smdistributed.dataparallel.torch.distributed as dist

dp_info = {
    "world_size": dist.get_world_size(),
    "rank": dist.get_rank(),
    "local_rank": dist.get_local_rank(),
}
batch_size //= dp_info["world_size"] // 8
batch_size = max(batch_size, 1)
```

Next, you define a `DistributedSampler` according to the `world_size` and `rank` and then pass it into the data loader. This sampler avoids GPUs accessing the same slice of a dataset.

```
train_sampler = torch.utils.data.distributed.DistributedSampler(
    train_dataset,
    num_replicas=dp_info["world_size"],
    rank=dp_info["rank"]
)
train_loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=False,
    num_workers=0,
    pin_memory=True,
    sampler=train_sampler,
)
```

Next, you use the `DistributedDataParallel` class to enable data parallelism.

```
from smdistributed.dataparallel.torch.parallel.distributed import DistributedDataParallel as DDP

model = DressedQNN(qc_dev).to(device)
model = DDP(model)
torch.cuda.set_device(dp_info["local_rank"])
model.cuda(dp_info["local_rank"])
```

The above are the changes you need to use data parallelism. In QML, you often want to save results and print training progress. If each GPU runs the saving and printing command, the log will be flooded with the repeated information and the results will overwrite each other. To avoid this, you can only save and print from the GPU that has `rank` 0.

```
if dp_info["rank"]==0:
    print('elapsed time: ', elapsed)
    torch.save(model.state_dict(), f"{output_dir}/test_local.pt")
    save_job_result({"last loss": loss_before})
```

 Amazon Braket Hybrid Jobs supports `ml.g4dn.12xlarge` instance types for the SageMaker distributed data parallel library. You configure the instance type through the `InstanceConfig` argument in Hybrid Jobs. For the SageMaker distributed data parallel library to know that data parallelism is enabled, you need to add two additional hyperparameters, `"sagemaker_distributed_dataparallel_enabled"` setting to `"true"` and `"sagemaker_instance_type"` setting to the instance type you are using. These two hyperparameters are used by `smdistributed` package. Your algorithm script does not need to explicitly use them. In Amazon Braket SDK, it provides a convenient keyword argument `distribution`. With `distribution="data_parallel"` in hybrid job creation, the Amazon Braket SDK automatically inserts the two hyperparameters for you. If you use the Amazon Braket API, you need to include these two hyperparameters.

With the instance and data parallelism configured, you can now submit your hybrid job. There are 4 GPUs in a `ml.g4dn.12xlarge` instance. When you set `instanceCount=1` , the workload is distributed across the 8 GPUs in the instance. When you set `instanceCount` greater than one, the workload is distributed across GPUs available in all instances. When using multiple instances, each instance incurs a charge based on how much time you use it. For example, when you use four instances, the billable time is four times the run time per instance because there are four instances running your workloads at the same time.

```
instance_config = InstanceConfig(instanceType='ml.g4dn.12xlarge',
                                 instanceCount=1,
)

hyperparameters={"nwires": "10",
                 "ndata": "32",
                 ...,
}

job = AwsQuantumJob.create(
    device="local:pennylane/lightning.gpu",
    source_module="qml_source",
    entry_point="qml_source.train_dp",
    hyperparameters=hyperparameters,
    instance_config=instance_config,
    distribution="data_parallel",
    ...
)
```

**Note**  
In the above hybrid job creation, `train_dp.py` is the modified algorithm script for using data parallelism. Keep in mind that data parallelism only works correctly when you modify your algorithm script according to the above section. If the data parallelism option is enabled without a correctly modified algorithm script, the hybrid job may throw errors, or each GPU may repeatedly process the same data slice, which is inefficient.

If used correctly, using multiple instances can lead to orders of magnitude reduction in both time and cost. See the [ example notebook for more details](https://github.com/amazon-braket/amazon-braket-examples/blob/main/examples/hybrid_jobs/5_Parallelize_training_for_QML/Parallelize_training_for_QML.ipynb).