

# Training a model using Neptune ML
Model training

After you have processed the data that you exported from Neptune for model training, you can start a model-training job using a command like the following:

------
#### [ AWS CLI ]

```
aws neptunedata start-ml-model-training-job \
  --endpoint-url https://your-neptune-endpoint:port \
  --id "(a unique model-training job ID)" \
  --data-processing-job-id "(the data-processing job-id of a completed job)" \
  --train-model-s3-location "s3://(your S3 bucket)/neptune-model-graph-autotrainer"
```

For more information, see [start-ml-model-training-job](https://docs.aws.amazon.com/cli/latest/reference/neptunedata/start-ml-model-training-job.html) in the AWS CLI Command Reference.

------
#### [ SDK ]

```
import boto3
from botocore.config import Config

client = boto3.client(
    'neptunedata',
    endpoint_url='https://your-neptune-endpoint:port',
    config=Config(read_timeout=None, retries={'total_max_attempts': 1})
)

response = client.start_ml_model_training_job(
    id='(a unique model-training job ID)',
    dataProcessingJobId='(the data-processing job-id of a completed job)',
    trainModelS3Location='s3://(your S3 bucket)/neptune-model-graph-autotrainer'
)

print(response)
```

------
#### [ awscurl ]

```
awscurl https://your-neptune-endpoint:port/ml/modeltraining \
  --region us-east-1 \
  --service neptune-db \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
        "id" : "(a unique model-training job ID)",
        "dataProcessingJobId" : "(the data-processing job-id of a completed job)",
        "trainModelS3Location" : "s3://(your S3 bucket)/neptune-model-graph-autotrainer"
      }'
```

**Note**  
This example assumes that your AWS credentials are configured in your environment. Replace *us-east-1* with the Region of your Neptune cluster.

------
#### [ curl ]

```
curl \
  -X POST https://your-neptune-endpoint:port/ml/modeltraining \
  -H 'Content-Type: application/json' \
  -d '{
        "id" : "(a unique model-training job ID)",
        "dataProcessingJobId" : "(the data-processing job-id of a completed job)",
        "trainModelS3Location" : "s3://(your S3 bucket)/neptune-model-graph-autotrainer"
      }'
```

------

The details of how to use this command are explained in [The modeltraining command](machine-learning-api-modeltraining.md), along with information about how to get the status of a running job, how to stop a running job, and how to list all running jobs.

You can also supply a `previousModelTrainingJobId` to use information from a completed Neptune ML model training job to accelerate the hyperparameter search in a new training job. This is useful during [model retraining on new graph data](machine-learning-overview-evolving-data-incremental.md#machine-learning-overview-model-retraining), as well as [incremental training on the same graph data](machine-learning-overview-evolving-data-incremental.md#machine-learning-overview-incremental). Use a command like this one:

------
#### [ AWS CLI ]

```
aws neptunedata start-ml-model-training-job \
  --endpoint-url https://your-neptune-endpoint:port \
  --id "(a unique model-training job ID)" \
  --data-processing-job-id "(the data-processing job-id of a completed job)" \
  --train-model-s3-location "s3://(your S3 bucket)/neptune-model-graph-autotrainer" \
  --previous-model-training-job-id "(the model-training job-id of a completed job)"
```

For more information, see [start-ml-model-training-job](https://docs.aws.amazon.com/cli/latest/reference/neptunedata/start-ml-model-training-job.html) in the AWS CLI Command Reference.

------
#### [ SDK ]

```
import boto3
from botocore.config import Config

client = boto3.client(
    'neptunedata',
    endpoint_url='https://your-neptune-endpoint:port',
    config=Config(read_timeout=None, retries={'total_max_attempts': 1})
)

response = client.start_ml_model_training_job(
    id='(a unique model-training job ID)',
    dataProcessingJobId='(the data-processing job-id of a completed job)',
    trainModelS3Location='s3://(your S3 bucket)/neptune-model-graph-autotrainer',
    previousModelTrainingJobId='(the model-training job-id of a completed job)'
)

print(response)
```

------
#### [ awscurl ]

```
awscurl https://your-neptune-endpoint:port/ml/modeltraining \
  --region us-east-1 \
  --service neptune-db \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
        "id" : "(a unique model-training job ID)",
        "dataProcessingJobId" : "(the data-processing job-id of a completed job)",
        "trainModelS3Location" : "s3://(your S3 bucket)/neptune-model-graph-autotrainer",
        "previousModelTrainingJobId" : "(the model-training job-id of a completed job)"
      }'
```

**Note**  
This example assumes that your AWS credentials are configured in your environment. Replace *us-east-1* with the Region of your Neptune cluster.

------
#### [ curl ]

```
curl \
  -X POST https://your-neptune-endpoint:port/ml/modeltraining \
  -H 'Content-Type: application/json' \
  -d '{
        "id" : "(a unique model-training job ID)",
        "dataProcessingJobId" : "(the data-processing job-id of a completed job)",
        "trainModelS3Location" : "s3://(your S3 bucket)/neptune-model-graph-autotrainer",
        "previousModelTrainingJobId" : "(the model-training job-id of a completed job)"
      }'
```

------

You can train your own model implementation on the Neptune ML training infrastructure by supplying a `customModelTrainingParameters` object, like this:

------
#### [ AWS CLI ]

```
aws neptunedata start-ml-model-training-job \
  --endpoint-url https://your-neptune-endpoint:port \
  --id "(a unique model-training job ID)" \
  --data-processing-job-id "(the data-processing job-id of a completed job)" \
  --train-model-s3-location "s3://(your Amazon S3 bucket)/neptune-model-graph-autotrainer" \
  --model-name "custom" \
  --custom-model-training-parameters '{
    "sourceS3DirectoryPath": "s3://(your Amazon S3 bucket)/(path to your Python module)",
    "trainingEntryPointScript": "(your training script entry-point name in the Python module)",
    "transformEntryPointScript": "(your transform script entry-point name in the Python module)"
  }'
```

For more information, see [start-ml-model-training-job](https://docs.aws.amazon.com/cli/latest/reference/neptunedata/start-ml-model-training-job.html) in the AWS CLI Command Reference.

------
#### [ SDK ]

```
import boto3
from botocore.config import Config

client = boto3.client(
    'neptunedata',
    endpoint_url='https://your-neptune-endpoint:port',
    config=Config(read_timeout=None, retries={'total_max_attempts': 1})
)

response = client.start_ml_model_training_job(
    id='(a unique model-training job ID)',
    dataProcessingJobId='(the data-processing job-id of a completed job)',
    trainModelS3Location='s3://(your Amazon S3 bucket)/neptune-model-graph-autotrainer',
    modelName='custom',
    customModelTrainingParameters={
        'sourceS3DirectoryPath': 's3://(your Amazon S3 bucket)/(path to your Python module)',
        'trainingEntryPointScript': '(your training script entry-point name in the Python module)',
        'transformEntryPointScript': '(your transform script entry-point name in the Python module)'
    }
)

print(response)
```

------
#### [ awscurl ]

```
awscurl https://your-neptune-endpoint:port/ml/modeltraining \
  --region us-east-1 \
  --service neptune-db \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
        "id" : "(a unique model-training job ID)",
        "dataProcessingJobId" : "(the data-processing job-id of a completed job)",
        "trainModelS3Location" : "s3://(your Amazon S3 bucket)/neptune-model-graph-autotrainer",
        "modelName": "custom",
        "customModelTrainingParameters" : {
          "sourceS3DirectoryPath": "s3://(your Amazon S3 bucket)/(path to your Python module)",
          "trainingEntryPointScript": "(your training script entry-point name in the Python module)",
          "transformEntryPointScript": "(your transform script entry-point name in the Python module)"
        }
      }'
```

**Note**  
This example assumes that your AWS credentials are configured in your environment. Replace *us-east-1* with the Region of your Neptune cluster.

------
#### [ curl ]

```
curl \
  -X POST https://your-neptune-endpoint:port/ml/modeltraining \
  -H 'Content-Type: application/json' \
  -d '{
        "id" : "(a unique model-training job ID)",
        "dataProcessingJobId" : "(the data-processing job-id of a completed job)",
        "trainModelS3Location" : "s3://(your Amazon S3 bucket)/neptune-model-graph-autotrainer",
        "modelName": "custom",
        "customModelTrainingParameters" : {
          "sourceS3DirectoryPath": "s3://(your Amazon S3 bucket)/(path to your Python module)",
          "trainingEntryPointScript": "(your training script entry-point name in the Python module)",
          "transformEntryPointScript": "(your transform script entry-point name in the Python module)"
        }
      }'
```

------



See [The modeltraining command](machine-learning-api-modeltraining.md) for more information, such as about how to get the status of a running job, how to stop a running job, and how to list all running jobs. See [Custom models in Neptune ML](machine-learning-custom-models.md) for information about how to implement and use a custom model.

**Topics**
+ [

# Models and model training in Amazon Neptune ML
](machine-learning-models-and-training.md)
+ [

# Customizing model hyperparameter configurations in Neptune ML
](machine-learning-customizing-hyperparams.md)
+ [

# Model training best practices
](machine-learning-improve-model-performance.md)

# Models and model training in Amazon Neptune ML
Models and training

Neptune ML uses Graph Neural Networks (GNN) to create models for the various machine-learning tasks. Graph neural networks have been shown to obtain state-of-the-art results for graph machine learning tasks and are excellent at extracting informative patterns from graph structured data.

## Graph neural networks (GNNs) in Neptune ML
Graph neural networks

Graph Neural Networks (GNNs) belong to a family of neural networks that compute node representations by taking into account the structure and features of nearby nodes. GNNs complement other traditional machine learning and neural network methods that are not well-suited for graph data.

GNNs are used to solve machine-learning tasks such as node classification and regression (predicting properties of nodes) and edge classification and regression (predicting properties of edges) or link prediction (predicting whether two nodes in the graph should be connected or not).

In general, using a GNN for a machine learning task involves two stages:
+ An encoding stage, where the GNN computes a d-dimensional vector for each node in the graph. These vectors are also called *representations* or *embeddings*. 
+ A decoding stage, which makes predictions based on the encoded representations.

For node classification and regression, the node representations are used directly for the classification and regression tasks. For edge classification and regression, the node representations of the incident nodes on an edge are used as input for the classification or regression. For link prediction, an edge likelihood score is computed by using a pair of node representations and an edge type representation.

The [Deep Graph Library (DGL)](https://www.dgl.ai/) facilitates the efficient definition and training of GNNs for these tasks.

Different GNN models are unified under the formulation of message passing. In this view, the representation for a node in a graph is calculated using the node's neighbors' representations (the messages), together with the node's initial representation. In NeptuneML, the initial representation of a node is derived from the features extracted from its node properties, or is learnable and depends on the identity of the node.

Neptune ML also provides the option to concatenate node features and learnable node representations to serve as the original node representation.

For the various tasks in Neptune ML involving graphs with node properties, we use the [Relational Graph Convolutional Network](https://arxiv.org/abs/1703.06103) (R-GCN)) to perform the encoding stage. R-GCN is a GNN architecture that is well-suited for graphs that have multiple node and edge types (these are known as heterogeneous graphs).

The R-GCN network consists of a fixed number of layers, stacked one after the other. Each layer of the R-GCN uses its learnable model parameters to aggregate information from the immediate, 1-hop neighborhood of a node. Since subsequent layers use the previous layer's output representations as input, the radius of the graph neighborhood that influences a node's final embedding depends on the number of layers (`num-layer`), of the R-GCN network.

For example, this means that a 2-layer network uses information from nodes that are 2 hops away.

To learn more about GNNs, see [A Comprehensive Survey on Graph Neural Networks](https://arxiv.org/abs/1901.00596). For more information about the Deep Graph Library (DGL), visit the DGL [webpage](https://www.dgl.ai/). For a hands-on tutorial about using DGL with GNNs, see [Learning graph neural networks with Deep Graph Library](https://www.amazon.science/videos-webinars/learning-graph-neural-networks-with-deep-graph-library).

## Training Graph Neural Networks
Training GNNs

In machine learning, the process of getting a model to learn how to make good predictions for a task is called model training. This is usually performed by specifying a particular objective to optimize, as well as an algorithm to use to perform this optimization.

This process is employed in training a GNN to learn good representations for the downstream task as well. We create an objective function for that task that is minimized during model training. For example, for node classification, we use [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) as the objective, which penalizes misclassifications, and for node regression we minimize [MeanSquareError](https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html).

The objective is usually a loss function that takes the model predictions for a particular data point and compares them to the ground-truth value for that data point. It returns the loss value, which shows how far off the model's predictions are. The goal of the training process is to minimize the loss and ensure that model predictions are close to the ground-truth.

The optimization algorithm used in deep learning for the training process is usually a variant of gradient descent. In Neptune ML, we use [Adam](https://arxiv.org/pdf/1412.6980.pdf), which is an algorithm for first-order gradient-based optimization of stochastic objective functions based on adaptive estimates of lower-order moments.

While the model training process tries to ensure that the learned model parameters are close to the minima of the objective function, the overall performance of a model also depends on the model's *hyperparameters*, which are model settings that aren't learned by the training algorithm. For example, the dimensionality of the learned node representation, `num-hidden`, is a hyperparameter that affects model performance. Therefore, it is common in machine learning to perform hyperparameter optimization (HPO) to choose the suitable hyperparameters.

Neptune ML uses a SageMaker AI hyperparameter tuning job to launch multiple instances of model training with different hyperparameter configurations to try to find the best model for a range of hyperparameters settings. See [Customizing model hyperparameter configurations in Neptune ML](machine-learning-customizing-hyperparams.md).

## Knowledge graph embedding models in Neptune ML
Knowledge graph embedding models

Knowledge graphs (KGs) are graphs that encode information about different entities (nodes) and their relations (edges). In Neptune ML, knowledge graph embedding models are applied by default for performing link prediction when the graph does not contain node properties, only relations with other nodes. Although, R-GCN models with learnable embeddings can also be used for these graphs by specifying the model type as `"rgcn"`, knowledge graph embedding models are simpler and are designed to be effective for learning representations for large scale knowledge graphs.

Knowledge graph embedding models are used in a link prediction task to predict the nodes or relations that complete a triple `(h, r, t)` where `h` is the source node, `r` is the relation type and `t` is the destination node.

The knowledge graph embedding models implemented in Neptune ML are `distmult`, `transE`, and `rotatE`. To learn more about knowledge graph embedding models, see [DGL-KE](https://github.com/awslabs/dgl-ke).

## Training custom models in Neptune ML
Custom models

Neptune ML lets you define and implement custom models of your own, for particular scenarios. See [Custom models in Neptune ML](machine-learning-custom-models.md) for information about how to implement a custom model and how to use Neptune ML infrastructure to train it.

# Customizing model hyperparameter configurations in Neptune ML
Customizing hyperparameters

When you start a Neptune ML model-training job, Neptune ML automatically uses the information inferred from the preceding [data-processing](machine-learning-on-graphs-processing.md) job. It uses the information to generate hyperparameter configuration ranges that are used to create a [SageMaker AI hyperparameter tuning job](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html) to train multiple models for your task. That way, you don’t have to specify a long list of hyperparameter values for the models to be trained with. Instead, the model hyperparameter ranges and defaults are selected based on the task type, graph type, and the tuning-job settings.

However, you can also override the default hyperparameter configuration and provide custom hyperparameters by modifying a JSON configuration file that the data-processing job generates.

Using the Neptune ML [modelTraining API](machine-learning-api-modeltraining.md), you can control several high level hyperparameter tuning job settings like `maxHPONumberOfTrainingJobs`, `maxHPOParallelTrainingJobs`, and `trainingInstanceType`. For more fine-grained control over the model hyperparameters, you can customize the `model-HPO-configuration.json` file that the data-processing job generates. The file is saved in the Amazon S3 location that you specified for processing-job output.

You can download the file, edit it to override the default hyperparameter configurations, and upload it back to the same Amazon S3 location. Do not change the name of the file, and be careful to follow these instructions as you edit.

To download the file from Amazon S3:

```
aws s3 cp \
  s3://(bucket name)/(path to output folder)/model-HPO-configuration.json \
  ./
```

When you have finished editing, upload the file back to where it was:

```
aws s3 cp \
  model-HPO-configuration.json \
  s3://(bucket name)/(path to output folder)/model-HPO-configuration.json
```

## Structure of the `model-HPO-configuration.json` file
File structure

The `model-HPO-configuration.json` file specifies the model to be trained, the machine learning `task_type` and the hyperparameters that should be varied or fixed for the various runs of model training.

The hyperparameters are categorized as belonging to various tiers that signify the precedence given to the hyperparameters when the hyperparameter tuning job is invoked:
+ Tier-1 hyperparameters have the highest precedence. If you set `maxHPONumberOfTrainingJobs` to a value less than 10, only Tier-1 hyperparameters are tuned, and the rest take their default values.
+ Tier-2 hyperparameters have lower precedence, so if you have more than 10 but less than 50 total training jobs for a tuning job, then both Tier-1 and Tier-2 hyperparameters are tuned.
+ Tier 3 hyperparameters are tuned together with Tier-1 and Tier-2 only if you have more than 50 total training jobs.
+ Finally, fixed hyperparameters are not tuned at all, and always take their default values.

### Example of a `model-HPO-configuration.json` file
Sample file

The following is a sample `model-HPO-configuration.json` file:

```
{
  "models": [
    {
      "model": "rgcn",
      "task_type": "node_class",
      "eval_metric": {
        "metric": "acc"
      },
      "eval_frequency": {
          "type":  "evaluate_every_epoch",
          "value":  1
      },
      "1-tier-param": [
        {
            "param": "num-hidden",
            "range": [16, 128],
            "type": "int",
            "inc_strategy": "power2"
        },
        {
          "param": "num-epochs",
          "range": [3,30],
          "inc_strategy": "linear",
          "inc_val": 1,
          "type": "int",
          "node_strategy": "perM"
        },
        {
          "param": "lr",
          "range": [0.001,0.01],
          "type": "float",
          "inc_strategy": "log"
        }
      ],
      "2-tier-param": [
        {
          "param": "dropout",
          "range": [0.0,0.5],
          "inc_strategy": "linear",
          "type": "float",
          "default": 0.3
        },
        {
          "param": "layer-norm",
          "type": "bool",
          "default": true
        }
      ],
      "3-tier-param": [
        {
          "param": "batch-size",
          "range": [128, 4096],
          "inc_strategy": "power2",
          "type": "int",
          "default": 1024
        },
        {
          "param": "fanout",
          "type": "int",
          "options": [[10, 30],[15, 30], [15, 30]],
          "default": [10, 15, 15]
        },
        {
          "param": "num-layer",
          "range": [1, 3],
          "inc_strategy": "linear",
          "inc_val": 1,
          "type": "int",
          "default": 2
        },
        {
          "param": "num-bases",
          "range": [0, 8],
          "inc_strategy": "linear",
          "inc_val": 2,
          "type": "int",
          "default": 0
        }
      ],
      "fixed-param": [
        {
          "param": "concat-node-embed",
          "type": "bool",
          "default": true
        },
        {
          "param": "use-self-loop",
          "type": "bool",
          "default": true
        },
        {
          "param": "low-mem",
          "type": "bool",
          "default": true
        },
        {
          "param": "l2norm",
          "type": "float",
          "default": 0
        }
      ]
    }
  ]
}
```

### Elements of a `model-HPO-configuration.json` file
File elements

The file contains a JSON object with a single top-level array named `models` that contains a single model-configuration object. When customizing the file, make sure the `models` array only has one model-configuration object in it. If your file contains more than one model-configuration object, the tuning job will fail with a warning.

The model-configuration object contains the following top-level elements:
+ **`model`**   –   (*String*) The model type to be trained (**do not modify**). Valid values are:
  + `"rgcn"`   –   This is the default for node classification and regression tasks, and for heterogeneous link prediction tasks.
  + `"transe"`   –   This is the default for KGE link prediction tasks.
  + `"distmult"`   –   This is an alternative model type for KGE link prediction tasks.
  + `"rotate"`   –   This is an alternative model type for KGE link prediction tasks.

  As a rule, don't directly modify the `model` value, because different model types often have substantially different applicable hyperparameters, which can result in a parsing error after the training job has started.

  To change the model type, use the `modelName` parameter in the [modelTraining API](machine-learning-api-modeltraining.md#machine-learning-api-modeltraining-create-job) rather than change it in the `model-HPO-configuration.json` file.

  A way to change the model type and make fine-grain hyperparameter changes is to copy the default model configuration template for the model that you want to use and paste that into the `model-HPO-configuration.json` file. There is a folder named `hpo-configuration-templates` in the same Amazon S3 location as the `model-HPO-configuration.json` file if the inferred task type supports multiple models. This folder contains all the default hyperparameter configurations for the other models that are applicable to the task.

  For example, if you want to change the model and hyperparameter configurations for a `KGE` link-prediction task from the default `transe` model to a `distmult` model, simply paste the contents of the `hpo-configuration-templates/distmult.json` file into the `model-HPO-configuration.json` file and then edit the hyperparameters as necessary.
**Note**  
If you set the `modelName` parameter in the `modelTraining` API and also change the `model` and hyperparameter specification in the `model-HPO-configuration.json` file, and these are different, the `model` value in the `model-HPO-configuration.json` file takes precedence, and the `modelName` value is ignored.
+ **`task_type`**   –   (*String*) The machine learning task type inferred by or passed directly to the data-processing job (**do not modify**). Valid values are:
  + `"node_class"`
  + `"node_regression"`
  + `"link_prediction"`

  The data-processing job infers the task type by examining the exported dataset and the generated training-job configuration file for properties of the dataset.

  This value should not be changed. If you want to train a different task, you need to [run a new data-processing job](machine-learning-on-graphs-processing.md). If the `task_type` value is not what you were expecting, you should check the inputs to your data-processing job to make sure that they are correct. This includes parameters to the `modelTraining` API, as well as in the training-job configuration file generated by the data-export process.
+ **`eval_metric`**   –   (*String*) The evaluation metric should be used for evaluating the model performance and for selecting the best-performing model across HPO runs. Valid values are:
  + `"acc"`   –   Standard classification accuracy. This is the default for single-label classification tasks, unless imbalanced labels are found during data processing, in which case the default is `"F1"`.
  + `"acc_topk"`   –   The number of times the correct label is among the top **`k`** predictions. You can also set the value **`k`** by passing in `topk` as an extra key.
  + `"F1"`   –   The [F1 score](https://en.wikipedia.org/wiki/F-score).
  + `"mse"`   –   [Mean-squared error metric](https://en.wikipedia.org/wiki/Mean_squared_error), for regression tasks.
  + `"mrr"`   –   [Mean reciprocal rank metric](https://en.wikipedia.org/wiki/Mean_reciprocal_rank).
  + `"precision"`   –   The model precision, calculated as the ratio of true positives to predicted positives: `= true-positives / (true-positives + false-positives)`.
  + `"recall"`   –   The model recall, calculated as the ratio of true positives to actual positives: `= true-positives / (true-positives + false-negatives)`.
  + `"roc_auc"`   –   The area under the [ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic). This is the default for multi-label classification.

  For example, to change the metric to `F1`, change the `eval_metric` value as follows:

  ```
  "  eval_metric": {
      "metric": "F1",
    },
  ```

  Or, to change the metric to a `topk` accuracy score, you would change `eval_metric` as follows:

  ```
    "eval_metric": {
      "metric": "acc_topk",
      "topk": 2
    },
  ```
+ **`eval_frequency`**   –   (*Object*) Specifies how often during training the performance of the model on the validation set should be checked. Based on the validation performance, early stopping can then be initiated and the best model can be saved.

  The `eval_frequency` object contains two elements, namely `"type"` and `"value"`. For example:

  ```
    "eval_frequency": {
      "type":  "evaluate_every_pct",
      "value":  0.1
    },
  ```

  Valid `type` values are:
  + **`evaluate_every_pct`**   –   Specifies the percentage of training to be completed for each evaluation.

    For `evaluate_every_pct`, the `"value"` field contains a floating-point number between zero and one which expresses that percentage.

    
  + **`evaluate_every_batch`**   –   Specifies the number of training batches to be completed for each evaluation.

    For `evaluate_every_batch`, the `"value"` field contains an integer which expresses that batch count.
  + **`evaluate_every_epoch`**   –   Specifies the number of epochs per evaluation, where a new epoch starts at midnight.

    For `evaluate_every_epoch`, the `"value"` field contains an integer which expresses that epoch count.

  The default setting for `eval_frequency` is:

  ```
    "eval_frequency": {
      "type":  "evaluate_every_epoch",
      "value":  1
    },
  ```
+ **`1-tier-param`**   –   (*Required*) An array of Tier-1 hyperparameters.

  If you don't want to tune any hyperparameters, you can set this to an empty array. This does not affect the total number of training jobs launched by the SageMaker AI hyperparameter tuning job. It just means that all training jobs, if there is more than 1 but less than 10, will run with the same set of hyperparameters.

  On the other hand, if you want to treat all your tunable hyperparameters with equal significance then you can put all the hyperparameters in this array.
+ **`2-tier-param`**   –   (*Required*) An array of Tier-2 hyperparameters.

  These parameters are only tuned if `maxHPONumberOfTrainingJobs` has a value greater than 10. Otherwise, they are fixed to the default values.

  If you have a training budget of at most 10 training jobs or don't want Tier-2 hyperparameters for any other reason, but you want to tune all tunable hyperparameters, you can set this to an empty array.
+ **`3-tier-param`**   –   (*Required*) An array of Tier-3 hyperparameters.

  These parameters are only tuned if `maxHPONumberOfTrainingJobs` has a value greater than 50. Otherwise, they are fixed to the default values.

  If you don't want Tier-3 hyperparameters, you can set this to an empty array.
+ **`fixed-param`**   –   (*Required*) An array of fixed hyperparameters that take only their default values and do not vary in different training jobs.

  If you want to vary all hyperparameters, you can set this to an empty array and either set the value for `maxHPONumberOfTrainingJobs` large enough to vary all tiers or make all hyperparameters Tier-1.

The JSON object that represents each hyperparameter in `1-tier-param`, `2-tier-param`, `3-tier-param`, and `fixed-param` contains the following elements:
+ **`param`**   –   (*String*) The name of the hyperparameter (**do not change**).

  See the [list of valid hyperparameter names in Neptune ML](#machine-learning-hyperparams-list).
+ **`type`**   –   (*String*) The hyperparameter type (**do not change**).

  Valid types are: `bool`, `int`, and `float`.
+ **`default`**   –   (*String*) The default value for the hyperparameter.

  You can set a new default value.

Tunable hyperparameters can also contain the following elements:
+ **`range`**   –   (*Array*) The range for a continuous tunable hyperparameter.

  This should be an array with two values, namely the minimum and maximum of the range (`[min, max]`).
+ **`options`**   –   (*Array*) The options for a categorical tunable hyperparameter.

  This array should contain all the options to consider:

  ```
    "options" : [value1, value2, ... valuen]
  ```
+ **`inc_strategy`**   –   (*String*) The type of incremental change for continuous tunable hyperparameter ranges (**do not change**).

  Valid values are `log`, `linear`, and `power2`. This applies only when the range key is set.

  Modifying this may result in not using the full range of your hyperparameter for tuning.
+ **`inc_val`**   –   (*Float*) The amount by which successive increments differ for continuous tunablehyperparameters (**do not change**).

  This applies only when the range key is set.

  Modifying this may result in not using the full range of your hyperparameter for tuning.
+ **`node_strategy`**   –   (*String*) Indicates that the effective range for this hyperparameter should change based on the number of nodes in the graph (**do not change**).

  Valid values are `"perM"` (per million), `"per10M"` (per 10 million), and `"per100M"` (per 100 million).

  Rather than change this value, change the `range` instead.
+ **`edge_strategy`**   –   (*String*) Indicates that the effective range for this hyperparameter should change based on the number of edges in the graph (**do not change**).

  Valid values are `"perM"` (per million), `"per10M"` (per 10 million), and `"per100M"` (per 100 million).

  Rather than change this value, change the `range` instead.

### List of all the hyperparameters in Neptune ML
List of hyperparameters

The following list contains all the hyperparameters that can be set anywhere in Neptune ML, for any model type and task. Because they are not all applicable to every model type, it is important that you only set hyperparameters in the `model-HPO-configuration.json` file that appear in the template for the model you're using.
+ **`batch-size`**   –   The size of the batch of target nodes using in one forward pass. *Type*: `int`.

  Setting this to a much larger value can cause memory issues for training on GPU instances.
+ **`concat-node-embed`**   –   Indicates whether to get the initial representation of a node by concatenating its processed features with learnable initial node embeddings in order to increase the expressivity of the model. *Type*: `bool`.
+ **`dropout`**   –   The dropout probability applied to dropout layers. *Type*: `float`.

  
+ **`edge-num-hidden`**   –   The hidden layer size or number of units for the edge feature module. Only used when `use-edge-features` is set to `True`. *Type*: float.
+ **`enable-early-stop`**   –   Toggles whether or not to use the early stopping feature. *Type*: `bool`. *Default*: `true`.

  Use this Boolean parameter to turn off the early stop feature.
+ **`fanout`**   –   The number of neighbors to sample for a target node during neighbor sampling. *Type*: `int`.

  This value is tightly coupled with `num-layers` and should always be in the same hyperparameter tier. This is because you can specify a fanout for each potential GNN layer.

  Because this hyperparameter can cause model performance to vary widely, it should be fixed or set as a Tier-2 or Tier-3 hyperparameter. Setting it to a large value can cause memory issues for training on GPU instance.
+ **`gamma`**   –   The margin value in the score function. *Type*: `float`.

  This applies to `KGE` link-prediction models only.
+ **`l2norm`**   –   The weight decay value used in the optimizer which imposes an L2 normalization penalty on the weights. *Type*: `bool`.
+ **`layer-norm`**   –   Indicates whether to use layer normalization for `rgcn` models. *Type*: `bool`.
+ **`low-mem`**   –   Indicates whether to use a low-memory implementation of the relation message passing function at the expense of speed. *Type*: `bool`.

  
+ **`lr`**   –   The learning rate. *Type*: `float`.

  This should be set as a Tier-1 hyperparameter.
+ **`neg-share`**   –   In link prediction, indicates whether positive sampled edges can share negative edge samples. *Type*: `bool`.
+ **`num-bases`**   –   The number of bases for basis decomposition in a `rgcn` model. Using a value of `num-bases` that is less than the number of edge types in the graph acts as a regularizer for the `rgcn` model. *Type*: `int`.
+ **`num-epochs`**   –   The number of epochs of training to run. *Type*: `int`.

  An epoch is a complete training pass through the graph.
+ **`num-hidden`**   –   The hidden layer size or number of units. *Type*: `int`.

  This also sets the initial embedding size for featureless nodes.

  Setting this to a much larger value without reducing `batch-size` can cause out-of-memory issues for training on GPU instance.
+ **`num-layer`**   –   The number of GNN layers in the model. *Type*: `int`.

  This value is tightly coupled with the fanout parameter and should come after fanout is set in the same hyperparameter tier.

  Because this can cause model performance to vary widely, it should be fixed or set as a Tier-2 or Tier-3 hyperparameter.
+ **`num-negs`**   –   In link prediction, the number of negative samples per positive sample. *Type*: `int`.
+ **`per-feat-name-embed`**   –   Indicates whether to embed each feature by independently transforming it before combining features. *Type*: `bool`.

  When set to `true`, each feature per node is independently transformed to a fixed dimension size before all the transformed features for the node are concatenated and further transformed to the `num_hidden` dimension.

  When set to `false`, the features are concatenated without any feature-specific transformations.
+ **`regularization-coef`**   –   In link prediction, the coefficient of regularization loss. *Type*: `float`.
+ **`rel-part`**   –   Indicates whether to use relation partition for `KGE` link prediction. *Type*: `bool`.
+ **`sparse-lr`**   –   The learning rate for learnable-node embeddings. *Type*: `float`.

  Learnable initial node embeddings are used for nodes without features or when `concat-node-embed` is set. The parameters of the sparse learnable node embedding layer are trained using a separate optimizer which can have a separate learning rate.
+ **`use-class-weight`**   –   Indicates whether to apply class weights for imbalanced classification tasks. If set to to `true`, the label counts are used to set a weight for each class label. *Type*: `bool`.
+ **`use-edge-features`**   –   Indicates whether to use edge features during message passing. If set to `true`, a custom edge feature module is added to the RGCN layer for edge types that have features. *Type*: `bool`.
+ **`use-self-loop`**   –   Indicates whether to include self loops in training a `rgcn` model. *Type*: `bool`.
+ **`window-for-early-stop`**   –   Controls the number of latest validation scores to average to decide on an early stop. The default is 3. type=int. See also [Early stopping of the model training process in Neptune ML](machine-learning-improve-model-performance.md#machine-learning-model-training-early-stop). *Type*: `int`. *Default*: `3`.

  See .

## Customizing hyperparameters in Neptune ML
Customizing hyperparameters

When you are editing the `model-HPO-configuration.json` file, the following are the most common kinds of changes to make:
+ Edit the minimum and/or maximum values of `range` hyperparameters.
+ Set a hyperparameter to a fixed value by moving it to the `fixed-param` section and setting its default value to the fixed value you want it to take.
+ Change the priority of a hyperparameter by placing it in a particular tier, editing its range, and making sure that its default value is set appropriately.

# Model training best practices
Training best practices

There are things you can do to improve the performance of Neptune ML models.

## Choose the right node property
Choose the property

Not all the properties in your graph may be meaningful or relevant to your machine learning tasks. Any irrelevant properties should be excluded during data export.

Here are some best practices:
+ Use domain experts to help evaluate the importance of features and the feasibility of using them for predictions.
+ Remove the features that you determine are redundant or irrelevant to reduce noise in the data and unimportant correlations.
+ Iterate as you build your model. Adjust the features, feature combinations, and tuning objectives as you go along.

[Feature Processing](https://docs.aws.amazon.com/machine-learning/latest/dg/feature-processing.html) in the Amazon Machine Learning Developer Guide provides additional guidelines for feature processing that are relevant to Neptune ML.

## Handle outlier data points
Handle outliers

An outlier is a data point that is significantly different from the remaining data. Data outliers can spoil or mislead the training process, resulting in longer training time or less accurate models. Unless they are truly important, you should eliminate outliers before exporting the data.

## Remove duplicate nodes and edges
Remove duplicates

Graphs stored in Neptune may have duplicate nodes or edges. These redundant elements will introduce noise for ML model training. Eliminate duplicate nodes or edges before exporting the data.

## Tune the graph structure
Tune the graph

When the graph is exported, you can change the way features are processed and how the graph is constructed, to improve the model performance.

Here are some best practices:
+ When an edge property has the meaning of categories of edges, it is worth turning it into edge types in some cases.
+ The default normalization policy used for a numerical property is `min-max`, but in some cases other normalization policies work better. You can preprocess the property and change the normalization policy as explained in [Elements of a `model-HPO-configuration.json` file](machine-learning-customizing-hyperparams.md#machine-learning-hyperparams-file-elements).
+ The export process automatically generates feature types based on property types. For example, it treats `String` properties as categorical features and `Float` and `Int` properties as numerical features. If you need to, you can modify the feature type after export (see [Elements of a `model-HPO-configuration.json` file](machine-learning-customizing-hyperparams.md#machine-learning-hyperparams-file-elements)).

## Tune the hyperparameter ranges and defaults
Change the HPO file

The data-processing operation infers hyperparameter configuration ranges from the graph. If the generated model hyperparameter ranges and defaults don't work well for your graph data, you can edit the HPO configuration file to create your own hyperparameter tuning strategy.

Here are some best practices:
+ When the graph goes large, the default hidden dimension size may not be large enough to contain all the information. You can change the `num-hidden` hyperparameter to control the hidden dimension size.
+ For knowledge graph embedding (KGE) models, you may want to change the specific model being used according to your graph structure and budget.

  `TrainsE` models have difficulty in dealing with one-to-many (1-N), many-to-one (N-1), and many-to-many (N-N) relations. `DistMult` models have difficulty in dealing with symmetric relations. `RotatE` is good at modeling all kinds of relations but is more expensive than `TrainsE` and `DistMult` during training.
+ In some cases, when both node identification and node feature information are important, you should use ``concat-node-embed`` to tell the Neptune ML model to get the initial representation of a node by concatenating its features with its initial embeddings.
+ When you are getting reasonably good performance over some hyperparameters, you can adjust the hyperparameter search space according to those results.

## Early stopping of the model training process in Neptune ML
Model training early stop

Early stopping can significantly reduce the model-training run time and associated costs without degrading model performance. It also prevent the model from overfitting on the training data.

Early stopping depends on regular measurements of validation-set performance. Initially, performance improves as training proceeds, but when the model starts overfitting, it starts to decline again. The early stopping feature identifies the point at which the model starts overfitting and halts model training at that point.

Neptune ML monitors the validation metric calls and compares the most recent validation metric to the average of validation metrics over the last **`n`** evaluations, where **`n`** is a number set using the `window-for-early-stop` parameter. As soon as the validation metric is worse than that average, Neptune ML stops the model training and saves the best model so far. 

You can control early stopping using the following parameters:
+ **`window-for-early-stop`**   –   The value of this parameter is an integer that specifies the number of recent validation scores to average when deciding on an early stop. The default value is `3`.
+ **`enable-early-stop`**   –   Use this Boolean parameter to turn off the early stop feature. By default, its value is `true`.

## Early stopping of the HPO process in Neptune ML
HPO early stop

The early stop feature in Neptune ML also stops training jobs that are not performing well compared to other training jobs, using the SageMaker AI HPO warm-start feature. This too can reduce costs and improve the quality of HPO.

See [Run a warm start hyperparameter tuning job](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-warm-start.html) for a description of how this works.

Warm start provides the ability to pass information learned from previous training jobs to subsequent training jobs and provides two distinct benefits:
+ First, the results of previous training jobs are used to select good combinations of hyperparameters to search over in the new tuning job.
+ Second, it allows early stopping to access more model runs, which reduces tuning time.

This feature is enabled automatically in Neptune ML, and allows you strike a balance between model training time and performance. If you are satisfied with the performance of the current model, you can use that model. Otherwise, you run more HPOs that are warm-started with the results of previous runs so as to discover a better model.

## Get professional support services
Get professional support

AWS offers professional support services to help you with problems in your machine learning on Neptune projects. If you get stuck, reach out to [AWS support](https://aws.amazon.com/premiumsupport/).