# Converting other dataset formats to a manifest file
<a name="md-converting-to-sm-format"></a>

You can use the following information to create Amazon SageMaker AI format manifest files from a variety of source dataset formats. After creating the manifest file, use it to create to a dataset. For more information, see [Using a manifest file to import images](md-create-dataset-ground-truth.md).

**Topics**
+ [

# Transforming a COCO dataset into a manifest file format
](md-transform-coco.md)
+ [

# Transforming multi-label SageMaker AI Ground Truth manifest files
](md-gt-cl-transform.md)
+ [

# Creating a manifest file from a CSV file
](ex-csv-manifest.md)

# Transforming a COCO dataset into a manifest file format
<a name="md-transform-coco"></a>

[COCO](http://cocodataset.org/#home) is a format for specifying large-scale object detection, segmentation, and captioning datasets. This Python [example](md-coco-transform-example.md) shows you how to transform a COCO object detection format dataset into an Amazon Rekognition Custom Labels [bounding box format manifest file](md-create-manifest-file-object-detection.md). This section also includes information that you can use to write your own code.

A COCO format JSON file consists of five sections providing information for *an entire dataset*. For more information, see [The COCO dataset format](md-coco-overview.md). 
+ `info` – general information about the dataset. 
+ `licenses `– license information for the images in the dataset.
+ [`images`](md-coco-overview.md#md-coco-images) – a list of images in the dataset.
+ [`annotations`](md-coco-overview.md#md-coco-annotations) – a list of annotations (including bounding boxes) that are present in all images in the dataset.
+ [`categories`](md-coco-overview.md#md-coco-categories) – a list of label categories.

You need information from the `images`, `annotations`, and `categories` lists to create an Amazon Rekognition Custom Labels manifest file.

An Amazon Rekognition Custom Labels manifest file is in JSON lines format where each line has the bounding box and label information for one or more objects *on an image*. For more information, see [Object localization in manifest files](md-create-manifest-file-object-detection.md).

## Mapping COCO Objects to a Custom Labels JSON Line
<a name="md-mapping-coco"></a>

To transform a COCO format dataset, you map the COCO dataset to an Amazon Rekognition Custom Labels manifest file for object localization. For more information, see [Object localization in manifest files](md-create-manifest-file-object-detection.md). To build a JSON line for each image, the manifest file needs to map the COCO dataset `image`, `annotation`, and `category` object field IDs. 

The following is an example COCO manifest file. For more information, see [The COCO dataset format](md-coco-overview.md).

```
{
    "info": {
        "description": "COCO 2017 Dataset","url": "http://cocodataset.org","version": "1.0","year": 2017,"contributor": "COCO Consortium","date_created": "2017/09/01"
    },
    "licenses": [
        {"url": "http://creativecommons.org/licenses/by/2.0/","id": 4,"name": "Attribution License"}
    ],
    "images": [
        {"id": 242287, "license": 4, "coco_url": "http://images.cocodataset.org/val2017/xxxxxxxxxxxx.jpg", "flickr_url": "http://farm3.staticflickr.com/2626/xxxxxxxxxxxx.jpg", "width": 426, "height": 640, "file_name": "xxxxxxxxx.jpg", "date_captured": "2013-11-15 02:41:42"},
        {"id": 245915, "license": 4, "coco_url": "http://images.cocodataset.org/val2017/nnnnnnnnnnnn.jpg", "flickr_url": "http://farm1.staticflickr.com/88/xxxxxxxxxxxx.jpg", "width": 640, "height": 480, "file_name": "nnnnnnnnnn.jpg", "date_captured": "2013-11-18 02:53:27"}
    ],
    "annotations": [
        {"id": 125686, "category_id": 0, "iscrowd": 0, "segmentation": [[164.81, 417.51,......167.55, 410.64]], "image_id": 242287, "area": 42061.80340000001, "bbox": [19.23, 383.18, 314.5, 244.46]},
        {"id": 1409619, "category_id": 0, "iscrowd": 0, "segmentation": [[376.81, 238.8,........382.74, 241.17]], "image_id": 245915, "area": 3556.2197000000015, "bbox": [399, 251, 155, 101]},
        {"id": 1410165, "category_id": 1, "iscrowd": 0, "segmentation": [[486.34, 239.01,..........495.95, 244.39]], "image_id": 245915, "area": 1775.8932499999994, "bbox": [86, 65, 220, 334]}
    ],
    "categories": [
        {"supercategory": "speaker","id": 0,"name": "echo"},
        {"supercategory": "speaker","id": 1,"name": "echo dot"}
    ]
}
```

The following diagram shows how the COCO dataset lists for a *dataset* map to Amazon Rekognition Custom Labels JSON lines for an *image*. Every JSON line for an image posseess a source-ref, job, and job metadata field. Matching colors indicate information for a single image. Note that in the manifest an individual image may have multiple annotations and metadata/categories.

![\[Diagram showing the structure of Coco Manifest, with images, annotations, and categories contained within it.\]](http://docs.aws.amazon.com/rekognition/latest/customlabels-dg/images/coco-transform.png)


**To get the COCO objects for a single JSON line**

1. For each image in the images list, get the annotation from the annotations list where the value of the annotation field `image_id` matches the image `id` field.

1. For each annotation matched in step 1, read through the `categories` list and get each `category` where the value of the `category` field `id` matches the `annotation` object `category_id` field.

1. Create a JSON line for the image using the matched `image`, `annotation`, and `category` objects. To map the fields, see [Mapping COCO object fields to a Custom Labels JSON line object fields](#md-mapping-fields-coco). 

1. Repeat steps 1–3 until you have created JSON lines for each `image` object in the `images` list.

For example code, see [Transforming a COCO dataset](md-coco-transform-example.md).

## Mapping COCO object fields to a Custom Labels JSON line object fields
<a name="md-mapping-fields-coco"></a>

After you identify the COCO objects for an Amazon Rekognition Custom Labels JSON line, you need to map the COCO object fields to the respective Amazon Rekognition Custom Labels JSON line object fields. The following example Amazon Rekognition Custom Labels JSON line maps one image (`id`=`000000245915`) to the preceding COCO JSON example. Note the following information.
+ `source-ref` is the location of the image in an Amazon S3 bucket. If your COCO images aren't stored in an Amazon S3 bucket, you need to move them to an Amazon S3 bucket.
+ The `annotations` list contains an `annotation` object for each object on the image. An `annotation` object includes bounding box information (`top`, `left`,`width`, `height`) and a label identifier (`class_id`).
+ The label identifier (`class_id`) maps to the `class-map` list in the metadata. It lists the labels used on the image.

```
{
	"source-ref": "s3://custom-labels-bucket/images/000000245915.jpg",
	"bounding-box": {
		"image_size": {
			"width": 640,
			"height": 480,
			"depth": 3
		},
		"annotations": [{
			"class_id": 0,
			"top": 251,
			"left": 399,
			"width": 155,
			"height": 101
		}, {
			"class_id": 1,
			"top": 65,
			"left": 86,
			"width": 220,
			"height": 334
		}]
	},
	"bounding-box-metadata": {
		"objects": [{
			"confidence": 1
		}, {
			"confidence": 1
		}],
		"class-map": {
			"0": "Echo",
			"1": "Echo Dot"
		},
		"type": "groundtruth/object-detection",
		"human-annotated": "yes",
		"creation-date": "2018-10-18T22:18:13.527256",
		"job-name": "my job"
	}
}
```

Use the following information to map Amazon Rekognition Custom Labels manifest file fields to COCO dataset JSON fields. 

### source-ref
<a name="md-source-ref-coco"></a>

The S3 format URL for the location of the image. The image must be stored in an S3 bucket. For more information, see [source-ref](md-create-manifest-file-object-detection.md#cd-manifest-source-ref). If the `coco_url` COCO field points to an S3 bucket location, you can use the value of `coco_url` for the value of `source-ref`. Alternatively, you can map `source-ref` to the `file_name` (COCO) field and in your transform code, add the required S3 path to where the image is stored. 

### *bounding-box*
<a name="md-label-attribute-id-coco"></a>

A label attribute name of your choosing. For more information, see [*bounding-box*](md-create-manifest-file-object-detection.md#md-manifest-source-bounding-box).

#### image\$1size
<a name="md-image-size-coco"></a>

The size of the image in pixels. Maps to an `image` object in the [images](md-coco-overview.md#md-coco-images) list.
+ `height`-> `image.height`
+ `width`-> `image.width`
+ `depth`-> Not used by Amazon Rekognition Custom Labels but a value must be supplied.

#### annotations
<a name="md-annotations-coco"></a>

A list of `annotation` objects. There’s one `annotation` for each object on the image.

#### annotation
<a name="md-annotation-coco"></a>

Contains bounding box information for one instance of an object on the image. 
+ `class_id` -> numerical id mapping to Custom Label’s `class-map` list.
+ `top` -> `bbox[1]`
+ `left` -> `bbox[0]`
+ `width` -> `bbox[2]`
+ `height` -> `bbox[3]`

### *bounding-box*-metadata
<a name="md-metadata-coco"></a>

Metadata for the label attribute. Includes the labels and label identifiers. For more information, see [*bounding-box*-metadata](md-create-manifest-file-object-detection.md#md-manifest-source-bounding-box-metadata).

#### Objects
<a name="cd-metadata-objects-coco"></a>

An array of objects in the image. Maps to the `annotations` list by index.

##### Object
<a name="cd-metadata-object-coco"></a>
+ `confidence`->Not used by Amazon Rekognition Custom Labels, but a value (1) is required.

#### class-map
<a name="md-metadata-class-map-coco"></a>

A map of the labels (classes) that apply to objects detected in the image. Maps to category objects in the [categories](md-coco-overview.md#md-coco-categories) list.
+ `id` -> `category.id`
+ `id value` -> `category.name`

#### type
<a name="md-type-coco"></a>

Must be `groundtruth/object-detection`

#### human-annotated
<a name="md-human-annotated-coco"></a>

Specify `yes` or `no`. For more information, see [*bounding-box*-metadata](md-create-manifest-file-object-detection.md#md-manifest-source-bounding-box-metadata).

#### creation-date -> [image](md-coco-overview.md#md-coco-images).date\$1captured
<a name="md-creation-date-coco"></a>

The creation date and time of the image. Maps to the [image](md-coco-overview.md#md-coco-images).date\$1captured field of an image in the COCO images list. Amazon Rekognition Custom Labels expects the format of `creation-date` to be *Y-M-DTH:M:S*.

#### job-name
<a name="md-job-name-coco"></a>

A job name of your choosing. 

# The COCO dataset format
<a name="md-coco-overview"></a>

A COCO dataset consists of five sections of information that provide information for the entire dataset. The format for a COCO object detection dataset is documented at [COCO Data Format](http://cocodataset.org/#format-data). 
+ info – general information about the dataset. 
+ licenses – license information for the images in the dataset.
+ [images](#md-coco-images) – a list of images in the dataset.
+ [annotations](#md-coco-annotations) – a list of annotations (including bounding boxes) that are present in all images in the dataset.
+ [categories](#md-coco-categories) – a list of label categories.

To create a Custom Labels manifest, you use the `images`, `annotations`, and `categories` lists from the COCO manifest file. The other sections (`info`, `licences`) aren’t required. The following is an example COCO manifest file.

```
{
    "info": {
        "description": "COCO 2017 Dataset","url": "http://cocodataset.org","version": "1.0","year": 2017,"contributor": "COCO Consortium","date_created": "2017/09/01"
    },
    "licenses": [
        {"url": "http://creativecommons.org/licenses/by/2.0/","id": 4,"name": "Attribution License"}
    ],
    "images": [
        {"id": 242287, "license": 4, "coco_url": "http://images.cocodataset.org/val2017/xxxxxxxxxxxx.jpg", "flickr_url": "http://farm3.staticflickr.com/2626/xxxxxxxxxxxx.jpg", "width": 426, "height": 640, "file_name": "xxxxxxxxx.jpg", "date_captured": "2013-11-15 02:41:42"},
        {"id": 245915, "license": 4, "coco_url": "http://images.cocodataset.org/val2017/nnnnnnnnnnnn.jpg", "flickr_url": "http://farm1.staticflickr.com/88/xxxxxxxxxxxx.jpg", "width": 640, "height": 480, "file_name": "nnnnnnnnnn.jpg", "date_captured": "2013-11-18 02:53:27"}
    ],
    "annotations": [
        {"id": 125686, "category_id": 0, "iscrowd": 0, "segmentation": [[164.81, 417.51,......167.55, 410.64]], "image_id": 242287, "area": 42061.80340000001, "bbox": [19.23, 383.18, 314.5, 244.46]},
        {"id": 1409619, "category_id": 0, "iscrowd": 0, "segmentation": [[376.81, 238.8,........382.74, 241.17]], "image_id": 245915, "area": 3556.2197000000015, "bbox": [399, 251, 155, 101]},
        {"id": 1410165, "category_id": 1, "iscrowd": 0, "segmentation": [[486.34, 239.01,..........495.95, 244.39]], "image_id": 245915, "area": 1775.8932499999994, "bbox": [86, 65, 220, 334]}
    ],
    "categories": [
        {"supercategory": "speaker","id": 0,"name": "echo"},
        {"supercategory": "speaker","id": 1,"name": "echo dot"}
    ]
}
```

## images list
<a name="md-coco-images"></a>

The images referenced by a COCO dataset are listed in the images array. Each image object contains information about the image such as the image file name. In the following example image object, note the following information and which fields are required to create an Amazon Rekognition Custom Labels manifest file.
+ `id` – (Required) A unique identifier for the image. The `id` field maps to the `id` field in the annotations array (where bounding box information is stored).
+ `license` – (Not Required) Maps to the license array. 
+ `coco_url` – (Optional) The location of the image.
+ `flickr_url` – (Not required) The location of the image on Flickr.
+ `width` – (Required) The width of the image.
+ `height` – (Required) The height of the image.
+ `file_name` – (Required) The image file name. In this example, `file_name` and `id` match, but this is not a requirement for COCO datasets. 
+ `date_captured` –(Required) the date and time the image was captured. 

```
{
    "id": 245915,
    "license": 4,
    "coco_url": "http://images.cocodataset.org/val2017/nnnnnnnnnnnn.jpg",
    "flickr_url": "http://farm1.staticflickr.com/88/nnnnnnnnnnnnnnnnnnn.jpg",
    "width": 640,
    "height": 480,
    "file_name": "000000245915.jpg",
    "date_captured": "2013-11-18 02:53:27"
}
```

## annotations (bounding boxes) list
<a name="md-coco-annotations"></a>

Bounding box information for all objects on all images is stored the annotations list. A single annotation object contains bounding box information for a single object and the object's label on an image. There is an annotation object for each instance of an object on an image. 

In the following example, note the following information and which fields are required to create an Amazon Rekognition Custom Labels manifest file. 
+ `id` – (Not required) The identifier for the annotation.
+ `image_id` – (Required) Corresponds to the image `id` in the images array.
+ `category_id` – (Required) The identifier for the label that identifies the object within a bounding box. It maps to the `id` field of the categories array. 
+ `iscrowd` – (Not required) Specifies if the image contains a crowd of objects. 
+ `segmentation` – (Not required) Segmentation information for objects on an image. Amazon Rekognition Custom Labels doesn't support segmentation. 
+ `area` – (Not required) The area of the annotation.
+ `bbox` – (Required) Contains the coordinates, in pixels, of a bounding box around an object on the image.

```
{
    "id": 1409619,
    "category_id": 1,
    "iscrowd": 0,
    "segmentation": [
        [86.0, 238.8,..........382.74, 241.17]
    ],
    "image_id": 245915,
    "area": 3556.2197000000015,
    "bbox": [86, 65, 220, 334]
}
```

## categories list
<a name="md-coco-categories"></a>

Label information is stored the categories array. In the following example category object, note the following information and which fields are required to create an Amazon Rekognition Custom Labels manifest file. 
+ `supercategory` – (Not required) The parent category for a label. 
+ `id` – (Required) The label identifier. The `id` field maps to the `category_id` field in an `annotation` object. In the following example, The identifier for an echo dot is 2. 
+ `name` – (Required) the label name. 

```
        {"supercategory": "speaker","id": 2,"name": "echo dot"}
```

# Transforming a COCO dataset
<a name="md-coco-transform-example"></a>

Use the following Python example to transform bounding box information from a COCO format dataset into an Amazon Rekognition Custom Labels manifest file. The code uploads the created manifest file to your Amazon S3 bucket. The code also provides an AWS CLI command that you can use to upload your images. 

**To transform a COCO dataset (SDK)**

1. If you haven't already:

   1. Make sure you have `AmazonS3FullAccess` permissions. For more information, see [Set up SDK permissions](su-sdk-permissions.md).

   1. Install and configure the AWS CLI and the AWS SDKs. For more information, see [Step 4: Set up the AWS CLI and AWS SDKs](su-awscli-sdk.md).

1. Use the following Python code to transform a COCO dataset. Set the following values.
   + `s3_bucket` – The name of the S3 bucket in which you want to store the images and Amazon Rekognition Custom Labels manifest file. 
   + `s3_key_path_images` – The path to where you want to place the images within the S3 bucket (`s3_bucket`).
   + `s3_key_path_manifest_file` – The path to where you want to place the Custom Labels manifest file within the S3 bucket (`s3_bucket`).
   + `local_path` – The local path to where the example opens the input COCO dataset and also saves the new Custom Labels manifest file.
   + `local_images_path` – The local path to the images that you want to use for training.
   + `coco_manifest` – The input COCO dataset filename.
   + `cl_manifest_file` – A name for the manifest file created by the example. The file is saved at the location specified by `local_path`. By convention, the file has the extension `.manifest`, but this is not required.
   + `job_name` – A name for the Custom Labels job.

   ```
   import json
   import os
   import random
   import shutil
   import datetime
   import botocore
   import boto3
   import PIL.Image as Image
   import io
   
   #S3 location for images
   s3_bucket = 'bucket'
   s3_key_path_manifest_file = 'path to custom labels manifest file/'
   s3_key_path_images = 'path to images/'
   s3_path='s3://' + s3_bucket  + '/' + s3_key_path_images
   s3 = boto3.resource('s3')
   
   #Local file information
   local_path='path to input COCO dataset and output Custom Labels manifest/'
   local_images_path='path to COCO images/'
   coco_manifest = 'COCO dataset JSON file name'
   coco_json_file = local_path + coco_manifest
   job_name='Custom Labels job name'
   cl_manifest_file = 'custom_labels.manifest'
   
   label_attribute ='bounding-box'
   
   open(local_path + cl_manifest_file, 'w').close()
   
   # class representing a Custom Label JSON line for an image
   class cl_json_line:  
       def __init__(self,job, img):  
   
           #Get image info. Annotations are dealt with seperately
           sizes=[]
           image_size={}
           image_size["width"] = img["width"]
           image_size["depth"] = 3
           image_size["height"] = img["height"]
           sizes.append(image_size)
   
           bounding_box={}
           bounding_box["annotations"] = []
           bounding_box["image_size"] = sizes
   
           self.__dict__["source-ref"] = s3_path + img['file_name']
           self.__dict__[job] = bounding_box
   
           #get metadata
           metadata = {}
           metadata['job-name'] = job_name
           metadata['class-map'] = {}
           metadata['human-annotated']='yes'
           metadata['objects'] = [] 
           date_time_obj = datetime.datetime.strptime(img['date_captured'], '%Y-%m-%d %H:%M:%S')
           metadata['creation-date']= date_time_obj.strftime('%Y-%m-%dT%H:%M:%S') 
           metadata['type']='groundtruth/object-detection'
           
           self.__dict__[job + '-metadata'] = metadata
   
   
   print("Getting image, annotations, and categories from COCO file...")
   
   with open(coco_json_file) as f:
   
       #Get custom label compatible info    
       js = json.load(f)
       images = js['images']
       categories = js['categories']
       annotations = js['annotations']
   
       print('Images: ' + str(len(images)))
       print('annotations: ' + str(len(annotations)))
       print('categories: ' + str(len (categories)))
   
   
   print("Creating CL JSON lines...")
       
   images_dict = {image['id']: cl_json_line(label_attribute, image) for image in images}
   
   print('Parsing annotations...')
   for annotation in annotations:
   
       image=images_dict[annotation['image_id']]
   
       cl_annotation = {}
       cl_class_map={}
   
       # get bounding box information
       cl_bounding_box={}
       cl_bounding_box['left'] = annotation['bbox'][0]
       cl_bounding_box['top'] = annotation['bbox'][1]
    
       cl_bounding_box['width'] = annotation['bbox'][2]
       cl_bounding_box['height'] = annotation['bbox'][3]
       cl_bounding_box['class_id'] = annotation['category_id']
   
       getattr(image, label_attribute)['annotations'].append(cl_bounding_box)
   
   
       for category in categories:
            if annotation['category_id'] == category['id']:
               getattr(image, label_attribute + '-metadata')['class-map'][category['id']]=category['name']
           
       
       cl_object={}
       cl_object['confidence'] = int(1)  #not currently used by Custom Labels
       getattr(image, label_attribute + '-metadata')['objects'].append(cl_object)
   
   print('Done parsing annotations')
   
   # Create manifest file.
   print('Writing Custom Labels manifest...')
   
   for im in images_dict.values():
   
       with open(local_path+cl_manifest_file, 'a+') as outfile:
               json.dump(im.__dict__,outfile)
               outfile.write('\n')
               outfile.close()
   
   # Upload manifest file to S3 bucket.
   print ('Uploading Custom Labels manifest file to S3 bucket')
   print('Uploading'  + local_path + cl_manifest_file + ' to ' + s3_key_path_manifest_file)
   print(s3_bucket)
   s3 = boto3.resource('s3')
   s3.Bucket(s3_bucket).upload_file(local_path + cl_manifest_file, s3_key_path_manifest_file + cl_manifest_file)
   
   # Print S3 URL to manifest file,
   print ('S3 URL Path to manifest file. ')
   print('\033[1m s3://' + s3_bucket + '/' + s3_key_path_manifest_file + cl_manifest_file + '\033[0m') 
   
   # Display aws s3 sync command.
   print ('\nAWS CLI s3 sync command to upload your images to S3 bucket. ')
   print ('\033[1m aws s3 sync ' + local_images_path + ' ' + s3_path + '\033[0m')
   ```

1. Run the code.

1. In the program output, note the `s3 sync` command. You need it in the next step.

1. At the command prompt, run the `s3 sync` command. Your images are uploaded to the S3 bucket. If the command fails during upload, run it again until your local images are synchronized with the S3 bucket.

1. In the program output, note the S3 URL path to the manifest file. You need it in the next step.

1. Follow the instruction at [Creating a dataset with a SageMaker AI Ground Truth manifest file (Console)](md-create-dataset-ground-truth.md#md-create-dataset-ground-truth-console) to create a dataset with the uploaded manifest file. For step 8, in **.manifest file location**, enter the Amazon S3 URL you noted in the previous step. If you are using the AWS SDK, do [Creating a dataset with a SageMaker AI Ground Truth manifest file (SDK)](md-create-dataset-ground-truth.md#md-create-dataset-ground-truth-sdk).

# Transforming multi-label SageMaker AI Ground Truth manifest files
<a name="md-gt-cl-transform"></a>

This topic shows you how to transform a multi-label Amazon SageMaker AI Ground Truth manifest file to an Amazon Rekognition Custom Labels format manifest file. 

SageMaker AI Ground Truth manifest files for multi-label jobs are formatted differently than Amazon Rekognition Custom Labels format manifest files. Multi-label classification is when an image is classified into a set of classes, but might belong to multiple classes at once. In this case, the image can potentially have multiple labels (multi-label), such as *football* and *ball*.

For information about multi-label SageMaker AI Ground Truth jobs, see [Image Classification (Multi-label)](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-image-classification-multilabel.html). For information about multi-label format Amazon Rekognition Custom Labels manifest files, see [Adding multiple image-level labels to an image](md-create-manifest-file-classification.md#md-dataset-purpose-classification-multiple-labels).

## Getting the manifest file for a SageMaker AI Ground Truth job
<a name="md-get-gt-manifest"></a>

The following procedure shows you how to get the output manifest file (`output.manifest`) for an Amazon SageMaker AI Ground Truth job. You use `output.manifest` as input to the next procedure.

**To download a SageMaker AI Ground Truth job manifest file**

1. Open the [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/). 

1. In the navigation pane, choose **Ground Truth** and then choose **Labeling Jobs**. 

1. Choose the labeling job that contains the manifest file that you want to use.

1. On the details page, choose the link under **Output dataset location**. The Amazon S3 console is opened at the dataset location. 

1. Choose `Manifests`, `output` and then `output.manifest`.

1. Choose **Object Actions** and then choose **Download** to download the manifest file.

## Transforming a multi-label SageMaker AI manifest file
<a name="md-transform-ml-gt"></a>

The following procedure creates a multi-label format Amazon Rekognition Custom Labels manifest file from an existing multi-label format SageMaker AI GroundTruth manifest file.

**Note**  
To run the code, you need Python version 3, or higher.<a name="md-procedure-multi-label-transform"></a>

**To transform a multi-label SageMaker AI manifest file**

1. Run the following python code. Supply the name of the manifest file that you created in [Getting the manifest file for a SageMaker AI Ground Truth job](#md-get-gt-manifest) as a command line argument.

   ```
   # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
   # SPDX-License-Identifier:  Apache-2.0
   """
   Purpose
   Shows how to create and Amazon Rekognition Custom Labels format
   manifest file from an Amazon SageMaker Ground Truth Image
   Classification (Multi-label) format manifest file.
   """
   import json
   import logging
   import argparse
   import os.path
   
   logger = logging.getLogger(__name__)
   
   def create_manifest_file(ground_truth_manifest_file):
       """
       Creates an Amazon Rekognition Custom Labels format manifest file from
       an Amazon SageMaker Ground Truth Image Classification (Multi-label) format
       manifest file.
       :param: ground_truth_manifest_file: The name of the Ground Truth manifest file,
       including the relative path.
       :return: The name of the new Custom Labels manifest file.
       """
   
       logger.info('Creating manifest file from %s', ground_truth_manifest_file)
       new_manifest_file = f'custom_labels_{os.path.basename(ground_truth_manifest_file)}'
   
       # Read the SageMaker Ground Truth manifest file into memory.
       with open(ground_truth_manifest_file) as gt_file:
           lines = gt_file.readlines()
   
       #Iterate through the lines one at a time to generate the
       #new lines for the Custom Labels manifest file.
       with open(new_manifest_file, 'w') as the_new_file:
           for line in lines:
               #job_name - The of the Amazon Sagemaker Ground Truth job.
               job_name = ''
               # Load in the old json item from the Ground Truth manifest file
               old_json = json.loads(line)
   
               # Get the job name
               keys = old_json.keys()
               for key in keys:
                   if 'source-ref' not in key and '-metadata' not in key:
                       job_name = key
   
               new_json = {}
               # Set the location of the image
               new_json['source-ref'] = old_json['source-ref']
   
               # Temporarily store the list of labels
               labels = old_json[job_name]
   
               # Iterate through the labels and reformat to Custom Labels format
               for index, label in enumerate(labels):
                   new_json[f'{job_name}{index}'] = index
                   metadata = {}
                   metadata['class-name'] = old_json[f'{job_name}-metadata']['class-map'][str(label)]
                   metadata['confidence'] = old_json[f'{job_name}-metadata']['confidence-map'][str(label)]
                   metadata['type'] = 'groundtruth/image-classification'
                   metadata['job-name'] = old_json[f'{job_name}-metadata']['job-name']
                   metadata['human-annotated'] = old_json[f'{job_name}-metadata']['human-annotated']
                   metadata['creation-date'] = old_json[f'{job_name}-metadata']['creation-date']
                   # Add the metadata to new json line
                   new_json[f'{job_name}{index}-metadata'] = metadata
               # Write the current line to the json file
               the_new_file.write(json.dumps(new_json))
               the_new_file.write('\n')
   
       logger.info('Created %s', new_manifest_file)
       return  new_manifest_file
   
   def add_arguments(parser):
       """
       Adds command line arguments to the parser.
       :param parser: The command line parser.
       """
   
       parser.add_argument(
           "manifest_file", help="The Amazon SageMaker Ground Truth manifest file"
           "that you want to use."
       )
   
   
   def main():
       logging.basicConfig(level=logging.INFO,
                           format="%(levelname)s: %(message)s")
       try:
           # get command line arguments
           parser = argparse.ArgumentParser(usage=argparse.SUPPRESS)
           add_arguments(parser)
           args = parser.parse_args()
           # Create the manifest file
           manifest_file = create_manifest_file(args.manifest_file)
           print(f'Manifest file created: {manifest_file}')
       except FileNotFoundError as err:
           logger.exception('File not found: %s', err)
           print(f'File not found: {err}. Check your manifest file.')
   
   if __name__ == "__main__":
       main()
   ```

1. Note the name of the new manifest file that the script displays. You use it in the next step.

1. [Upload your manifest files](https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html) to the Amazon S3 bucket that you want to use for storing the manifest file.
**Note**  
Make sure Amazon Rekognition Custom Labels has access to the Amazon S3 bucket referenced in the `source-ref` field of the manifest file JSON lines. For more information, see [Accessing external Amazon S3 Buckets](su-console-policy.md#su-external-buckets). If your Ground Truth job stores images in the Amazon Rekognition Custom Labels Console Bucket, you don't need to add permissions.

1. Follow the instructions at [Creating a dataset with a SageMaker AI Ground Truth manifest file (Console)](md-create-dataset-ground-truth.md#md-create-dataset-ground-truth-console) to create a dataset with the uploaded manifest file. For step 8, in **.manifest file location**, enter the Amazon S3 URL for the location of the manifest file. If you are using the AWS SDK, do [Creating a dataset with a SageMaker AI Ground Truth manifest file (SDK)](md-create-dataset-ground-truth.md#md-create-dataset-ground-truth-sdk).

# Creating a manifest file from a CSV file
<a name="ex-csv-manifest"></a>

This example Python script simplifies the creation of a manifest file by using a Comma Separated Values (CSV) file to label images. You create the CSV file. The manifest file is suitable for [Multi-label image classification](getting-started.md#gs-multi-label-image-classification-example) or [Multi-label image classification](getting-started.md#gs-multi-label-image-classification-example). For more information, see [Find objects, scenes, and concepts](understanding-custom-labels.md#tm-classification). 

**Note**  
This script doesn't create a manifest file suitable for finding [object locations](understanding-custom-labels.md#tm-object-localization) or for finding [brand locations](understanding-custom-labels.md#tm-brand-detection-localization).

A manifest file describes the images used to train a model. For example, image locations and labels assigned to images. A manifest file is made up of one or more JSON lines. Each JSON line describes a single image. For more information, see [Importing image-level labels in manifest files](md-create-manifest-file-classification.md).

A CSV file represents tabular data over multiple rows in a text file. Fields on a row are separated by commas. For more information, see [comma separated values](https://en.wikipedia.org/wiki/Comma-separated_values). For this script, each row in your CSV file represents a single image and maps to a JSON Line in the manifest file. To create a CSV file for a manifest file that supports [Multi-label image classification](getting-started.md#gs-multi-label-image-classification-example), you add one or more image-level labels to each row. To create a manifest file suitable for [Image classification](getting-started.md#gs-image-classification-example), you add a single image-level label to each row.

For example, The following CSV file describes the images in the [Multi-label image classification](getting-started.md#gs-multi-label-image-classification-example) (Flowers) *Getting started* project. 

```
camellia1.jpg,camellia,with_leaves
camellia2.jpg,camellia,with_leaves
camellia3.jpg,camellia,without_leaves
helleborus1.jpg,helleborus,without_leaves,not_fully_grown
helleborus2.jpg,helleborus,with_leaves,fully_grown
helleborus3.jpg,helleborus,with_leaves,fully_grown
jonquil1.jpg,jonquil,with_leaves
jonquil2.jpg,jonquil,with_leaves
jonquil3.jpg,jonquil,with_leaves
jonquil4.jpg,jonquil,without_leaves
mauve_honey_myrtle1.jpg,mauve_honey_myrtle,without_leaves
mauve_honey_myrtle2.jpg,mauve_honey_myrtle,with_leaves
mauve_honey_myrtle3.jpg,mauve_honey_myrtle,with_leaves
mediterranean_spurge1.jpg,mediterranean_spurge,with_leaves
mediterranean_spurge2.jpg,mediterranean_spurge,without_leaves
```

The script generates JSON Lines for each row. For example, the following is the JSON Line for the first row (`camellia1.jpg,camellia,with_leaves`) .

```
{"source-ref": "s3://bucket/flowers/train/camellia1.jpg","camellia": 1,"camellia-metadata":{"confidence": 1,"job-name": "labeling-job/camellia","class-name": "camellia","human-annotated": "yes","creation-date": "2022-01-21T14:21:05","type": "groundtruth/image-classification"},"with_leaves": 1,"with_leaves-metadata":{"confidence": 1,"job-name": "labeling-job/with_leaves","class-name": "with_leaves","human-annotated": "yes","creation-date": "2022-01-21T14:21:05","type": "groundtruth/image-classification"}}
```

In the example CSV, the Amazon S3 path to the image is not present. If your CSV file doesn't include the Amazon S3 path for the images, use the `--s3_path` command line argument to specify the Amazon S3 path to the image. 

The script records the first entry for each image in a deduplicated image CSV file. The deduplicated image CSV file contains a single instance of each image found in the input CSV file. Further occurrences of an image in the input CSV file are recorded in a duplicate image CSV file. If the script finds duplicate images, review the duplicate image CSV file and update the deduplicated image CSV file as necessary. Rerun the script with the deduplicated file. If no duplicates are found in the input CSV file, the script deletes the deduplicated image CSV file and duplicate image CSVfile, as they are empty. 

 In this procedure, you create the CSV file and run the Python script to create the manifest file. 

**To create a manifest file from a CSV file**

1. Create a CSV file with the following fields in each row (one row per image). Don't add a header row to the CSV file.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/rekognition/latest/customlabels-dg/ex-csv-manifest.html)

   For example `camellia1.jpg,camellia,with_leaves` or `s3://my-bucket/flowers/train/camellia1.jpg,camellia,with_leaves` 

1. Save the CSV file.

1. Run the following Python script. Supply the following arguments:
   + `csv_file` – The CSV file that you created in step 1. 
   + `manifest_file` – The name of the manifest file that you want to create.
   + (Optional)`--s3_path s3://path_to_folder/` – The Amazon S3 path to add to the image file names (field 1). Use `--s3_path` if the images in field 1 don't already contain an S3 path.

   ```
   # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
   # SPDX-License-Identifier:  Apache-2.0
   
   from datetime import datetime, timezone
   import argparse
   import logging
   import csv
   import os
   import json
   
   """
   Purpose
   Amazon Rekognition Custom Labels model example used in the service documentation.
   Shows how to create an image-level (classification) manifest file from a CSV file.
   You can specify multiple image level labels per image.
   CSV file format is
   image,label,label,..
   If necessary, use the bucket argument to specify the S3 bucket folder for the images.
   https://docs.aws.amazon.com/rekognition/latest/customlabels-dg/md-gt-cl-transform.html
   """
   
   logger = logging.getLogger(__name__)
   
   
   def check_duplicates(csv_file, deduplicated_file, duplicates_file):
       """
       Checks for duplicate images in a CSV file. If duplicate images
       are found, deduplicated_file is the deduplicated CSV file - only the first
       occurence of a duplicate is recorded. Other duplicates are recorded in duplicates_file.
       :param csv_file: The source CSV file.
       :param deduplicated_file: The deduplicated CSV file to create. If no duplicates are found
       this file is removed.
       :param duplicates_file: The duplicate images CSV file to create. If no duplicates are found
       this file is removed.
       :return: True if duplicates are found, otherwise false.
       """
   
       logger.info("Deduplicating %s", csv_file)
   
       duplicates_found = False
   
       # Find duplicates.
       with open(csv_file, 'r', newline='', encoding="UTF-8") as f,\
               open(deduplicated_file, 'w', encoding="UTF-8") as dedup,\
               open(duplicates_file, 'w', encoding="UTF-8") as duplicates:
   
           reader = csv.reader(f, delimiter=',')
           dedup_writer = csv.writer(dedup)
           duplicates_writer = csv.writer(duplicates)
   
           entries = set()
           for row in reader:
               # Skip empty lines.
               if not ''.join(row).strip():
                   continue
   
               key = row[0]
               if key not in entries:
                   dedup_writer.writerow(row)
                   entries.add(key)
               else:
                   duplicates_writer.writerow(row)
                   duplicates_found = True
   
       if duplicates_found:
           logger.info("Duplicates found check %s", duplicates_file)
   
       else:
           os.remove(duplicates_file)
           os.remove(deduplicated_file)
   
       return duplicates_found
   
   
   def create_manifest_file(csv_file, manifest_file, s3_path):
       """
       Reads a CSV file and creates a Custom Labels classification manifest file.
       :param csv_file: The source CSV file.
       :param manifest_file: The name of the manifest file to create.
       :param s3_path: The S3 path to the folder that contains the images.
       """
       logger.info("Processing CSV file %s", csv_file)
   
       image_count = 0
       label_count = 0
   
       with open(csv_file, newline='', encoding="UTF-8") as csvfile,\
               open(manifest_file, "w", encoding="UTF-8") as output_file:
   
           image_classifications = csv.reader(
               csvfile, delimiter=',', quotechar='|')
   
           # Process each row (image) in CSV file.
           for row in image_classifications:
               source_ref = str(s3_path)+row[0]
   
               image_count += 1
   
               # Create JSON for image source ref.
               json_line = {}
               json_line['source-ref'] = source_ref
   
               # Process each image level label.
               for index in range(1, len(row)):
                   image_level_label = row[index]
   
                   # Skip empty columns.
                   if image_level_label == '':
                       continue
                   label_count += 1
   
                  # Create the JSON line metadata.
                   json_line[image_level_label] = 1
                   metadata = {}
                   metadata['confidence'] = 1
                   metadata['job-name'] = 'labeling-job/' + image_level_label
                   metadata['class-name'] = image_level_label
                   metadata['human-annotated'] = "yes"
                   metadata['creation-date'] = \
                       datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%S.%f')
                   metadata['type'] = "groundtruth/image-classification"
   
                   json_line[f'{image_level_label}-metadata'] = metadata
   
                   # Write the image JSON Line.
               output_file.write(json.dumps(json_line))
               output_file.write('\n')
   
       output_file.close()
       logger.info("Finished creating manifest file %s\nImages: %s\nLabels: %s",
                   manifest_file, image_count, label_count)
   
       return image_count, label_count
   
   
   def add_arguments(parser):
       """
       Adds command line arguments to the parser.
       :param parser: The command line parser.
       """
   
       parser.add_argument(
           "csv_file", help="The CSV file that you want to process."
       )
   
       parser.add_argument(
           "--s3_path", help="The S3 bucket and folder path for the images."
           " If not supplied, column 1 is assumed to include the S3 path.", required=False
       )
   
   
   def main():
   
       logging.basicConfig(level=logging.INFO,
                           format="%(levelname)s: %(message)s")
   
       try:
   
           # Get command line arguments
           parser = argparse.ArgumentParser(usage=argparse.SUPPRESS)
           add_arguments(parser)
           args = parser.parse_args()
   
           s3_path = args.s3_path
           if s3_path is None:
               s3_path = ''
   
           # Create file names.
           csv_file = args.csv_file
           file_name = os.path.splitext(csv_file)[0]
           manifest_file = f'{file_name}.manifest'
           duplicates_file = f'{file_name}-duplicates.csv'
           deduplicated_file = f'{file_name}-deduplicated.csv'
   
           # Create manifest file, if there are no duplicate images.
           if check_duplicates(csv_file, deduplicated_file, duplicates_file):
               print(f"Duplicates found. Use {duplicates_file} to view duplicates "
                     f"and then update {deduplicated_file}. ")
               print(f"{deduplicated_file} contains the first occurence of a duplicate. "
                     "Update as necessary with the correct label information.")
               print(f"Re-run the script with {deduplicated_file}")
           else:
               print("No duplicates found. Creating manifest file.")
   
               image_count, label_count = create_manifest_file(csv_file,
                                                               manifest_file,
                                                               s3_path)
   
               print(f"Finished creating manifest file: {manifest_file} \n"
                     f"Images: {image_count}\nLabels: {label_count}")
   
       except FileNotFoundError as err:
           logger.exception("File not found: %s", err)
           print(f"File not found: {err}. Check your input CSV file.")
   
   
   if __name__ == "__main__":
       main()
   ```

1. If you plan to use a test dataset, repeat steps 1–3 to create a manifest file for your test dataset.

1. If necessary, copy the images to the Amazon S3 bucket path that you specified in column 1 of the CSV file (or specified in the `--s3_path` command line). You can use the following AWS S3 command.

   ```
   aws s3 cp --recursive your-local-folder s3://your-target-S3-location
   ```

1. [Upload your manifest files](https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html) to the Amazon S3 bucket that you want to use for storing the manifest file.
**Note**  
Make sure Amazon Rekognition Custom Labels has access to the Amazon S3 bucket referenced in the `source-ref` field of the manifest file JSON lines. For more information, see [Accessing external Amazon S3 Buckets](su-console-policy.md#su-external-buckets). If your Ground Truth job stores images in the Amazon Rekognition Custom Labels Console Bucket, you don't need to add permissions.

1. Follow the instructions at [Creating a dataset with a SageMaker AI Ground Truth manifest file (Console)](md-create-dataset-ground-truth.md#md-create-dataset-ground-truth-console) to create a dataset with the uploaded manifest file. For step 8, in **.manifest file location**, enter the Amazon S3 URL for the location of the manifest file. If you are using the AWS SDK, do [Creating a dataset with a SageMaker AI Ground Truth manifest file (SDK)](md-create-dataset-ground-truth.md#md-create-dataset-ground-truth-sdk).