

# The importance of the sourceIdentifier attribute to lineage nodes


Every lineage node is uniquely identified by its sourceIdentifier (usually provided as part of open-lineage event) in addition to system generated nodeId. sourceIdentifier is generated using <namespace>, <name> of the node in lineage event.

The following are examples of sourceIdentifier values for different types of nodes:
+ **Job nodes**
  + SourceIdentifier of job nodes is populated from <namespace>.<name> on the job node in open-lineage run event
+ **Jobrun nodes**
  + SourceIdentifier of jobrun nodes is populated from <job's namespace>.<job's name>/<run\$1id>
+ **Dataset nodes**
  + Dataset nodes representing AWS resources: sourceIdentifier is in ARN format
    + AWS Glue table: arn:aws:glue:<region>:<account-id>:table/<database>/<table-name>
    + AWS Glue table with federated sources: arn:aws:glue:<region>:<account-id>:table/<catalog><database>/<table-name>
      + Example: catalog can be "s3tablescatalog"/"s3tablesBucket", "lakehouse\$1catalog" etc
    + Amazon Redshift table:
      + serverless: arn:aws:redshift-serverless:<region>:<account-id>:table/workgroupName/<database>/<schema>/<table-name>
      + provisioned: arn:aws:redshift:<region>:<account-id>:table/clusterIdentifier/<database>/<schema>/<table-name>
    + Amazon Redshift view:
      + serverless: arn:aws:redshift-serverless:<region>:<account-id>:view/workgroupName/<database>/<schema>/<view-name>
      + provisioned: arn:aws:redshift:<region>:<account-id>:view/clusterIdentifier/<database>/<schema>/<view-name>
  + Dataset nodes representing SageMaker catalog resources:
    + Asset: amazon.datazone.asset/<assetId>
    + Listing (published asset): amazon.datazone.listing/<listingId>
  + In all other cases, dataset nodes' sourceIdentifier is populated using <namespace>/<name> of the dataset nodes in open-lineage run event
    + https://openlineage.io/docs/spec/naming/ contains naming convention for various datastores.

The following table contains the examples of how sourceIdentifier is generated for datasets of different types.


****  

| Source for lineage event | Sample OpenLineage event data | Source ID computed by Amazon DataZone | 
| --- | --- | --- | 
|  AWS Glue ETL  |  <pre><br />{<br />   "run": {<br />      "runId":"4e3da9e8-6228-4679-b0a2-fa916119fthr",<br />      "facets":{<br />           "environment-properties":{<br />                 ....<br />                "environment-properties":{<br />                     "GLUE_VERSION":"3.0",<br />                     "GLUE_COMMAND_CRITERIA":"glueetl",<br />                     "GLUE_PYTHON_VERSION":"3"<br />                }<br />           }<br />       } <br />    },<br />    .....<br />   "outputs":[<br />      {<br />         "namespace":"namespace.output",<br />         "name":"output_name",<br />         "facets":{<br />             "symlinks":{<br />                 .... <br />                 "identifiers":[<br />                    {<br />                       "namespace":"arn:aws:glue:us-west-2:123456789012",<br />                       "name":"table/testdb/testtb-1",<br />                       "type":"TABLE"<br />                    }<br />                 ]<br />             }<br />        }<br />     }<br />   ]<br />    <br />}<br />                               </pre>  | arn:aws:glue:us-west-2:123456789012:table/testdb/testtb-1 If environment-properties contains GLUE\$1VERSION, GLUE\$1PYTHON\$1VERSION, etc, Amazon DataZone uses namespace and name in symlink of the dataset (input or output) to construct AWS Glue table ARN for sourceIdentifier. | 
|  Amazon Redshift (Provisioned)  |  <pre><br />{<br />   "run": {<br />      "runId":"4e3da9e8-6228-4679-b0a2-fa916119fthr",<br />      "facets":{<br />          .......<br />       } <br />    },<br />    .....<br />   "inputs":[<br />      {<br />         "namespace":"redshift://cluster-20240715.123456789012.us-east-1.redshift.amazonaws.com:5439",<br />         "name":"tpcds_data.public.dws_tpcds_7"<br />         "facets":{<br />             .....<br />        }<br />     }<br />   ]<br />    <br />}<br />                                </pre>  | arn:aws:redshift:us-east-1:123456789012:table/cluster-20240715/tpcds\$1data/public/dws\$1tpcds\$17  If the namespace prefix is `redshift`, Amazon DataZone uses that to construct the Amazon Redshift ARN using values of namespace and name attributes. | 
|  Amazon Redshift (serverless)  |  <pre><br />{<br />   "run": {<br />      "runId":"4e3da9e8-6228-4679-b0a2-fa916119fthr",<br />      "facets":{<br />          .......<br />       } <br />    },<br />    .....<br />   "outputs":[<br />      {<br />         "namespace":"redshift://workgroup-20240715.123456789012.us-east-1.redshift-serverless.amazonaws.com:5439",<br />         "name":"tpcds_data.public.dws_tpcds_7"<br />         "facets":{<br />             .....<br />        }<br />     }<br />   ]<br />}<br />                                </pre>  | arn:aws:redshift-serverless:us-east-1:123456789012:table/workgroup-20240715/tpcds\$1data/public/dws\$1tpcds\$17  As per OpenLineage naming convention, namespace for Amazon Redshift dataset should be `provider://{cluster_identifier or workgroup}.{region_name}:{port}`. If the namespace contains `redshift-serverless`, Amazon DataZone uses that to construct Amazon Redshift ARN using values of namespace and name attributes. | 
|  Any other datastore  |  Recommendation is to populate namespace and name as per OpenLineage convention defined in [https://openlineage.io/docs/spec/naming/](https://openlineage.io/docs/spec/naming/).  |  Amazon DataZone populates sourceIdentifier as <namespace>/<name>.  | 