# FlagDuplicateRows class The `FlagDuplicateRows` transform returns a new column with a specified value in each row that indicates whether that row is an exact match of an earlier row in the dataset. When matches are found, they are flagged as duplicates. The initial occurrence is not flagged, because it doesn't match an earlier row. ## Example ``` from pyspark.context import SparkContext from pyspark.sql import SparkSession from awsgluedi.transforms import * sc = SparkContext() spark = SparkSession(sc) input_df = spark.createDataFrame( [ (105.111, 13.12), (13.12, 13.12), (None, 13.12), (13.12, 13.12), (None, 13.12), ], ["source_column_1", "source_column_2"], ) try: df_output = data_quality.FlagDuplicateRows.apply( data_frame=input_df, spark_context=sc, target_column="flag_row", true_string="True", false_string="False", target_index=1 ) except: print("Unexpected Error happened ") raise ``` ## Output The output will be a PySpark DataFrame with an additional column `flag_row` that indicates whether a row is a duplicate or not, based on the `source_column_1` column. The resulting `df\$1output` DataFrame will contain the following rows: ``` ``` +---------------+---------------+--------+ |source_column_1|source_column_2|flag_row| +---------------+---------------+--------+ | 105.111| 13.12| False| | 13.12| 13.12| True| | null| 13.12| True| | 13.12| 13.12| True| | null| 13.12| True| +---------------+---------------+--------+ ``` ``` The `flag_row` column indicates whether a row is a duplicate or not. The `true\$1string` is set to "True", and the `false\$1string` is set to "False". The `target\$1index` is set to 1, which means that the `flag_row` column will be inserted at the second position (index 1) in the output DataFrame. ## Methods + [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-FlagDuplicateRows-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-describeErrors) + [describe](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-describe) ## \$1\$1call\$1\$1(spark\$1context, data\$1frame, target\$1column, true\$1string=DEFAULT\$1TRUE\$1STRING, false\$1string=DEFAULT\$1FALSE\$1STRING, target\$1index=None) The `FlagDuplicateRows` transform returns a new column with a specified value in each row that indicates whether that row is an exact match of an earlier row in the dataset. When matches are found, they are flagged as duplicates. The initial occurrence is not flagged, because it doesn't match an earlier row. + `true_string` – Value to be inserted if the row matches an earlier row. + `false_string` – Value to be inserted if the row is unique. + `target_column` – Name of the new column that is inserted in the dataset. ## apply(cls, \$1args, \$1\$1kwargs) Inherited from `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply). ## name(cls) Inherited from `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name). ## describeArgs(cls) Inherited from `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs). ## describeReturn(cls) Inherited from `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn). ## describeTransform(cls) Inherited from `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform). ## describeErrors(cls) Inherited from `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors). ## describe(cls) Inherited from `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe).