# Data science recipe steps
<a name="recipe-actions.data-science"></a>

Use these recipe steps to tabulate and summarize data from different perspectives, or to perform advanced transformations.

**Topics**
+ [BINARIZATION](recipe-actions.BINARIZATION.md)
+ [BUCKETIZATION](recipe-actions.BUCKETIZATION.md)
+ [CATEGORICAL\$1MAPPING](recipe-actions.CATEGORICAL_MAPPING.md)
+ [ONE\$1HOT\$1ENCODING](recipe-actions.ONE_HOT_ENCODING.md)
+ [SCALE](#recipe-actions.SCALE)
+ [SKEWNESS](recipe-actions.SKEWNESS.md)
+ [TOKENIZATION](recipe-actions.TOKENIZATION.md)

# BINARIZATION
<a name="recipe-actions.BINARIZATION"></a>

Takes all the values in a selected numeric source column, compares them to a threshold value, and outputs a new column with a 1 or 0 for each row. 

**Parameters**
+ `sourceColumn` – The name of an existing column. 

  `targetColumn` – The name of the new column to be created.

  `threshold` – Number indicating the threshold for assigning the value of 0 or 1.

  `flip` – Option to flip binary assignment so that lower values are assigned 1 and higher values are assigned 0. When the flip parameter is true, values lower than or equal to the threshold value result in 1, and values greater than the threshold value result in 0.

**Example**  
  

```
{
    "Action": {
        "Operation": "BINARIZATION",
        "Parameters": {
            "sourceColumn": "level",
            "targetColumn": "bin",
            "threshold": "100.0",
            "flip": "false"
        }
    }
}
```

# BUCKETIZATION
<a name="recipe-actions.BUCKETIZATION"></a>

Bucketization (called Binning in the console) takes the items in a column of numeric values, groups them into bins defined by numeric ranges, and outputs a new column that displays the bin for each row. Bucketization can be done using splits or percentage. The first example below uses splits and the second example uses a percentage.

**Parameters**
+ `sourceColumn` – The name of an existing column.

  `targetColumn` – The name of the new column to be created.

  `bucketNames` – List of bucket names.

  `splits` – List of bucket levels. Buckets are consecutive, and an upper bound for a bucket will be a lower bound for the next bucket.

  `percentage` – Each bucket will be described as a percentage.

**Example using splits**  
  

```
{
    "Action": {
        "Operation": "BUCKETIZATION",
        "Parameters": {
            "sourceColumn": "level",
            "targetColumn": "bin",
            "bucketNames": "[\"Bin1\",\"Bin2\",\"Bin3\"]",
            "splits": "[\"-Infinity\",\"2\",\"20\",\"Infinity\"]"
        }
    }
}
```

**Example using a percentage**  

```
{
    "Action": {
        "Operation": "BUCKETIZATION",
        "Parameters": {
            "sourceColumn": "level",
            "targetColumn": "bin",
            "bucketNames": "[\"Bin1\",\"Bin2\"]",
            "percentage": "50"
        }
    }
}
```

# CATEGORICAL\$1MAPPING
<a name="recipe-actions.CATEGORICAL_MAPPING"></a>

Maps one or more categorical values to numeric or other values

**Parameters**
+ `sourceColumn` – The name of an existing column. 

  `categoryMap` – A JSON-encoded string representing a map of values to categories.

  `deleteOtherRows` – If `true`, all non-mapped rows will be removed from the dataset.

  `other` – When provided, all non-mapped values will be replaced by this value.

  `keepOthers` – If true, all non-mapped values will remain the same.

  `mapType` – The data type of the mapped column.

  `targetColumn` – The name of a column to contain the results.

**Example**  
  

```
{
    "Action": {
        "Operation": "CATEGORICAL_MAPPING",
        "Parameters": {
            "categoryMap": "{\"United States of America\":\"1\",\"Canada\":\"2\",\"Cuba\":\"3\",\"Haiti\":\"4\",\"Dominican Republic\":\"5\"}",
            "deleteOtherRows": "false",
            "keepOthers": "true",           
            "mapType": "NUMERIC",
            "sourceColumn": "state_name",
            "targetColumn": "state_name_mapped"
        }
    }
}
```

# ONE\$1HOT\$1ENCODING
<a name="recipe-actions.ONE_HOT_ENCODING"></a>

Creates *n* numerical columns, where *n* is the number of unique values in a selected categorical variable.

For example, consider a column named `shirt_size`. Shirts are available in small, medium, large, or extra large. The column data might look like the following.

```
shirt_size
-----------
L
XL
M
S
M
M
S
XL
M
L
XL
M
```

In this scenario, there are four distinct values for `shirt_size`. Therefore, `ONE_HOT_ENCODING` generates four new columns. Each new column is named `shirt_size_x`, where `x` represents a distinct `shirt_size `value.

The results of `shirt_size` and the four generated columns look like this.

```
shirt_size    shirt_size_S    shirt_size_M    shirt_size_L    shirt_size_XL
------------    ------------    ------------    ------------    -------------
L              0               0               1               0
XL             0               0               0               1
M              0               1               0               0
S              1               0               0               0
M              0               1               0               0
M              0               1               0               0
S              1               0               0               0
XL             0               0               0               1
M              0               1               0               0
L              0               0               1               0
XL             0               0               0               1
M              0               1               0               0
```

The column that you specify for `ONE_HOT_ENCODING` can have a maximum of ten (10) distinct values.

**Parameters**
+ `sourceColumn` – The name of an existing column. The column can have a maximum of 10 distinct values.

**Example**  
  

```
{
    "RecipeAction": {
        "Operation": "ONE_HOT_ENCODING",
        "Parameters": {
            "sourceColumn": "shirt_size"
        }
    }
}
```

## SCALE
<a name="recipe-actions.SCALE"></a>

Scales or normalizes the range of data in a numeric column.

**Parameters**
+ `sourceColumn` – The name of an existing column.
+ `strategy` – The operation to be applied to the column values:
  + `MIN_MAX` – Rescales the values into a range of [0,1]
  + `SCALE_BETWEEN` – Rescales the values into a range of 2 specified values.
  +  `MEAN_NORMALIZATION` – Rescales the data to have a mean (μ) of 0 and standard deviation (σ) of 1 within a range of [-1, 1]
  +  `Z_SCORE` – Linearly scale data values to have a mean (μ) of 0 and standard deviation (σ) of 1. Best for handling outliers.
+ `targetColumn` – The name of a column to contain the results.

**Example**  
  

```
{
    "Action": {
        "Operation": "NORMALIZATION",
        "Parameters": {
            "sourceColumn": "all_votes",
            "strategy": "MIN_MAX",
            "targetColumn": "all_votes_normalized"
        }
    }
}
```

# SKEWNESS
<a name="recipe-actions.SKEWNESS"></a>

Applies transformations on your data values to change the distribution shape and its skew.

**Parameters**
+ `sourceColumn` – The name of an existing column. 

  `targetColumn` – The name of the new column to be created.

  `skewFunction`
  + `ROOT` – extract value-root. The root can be provided in the `value` parameter.

    `LOG` – log base value. The log base can be provided in the `value` parameter.

    `SQUARE` – square function

  `value` – Argument of the skewFunction.

**Example**  
  

```
{
    "RecipeAction": {
        "Operation": "SKEWNESS",
        "Parameters": {
            "sourceColumn": "level",
            "targetColumn": "bin",
            "skewFunction": "LOG",
            "value": "2.718281828"
        }
    }
}
```

# TOKENIZATION
<a name="recipe-actions.TOKENIZATION"></a>

Splits text into smaller units, or tokens, such as individual words or terms.

**Parameters**
+ `sourceColumn` – The name of an existing column.
+ `delimiter` — A custom delimiter that appears between tokenized words. (The default behavior is to separate each token by a space.)
+ `expandContractions` — If `ENABLED`, expands contracted words. For example: "don't" becomes "do not".
+ `stemmingMode` — Splits text into smaller units or tokens, such as individual lowercase words or terms. Two stemming modes are available: `PORTER` \$1 `LANCASTER`.
+ `stopWordRemovalMode` — Removes common words like a, an, the, and more. 
+ `customStopWords` — For `StopWordRemovalMode`, allows you to specify a custom list of stop words.
+ `targetColumn` — The name of a column to contain the results.

**Example**  
  

```
{
    "Action": {
        "Operation": "TOKENIZATION",
        "Parameters": {
            "customStopWords": "[]",
            "delimiter": "- ",
            "expandContractions": "ENABLED",
            "sourceColumn": "dimensions",
            "stemmingMode": "PORTER",
            "stopWordRemovalMode": "DEFAULT",
            "targetColumn": "dimensions_tokenized"
        }
    }

}
```