

We are no longer updating the Amazon Machine Learning service or accepting new users for it. This documentation is available for existing users, but we are no longer updating it. For more information, see [ What is Amazon Machine Learning](https://docs.aws.amazon.com/machine-learning/latest/dg/what-is-amazon-machine-learning.html).

# Data Transformations for Machine Learning


 Machine learning models are only as good as the data that is used to train them. A key characteristic of good training data is that it is provided in a way that is optimized for learning and generalization. The process of putting together the data in this optimal format is known in the industry as *feature transformation*. 

**Topics**
+ [

# Importance of Feature Transformation
](importance-of-feature-transformation.md)
+ [

# Feature Transformations with Data Recipes
](feature-transformations-with-data-recipes.md)
+ [customize a recipe](recipe-format-reference.md)
+ [

# Suggested Recipes
](suggested-recipes.md)
+ [

# Data Transformations Reference
](data-transformations-reference.md)
+ [

# Data Rearrangement
](data-rearrangement.md)

# Importance of Feature Transformation


 Consider a machine learning model whose task is to decide whether a credit card transaction is fraudulent or not. Based on your application background knowledge and data analysis, you might decide which data fields (or features) are important to include in the input data. For example, transaction amount, merchant name, address, and credit card owner's address are important to provide to the learning process. On the other hand, a randomly generated transaction ID carries no information (if we know that it really is random), and is not useful. 

 Once you have decided on which fields to include, you transform these features to help the learning process. Transformations add background experience to the input data, enabling the machine learning model to benefit from this experience. For example, the following merchant address is represented as a string: 

 "123 Main Street, Seattle, WA 98101" 

 By itself, the address has limited expressive power – it is useful only for learning patterns associated with that exact address. Breaking it up into constituent parts, however, can create additional features like "Address" (123 Main Street), "City" (Seattle), "State" (WA) and "Zip" (98101). Now, the learning algorithm can group more disparate transactions together, and discover broader patterns – perhaps some merchant zip codes experience more fraudulent activity than others. 

 For more information about the feature transformation approach and process, see [Machine Learning Concepts](https://docs.aws.amazon.com/machine-learning/latest/dg/machine-learning-concepts.html). 

# Feature Transformations with Data Recipes


 There are two ways to transform features before creating ML models with Amazon ML: you can transform your input data directly before showing it to Amazon ML, or you can use the built-in data transformations of Amazon ML. You can use Amazon ML recipes, which are pre-formatted instructions for common transformations. With recipes, you can do the following: 
+  Choose from a list of built-in common machine learning transformations, and apply these to individual variables or groups of variables 
+  Select which of the input variables and transformations are made available to the machine learning process 

 Using Amazon ML recipes offers several advantages. Amazon ML performs the data transformations for you, so you do not need to implement them yourself. In addition, they are fast because Amazon ML applies the transformations while reading input data, and provides results to the learning process without the intermediate step of saving results to disk. 

# Recipe Format Reference


Amazon ML recipes contain instructions for transforming your data as a part of the machine learning process. Recipes are defined using a JSON-like syntax, but they have additional restrictions beyond the normal JSON restrictions. Recipes have the following sections, which must appear in the order shown here:
+  **Groups** enable grouping of multiple variables, for ease of applying transformations. For example, you can create a group of all variables having to do with free-text parts of a web page (title, body), and then perform a transformation on all of these parts at once. 
+  **Assignments** enable the creation of intermediate named variables that can be reused in processing. 
+  **Outputs** define which variables will be used in the learning process, and what transformations (if any) apply to these variables. 

## Groups


 You can define groups of variables in order to collectively transform all variables within the groups, or to use these variables for machine learning without transforming them. By default, Amazon ML creates the following groups for you:

 **ALL\$1TEXT, ALL\$1NUMERIC, ALL\$1CATEGORICAL, ALL\$1BINARY –**Type-specific groups based on variables defined in the datasource schema. 

**Note**  
You cannot create a group with `ALL_INPUTS`.

 These variables can be used in the outputs section of your recipe without being defined. You can also create custom groups by adding to or subtracting variables from existing groups, or directly from a collection of variables. In the following example, we demonstrate all three approaches, and the syntax for the grouping assignment: 

```
"groups": {

"Custom_Group": "group(var1, var2)",
"All_Categorical_plus_one_other": "group(ALL_CATEGORICAL, var2)"

}
```

 Group names need to start with an alphabetical character and can be between 1 and 64 characters long. If the group name does not start with an alphabetical character or if it contains special characters (, ' " \$1t \$1r \$1n ( ) \$1), then the name needs to be quoted to be included in the recipe. 

## Assignments


 You can assign one or more transformations to an intermediate variable, for convenience and readability. For example, if you have a text variable named email\$1subject, and you apply the lowercase transformation to it, you can name the resulting variable email\$1subject\$1lowercase, making it easy to keep track of it elsewhere in the recipe. Assignments can also be chained, enabling you to apply multiple transformations in a specified order. The following example shows single and chained assignments in recipe syntax: 

```
"assignments": {

"email_subject_lowercase": "lowercase(email_subject)",

"email_subject_lowercase_ngram":"ngram(lowercase(email_subject), 2)"

}
```

 Intermediate variable names need to start with an alphabet character and can be between 1 and 64 characters long. If the name does not start with an alphabet or if it contains special characters (, ' " \$1t \$1r \$1n ( ) \$1), then the name needs to be quoted to be included in the recipe. 

## Outputs


 The outputs section controls which input variables will be used for the learning process, and which transformations apply to them. An empty or non-existent output section is an error, because no data will be passed to the learning process. 

 The simplest outputs section simply includes the predefined **ALL\$1INPUTS** group, instructing Amazon ML to use all of the variables defined in the datasource for learning: 

```
"outputs": [

"ALL_INPUTS"

]
```

 The output section can also refer to the other predefined groups by instructing Amazon ML to use all the variables in these groups: 

```
"outputs": [

"ALL_NUMERIC",

"ALL_CATEGORICAL"

]
```

 The output section can also refer to custom groups. In the following example, only one of the custom groups defined in the grouping assignments section in the preceding example will be used for machine learning. All other variables will be dropped: 

```
"outputs": [

"All_Categorical_plus_one_other"

]
```

The outputs section can also refer to variable assignments defined in the assignment section:

```
"outputs": [

"email_subject_lowercase"

]
```

 And input variables or transformations can be defined directly in the outputs section: 

```
"outputs": [

"var1",

"lowercase(var2)"

]
```

 Output needs to explicitly specify all variables and transformed variables that are expected to be available to the learning process. Say, for example, that you include in the output a Cartesian product of var1 and var2. If you would like to include both the raw variables var1 and var2 as well, then you need to add the raw variables in the output section: 

```
"outputs": [

"cartesian(var1,var2)",

"var1",

"var2"

]
```

 Outputs can include comments for readability by adding the comment text along with the variable: 

```
"outputs": [

"quantile_bin(age, 10) //quantile bin age",

"age // explicitly include the original numeric variable along with the
binned version"

]
```

 You can mix and match all of these approaches within the outputs section. 

**Note**  
Comments are not allowed in the Amazon ML console when adding a recipe.

## Complete Recipe Example


 The following example refers to several built-in data processors that were introduced in preceding examples: 

```
{

"groups": {

"LONGTEXT": "group_remove(ALL_TEXT, title, subject)",

"SPECIALTEXT": "group(title, subject)",

"BINCAT": "group(ALL_CATEGORICAL, ALL_BINARY)"

},

"assignments": {

"binned_age" : "quantile_bin(age,30)",

"country_gender_interaction" : "cartesian(country, gender)"

},

"outputs": [

"lowercase(no_punct(LONGTEXT))",

"ngram(lowercase(no_punct(SPECIALTEXT)),3)",

"quantile_bin(hours-per-week, 10)",

"hours-per-week // explicitly include the original numeric variable
along with the binned version",

"cartesian(binned_age, quantile_bin(hours-per-week,10)) // this one is
critical",

"country_gender_interaction",

"BINCAT"

]

}
```

# Suggested Recipes


 When you create a new datasource in Amazon ML and statistics are computed for that datasource, Amazon ML will also create a suggested recipe that can be used to create a new ML model from the datasource. The suggested datasource is based on the data and target attribute present in the data, and provides a useful starting point for creating and fine-tuning your ML models. 

 To use the suggested recipe on the Amazon ML console, choose **Datasource** or **Datasource and ML model** from the **Create new** drop down list. For ML model settings, you will have a choice of Default or Custom Training and Evaluation settings in the **ML Model Setting**s step of the **Create ML Model** wizard. If you pick the Default option, Amazon ML will automatically use the suggested recipe. If you pick the Custom option, the recipe editor in the next step will display the suggested recipe, and you will be able to verify or modify it as needed. 

**Note**  
 Amazon ML allows you to create a datasource and then immediately use it to create an ML model, before statistics computation is completed. In this case, you will not be able to see the suggested recipe in the Custom option, but you will still be able to proceed past that step and have Amazon ML use the default recipe for model training. 

 To use the suggested recipe with the Amazon ML API, you can pass an empty string in both Recipe and RecipeUri API parameters. It is not possible to retrieve the suggested recipe using the Amazon ML API. 

# Data Transformations Reference


**Topics**
+ [

## N-gram Transformation
](#n-gram-transformation)
+ [

## Orthogonal Sparse Bigram (OSB) Transformation
](#orthogonal-sparse-bigram-osb-transformation)
+ [

## Lowercase Transformation
](#lowercase-transformation)
+ [

## Remove Punctuation Transformation
](#remove-punctuation-transformation)
+ [

## Quantile Binning Transformation
](#quantile-binning-transformation)
+ [

## Normalization Transformation
](#normalization-transformation)
+ [

## Cartesian Product Transformation
](#cartesian-product-transformation)

## N-gram Transformation


 The n-gram transformation takes a text variable as input and produces strings corresponding to sliding a window of (user-configurable) n words, generating outputs in the process. For example, consider the text string "I really enjoyed reading this book". 

 Specifying the n-gram transformation with window size=1 simply gives you all the individual words in that string: 

```
{"I", "really", "enjoyed", "reading", "this", "book"}
```

 Specifying the n-gram transformation with window size =2 gives you all the two-word combinations as well as the one-word combinations: 

```
{"I really", "really enjoyed", "enjoyed reading", "reading this", "this
book", "I", "really", "enjoyed", "reading", "this", "book"}
```

 Specifying the n-gram transformation with window size = 3 will add the three-word combinations to this list, yielding the following: 

```
{"I really enjoyed", "really enjoyed reading", "enjoyed reading this",
"reading this book", "I really", "really enjoyed", "enjoyed reading",
"reading this", "this book", "I", "really", "enjoyed", "reading",
"this", "book"}
```

 You can request n-grams with a size ranging from 2-10 words. N-grams with size 1 are generated implicitly for all inputs whose type is marked as text in the data schema, so you do not have to ask for them. Finally, keep in mind that n-grams are generated by breaking the input data on whitespace characters. That means that, for example, punctuation characters will be considered a part of the word tokens: generating n-grams with a window of 2 for string "red, green, blue" will yield \$1"red,", "green,", "blue,", "red, green", "green, blue"\$1. You can use the punctuation remover processor (described later in this document) to remove the punctuation symbols if this is not what you want. 

 To compute n-grams of window size 3 for variable var1: 

```
"ngram(var1, 3)"
```

## Orthogonal Sparse Bigram (OSB) Transformation


 The OSB transformation is intended to aid in text string analysis and is an alternative to the bi-gram transformation (n-gram with window size 2). OSBs are generated by sliding the window of size n over the text, and outputting every pair of words that includes the first word in the window. 

 To build each OSB, its constituent words are joined by the "\$1" (underscore) character, and every skipped token is indicated by adding another underscore into the OSB. Thus, the OSB encodes not just the tokens seen within a window, but also an indication of number of tokens skipped within that same window. 

 To illustrate, consider the string "The quick brown fox jumps over the lazy dog", and OSBs of size 4. The six four-word windows, and the last two shorter windows from the end of the string are shown in the following example, as well OSBs generated from each: 

 Window, \$1OSBs generated\$1 

```
"The quick brown fox", {The_quick, The__brown, The___fox}

"quick brown fox jumps", {quick_brown, quick__fox, quick___jumps}

"brown fox jumps over", {brown_fox, brown__jumps, brown___over}

"fox jumps over the", {fox_jumps, fox__over, fox___the}

"jumps over the lazy", {jumps_over, jumps__the, jumps___lazy}

"over the lazy dog", {over_the, over__lazy, over___dog}

"the lazy dog", {the_lazy, the__dog}

"lazy dog", {lazy_dog}
```

 Orthogonal sparse bigrams are an alternative for n-grams that might work better in some situations. If your data has large text fields (10 or more words), experiment to see which works better. Note that what constitutes a large text field may vary depending on the situation. However, with larger text fields, OSBs have been empirically shown to uniquely represent the text due to the special *skip* symbol (the underscore). 

 You can request a window size of 2 to 10 for OSB transformations on input text variables. 

 To compute OSBs with window size 5 for variable var1: 

 "osb(var1, 5)" 

## Lowercase Transformation


 The lowercase transformation processor converts text inputs to lowercase. For example, given the input "The Quick Brown Fox Jumps Over the Lazy Dog", the processor will output "the quick brown fox jumps over the lazy dog". 

 To apply lowercase transformation to the variable var1: 

 "lowercase(var1)" 

## Remove Punctuation Transformation


 Amazon ML implicitly splits inputs marked as text in the data schema on whitespace. Punctuation in the string ends up either adjoining word tokens, or as separate tokens entirely, depending on the whitespace surrounding it. If this is undesirable, the punctuation remover transformation may be used to remove punctuation symbols from generated features. For example, given the string "Welcome to AML - please fasten your seat-belts\$1", the following set of tokens is implicitly generated: 

```
{"Welcome", "to", "Amazon", "ML", "-", "please", "fasten", "your", "seat-belts!"}
```

 Applying the punctuation remover processor to this string results in this set: 

```
{"Welcome", "to", "Amazon", "ML", "please", "fasten", "your", "seat-belts"}
```

 Note that only the prefix and suffix punctuation marks are removed. Punctuations that appear in the middle of a token, e.g. the hyphen in "seat-belts", are not removed. 

 To apply punctuation removal to the variable var1: 

 "no\$1punct(var1)" 

## Quantile Binning Transformation


 The quantile binning processor takes two inputs, a numerical variable and a parameter called *bin number*, and outputs a categorical variable. The purpose is to discover non-linearity in the variable's distribution by grouping observed values together. 

 In many cases, the relationship between a numeric variable and the target is not linear (the numeric variable value does not increase or decrease monotonically with the target). In such cases, it might be useful to bin the numeric feature into a categorical feature representing different ranges of the numeric feature. Each categorical feature value (bin) can then be modeled as having its own linear relationship with the target. For example, let's say you know that the continuous numeric feature *account\$1age* is not linearly correlated with likelihood to purchase a book. You can bin age into categorical features that might be able to capture the relationship with the target more accurately. 

 The quantile binning processor can be used to instruct Amazon ML to establish n bins of equal size based on the distribution of all input values of the age variable, and then to substitute each number with a text token containing the bin. The optimum number of bins for a numeric variable is dependent on characteristics of the variable and its relationship to the target, and this is best determined through experimentation. Amazon ML suggests the optimal bin number for a numeric feature based on data statistics in the [Suggested Recipe](https://docs.aws.amazon.com/machine-learning/latest/dg/suggested-recipes.html). 

 You can request between 5 and 1000 quantile bins to be computed for any numeric input variable. 

 To following example shows how to compute and use 50 bins in place of numeric variable var1: 

 "quantile\$1bin(var1, 50)" 

## Normalization Transformation


 The normalization transformer normalizes numeric variables to have a mean of zero and variance of one. Normalization of numeric variables can help the learning process if there are very large range differences between numeric variables because variables with the highest magnitude could dominate the ML model, no matter if the feature is informative with respect to the target or not. 

 To apply this transformation to numeric variable var1, add this to the recipe: 

 normalize(var1) 

 This transformer can also take a user defined group of numeric variables or the pre-defined group for all numeric variables (ALL\$1NUMERIC) as input: 

 normalize(ALL\$1NUMERIC) 

 **Note** 

 It is *not* mandatory to use the normalization processor for numeric variables. 

## Cartesian Product Transformation


 The Cartesian transformation generates permutations of two or more text or categorical input variables. This transformation is used when an interaction between variables is suspected. For example, consider the bank marketing dataset that is used in Tutorial: Using Amazon ML to Predict Responses to a Marketing Offer. Using this dataset, we would like to predict whether a person would respond positively to a bank promotion, based on the economic and demographic information. We might suspect that the person's job type is somewhat important (perhaps there is a correlation between being employed in certain fields and having the money available), and the highest level of education attained is also important. We might also have a deeper intuition that there is a strong signal in the interaction of these two variables—for example, that the promotion is particularly well-suited to customers who are entrepreneurs who earned a university degree. 

 The Cartesian product transformation takes categorical variables or text as input, and produces new features that capture the interaction between these input variables. Specifically, for each training example, it will create a combination of features, and add them as a standalone feature. For example, let's say our simplified input rows look like this: 

 target, education, job 

 0, university.degree, technician 

 0, high.school, services 

 1, university.degree, admin 

 If we specify that the Cartesian transformation is to be applied to the categorical variables education and job fields, the resultant feature education\$1job\$1interaction will look like this: 

 target, education\$1job\$1interaction 

 0, university.degree\$1technician 

 0, high.school\$1services 

 1, university.degree\$1admin 

 The Cartesian transformation is even more powerful when it comes to working on sequences of tokens, as is the case when one of its arguments is a text variable that is implicitly or explicitly split into tokens. For example, consider the task of classifying a book as being a textbook or not. Intuitively, we might think that there is something about the book's title that can tell us it is a textbook (certain words might occur more frequently in textbooks' titles), and we might also think that there is something about the book's binding that is predictive (textbooks are more likely to be hardcover), but it's really the combination of some words in the title and binding that is most predictive. For a real-world example, the following table shows the results of applying the Cartesian processor to the input variables binding and title: 


|  Textbook  |  Title  |  Binding  |  Cartesian product of no\$1punct(Title) and Binding  | 
| --- | --- | --- | --- | 
|  1  |  Economics: Principles, Problems, Policies  |  Hardcover  |  \$1"Economics\$1Hardcover", "Principles\$1Hardcover", "Problems\$1Hardcover", "Policies\$1Hardcover"\$1  | 
|  0  |  The Invisible Heart: An Economics Romance  |  Softcover  |  \$1"The\$1Softcover", "Invisible\$1Softcover", "Heart\$1Softcover", "An\$1Softcover", "Economics\$1Softcover", "Romance\$1Softcover"\$1  | 
|  0  |  Fun With Problems  |  Softcover  |  \$1"Fun\$1Softcover", "With\$1Softcover", "Problems\$1Softcover"\$1  | 

 The following example shows how to apply the Cartesian transformer to var1 and var2: 

 cartesian(var1, var2) 

# Data Rearrangement


 The data rearrangement functionality enables you to create a datasource that is based on only a portion of the input data that it points to. For example, when you create an ML Model using the **Create ML Model** wizard in the Amazon ML console, and choose the default evaluation option, Amazon ML automatically reserves 30% of your data for ML model evaluation, and uses the other 70% for training. This functionality is enabled by the Data Rearrangement feature of Amazon ML. 

 If you are using the Amazon ML API to create datasources, you can specify which part of the input data a new datasource will be based. You do this by passing instructions in the `DataRearrangement` parameter to the `CreateDataSourceFromS3`, `CreateDataSourceFromRedshift` or `CreateDataSourceFromRDS` APIs. The contents of the DataRearrangement string are a JSON string containing the beginning and end locations of your data, expressed as percentages, a complement flag, and a splitting strategy. For example, the following DataRearrangement string specifies that the first 70% of the data will be used to create the datasource: 

```
{
    "splitting": {
        "percentBegin": 0,
        "percentEnd": 70,
        "complement": false,
        "strategy": "sequential"
    }
}
```

## DataRearrangement Parameters


To change how Amazon ML creates a datasource, use the follow parameters.

**PercentBegin (Optional)**  
Use `percentBegin` to indicate where the data for the datasource starts. If you do not include `percentBegin` and `percentEnd`, Amazon ML includes all of the data when creating the datasource.  
Valid values are `0` to `100`, inclusive.

**PercentEnd (Optional)**  
Use `percentEnd` to indicate where the data for the datasource ends. If you do not include `percentBegin` and `percentEnd`, Amazon ML includes all of the data when creating the datasource.  
Valid values are `0` to `100`, inclusive.

**Complement (Optional)**  
The `complement` parameter tells Amazon ML to use the data that is not included in the range of `percentBegin` to `percentEnd` to create a datasource. The `complement` parameter is useful if you need to create complementary datasources for training and evaluation. To create a complementary datasource, use the same values for `percentBegin` and `percentEnd`, along with the `complement` parameter.  
For example, the following two datasources do not share any data, and can be used to train and evaluate a model. The first datasource has 25 percent of the data, and the second one has 75 percent of the data.  
Datasource for evaluation:  

```
{
    "splitting":{
        "percentBegin":0, 
        "percentEnd":25
    }
}
```
Datasource for training:  

```
{
    "splitting":{
        "percentBegin":0, 
        "percentEnd":25, 
        "complement":"true"
    }
}
```
Valid values are `true` and `false`.

**Strategy (Optional)**  
To change how Amazon ML splits the data for a datasource, use the `strategy` parameter.  
The default value for the `strategy` parameter is `sequential`, meaning that Amazon ML takes all of the data records between the `percentBegin` and `percentEnd` parameters for the datasource, in the order that the records appear in the input data  
The following two `DataRearrangement` lines are examples of sequentially ordered training and evaluation datasources:  
Datasource for evaluation: `{"splitting":{"percentBegin":70, "percentEnd":100, "strategy":"sequential"}}`  
Datasource for training: `{"splitting":{"percentBegin":70, "percentEnd":100, "strategy":"sequential", "complement":"true"}}`  
To create a datasource from a random selection of the data, set the `strategy` parameter to `random` and provide a string that is used as the seed value for the random data splitting (for example, you can use the S3 path to your data as the random seed string). If you choose the random split strategy, Amazon ML assigns each row of data a pseudo-random number, and then selects the rows that have an assigned number between `percentBegin` and `percentEnd`. Pseudo-random numbers are assigned using the byte offset as a seed, so changing the data results in a different split. Any existing ordering is preserved. The random splitting strategy ensures that variables in the training and evaluation data are distributed similarly. It is useful in the cases where the input data may have an implicit sort order, which would otherwise result in training and evaluation datasources containing non-similar data records.  
The following two `DataRearrangement` lines are examples of non-sequentially ordered training and evaluation datasources:  
Datasource for evaluation:  

```
{
    "splitting":{
        "percentBegin":70, 
        "percentEnd":100, 
        "strategy":"random", 
        "strategyParams": {
            "randomSeed":"RANDOMSEED"
        }
    }
}
```
Datasource for training:  

```
{
    "splitting":{
        "percentBegin":70, 
        "percentEnd":100, 
        "strategy":"random", 
        "strategyParams": {
            "randomSeed":"RANDOMSEED"
        }
        "complement":"true"
    }
}
```
Valid values are `sequential` and `random`.

**(Optional) Strategy:RandomSeed**  
Amazon ML uses the **randomSeed** to split the data. The default seed for the API is an empty string. To specify a seed for the random split strategy, pass in a string. For more information about random seeds, see [Randomly Splitting Your Data](splitting-types.md#random-splitting) in the *Amazon Machine Learning Developer Guide*.

For sample code that demonstrates how to use cross-validation with Amazon ML, go to [Github Machine Learning Samples](https://github.com/awslabs/machine-learning-samples).