

# Understanding the ML algorithm used by Amazon Quick Sight
Understanding the ML algorithm


|  | 
| --- |
|  You don't need any technical experience in machine learning to use the ML-powered features in Amazon Quick Sight. This section dives into the technical aspects of the algorithm, for those who want the details about how it works. This information isn't required reading to use the features.   | 

Amazon Quick Sight uses a built-in version of the Random Cut Forest (RCF) algorithm. The following sections explain what that means and how it is used in Amazon Quick Sight.

First, let's look at some of the terminology involved: 
+ Anomaly – Something that is characterized by its difference from the majority of the other things in the same sample. Also known as an outlier, an exception, a deviation, and so on.
+ Data point – A discrete unit—or simply put, a row—in a dataset. However, a row can have multiple data points if you use a measure over different dimensions.
+ Decision Tree – A way of visualizing the decision process of the algorithm that evaluates patterns in the data.
+ Forecast – A prediction of future behavior based on current and past behavior.
+ Model – A mathematical representation of the algorithm or what the algorithm learns.
+ Seasonality – The repeating patterns of behavior that occur cyclically in time series data.
+ Time series – An ordered set of date or time data in one field or column.

**Topics**
+ [

# What's the difference between anomaly detection and forecasting?
](difference-between-anomaly-detection-and-forecasting.md)
+ [

# What is RCF?
](what-is-random-cut-forest.md)
+ [

# How RCF is applied to detect anomalies
](how-does-rcf-detect-anomalies.md)
+ [

# How RCF is applied to generate forecasts
](how-does-rcf-generate-forecasts.md)
+ [

# References for machine learning and RCF
](learn-more-about-machine-learning-and-rcf.md)

# What's the difference between anomaly detection and forecasting?


Anomaly detection identifies outliers and their contributing drivers to answer the question "What happened that doesn't usually happen?" Forecasting answers the question "If everything continues to happen as expected, what happens in the future?" The math that allows forecasting also enables us to ask "If a few things change, what happens then?" 

Both anomaly detection and forecasting begin by examining the current known data points. Amazon Quick Sight anomaly detection begins with what is known so it can establish what is outside the known set, and identify those data points as anomalous (outliers). Amazon Quick Sight forecasting excludes the anomalous data points, and sticks with the known pattern. Forecasting focuses on the established pattern of data distribution. In contrast, anomaly detection focuses on the data points that deviate from what is expected. Each method approaches decision-making from a different direction. 

# What is RCF?


A *random cut forest* (RCF) is a special type of *random forest* (RF) algorithm, a widely used and successful technique in machine learning. It takes a set of random data points, cuts them down to the same number of points, and then builds a collection of models. In contrast, a model corresponds to a decision tree—thus the name forest. Because RFs can't be easily updated in an incremental manner, RCFs were invented with variables in tree construction that were designed to allow incremental updates. 

As an unsupervised algorithm, RCF uses cluster analysis to detect spikes in time series data, breaks in periodicity or seasonality, and data point exceptions. Random cut forests can work as a synopsis or sketch of a dynamic data stream (or a time-indexed sequence of numbers). The answers to our questions about the stream come out of that synopsis. The following characteristics address the stream and how we make connections to anomaly detection and forecasting:
+ A *streaming algorithm *is an online algorithm with a small memory footprint. An online algorithm makes its decision about the input point indexed by time **t** before it sees the **(t\$11)-**st point. The small memory allows nimble algorithms that can produce answers with low latency and allow a user to interact with the data.
+ Respecting the ordering imposed by time, as in an *online* algorithm, is necessary in anomaly detection and forecasting. If we already know what will happen the day after tomorrow, then predicting what happens tomorrow isn't a forecast—it's just interpolating an unknown missing value. Similarly, a new product introduced today can be an anomaly, but it doesn't necessarily remain an anomaly at the end of the next quarter. 

# How RCF is applied to detect anomalies


A human can easily distinguish a data point that stands out from the rest of the data. RCF does the same thing by building a "forest" of decision trees, and then monitoring how new data points change the forest. 

An *anomaly* is a data point that draws your attention away from normal points—think of an image of a red flower in a field of yellow flowers. This "displacement of attention" is encoded in the (expected) position of a tree (that is, a model in RCF) that would be occupied by the input point. The idea is to create a forest where each decision tree grows out of a partition of the data sampled for training the algorithm. In more technical terms, each tree builds a specific type of binary space partitioning tree on the samples. As Amazon Quick Sight samples the data, RCF assigns each data point an anomaly score. It gives higher scores to data points that look anomalous. The score is, in approximation, inversely proportional to the resulting depth of the point in the tree. The random cut forest assigns an anomaly score by computing the average score from each constituent tree and scaling the result with respect to the sample size. 

The votes or scores of the different models are aggregated because each of the models by itself is a weak predictor. Amazon Quick Sight identifies a data point as anomalous when its score is significantly different from the recent points. What qualifies as an anomaly depends on the application. 

The paper [Random Cut Forest Based Anomaly Detection On Streams](http://proceedings.mlr.press/v48/guha16.pdf) provides multiple examples of this state-of-the-art online anomaly detection (time-series anomaly detection). RCFs are used on contiguous segments or “shingles" of data, where the data in the immediate segment acts as a context for the most recent one. Previous versions of RCF-based anomaly-detection algorithms score an entire shingle. The algorithm in Amazon Quick Sight also provides an approximate location of the anomaly in the current extended context. This approximate location can be useful in the scenario where there is delay in detecting the anomaly. Delays occur because any algorithm needs to characterize "previously seen deviations" to "anomalous deviations," which can unfold over some time. 

# How RCF is applied to generate forecasts


To forecast the next value in a stationary time sequence, the RCF algorithm answers the question "What would be the most likely completion, after we have a candidate value?" It uses a single tree in RCF to perform a search for the best candidate. The candidates across different trees are aggregated, because each tree by itself a weak predictor. The aggregation also allows the generation of quantile errors. This process is repeated **t** times to predict the **t**−th value in the future. 

The algorithm in Amazon Quick Sight is called *BIFOCAL*. It uses two RCFs to create a CALibrated BI-FOrest architecture. The first RCF is used to filter out anomalies and provide a weak forecast, which is corrected by the second. Overall, this approach provides significantly more robust forecasts in comparison to other widely available algorithms such as ETS. 

The number of parameters in the Amazon Quick Sight forecasting algorithm is significantly fewer than for other widely available algorithms. This allows it to be useful out of the box, without human adjustment for a larger number of time series data points. As more data accumulates in a particular time series, the forecasts in Amazon Quick Sight can adjust to data drifts and changes of pattern. For time series that show trends, trend detection is performed first to make the series stationary. The forecast of that stationary sequence is projected back with the trend. 

Because the algorithm relies on an efficient online algorithm (RCF), it can support interactive "what-if" queries. In these, some of the forecasts can be altered and treated as hypotheticals to provide conditional forecasts. This is the origin of the ability to explore "what-if" scenarios during analysis. 

# References for machine learning and RCF


To learn more about machine learning and this algorithm, we suggest the following resources:
+ The article [Robust Random Cut Forest (RRCF): A No Math Explanation](https://www.linkedin.com/pulse/robust-random-cut-forest-rrcf-math-explanation-logan-wilt/) provides a lucid explanation without the mathematical equations. 
+ The book [*The Elements of Statistical Learning: Data Mining, Inference, and Prediction*, Second Edition (Springer Series in Statistics)](https://www.amazon.com/Elements-Statistical-Learning-Prediction-Statistics/dp/0387848576) provides a thorough foundation on machine learning. 
+ [http://proceedings.mlr.press/v48/guha16.pdf](http://proceedings.mlr.press/v48/guha16.pdf), a scholarly paper that dives deep into the technicalities of both anomaly detection and forecasting, with examples. 

A different approach to RCF appears in other AWS services. If you want to explore how RCF is used in other services, see the following:
+ *Amazon Managed Service for Apache Flink SQL Reference: *[RANDOM\$1CUT\$1FOREST](https://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/sqlrf-random-cut-forest.html) and [RANDOM\$1CUT\$1FOREST\$1WITH\$1EXPLANATION](https://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/sqlrf-random-cut-forest-with-explanation.html)
+ *Amazon SageMaker Developer Guide: *[Random Cut Forest (RCF) Algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/randomcutforest.html). This approach is also explained in [The Random Cut Forest Algorithm](https://freecontent.manning.com/the-randomcutforest-algorithm/), a chapter in [Machine Learning for Business](https://www.amazon.com/Machine-Learning-Business-Doug-Hudgeon/dp/1617295833/ref=sr_1_3) (October 2018). 