Friday, January 12, 2018

Annotating detected outliers

The so-called Twitter Anomaly Detection function for R is excellent but also very minimalistic. The input is a two-column data frame where the first column consists of the timestamps and the second column contains the observations. In addition to a plot, the output is a data frame comprising timestamps, values, and optionally, expected values.

In practice, we usually have some semantic information that we would also like to include in the output, so we do not have to refer back to the original data. Fortunately, there is a quick-and-dirty way to add a description to the outlier data frame.

We start with the annotated data frame containing at least columns with the timestamps, the observations, and factors providing contextual or semantic information on each observation. We then create a simple data frame with just the first two columns, which we pass to the outlier detection function.

We can write a trivial function that for each outlier finds the row index in the simple data frame and looks up the semantic information in the annotated data frame:

AddDescription <- function(series1, series2, outliers) {
 quantity <-  lengths(outliers$anoms[1])
 if (quantity < 1) return (NULL)
 else {
   result <- NULL
  for (i in 1:quantity) {
   rowIndex <- which(series1$timestamp == outliers$anoms$timestamp[i])
   newRow <- data.frame(outliers$anoms$timestamp[i],
   result <- rbind(result, newRow)
  colnames (result) <- c("timestamp", "outlier_value", "description")
  return (result)

This function is just an elementary example. It is easy to add to each outlier more detailed information you can compile from the full data frame.

Time series with outliers at green markers

outliers with descriptions
  timestamp outlier_value description
2017-01-17 06:53:00
gear display flashing
2017-09-19 09:10:00
gear shift failure
2017-11-17 07:26:00
check engine lamp on

Dates are a sore point of analytics: they alway get you. When no time zone is specified, i.e., tz = "", R assumes the local time zone. In the data frame returned by Twitter's AnomalyDetectionTs functions, the time column has UTC as the time zone. Therefore, the following statement is useful after the call to AnomalyDetectionTs:

anomalies$anoms$timestamp <- as.POSIXct(anomalies$anoms$timestamp, tz = "")