Anomaly Detector Usage Guide

TL;DR:
Start the container, send a JSON array of strings to: POST /api/v1/novelty/{name}/observe
Read the score field in the response.
Full docs at: http://<container>:8080/docs

Introduction

thatDot Anomaly Detector allows users to stream in categorical data and immediately receive information about how unusual that data is, with an explanation of why it is so unusual. Built in visualization tools let you understand how a single observation relates to the entirety of your dataset. All this is done on your own system, without any data ever leaving the machines you control.

Unlike other anomaly detection systems, thatDot Anomaly Detector works on categorical data—which means non-numeric data like: names, identifiers, email addresses, IP Addresses, status codes, natural language text, and other strings.

No training data or data labeling is required. Simply start it up, feed it data, and get results back in real-time. The system adapts to the data you feed it. The anomaly scores returned are based on a complex model of the data seen so far, so once a representative amount of data has been fed in, your scores will helpfully identify every unusual observation.

What’s wrong with existing anomaly detection methods?

Traditional unsupervised anomaly detection methods—like Clustering (e.g. K-Means), Random Forests, Isolation Forest, and others require converting all data to a numeric representation. This works well when the data is naturally numeric and there is a small set of features, but it becomes impossible when there is categorical data with more than a handful of possible values.

Terminology

  • observation – An observation is a list of strings fed into the anomaly detector. e.g.: ["my", "sample", "observation"] The list can be any length, but all observations made into the same context should have the same length.
  • component – One observation is made of up many components. Each string in the observation is one component. e.g.: "sample"
  • context – A context is a name of a group of observations. Each observation in the same context should have the same structure. One instance of a running anomaly detector can have any number of context. Each context is entirely separate from the others.
  • novelty – A measure of how anomalous an observation or component is. Unlike the terms “anomaly” or “anomalous”, which each tend to be binary (“it is or isn’t an anomaly”), novelty comes in shades of gray. Data can be more or less novel. The most novel is the most anomalous.

Step #1: Choose Data Structure

We’ve worked hard to make thatDot Anomaly Detector very simple to use. Pairing the power and simplicity of this system with the many varieties of data in the world requires one last decision: you the data owner should decide how to structure the data you feed into the system. In general, this means making an array of strings. But your choices about the content and order of the data will have an effect on how valuable the results are. Choosing the content and order is how you define which question you want to analyze and monitor with thatDot Anomaly Detector.

Contextual Learning

Under the hood, thatDot Anomaly Detector builds a rich graphical model based on the data you stream in. That model is contextually tailored for each component of the observation. Practically, that means the system will learn what represents normal vs. novel behavior for each component in an observation, given the prior components that come before it in that observation. So the order in which you feed in data is relevant to answering different kinds of questions.

Choosing Your First Observation Structure

If you’re just getting started, a good “rule-of-thumb” approach is to choose the values from your data which you believe relate to your question. Then arrange those values in ascending order of their expected cardinality. For instance, if geographic location is relevant to your topic, then you would want to include country before city. Since there are only about 200 countries in the world, and tens of thousands of cities, the cardinality of country is much lower and should come first. This would learn a “fingerprint” for what is normal in Athens, Greece which is entirely separate from the fingerprint learned from Athens, Georgia in the United States.

If you choose more values than you need, it won’t harm the results, but it will probably require more data to get useful explanations—though this will depend on the actual data itself.

Example observation structures for common use cases

  • Operational Security – [user_id, service_name, access_location, access_type_read_or_write, path_accessed, response_code]
  • Network Optimization – [country, region, status_code, cache_status, server_ip_address, client_subnet]
  • E-Commerce Intelligence – [web_property, region, demographic_profile, previously_viewed_product_id, product_id_purchased]
  • Log Analysis and Reduction – [application, hostname, function_call, status_code]

Observation Order in Depth

The order of observations is used to generate conditional probabilities under the hood. The order determines which values are given, for each particular observation.

For example: in the “Log Analysis and Reduction” example above, the observation structure given has four components: [application, hostname, function_call, status_code] This is like asking the following four questions:

  1. Given all data observed so far, what is the chance of seeing this application name?
  2. Given all data observed so far, and given this particular application name, what is the chance of seeing this hostname value?
  3. Given all data observed so far, and given these particular application and hostname values, what is the chance of seeing this particular function_call value?
  4. Given all data observed so far, and given these particular applicationhostname and function_call values, what is the chance of seeing this particular status_code value?

These conditional probabilities are the first step in the underlying algorithm. Once you choose the order, the rest of the process requires no additional choices or interaction. If you’d like to experiment with different ordering to see how it affects the results, you can do so in the same anomaly detector instance by feeding the reordered observations into a different novelty context as described in Step 2.

Step #2: Stream Data In

Streaming data into the Anomaly Detector is simply a matter of making REST API calls. Two endpoints are available:

POST /api/v1/novelty/{name}/observe

POST /api/v1/novelty/{name}/observe/bulk

The only difference between these two endpoints is how many observations they receive. …/observe takes a single observation and returns a single result. …/observe/bulk accepts an array of observations and returns a single response containing an array of results. These two endpoints can be used together. Full interactive documentation for these and all other API endpoints is available in each running instance of thatDot Anomaly Detector.

Feeding in results requires choosing a value for {name} in the POST URL. This value is chosen by the user and can be any URL-encoded string which (after being decoded) will be used as the name of the “context” for the provided observation(s). As mentioned above, a “context” is just a group of observations. All observations passed to the same context will use previous observations in that context as a part of their novelty calculations. Contexts with different names have absolutely no bearing on each other.

Step #3: Interpret the Results

Response Payload

The response payload returned from the system has the following form:

{
  "observation": [
    "my",
    "sample",
    "observation"
  ],
  "score": 0.36231689108923804,
  "totalObsScore": 0.36231689108923804,
  "sequence": 3,
  "probability": 0.6666666666666666,
  "surprise": 0.9943363088569088,
  "infoContent": 0.5849625007211563,
  "mostNovelComponent": {
    "index": 2,
    "value": "observation",
    "novelty": 0.5849625007211563
  }
}

The fields returned have the following meaning:

  • observation – This is the same value passed in to produce the output. It is returned here only for reference.
  • score – The score is the total calculation of how novel the particular observation is. The value is always between 0 and 1, where zero is entirely normal and not-anomalous, and one is highly novel and clearly anomalous. The score is the result of a complex analysis of the observation and other contextual data. In contrast to the next field, this score is weighted primarily by the novelty of individual components of the observation. Depending on the dataset and corresponding observation structure (see Step 2), real-world datasets will often see this score weighted with exponentially fewer results at higher scores. Practically, this often means that 0.99 is a reasonable threshold for finding only the most anomalous results; and 0.999 is likely to return half as many results. But to reiterate, the actual values and results will depend on the data and observation structure.
  • totalObsScore – While the score field is biased toward novel components the totalObsScore field is a similar computation applied to all components of the entire observation. One of the practical uses of this field is when using thatDot Anomaly Detector for finding “anti-anomalies”: data which is very typical.
  • sequence – Each observation passed into thatDot Anomaly Detector is given a unique sequence number. This value represents a total order for all observations and can be used to explore the data visualization as it was at the time when this observation was observed.
  • probability – This field represents the probability of seeing this entire observation (exactly) given all previous data when the observation was made.
  • uniqueness – A value between 0 and 1 which indicates how unique this entire observation is, given all previously observed data. A value of 1 means that this observation has never been seen before (in its entirety). Values approaching 0 indicate that this observation is incredibly common.
  • infoContent – The “Information Content”, “Shannon Information”, or “self-information” contained in this entire observation, given all prior observations. This value is measured in bits, and is an answer to the question: On average, how many “yes/no” questions would I need to ask to identify this observation, given this and all previous observations made to the system.
  • mostNovelComponent – An object describing which component of the observation was the most novel.
  • index – Which component in the list from the observation field was the most novel. This value is the index into that list, and is zero-indexed.
  • value – The string from the observation field which is the most novel component. This is the value you would find by extracting the component at position index from the observation array.
  • novelty – An abstract measure of how novel this one particular (most novel) component is. The maximum theoretical value of this field is equivalent to the value in the infoContent field. This field is not directly a measure of information content, however. Instead it is weighted by many additional factors. The ratio of novelty over infoContent will always be between 0 and 1 and will explain how much of the total infoContent is attributable to this particular component.

Full documentation for the payload values is also included in the interactive API documentation built in to each instance of thatDot Anomaly Detector.

Conditioning the System Instead of Training

thatDot Anomaly Detector requires no labeled training data, as it is an unsupervised process. The system will produce scored result immediately with the very first observation passed in. The first results will not be very useful, however! The system will adapt its scoring to the data it has seen so far in that particular novelty context. Before the system has seen a representative sample of your data, the scores won’t have much to go on. So thatDot recommends ignoring the first result while the system is still learning a representative sample of your data.

There is no universal guidance possible for how much data to ignore, since this depends on the dataset it self and the user’s choice of observation ordering (in Step 2). In practice, we find that many users have a good intuition for how much data is representative—but if not, a reasonable first estimate would be a few thousand observations. We have provided free usage tiers so that users can experiment with enough data to see useful results.

Our professional services team is available for engagements which require deeper analysis or collaboration on specific customer datasets and use cases.

Exploring the Data

thatDot Anomaly Detector includes a web-based exploration UI meant to build an intuition for what the data passed in to any particular context is like. By default, the exploration UI is hosted at: http://<ip-or-domain-name>:8080/ The exploration UI exposes a simplified interactive visualization of the underlying model built from the observations passed in.

Operational Considerations

Saving Data

The application will save data produced from the observations passed in. Upon restart, the system resumes with all data available from a prior session. So the underlying model created from past observations will be preserved across system restarts, if the underlying file system is restored.

A set of administrative API endpoints are included in the built-in documentation. To clear out all data from all contexts and restore an instance of the running application to the empty state it began with, you can use the POST /api/v1/admin/reset API endpoint.

For alternative options for data storage, high availability, clustered operation, and alternate data storage formats, please contact thatDot’s sales team.

Shutting Down

To shut down the system cleanly and ensure that all data is property persisted, the API endpoint at POST /api/v1/admin/shutdown is provided. Please be sure to call this API and await a successful response before shutting down the application or container. Failure to properly shut down the system can result in loss of data.

System Sizing

thatDot Anomaly Detector can run in a container or on a full host machine. Custom deployment configurations (e.g. on-premise deployment, non-containerized deployments, alternate storage configurations, etc.) are available upon request. Performance is primarily measured as throughput of the number of observations per second. A system the size/scale of a modern laptop (with SSD) will often witness throughput around 20,000 observations per second. Increasing CPU and RAM will allow for higher throughput, though the total throughput is limited by several other considerations, including the parallelism query parameter in the observe and observe/bulk API endpoints.

If you are interested in optimizing a deployment for high-throughput, or massive data volumes, please contact our solutions engineering team to discuss your specific situation.