Introduction: a New Approach to Anomaly Detection
Anomaly detection is a technique for finding important data. Decades of research has been spent on creating tools for anomaly detection with numeric data. But most data produced in the real world is not numbers—it is user names, identifiers, log statements, email addresses, URLs, access credentials, service names, file paths, timestamps, IP addresses, API paths, and a seemingly endless list of valuable data that is not a number. Non-numeric data is called “categorical data” and it has been mostly ignored by data analysis tools. So how could you find important categorical data with existing anomaly detection tools if they only work with numeric data?
The state of the art for using anomaly detection—or most other artificial intelligence techniques—on categorical data is to begin by converting the categorical data into numbers. There are standard ways of converting categorical data into numbers, the most common by far is “one-hot encoding.” Here is a list of 16 more
. These techniques are all cumbersome and lossy in one form or another. Each of those 17 transformations require a data scientist to bake in some interpretation—which future data may not align with. And critically, the standard approach—one-hot encoding—requires that you know the cardinality of the categorical data ahead of time, and that it remains very low. Each new value of the categorical data requires another dimension to be added to the matrix computations.
Adding dimensions leads to a highly-complex feature space that is ruinous for anomaly detection! In addition to larger matrices requiring more computation time, achieving useful results typically becomes impossible because of what has become known as “the curse of dimensionality”: as the number of dimensions used for anomaly detection increases, every data point starts looking like an anomaly in some way and the results are useless.
The Standard For a New Technique
When we set out to create a new way to detect important data directly using categorical data, we set our sights rather high. The standards we wanted to meet for this new technique were:
Streaming / Online / Real-Time. Answers now are much better than answers in 15 minutes, or an hour, or tomorrow, or next week. To be useful in the greatest number of applications, we should be able to provide results immediately. Since the streaming use case is a superset of the batch use case, any tool that can provide streaming results can also provide batch results; but not the other way around.
Unsupervised. Humans should not have to manually label the data to indicate what is and isn’t anomalous. Supervised AI techniques require manually creating training sets. That process is extremely laborious and time consuming—and often leads to bias encoded into the system. Instead, this system should train itself automatically from the data.
High cardinality. One of the greatest challenges with categorical data is that it often has extremely high cardinality. A system designed to handle this should not cause a data or analysis explosion when some fields have an extremely high number of values. Instead, the system should support constant-time processing and storage, incorporating new values and dimensions in the data on-the-fly, while still delivering real-time results.
Distinguish unique from anomalous. Sometimes “new” is actually “normal.” When a data set regularly includes new unseen values, an anomaly detection system should take that into account. Instead of producing false alarms because data is unique, the system should learn from the context and the rest of the data seen so far to understand when unique data is actually typical.
Learn behavioral patterns. Human behavior is complex, and system behavior can often compound to be even more complex. The ideal anomaly detection system would be able to learn idiosyncratic behaviors, applicable only in specific situations, and incorporate that learning into the final evaluation of data.
Rank results in a total order. It is helpful to be told “yes” or “no” for whether data is anomalous. But in many real-world environments, we would also like to know how anomalous it is and how that compares to another anomaly we might be looking at. A strength of existing numerical methods is that they often give a total ranking of results. A new anomaly detection system should preserve this total ordering of anomaly scores but deliver those results immediately—with scores that are still totally ordered as the stream continues. In practice, if I’m already looking at a rather anomalous event, I want to be told if new data is even more anomalous.
It’s great to reach a conclusion and learn that a piece of data is very novel and important. But it is much better
to also understand WHY
that datum is important. A system for categorical anomaly detection should embrace the goals of “Explainable AI
” so that the people involved can understand its conclusions.
The Achilles heel of most AI systems today is that they are empirically tested but often lack a solid theoretical justification for the results they produce
. The individual steps in modern methods are often well-established, but evaluating the fully-composed technique is limited to empirical measurement instead of theoretical grounding. In contrast, we believe that new techniques are more powerful if they are theoretically sound all the way to their core, and the results reflect that. A system that is theoretically sound produces more reliable results—especially when faced with unfamiliar data. Since anomaly detection is entirely about finding and explaining unfamiliar data, the soundness of the approach is of critical importance.
Terminology: Novel vs. Anomalous
We describe the system overall as an “anomaly detection system” because that is its most common use, and the name most well-known in the industry. But what we actually compute is a continuous score describing how novel each piece of data is. More than just swapping words, this distinction between “novel” and “anomalous” is an important one. Novelty is an objective feature we can ascribe to the data, and evaluate on a continuous spectrum. Applying this system to a data set, a user can use the novelty scores to decide whether or not data is anomalous. This reflects a separate application-specific process of translating the continuous-valued novelty score into a binary-valued anomaly decision.
Considering an example application in cyber-security, “anomalous” is often the term used to describe what an analyst would probably prefer to call “suspicious.” These terms get used interchangeably in practice, but with subtle subjective distinctions. When approaching the theoretical justification for a new technique, we think it is important to deliberately use the term “novelty” as an objective feature of the data which happens before interpreting the results as “anomalous” or “suspicious.” Thus, “novelty” is the best term for the theoretical determination arrived at by this system.
Probabilistic Graphical Models
A collection of data can be described with a graphical model. This is a representation of the data built by structuring nodes and edges in relationships that represent the core features of the data. When that graph is structured with historical facts and probabilistic information about the data being examined, it can provide a wealth of statistical information about the dataset as a whole, and each individual data observation.
That statistical information provides a probabilistic view of the data which allows us to measure the Information Content (also known as “Shannon Information” or “Self-Information”) represented in the data. Information Content is the basis for the novelty scores returned from thatDot Novelty Detector. This approach runs parallel to traditional AI techniques where Information Content is often the primary measure used in the cost function (or “loss function”) of many machine learning methods. In fact, to help machine learning researchers and engineers use this system for other purposes beyond just anomaly detection, we even return the Information Content in the result payload for each observation. Inside Novelty Detector, the Information Content is computed and then combined with other information from a complex graph built by the system.
A Dynamic Graph
thatDot Novelty Detector builds a dynamic graphical model of the dataset in real-time. This is generally a challenging problem—often calling for a graph database or other complex tools. But those tools aren’t built for streaming data and end up crushed under the load of a high volume of streaming input data and the voluminous computation that result. To overcome this challenge, we built Novelty Detector on top of Quine. Quine is a streaming graph interpreter, capable of high-volume data processing and storage. Using Quine, we construct and maintain the graphical model that represents the incoming data. Quine records the historical facts and computes all the necessary probabilities and information measures needed to produce a score for each incoming data item. That graph interpreter can compare scores across the entire data context and explain what about the data was so novel. All of this is accomplished in real-time so that streaming results are scored immediately as they flow through the system.
The final result is a high-throughput, low-latency, high-cardinality, categorical data analysis tool capable of scoring the novelty of all incoming data in real-time. Streaming data dynamically updates a probabilistic graphical model to compute information content assessed holistically with the data context to provide a novelty score useful for finding anomalies and explaining the result. All together, thatDot Novelty Detector
represents a breakthrough in the field of anomaly detection, with wide-ranging applications across industries. You can try thatDot Novelty Detector for free on AWS right away