Novelty Usage Guide
TL;DR:
• Start the container, send a JSON array of strings to: POST /api/v1/novelty/{name}/observe
• Read the score field in the response.
• Full docs at: http://<container>:8080/docs
Introduction
thatDot Novelty allows users to stream in categorical data and immediately receive information about how unusual that data is, with an explanation of why it is so unusual. Built in visualization tools let you understand how a single observation relates to the entirety of your dataset. All this is done on your own system, without any data ever leaving the machines you control.
Unlike other anomaly detection systems, thatDot Novelty Detector works on categorical data—which means non-numeric data like: names, identifiers, email addresses, IP Addresses, status codes, natural language text, and other strings.
No training data or data labeling is required. Simply start it up, feed it data, and get results back in real-time. The system adapts to the data you feed it. The anomaly scores returned are based on a complex model of the data seen so far, so once a representative amount of data has been fed in, your scores will helpfully identify every unusual observation.
What’s wrong with existing anomaly detection methods?
Traditional unsupervised anomaly detection methods—like Clustering (e.g. K-Means), Random Forests, Isolation Forest, and others require converting all data to a numeric representation. This works well when the data is naturally numeric and there is a small set of features, but it becomes impossible when there is categorical data with more than a handful of possible values.
Terminology
observation
– An observation is a list of strings fed into novelty. e.g.: [“my”, “sample”, “observation”] the list can be any length, but all observations made into the same context should have the same length.component
– One observation is made up of many components. Each string in the observation is one component. e.g.: “sample”context
– A context is a name of a group of observations. Each observation in the same context should have the same structure. One instance of a running novelty can have any number of context. Each context is entirely separate from the others.novelty
– A measure of how anomalous an observation or component is. Unlike the terms “anomaly” or “anomalous”, which each tend to be binary (“it is or isn’t an anomaly”), novelty comes in shades of gray. Data can be more or less novel. The most novel is the most anomalous.
Step #1: Choose Data Structure
We’ve worked hard to make thatDot Novelty Detector very simple to use. Pairing the power and simplicity of this system with the many varieties of data in the world requires one last decision: you the data owner should decide how to structure the data you feed into the system. In general, this means making an array of strings. But your choices about the content and order of the data will have an effect on how valuable the results are. Choosing the content and order is how you define which question you want to analyze and monitor with thatDot Novelty Detector.
Contextual Learning
Under the hood, thatDot Novelty Detector builds a rich graphical model based on the data you stream in. That model is contextually tailored for each component of the observation. Practically, that means the system will learn what represents normal vs. novel behavior for each component in an observation, given the prior components that come before it in that observation. So the order in which you feed in data is relevant to answering different kinds of questions.
Choosing Your First Observation Structure
If you’re just getting started, a good “rule-of-thumb” approach is to choose the values from your data which you believe relate to your question. Then arrange those values in ascending order of their expected cardinality. For instance, if geographic location is relevant to your topic, then you would want to include country
before city
. Since there are only about 200 countries in the world, and tens of thousands of cities, the cardinality of country
is much lower and should come first. This would learn a “fingerprint” for what is normal in Athens, Greece which is entirely separate from the fingerprint learned from Athens, Georgia in the United States.
If you choose more values than you need, it won’t harm the results, but it will probably require more data to get useful explanations—though this will depend on the actual data itself.
Example observation structures for common use cases
- Operational Security –
[user_id, service_name, access_location, access_type_read_or_write, path_accessed, response_code]
- Network Optimization –
[country, region, status_code, cache_status, server_ip_address, client_subnet]
- E-Commerce Intelligence –
[web_property, region, demographic_profile, previously_viewed_product_id, product_id_purchased]
- Log Analysis and Reduction –
[application, hostname, function_call, status_code]
Observation Order in Depth
The order of observations is used to generate conditional probabilities under the hood. The order determines which values are given, for each particular observation.
For example: in the “Log Analysis and Reduction” example above, the observation structure given has four components: [application, hostname, function_call, status_code]
This is like asking the following four questions:
- Given all data observed so far, what is the chance of seeing this
application
name? - Given all data observed so far, and given this particular
application name
, what is the chance of seeing thishostname
value? - Given all data observed so far, and given these particular
application
andhostname values
, what is the change of seeing this particularfunction_call
value? - Given all data observed so far, and given these particular
application
,hostname
andfunction_call
value, what is the chance of seeing this particularstatus_code
value?
These conditional probabilities are the first step in the underlying algorithm. Once you choose the order, the rest of the process requires no additional choices or interaction. If you’d like to experiment with different ordering to see how it affects the results, you can do so in the same novelty detector instance by feeding the reordered observations into a different novelty context as described in Step 2.
Step #2: Stream Data In
Streaming data into the Novelty Detector is simply a matter of making REST API calls. Two endpoints are available:
POST /api/v1/novelty/{name}/observe
POST /api/v1/novelty/{name}/observe/bulk
The only difference between these two endpoints is how many observations they receive. …/observe
takes a single observation and returns a single result. …/observe/bulk
accepts an array of observations and returns a single response containing an array of results. These two endpoints can be used together. Full interactive documentation for these and all other API endpoints is available in each running instance of thatDot Novelty Detector.
Feeding in results requires choosing a value for {name}
in the POST URL. This value is chosen by the user and can be any URL-encoded string which (after being decoded) will be used as the name of the “context” for the provided observation(s). As mentioned above, a “context” is just a group of observations. All observations passed to the same context will use previous observations in that context as a part of their novelty calculations. Contexts with different names have absolutely no bearing on each other.
Step #3: Interpret the Results
Response Payload
The response payload returned from the system has the following form:
{
"observation": [
"my",
"sample",
"observation"
],
"score": 0.36231689108923804,
"totalObsScore": 0.36231689108923804,
"sequence": 3,
"probability": 0.666666666666666,
"uniqueness": 0.9943363088569088,
"infoContent": 0.5849625007211563,
"mostNovelComponent": {
"index": 2,
"value": "observation",
"novelty": 0.5849625007211563
}
}
The response payload returned from the system has the following form:
Probabilistic Graphical Models
The fields returned have the following meaning:
Here are the words from the image:
observation
– This is the same value passed in to produce the output. It is returned here only for reference.score
– The score is the total calculation of how novel the particular observation is. The value is always between0
and1
, where zero is entirely normal and not-anomalous, and one is highly novel and clearly anomalous. The score is the result of a complex analysis of the observation and other contextual data. In contrast to the next field, this score is weighted primarily by the novelty of individual components of the observation. Depending on the dataset and corresponding observation structure (see Step 2), real-world datasets will often see this score weighted with exponentially fewer results at higher scores. Practically, this often means that0.99
is a reasonable threshold for finding only the most anomalous results; and0.999
is likely to return half as many results. But to reiterate, the actual values and results will depend on the data and observation structure.totalObsScore
– While thescore
field is biased toward novel components thetotalObsScore
field is a similar computation applied to all components of the entire observation. One of the practical uses of this field is when using thatDot Novelty Detector for finding “anti-anomalies”: data which is very typical.sequence
– Each observation passed into thatDot Novelty Detector is given a unique sequence number. This value represents a total order for all observations and can be used to explore the data visualization as it was at the time when this observation was observed.probability
uniqueness
– A value between0
1
1
means that this observation has never been seen before (in its entirety). Values approaching0
indicate that this observation is incredibly common.infoContent
– The “Information Content”, “Shannon Information”, or “self-information” contained in this entire observation, given all prior observations. This value is measured in bits, and is an answer to the question: On average, how many “yes/no” questions would I need to ask to identify this observation, given this and all previous observations made to the system.mostNovelComponent
– An object describing which component of the observation was the most novel.index
– Which component in the list from the observation field was the most novel. This value is the index into that list, and is zero-indexed.value
observation
field which is the most novel component. This is the value you would find by extracting the component at positionindex
from theobservation
array.novelty
infoContent
field. This field is not directly a measure of information content, however. Instead it is weighted by many additional factors. The ratio ofnovelty
overinfoContent
will always be between0
and1
and will explain how much of the totalinfoContent
is attributable to this particular component.
Full documentation for the payload values is also included in the interactive API documentation built in to each instance of thatDot Novelty Detector.
Conditioning the System Instead of Training
thatDot Novelty Detector requires no labeled training data, as it is an unsupervised process. The system will produce scored result immediately with the very first observation passed in. The first results will not be very useful, however! The system will adapt its scoring to the data it has seen so far in that particular novelty context. Before the system has seen a representative sample of your data, the scores won’t have much to go on. So thatDot recommends ignoring the first result while the system is still learning a representative sample of your data.
There is no universal guidance possible for how much data to ignore, since this depends on the dataset it self and the user’s choice of observation ordering (in Step 2). In practice, we find that many users have a good intuition for how much data is representative—but if not, a reasonable first estimate would be a few thousand observations. We have provided free usage tiers so that users can experiment with enough data to see useful results.
Our professional services team is available for engagements which require deeper analysis or collaboration on specific customer datasets and use cases.
Exploring the Data
thatDot Novelty Detector includes a web-based exploration UI meant to build an intuition for what the data passed in to any particular context is like. By default, the exploration UI is hosted at: http://<ip-or-domain-name>:8080/
The exploration UI exposes a simplified interactive visualization of the underlying model built from the observations passed in.
Operational Considerations
Saving Data
The application will save data produced from the observations passed in. Upon restart, the system resumes with all data available from a prior session. So the underlying model created from past observations will be preserved across system restarts, if the underlying file system is restored.
A set of administrative API endpoints are included in the built-in documentation. To clear out all data from all contexts and restore an instance of the running application to the empty state it began with, you can use the POST /api/v1/admin/reset
API endpoint.
For alternative options for data storage, high availability, clustered operation, and alternate data storage formats, please contact thatDot’s sales team.
Shutting Down
To shut down the system cleanly and ensure that all data is property persisted, the API endpoint at POST /api/v1/admin/shutdown
is provided. Please be sure to call this API and await a successful response before shutting down the application or container. Failure to properly shut down the system can result in loss of data.
System Sizing
thatDot Novelty Detector can run in a container or on a full host machine. Custom deployment configurations (e.g. on-premise deployment, non-containerized deployments, alternate storage configurations, etc.) are available upon request. Performance is primarily measured as throughput of the number of observations per second. A system the size/scale of a modern laptop (with SSD) will often witness throughput around 20,000 observations per second. Increasing CPU and RAM will allow for higher throughput, though the total throughput is limited by several other considerations, including the parallelism
query parameter in the observe
and observe/bulk
API endpoints.
Recent posts
-
Streaming Graph for Real-Time Risk Analysis at Data Connect in Columbus 2024
After more than 25 years in the data management and analysis industry, I had a brand new experience. I attended a technical conference. No, that wasn’t the new…
-
Cypher all the things!
Uses for individual data engineering technologies are often broadened to more than just interacting with databases. The same goes for graph database techniques and, specifically, the leading language…
-
thatDot CEO Explains Streaming Graph to Cybersecurity Thought Leader
Briefing Room on demand webinar on thatDot Youtube channel: The Unreasonable Effectiveness of Streaming Graph thatDot founder and CEO Ryan Wright discussed the power of thatDot Streaming Graph…
Want to read more news and other posts? Visit the resource center for all things thatDot.
Help Center
Streaming Graph Help
Novelty & Additional Help