Adding Quine to the Insider Threat Detection Proof of Concept
A lot has changed since we first posted the Stop Insider Threats With Automated Behavioral Anomaly Detection blog post. Most significantly, thatDot released Quine, our streaming graph, as an open source project just as the industry is recognizing the value of real-time ETL and complex event processing in service of business requirements. This is especially true in finance and cybersecurity, where minutes (seconds or even milliseconds) can mean the difference between disaster, survival or success.
Our goal, at the time, was to show how anomaly detection on categorical data could be used to resolve complex challenges utilizing an industry recognized standard benchmark dataset, which happened to be static. The approach we used then was to pre-process (batch) the VAST Insider Threat challenge dataset with Python then ingest that processed stream of data with thatDot's Novelty Detector to identity the bad actor.
But with a new tool in our kit we decided to see what would be involved in updating the workflow by replacing the Python pre-processing, instead using Quine in front of Novelty Detector in our pipeline.
- Defining the ingest queries required to consume and shape the VAST datasets; and
- Developing a standing query to output the data to Novelty Detector for anomaly detection.
Data from the dataset is broken into three files:
- Employee to office and source IP address mapping in employeeData.csv
- Proximity reader data from door badge scanners in proxLog.csv
- Network traffic in IPLog3.5.csv
These ingests form a basic structure that looks like:
Because we have created an intuitive schema for identifying nodes by way of feeding `idFrom()` deterministic and descriptive data that can be used to query for them very efficiently (and do so with sub-millisecond latency).
Moving from Batch to Real Time Monitoring
While this is certainly an improvement from our previous workflow, it is still highly manual (i.e., having to explicitly query for the data we’re looking for). The promise of a Quine to Novelty Detector workflow is automation with real-time results.
By ingesting the data in chronological order (as presented in the source files), we are able to easily match proximity network events to the last associated proximity badge event in real-time.
This is accomplished via standing query matches like:
The question remains, “How do we share the standing query matches from Quine to Novelty Detector?” This can be done in a number of ways (all via standing query outputs) including, but not limited to:
- Writing results to a file that Novelty Detector ingests;
- Emitting webhooks from Quine to Novelty Detector; or
- Publishing results to a Kafka topic to be ingested by Novelty Detector.
Although the first two choices will work, they are severely suboptimal. Consider a simple example of a single employee’s data:
Writing the aggregate 115,434 matches would be done one record at a time (on each standing query match) to the filesystem.
Using webhooks suffer the same issue as writing to file, and introduces induced latency from the HTTP transactions.
Ultimately, we settled on the third option as it most closely resembles production environments, and is the most performant.
The big question - did it work?
The anomalous activity has been identified.
Was it worthwhile?
It Don’t Mean a Thing If It Ain’t Got That Real-Time Swing
Although we were able to accomplish the same results with Quine in a single step this was still a batch processing-based exercise. The true value of a Quine to Novelty Detector pipeline is in the melding of complex event stream processing in Quine with shallow learning (no training data) techniques in Novelty Detector, providing an efficient solution for detecting persistent threats and unwanted behaviors in your network. This pattern, moving from batch processing, requiring heavy lifting and grooming of datasets, to real-time stream processing is one where Quine and Novelty Detector thrive.
Try it Yourself