Dynamic Duo: Quine & Novelty Detector for Insider Threats

April 18, 2023 · by thatDot · 5 min read

Adding Quine to the Insider Threat Detection Proof of Concept

A lot has changed since we first posted the Stop Insider Threats With Automated Behavioral Anomaly Detection blog post. Most significantly, thatDot released Quine, our streaming graph, as an open source project just as the industry is recognizing the value of real-time ETL and complex event processing in service of business requirements. This is especially true in finance and cybersecurity, where minutes (seconds or even milliseconds) can mean the difference between disaster, survival or success.

Our goal, at the time, was to show how anomaly detection on categorical data could be used to resolve complex challenges utilizing an industry recognized standard benchmark dataset, which happened to be static. The approach we used then was to pre-process (batch) the VAST Insider Threat challenge dataset with Python then ingest that processed stream of data with thatDot's Novelty Detector to identity the bad actor.

But with a new tool in our kit we decided to see what would be involved in updating the workflow by replacing the Python pre-processing, instead using Quine in front of Novelty Detector in our pipeline.

This involved:

Defining the ingest queries required to consume and shape the VAST datasets; and
Developing a standing query to output the data to Novelty Detector for anomaly detection.

Data from the dataset is broken into three files:

Employee to office and source IP address mapping in employeeData.csv

ingestStreams:
  - type: FileIngest
    path: employeeData.csv
    parallelism: 61
    format:
      type: CypherCsv
      headers: true
      query: &gt;-
        MATCH (employee), (ipAddress), (office)
        WHERE id(employee) = idFrom("employee", $that.EmployeeID)
          AND id(ipAddress) = idFrom("ipAddress",$that.IP)
          AND id(office) = idFrom("office",$that.Office)

        SET employee.id = $that.EmployeeID,
            employee:employee
            
        SET ipAddress.ip = $that.IP,
            ipAddress:ipAddress
            
        SET office.office = $that.Office,
            office:office

        CREATE (ipAddress)&lt;-[:USES_IP]-(employee)-[:SHARES_OFFICE]-&gt;(office)

Proximity reader data from door badge scanners in proxLog.csv

  - type: FileIngest
    path: proxLog.csv
    format:
      type: CypherCsv
      headers: true
      query: >-
        MATCH (employee), (badgeStatus)
        WHERE id(employee) = idFrom("employee", $that.ID)
          AND id(badgeStatus) = idFrom("badgeStatus",$that.ID,$that.Datetime,$that.Type,$that.ID)

        SET employee.id = $that.ID,
            employee:employee

        SET badgeStatus.type = $that.Type,
            badgeStatus.employee = $that.ID,
            badgeStatus.datetime = $that.Datetime,
            badgeStatus:badgeStatus

        CREATE (employee)-[:BADGED]->(badgeStatus)

Network traffic in IPLog3.5.csv

 - type: FileIngest
    path: IPLog3.5.csv
    format:
      type: CypherCsv
      headers: true
      query: >-
        MATCH (ipAddress), (request)
        WHERE id(ipAddress) = idFrom("ipAddress",$that.SourceIP)
          AND id (request) = idFrom("request", $that.SourceIP,$that.AccessTime, $that.DestIP, $that.Socket)

        SET request.reqSize = $that.ReqSize,
            request.respSize = $that.RespSize,
            request.datetime = $that.AccessTime,
            request.dst = $that.DestIP,
            request.dstport = $that.Socket,
            request:request

        SET ipAddress.ip = $that.SourceIP,
            ipAddress:ipAddress

        CREATE (ipAddress)-[:MADE_REQUEST]->(request)

These ingests form a basic structure that looks like:

A snapshot of the graph created by ingest streams showing Employee 51 connected by the Badged edge to a door reader event node and node IP address by USES_IP edge, which is connected to a Request node by a Made_request edge.

The ingest streams combine to create the essential graph structure.

Because we have created an intuitive schema for identifying nodes by way of feeding idFrom() deterministic and descriptive data that can be used to query for them very efficiently (and do so with sub-millisecond latency).

The same basic graph as above but this time showing a very efficient query for node properties.

A quick query efficiently displays relevant properties from connected nodes.

Moving from Batch to Real Time Monitoring

While this is certainly an improvement from our previous workflow, it is still highly manual (i.e., having to explicitly query for the data we’re looking for). The promise of a Quine to Novelty Detector workflow is automation with real-time results.

By ingesting the data in chronological order (as presented in the source files), we are able to easily match proximity network events to the last associated proximity badge event in real-time.

This is accomplished via standing query matches like:

standingQueries:
   - pattern:
       query: >-
         MATCH (request)<-[:MADE_REQUEST]-(ipAddress)<-[:USES_IP]-(employee)-[:BADGED]->(badgeStatus)
         RETURN DISTINCT id(request) AS requestid
       type: Cypher
     outputs:
       print-output:
         type: CypherQuery
         query: >-
          MATCH (request)<-[:MADE_REQUEST]-(ipAddress)<-[:USES_IP]-(employee)-[:BADGED]->(badgeStatus) 
          WHERE id(request) = $that.data.requestid 
            AND badgeStatus.datetime<=request.datetime 
          WITH max(badgeStatus.datetime) AS date, request, ipAddress 
          MATCH (request)<-[:MADE_REQUEST]-(ipAddress)<-[:USES_IP]-(employee)-[:BADGED]->(badgeStatus) 
          WHERE badgeStatus.datetime=date 

          RETURN badgeStatus.type AS status,ipAddress.ip AS src,request.dstport AS port,request.dst AS dst

The question remains, “How do we share the standing query matches from Quine to Novelty Detector?” This can be done in a number of ways (all via standing query outputs) including, but not limited to:

Writing results to a file that Novelty Detector ingests;
Emitting webhooks from Quine to Novelty Detector; or
Publishing results to a Kafka topic to be ingested by Novelty Detector.

Although the first two choices will work, they are severely suboptimal. Consider a simple example of a single employee’s data:

A graph showing employee's data that renders as thousands of nodes connected to four main clusters.

Visualizing data from a single employee.

Writing the aggregate 115,434 matches would be done one record at a time (on each standing query match) to the filesystem.

andThen:
    type: WriteToFile
    path: behaviors.jsonl

Using webhooks suffer the same issue as writing to file, and introduces induced latency from the HTTP transactions.

andThen:
    type: PostToEndpoint
    url: http://localhost:8080/api/v1/novelty/behaviors/observe?transformation=behaviors

Ultimately, we settled on the third option as it most closely resembles production environments, and is the most performant.

andThen:
    type: WriteToKafka
    bootstrapServers: localhost:9092
    topic: vast
    format: {
        type: JSON
    }

The big question - did it work?

A scatter graph of Novelty Detector results showing the anomalous behavior connected to a compromised faciltiy.

Results from the Novelty Detector UI.

Absolutely.

The anomalous activity has been identified.

Was it worthwhile?

Sure, but…

It Don’t Mean a Thing If It Ain’t Got That Real-Time Swing

Although we were able to accomplish the same results with Quine in a single step this was still a batch processing-based exercise. The true value of a Quine to Novelty Detector pipeline is in the melding of complex event stream processing in Quine with shallow learning (no training data) techniques in Novelty Detector, providing an efficient solution for detecting persistent threats and unwanted behaviors in your network. This pattern, moving from batch processing, requiring heavy lifting and grooming of datasets, to real-time stream processing is one where Quine and Novelty Detector thrive.

Try it Yourself

If you'd like to try the VAST test case yourself, you can run Novelty Detector on AWS with a generous free usage tier. Instructions for configuring Novelty Detector are available here.

And the open source version of Quine is available for download here. If you are interested there is also an enterprise version that offers clustering for horizontal scaling and resilience.

And if you'd prefer a demo or have additional questions, check out Quine community slack or send us an email.

‍