Insider Threat Detection

Share this post

Stop Insider Threats With Automated Behavioral Anomaly Detection

Introduction

Finding a malicious employee is one of the toughest cyber-security challenges in the industry. Someone who has been deliberately given access to sensitive information… but violates that trust and secretly steals private data to give to a third party. Finding such a threat has always been a murky and labor-intensive challenge. Until now.

Behavioral Anomaly Detection

We’ve developed a new technique for measuring how unusual each behavior is. Unlike traditional approaches, this innovation uses non-numeric categorical data directly, instead of trying to convert it to numbers first. We can analyze that data in real-time to score each behavioral event to explain how unusual it is. As a result, it’s very easy to automate the analysis of behavioral data, and get powerful results in real-time.

VAST Insider Threat Dataset

A standard benchmark dataset for insider threat detection is the publicly available VAST Insider Threat challenge dataset. Originally released in 2009, it remains a good example of the data available—and challenge presented—to cyber-security professionals today who are trying to detect and stop a malicious insider. This post will use the VAST dataset for the “mini-challenge” #1: “MC1 – Badge and Network Traffic” to demonstrate automated insider threat detection. For comparison, the accepted solution to the problem can be found here, where a laborious manual exploration was the previously accepted approach to this problem.

The Embassy Scenario

The data describes the workplace environment of 60 employees at an embassy. There are 30 offices shared with two people to an office. A classified space is separated from the offices. There are two doors, one provides access between the outside and the office area, and the other provides access from the office space to the classified space.

Employees are expected to scan their badges every time they enter from the outside, and every time they enter and leave the classified space. Each office holds the desktop computers assigned uniquely to each employee who shares that office.

Source Data

Data from the dataset is broken into three files: employee data in employeeData.csv, proximity data from door badge scanners in proxLog.csv, and network traffic in IPLog3.5.csv. Some sample records of each file (below) shows their shape.

Note: the original data includes a USER WARNING that this is Synthetic Data . This field appears on each record of the employee and network traffic files, but is deleted from this point on for brevity.

Data is read from each .csv file and parsed into Python dictionaries, also parsing out timestamps into datetime objects:

employeeData = []
ipLog = []
proxLog = []
with open("employeeData.csv", "r") as fd:
    for l in csv.DictReader(fd):
        del(l['USER_WARNING'])
        employeeData.append(l)
with open("IPLog3.5.csv", "r") as fd:
    for l in csv.DictReader(fd):
        del(l['USER WARNING'])
        l['AccessTime'] = datetime.fromisoformat(l['AccessTime'])
        ipLog.append(l)
with open"proxLog.csv", "r") as fd:
    for l in csv.DictReader(fd):
        l['Datetime'] = datetime.fromisoformat(l['Datetime'])

Looking at the first three sample records from each file helps build a sense of what this data looks like:

pprint(employeeData[:3])
len(employeeData)
[{'EmployeeID': '0', 'IP': '37.170.100.0', 'Office': '0'},
 {'EmployeeID': '1', 'IP': '37.170.100.1', 'Office': '0'},
 {'EmployeeID': '2', 'IP': '37.170.100.2', 'Office': '1'}]

60 records
pprint(proxLog[:3])
len(proxLog)
[{'Datetime': datetime.datetime(2008, 1, 1, 7, 28),
  'ID': '44',
  'Type': 'prox-in-building'}, 
 {'Datetime': datetime.datetime(2008, 1, 1, 8, 31),
  'ID': '44',
  'Type': 'prox-in-classified'}, 
 {'Datetime': datetime.datetime(2008, 1, 1, 9, 23),
  'ID': '38',
  'Type': 'prox-in-building'}]

10162 records
pprint(ipLog[:3])
len(ipLog)
[{'AccessTime': datetime.datetime(2008, 1, 1, 9, 40, 29, 276000),
  'DestIP': '37.170.100.200',
  'ReqSize': 7063,
  'RespSize': 49591,
  'Socket': '80',
  'SourceIP': '37.170.100.38'},
 {'AccessTime': datetime.datetime(2008, 1, 1, 9, 43, 8, 861000),
  'DestIP': '37.157.76.124',
  'ReqSize': 5171,
  'RespSize': 434285,
  'Socket': '80',
  'SourceIP': '37.170.100.38'},
 {'AccessTime': datetime.datetime(2008, 1, 1, 9, 47, 41, 282000),
  'DestIP': '37.170.30.250',
  'ReqSize': 32818,
  'RespSize': 182798,
  'Socket': '25',
  'SourceIP': '37.170.100.38'}]
115414 records

Behavioral Data

The proximity data in proxLog and network traffic data in ipLog both represent different kinds of behavioral data. Though they are separate files, each type of data is an aspect of the same behavioral information we’re trying to analyze. They can be combined together by adding the relevant employees status from the proxLog to each record from ipLog:

sorted_emp_prox = {}
for id in [e['EmployeeID'] for e in employeeData]:
    sorted_emp_prox[id] = sorted([x for x in proxLog if x['ID'] == id], key=lambda x: x['Datetime'])
def emp_state_at_time(employee, dt):
    last = None
    for e in sorted_emp_prox[employee]:
        if e['Datetime'] < dt:
            last = e
        else: # e['AccessTime'] >= dt:
            if last is not None:
                return last['Type']
            else:
                return "unknown"
    if last is not None:
        return last['Type']
    else:
        return "unknown"
emp_for_ip = {e['IP']: e['EmployeeID'] for e in employeeData}

behaviors = [[emp_state_at_time(emp_for_ip[x['SourceIP']], x['AccessTime']), x['SourceIP'], x['Socket'], x['DestIP']] for x in ipLog]

pprint(behaviors[:5])

We’ve added proximity badge status to each network record, and then trimmed those records down to only four values relevant for our automated detection:

[['prox-in-building', '37.170.100.38', '80', '37.170.100.200'],
 ['prox-in-building', '37.170.100.38', '80', '37.157.76.124'],
 ['prox-in-building', '37.170.100.38', '25', '37.170.30.250'],
 ['prox-in-building', '37.170.100.38', '80', '37.116.192.39'],
 ['prox-in-building', '37.170.100.38', '80', '10.24.74.254'], ...]
115414 records

Automated Detection

With the combined proximity+network data, we can dump that out as a .csv file and use the built-in “Getting Started” page to easily feed this to thatDot’s Categorical Anomaly Detection system:

Loading data trains the system and scores each result as it streams in.

Streaming results are often more difficult to produce but more useful than batch results. However, since this scenario is a batch of data and meant to be analyzed as a batch, we can use a quick Python script to make a second pass that feeds the data to the novelty scoring system in a read-only fashion (does not update the model) to get results where each is informed by the entire batch:

e = read_results("behaviors", behaviors, 10000)
Read: 10,000  Rate: 52,631 / second
Read: 10,000  Rate: 38,910 / second
Read: 10,000  Rate: 56,497 / second
Read: 10,000  Rate: 43,668 / second
Read: 10,000  Rate: 55,248 / second
Read: 10,000  Rate: 56,497 / second
Read: 10,000  Rate: 47,169 / second
Read: 10,000  Rate: 35,087 / second
Read: 10,000  Rate: 49,019 / second
Read: 10,000  Rate: 46,296 / second
Read: 10,000  Rate: 56,818 / second
Read:  5,414  Rate: 40,103 / second

And plot the results:

import plotly.express as px
px.scatter(limit([x for x in e if x['score'] > 0.5]), title="Scores after observing all data (sampled)", y='score', x='sequence', marginal_y='violin').show()

More novel, high scoring results show up higher on the plot. We see from this sampling of results (not every single data point is shown), that there is a distinct group of nodes which show up with very high scores around 0.94.

thatDot Categorical Anomaly Detection has found a small number of clearly unusual behaviors. Looking at the results shows that each of these events was network traffic to the same destination IP address: 100.59.151.133. What’s more, these connections were usually made from different source IP addresses, but while the respective employee responsible for that computer was badged into the classified space.

Who Done It?

thatDot Categorical Anomaly Detection successfully found the needle in the haystack! Eighteen records in the network traffic log correspond with traffic sent to this suspicious IP address:

suspicious_ip = list(set([x['observation'][3] for x in e if x['score'] > 0.9]))[0]
suspicious_ip_traffic = [x for x in ipLog if x['DestIP'] == '100.59.151.133']
pprint(suspicious_ip_traffic[:3])
[{'AccessTime': datetime.datetime(2008, 1, 8, 17, 1, 33, 1000),
 'DestIP': '100.59.151.133',
 'ReqSize': 8889677,
 'RespSize': 12223,
 'Socket': '8080',
 'SourceIP': '37.170.100.31'},
{'AccessTime': datetime.datetime(2008, 1, 10, 14, 27, 12, 238000),
 'DestIP': '100.59.151.133',
 'ReqSize': 6543216,
 'RespSize': 22315,
 'Socket': '8080',
 'SourceIP': '37.170.100.31'},
{'AccessTime': datetime.datetime(2008, 1, 10, 16, 1, 53, 956000),
 'DestIP': '100.59.151.133',
 'ReqSize': 8543125,
 'RespSize': 12312,
 'Socket': '8080',
 'SourceIP': '37.170.100.16'}]

18 records

So who is the guilty party?

Using the behavioral data about employee proximity, we can put together a set of people who were available to access the compromised computer during each of the 18 illicit data transmissions:

def available_emp(atTime):
    a = []
    available = ['prox-out-classified', 'prox-in-building']
    for e in employeeData:
        if emp_state_at_time(e['EmployeeID'], atTime) in available:
            a.append(e['EmployeeID'])
    return a

opportunities = [available_emp(x['AccessTime']) for x in suspicious_ip_traffic]

[len(x) for x in opportunities]

At each of the relevant times, there were between 38–60 employees available.

[52, 43, 49, 42, 42, 38, 53, 60, 47, 54, 52, 48, 47, 44, 44, 56, 45, 45]

But who was available during ALL times?

had_opportunity = set(opportunities[0])
for o in opportunities:
    had_opportunity = had_opportunity.intersection(o)
had_opportunity
{'27', '30'}

Only two employees! Maybe they are colluding. But if not, one of them would be a witness if their officemate’s computer was compromised. Looking at the traffic records, we can focus in on suspects who wouldn’t be caught by officemate in the room with the compromised computer:

def not_caught_by_officemate(suspects, suspicious_events):
    still_suspect = []
    for s in suspects:
        suspect_office = [e['Office'] for e in employeeData if e['EmployeeID']== s][0]
        officemate_ip = [e['IP'] for e in employeeData if e['Office'] == suspect_office and e['EmployeeID'] != s][0]
        sent_traffic = [x['SourceIP'] for x in suspicious_ip_traffic]
        if officemate_ip in sent_traffic:
            still_suspect.append(s)
    return still_suspect
not_caught_by_officemate(had_opportunity, suspicious_ip_traffic)
{'30'}

Which leaves only Employee #30 as the guilty party.

Bonus: Real-Time Automated Detection

thatDot Categorical Anomaly Detection doesn’t have to be used only on static data. In fact, it’s meant to run on live streaming data and to produce results immediately, in real-time. What would the experience be like to a person who was monitoring this data live while it came in? Could they have stopped the malicious insider sooner?

This image shows the “Results” section of thatDot’s Categorical Anomaly Detection tool, which updates live with all the recent results:

This plot shows the scores as they were produced in real-time; higher dots correspond with more novel data. The colors in this plot indicate how unique each behavioral observation was (dark red means we had never seen that specific observation before). You can see from the x-axis that it only shows the most recent result. Another plot on this page shows the highest scoring results overall:

Since these results are the real-time streaming results, each observation is scored immediately without any knowledge of what data will arrive after it. This system uses past data to understand how novel new data is, so at the beginning of the stream, there is a lot of novel data. Those early high scores are genuinely novel at that moment, given what little has been seen so far, but for those of us with knowledge of the whole dataset, we know that is not enough data to have a representative sample of what is yet to come.

After about 25k observations in this dataset, the novelty scoring system has automatically learned what this data looks like. Notice that no data labeling is required! Simply turn on the system and let it observe a representative sample of your data. Ignore the early results while it is automatically training to fit the data. In a real-world scenario, this system would likely be running long before the insider began their malicious activity, so the early tuning would have long since been accomplished.

A security analyst observing this stream of results would regularly see low-scoring results they would easily ignore during the course of normal business. But the moment the malicious insider sneaked onto an employee’s computer, it would produce the first top-scoring result:

The observer would have a clear signal about the first occurrence of malicious activity—in this case, when Employee #30 used their office-mate’s computer while #31 was in the classified space. That warning would come with enough information for an observer to understand the context of what is happening, why is it unusual, and where to go RIGHT NOW to catch the malicious insider red handed! If the observer missed this opportunity or needed to wait for more evidence, subsequent malicious activity on other computers would give the authorities additional opportunities and evidence.

This Is Just the Beginning…

Categorical Anomaly Detection is a major new innovation in cyber-security threat detection. It comes out of years of DARPA-funded R&D with some of the world’s foremost experts. As shown here, it’s a game-changer for insider threat detection. —but it doesn’t stop here! This tool can be used in a wide variety of other domains to measure in real-time how unusual is each and every piece of data. Our blog includes other examples like detecting stolen credential use, or data exfiltration in the cloud.

While anomaly detection algorithms have been around for decades, there has never been an effective way to apply them to categorical data… until now. The first of it’s kind, thatDot’s Categorical Anomaly Detection is currently being used to detect real-time cyber security threats, reduce the cost of data analysis tools (e.g. only analyze the unusual data), find fraudulent activity, detect stolen credentials, analyze log data, audit financial transactions, and much more.