thatDot

Quine 2.0 Released!

2026-05-20T00:00:00+00:00

We are thrilled to announce the release of Quine 2.0! With major updates to key enterprise integrations, A.I. capabilities, and UI/UX enhancements, Quine continues to push beyond what's possible anywhere else for understanding high-volume data in real-time.

The Quine 2.0 Thesis</h2>
Your A.I. tools need reliable reasoning to be useful in the enterprise. Hallucinations rob the confidence from an otherwise transformative A.I. initiative. Quine is the reliable reasoning engine which anchors those A.I. agents with an always-up-to-date, perfectly reliable, and completely traceable context graph for enterprise agent workflows. Our largest customers are using "Reverse-RAG" with Quine to turn real-time event streams into 100% confident conclusions at scale for critical business processes. Quine is used throughout the Fortune 500 at the world's leading banks, cybersecurity companies, and many U.S. government agencies and allied partners to ensure employees and agents always have the right context, at the right time, to make the right decision, when it matters most.
One of the world's leading global banks is using Quine to process hundreds of thousands of infrastructure events per second, routing employees to the right resources in real time, before the underlying data shifts and the recommendation becomes stale. The goal is moving from "siloed data and teams" to "real-time accessible intelligence that fuels automation and makes proactive recommendations."
Whether the challenge is dynamically provisioning IT infrastructure at scale, adversaries on the global stage, threat actors in the cyber domain, fraudsters, or adversarial A.I., Quine helps you see, understand, and react faster than the situation can evolve. It plugs into high-volume data streams and "connects the dots" to build a live streaming graph from millions of events every second. Data agents efficiently monitor that graph to find critical patterns immediately and run reliable algorithms to feed the perfect context for reliable action—all without an army of consultants or forward deployed engineers.
What's New in Quine Enterprise 2.0</h2>

Identity, Access, and Audit - Governance and access controls dominate enterprise system deployment concerns. Quine Enterprise 2.0 integrates with many kinds of existing enterprise RBAC systems and provides a tailored experience with permission-based UI rendering for each user's level of access. OIDC authorization, session management, SP 800-53 audit logging and many of the important-but-boring internal details are all done and in place with the latest version of Quine Enterprise.</li>
API V2: the Integration Foundation - Version 2 of the REST API supports RBAC and standardizes on the Google API style guide. You can now transform ingested data with javascript before passing it to a Cypher ingest function (no extra micro-service to transform streaming data in front of Quine!). Did some bad data make it into your stream, handle it intelligently without disruption in the new deadletter queue.</li>
Operating at Scale with Ease - With a new UI and dashboard, it's easy to see the status of your Quine cluster at a glance—if you have the proper permissions. Need to add a new ingest stream or standing query? Just use the simplified multiple-choice template on the "Streams" page to set it up in seconds. Cluster stats and status are now neatly visualized from the dashboard.</li>
Optimized for A.I. Coding Tools - If you're using A.I. tools to work with Quine, just point your agent at docs.thatdot.com</a> or the OpenAPI spec built into the docs. Our world-class human-written and human-readable documentation has also been tuned for LLM consumption. Give it a try and you can one-shot your next Quine recipe in a single prompt!</li> </ul>
You can see the full list of changes here</a>. If you're upgrading a running Quine instance from a 1.x version, see the migration guide here</a>.
Deep History</h2>
In 2022 we released Quine 1.0 as open source software at https://quine.io</a> A revolutionary advance for graphs+streaming, it was the result of many years of high-level R&D funded DARPA, Crowdstrike, and other large organizations confronting challenges on the cutting edge of what's possible. Since that release, Quine's user base expanded dramatically to include many of the world's largest banks and governments, who use Quine for critical applications inside their high-volume enterprise data pipelines. To be used at that scale also requires building the surrounding infrastructure to manage such large deployments and massive data volumes. The Open Source version of Quine</a> has many new features and quality of life improvements, but many of the key capabilities in the 2.0 release are in support of our enterprise users.
Around the Corner</h2>
The Quine 2.0 release sets a foundation for many more exciting capabilities right around the corner. From deeper A.I. integration, a revolutionary new indexing capability, automated query writing, and more integration sources, the 2.0 is not just a major milestone, it's the starting gun for the next leg of this exciting race.

The Secret Ingredient in the Alphabet Soup of Cybersecurity

2025-03-04T00:00:00+00:00


This is the first in a series of blogs exploring how the Quine</a> Streaming Graph analytics engine is the secret ingredient in the Alphabet Soup of cybersecurity, enabling faster, more accurate detection of complex threats without compromising on the type or volume of data analyzed, the fidelity of alerts or response time.
The Dilemma of Data in Cybersecurity</h4>
As we all know, the letter combinations in cybersecurity continue to grow, sometimes falling out of view, floating just under the surface, and others rising to the top. These letter combinations include network protection (NDR, NTA, ID/PS), endpoint (XDR, FIM, EPP, EDR, HIPS), or cloud (CWPP, CSPM, and CNAPP). Despite their diversity, these solutions all face a shared challenge: the amount of data they need to analyze and how to go about it.
Including the correct information in the analysis process is a delicate balance - like that right balance of herbs and spices in our favorite meal. It is no simple task to determine which data to analyze and how to do it efficiently without the risk of false positives/negatives. The current approach is to look at it in subsets and cohorts, but never holistically. In some cases, this decision is warranted; the data is irrelevant - that ingredient simply does not go into our meal. However, this process frequently results in context being left on the proverbial cutting board, and the data so watered down it is useless.
A New Paradigm: Data Analysis Without Compromise</h4>
Imagine the following what ifs: what if we didn’t have to exclude relevant data? what if you did not need to leave relevant data on the cutting board? what if we could analyze all data for any time - past, present, or future?
With thatDot’s Quine Streaming Graph, you can continuously analyze real-time and historical data at scale to identify complex patterns and enable your solution to trigger an action within milliseconds.
This enables product owners to reenvision current features and approaches—moving from periodic batch processing to real-time analytics. For cybersecurity vendors, this changes the game. Instead of relying on batch processing or overlooking key data for speed, you can achieve instant notifications to trigger mitigation and containment routines.
What’s Next?</h4>
There are various ways in which we intend to explore adding thatDot to various cybersecurity solutions to see what we can cook up. Each of these is either not done adequately or only viable with lots of development time, custom code, and homegrown analysis pipelines, such as:

Identifying attack paths</li>
Triggering immediate response</li>
Continuous enrichment of event data</li>
Identify even the most latent of patterns</li>
Real-time as well as “point-in-time” visibility</li>
Context-Aware Threat Intelligence</li>
Real-time MITRE TTP Awareness</li> </ul>
These are problems that can be solved in other ways; however, how long does it take to develop each successive detection pipeline? Other than time to market, what else are you giving up? It may be avoiding asking too complex a question or inspecting a tiny sliver of time? Or just settling for pseudo and “near” real-time analysis? Let’s explore Quine, drop some of those qualifiers, and make something great!
Learn More</h4>
Check out these resources:

Are You Ready for Low and Slow Auth Attacks</a> Blog Post</li>
Quine for cybersecurity</a> and fraud</a> use cases</li>
Download Quine</a></li> </ol>

Stream Processing World Meets Streaming Graph at Current 2024

2024-09-24T00:00:00+00:00

The thatDot team had a great time last week at Confluent’s big conference, Current 2024. Our apologies to anyone that may have been hit. Attendees and exhibitors alike loved the thatDot Frisbees. We spoke with multiple attendees, learned about the challenges the stream processing community have when trying to do their jobs with KSQLDB or Flink in their data pipelines. We helped them understand how thatDot would fit into their architecture and how our approach to stream processing - thatDot Streaming Graph</a> - can scale to meet their demands while solving deeper problems that challenge the key-value/relational models of other stream processors.

Will Flink fail at practical stream processing like Streams and KSQLDB did?</h2>
Some attendees mentioned that over the years Confluent has introduced two other technologies intended to provide stream processing in Kafka pipelines. The first technology introduced for event stream processing was Kafka Streams, but people were not satisfied with its capabilities. Then KSQLDB was the way to go. But that didn’t work quite as advertised, either.
Now Flink is the new technology touted to solve the problems of previous stream processing engines. Yet, Flink practitioners know that the complexity inherent in Flink operations necessitates a high level of expertise to run it. Out-of-memory errors due to making time windows too wide, or trying to join too many things across data streams were common problems people reported. We heard a repeated question from attendees about how long it will likely be until Confluent starts looking for the next stream processing technology.
As an industry, we keep doing the same thing over and over again and expecting a different result–the definition of insanity. By continuing to use the same relational key/value type of mindset to process data with every stream processor, we continue to run into the same problems.

Advantages of stream processors and graph data models</h2>
For years, people have understood the power of graph data to connect the dots and see the big contextual picture. But graph databases have the same problem as any other database, the data is no longer real-time streaming. It’s at rest. That inherently makes data analysis too slow for some of the most important and urgent actions a stream processor is needed for, like catching cybersecurity intrusions or stopping a fraudulent transaction.
Ryan wright always says, "Answers now are always better than answers later." Graph analysis that can work with modern data volumes at stream processor speed is a paradigm shift. You get the answers to deep questions fast. Instead of finding out months or even days later that you were breached, or your company was ripped off, you can stop problems before they cost you.
Advantages of thatDot Streaming Graph</h2>
Some of the advantages the attendees of Confluent told us they found most compelling about thatDot Streaming Graph included:

Simplicity at scale - Flink has to manage current state and do complex logic for fault tolerance such as checkpoints/save points. Streaming graph doesn’t require any of that. Dynamic graph technologies don’t require state management, and high availability in our stream processor is more automatic. thatDot is far simpler to use, even at high scale</a>.</li>
Unlimited joins - For a relational key/value data model like Flink uses, multiple joins are difficult and memory intensive. For thatDot’s graph data model, unlimited joins in stream processing are the normal way we do things.</li>
Categorical analysis - Most analytic tools can only analyze numbers. This means if you want to analyze people, places, events, locations, etc., you have to convert that data into wide, sparse numeric data, rest it in a database, analyze it, then turn it back into categorical data</a> to get a final, understandable if a bit muddy answer. Having to rest the data before analysis slows response time hugely, and even then, your answer is likely to be unclear and inaccurate. thatDot analyzes categorical data directly, right in the stream processor.</li>
Time unbound analysis - An event stream processor taking unbounded data streams and chopping them into little time-bounded chunks in order to analyze them has always been a workaround in our opinion. thatDot analyzes the whole data stream as it flows by, with no time window limitations. Even data stored in a file or in a database can be joined with current data. Important points from six months ago can be joined with data from six milliseconds ago to complete a picture and answer an analytical question.</li> </ul> Ryan's talk on "Streaming Entity Resolution for Kafka"</h2>

Ryan Wright, our founder and CTO, did a very cool presentation at Current on entity resolution in stream processing which caught a lot of attention, especially from data engineers and anyone working toward real-time master data management. That’s live on the Current site now. Be sure to check it out: https://current.confluent.io/2024-sessions/streaming-entity-resolution-for-kafka-with-quine</a>
To learn more, check out the thatDot Streaming Graph product page.</a>
Or, try the Streaming Graph or Novelty free trial</a> for yourself.
Get the handouts we gave to Current attendees</a>.
And be sure to catch the thatDot team at Current 2025!


Streaming Graph Get Started

2024-07-23T00:00:00+00:00

It's been said that graphs are everywhere. Graph-based data models provide a flexible and intuitive way to represent complex relationships and interconnectedness in data. They are particularly well-suited for scenarios where relationships and patterns are important, but until recently, they have been confined to a handful of use cases – databases, chip design, information theory, AI – that all have one thing in common: the data in question is stored first and then processed, usually as a batch job.
In other words, the data in these use cases is at rest. However, what about data in motion, data in event-driven use cases that is constantly changing and being transmitted? As event-driven applications and operational intelligence scenarios, such as real-time monitoring, situational intelligence, and fraud detection, continue to expand rapidly, graph data models and the primary query language used for them, Cypher, are proving much more versatile than SQL and the best tools for the task.
Consider the challenge of extracting insights from a complex event stream. The stream may have high volume and velocity, require the correlation of events by context from multiple sources, contain meaningful event patterns, and have a short timeframe to identify, detect, and take action. Addressing these challenges requires efficient data processing, scalable infrastructure, and effective event modeling techniques in graph solutions and Cypher.
Graph databases are useful for batch processing a portion of a complex event stream to provide macro-level insights and metrics to understand events but not take action. The same concepts (patterns and algorithms) used in graph databases when the event stream is at rest can be applied to an event stream while in motion using a streaming graph like Quine, often directly reusing the Cypher written in the database. Here’s how Cypher addresses the challenges found in complex event streams:

Pattern Matching: Cypher excels at pattern matching, allowing you to detect patterns (sub-graphs) in the event stream. This is particularly useful for identifying sequences of events or detecting specific patterns, allowing you to efficiently filter and process relevant events based on their relationships and properties.</li>
Event Correlation: You can define relationships between events and other entities, such as users, devices, or locations. This enables you to correlate events based on common attributes or shared relationships, often with high cardinality and a mix of categorical and numerical data, to identify patterns, anomalies, or complex dependencies.</li>
Time-based Queries: Cypher provides temporal capabilities, allowing you to query and analyze events based on their timestamps or time intervals. You can filter events based on specific time ranges, compare temporal values, and perform time-based aggregations. This enables you to process time-dependent patterns, detect trends, and perform time window-based computations on the event stream.</li>
Real-time Insights: You can continuously execute Cypher queries on an incoming event stream, allowing for dynamic analysis and near real-time decision-making. This enables you to monitor, detect patterns, and trigger actions based on the evolving stream of events.</li> </ul>
Event Pattern Detection</h2>
Specifying a pattern (sub-graph) to MATCH</code> can identify specific sequences of events or combinations of events of interest. For example, when observing the efficiency of cache nodes in a CDN network, Cypher can easily identify when a series (10) of cache misses occur and send an alert to the NOC to trigger an investigation.
The Cypher required to detect a MISS</code> event only needs to identify the node types and relationships as a pattern.
MATCH (server1:server)<-[:TARGETED]-(event1 {cache_class:"MISS"})-[:REQUESTED]->(asset)<-[:REQUESTED]-(event2 {cache_class:"MISS"})-[:TARGETED]->(server2:server) RETURN DISTINCT id(event1) AS event1</code></pre> Then, additional Cypher processes the event to take action, recording it as a metric or sending an alert if the metric constraint is exceeded. This technique is demonstrated in the CDN Observability recipe. An unexpected challenge Cypher can respond to changes in the event stream in real time, allowing organizations to reduce the risk associated with a condition's duration before it is analyzed and addressed. For example, the Financial Risk Calculation recipe models market changes in real-time so that organizations can provide adequate coverage for risk exposure while ensuring their regulatory compliance minimally affects their asset allocation. As basic patterns are matched, results are passed to business logic written in Cypher to generate an adjusted trading value, correlate (roll-up) trading events across the network, and trigger an alert when the trading system is out of compliance. When a pattern match query detects an investment pattern, it triggers an output query to process the StandingQueryResult. For example, the result returned from an investment pattern in Cypher: MATCH MATCH (investment:investment)<-[:HOLDS]-(desk:desk)<-[:HAS]-(institution:institution) RETURN DISTINCT id(investment) AS id Triggers business logic in Cypher to generate a new property with a value based on the nodes investment.class property.SET investment.adjustedValue = CASE WHEN investment.class = '1' THEN investment.value WHEN investment.class = '2a' THEN investment.value * .85 WHEN investment.class = '2b' AND investment.type = 9 THEN investment.value * .75 WHEN investment.class = '2b' AND investment.type = 10 THEN investment.value * .5 END</code></pre> The investment events are then correlated through a roll-up function for each investment type. UNWIND [["1","adjustedValue1"], ["2a","adjustedValue2a"], ["2b","adjustedValue2b"]] AS stuff WITH institution,investment,desk,stuff WHERE investment.class = stuff[0] CALL float.add(institution,stuff[1],investment.adjustedValue) YIELD result AS institutionAdjustedValueRollupByClass CALL float.add(institution,"totalAdjustedValue",investment.adjustedValue) YIELD result AS institutionAdjustedValueRollup CALL float.add(desk,stuff[1],investment.adjustedValue) YIELD result AS deskAdjustedValueRollupByClass CALL float.add(desk,"totalAdjustedValue",investment.adjustedValue) YIELD result AS deskAdjustedValueRollup SET institution.percentAdjustedValue2 = ((institution.adjustedValue2a + institution.adjustedValue2b)/institution.totalAdjustedValue) * 100, institution.percentAdjustedValue2b = (institution.adjustedValue2b/institution.totalAdjustedValue) * 100</code></pre>Temporal Analysis</h2> With Cypher, you can express temporal conditions, such as events occurring within a specific time window, events happening before or after certain events, or events falling into a particular time range. This enables temporal analysis of event streams, including trend analysis, time-based aggregations, and windowed computations. For example, the temporal locality recipe looks for emails sent or received by cto@company.com</code> within a four to six-minute sliding window. The pattern query matches each individual (sender)-[:SENT_MSG]->(message)-[:RECEIVED_MSG]->(receiver)</code> pattern containing the CTO’s email address. MATCH (n)-[:SENT_MSG]->(m)-[:RECEIVED_MSG]->(r) WHERE n.email="cto@company.com" OR r.email="cto@company.com" RETURN id(n) as ctoId, id(m) as ctoMsgId, m.time as mTime, id(r) as recId</code></pre> And then calculates the duration between the emails to generate a sub-graph containing messages that went to or from the CTO within the time window. MATCH (n)-[:SENT_MSG]->(m)-[:RECEIVED_MSG]->(r), (thisMsg) WHERE id(n) = $that.data.ctoId AND id(r) = $that.data.recId AND id(thisMsg) = $that.data.ctoMsgId AND id(m) <> id(thisMsg) AND duration("PT6M") > duration.between(m.time,thisMsg.time) > duration("P CREATE (m)-[:IN_WINDOW]->(thisMsg) CREATE (m)<-[:IN_WINDOW]-(thisMsg) WITH n, m, r, "http://localhost:8080/#MATCH" + text.urlencode(' (n)-[:SENT_MSG]->(m)-[:RECEIVED_MSG]->(r) WHERE strId(n)="' + strId(n) + '"AND strId(r)="' + strId(r) + '" AND strId(m)="' + strId(m) + '" RETURN n, r, m') a RETURN URL</code></pre>Conclusion</h2> Cypher is a powerful and expressive query language well-suited for processing complex event streams. Quine streaming graph enables Cypher developers to leverage graph techniques early when processing a complex event stream to aggregate and shape events, detect patterns for alerting and early feedback, and perform event normalization before entering the data warehouse. Learn more and Try Quine If you want to try Quine using your own data, here are some resources to help: Learn more about Quine by visiting the Quine open source project.</li> Download Quine - JAR file | Docker Image | Github</li> Check out the Financial Risk Calculation recipe to see how Cypher is used for real-time rollups.</li> Check out demos and other videos at our YouTube channel.</li> </ol>
Streaming Graph for Real-Time Risk Analysis at Data Connect in Columbus 2024 2024-07-23T00:00:00+00:00 After more than 25 years in the data management and analysis industry, I had a brand new experience. I attended a technical conference. No, that wasn’t the new thing. At many conferences, I’ve been surrounded by data scientists, business analysts, data engineers, mathematicians, developers, startup founders, CTO’s, architects, and PHD students, made network connections, listened to giants in the field, like the Chief of Information Management of the United Nations at this one. But, uniquely, at this one conference, Data Connect</a>, organized by Women in Analytics, 9 out of 10 of those leaders in the field were women, and all the speakers were women or a gender minority. It was a soul-filling feeling. Sometimes, it can feel isolating to be a woman in a technical field, but for 2 days, I was surrounded by smart, capable women encouraging each other and talking shop. I got a copy of Low Code AI signed by Dr. Gwendolyn Stripling who was one of the coolest people we met there, and Unmasking AI by Joy Buolamwini, both of whom gave brilliant presentations. Risky real-time risk analysis presentation</h2> For my presentation, I talked about a way to do powerful risk analysis in real-time. Not too surprising, the method used thatDot Streaming Graph</a>. What was surprising is that I went out of my comfort zone for this deeply technical audience; I did a live demo of the risk analysis recipe. Live demos are always a bit nerve-wracking at conferences, and having never done one before with thatDot tech, well … talk about risky. The presentation defined risk analysis and pointed out the failure of Washington Mutual, the largest bank failure in US history, and Silicon Valley Bank last year, the second largest bank failure in US history. Both, due largely to poor risk management. Those are just two of the over 550 banks that have failed</a> since the turn of the century. Between a relaxation of government oversight and less than ideal risk calculation, we’re lucky our economy is still functioning. Since the government regulations aren’t my area, I focused on the problems with current risk analysis methods, mainly that they’re batch and often take 24 or more hours to complete. Even longer if you and your bank HQ aren’t in the same time zone. Since many trades or investments have a regulated time during which a bank can decide to accept the risk and approve the trade or not, usually 24 hours, slow batch processing can expose them far too much. Most financial institutions are shifting to graph analysis for the entity type of categorical analysis needed, but graph databases don’t scale well to the levels large banks require. Event stream processors scale just fine and are real-time by nature, but they have difficulty with the kind of deep graph analysis. So, you need something with powerful graph analytics at event stream processor speeds to get to real-time risk analysis. The risk analysis recipe uses simulated data, but does a good job of showing the speed of analysis and how it could be done. The presentation was well-received with one person coming up and telling me they thought it was the best presentation of the conference. Wow. Now, that’s a heck of a compliment considering the caliber of presenters. I’m looking forward to going to Data Connect again next year, and if you want to learn more about data analysis and data management, don’t miss it. The Power of Real-Time Entity Resolution with Ryan Wright 2024-07-08T00:00:00+00:00 Ever wondered why duplicate records keep slipping through your data streams? This September, thatDot's Founder and CEO Ryan Wright will be addressing this critical issue at Current. Data inconsistencies in Kafka streams, such as misspelled company names, users registering with different email addresses, or multiple bank accounts linked to the same person, can present significant challenges. These issues not only hinder the adoption of streaming data technologies but also impact organizations across the spectrum, from major banks to small startups. Recent advancements in open-source streaming graph tools, like thatDot's Streaming Graph powered by Quine Open Source, have made it easier to clean and resolve data in real-time. These tools offer powerful entity resolution at scale, even as data flows in motion and potentially out of order. Event Details Title: Streaming Entity Resolution for Kafka with Quine Dates: September 17, 2024 Time: Tue Sep 17, 1:30 PM - 1:40 PM CDT (10 Min) Register for the Event</a> Join us on Tuesday, September 17, from 1:30 PM to 1:40 PM CDT in Breakroom 5 for an illuminating lightning talk with Ryan Wright, as we delve into two cutting-edge approaches to real-time entity resolution using the Quine Open Source streaming graph</a>: Viewing Your Stream as a Graph: Leveraging event-triggered "standing queries" for real-time entity resolution.</li> AI-Powered Entity Resolution: Using historical stream data to enable AI-driven resolution with graph neural networks.</li> </ol> In both scenarios, you'll witness how a Kafka stream filled with messy data can be transformed into a clean, entity-resolved output. Don't miss this opportunity to learn how to enhance your data streams and drive better insights. Stay tuned for more updates on this event. For more details on the event and speakers, visit Current 2024 Speakers</a>. Cypher all the things! 2024-07-03T00:00:00+00:00 It's been said that graphs are everywhere</a>. Graph-based data models provide a flexible and intuitive way to represent complex relationships and interconnectedness in data. They are particularly well-suited for scenarios where relationships and patterns are important, but until recently, they have been confined to a handful of use cases – databases, chip design, information theory, AI – that all have one thing in common: the data in question is stored first and then processed, usually as a batch job. In other words, the data in these use cases is at rest. However, what about data in motion, data in event-driven use cases that is constantly changing and being transmitted? As event-driven applications and operational intelligence scenarios, such as real-time monitoring, situational intelligence, and fraud detection, continue to expand rapidly, graph data models and the primary query language used for them, Cypher, are proving much more versatile than SQL and the best tools for the task. Consider the challenge of extracting insights from a complex event stream. The stream may have high volume and velocity, require the correlation of events by context from multiple sources, contain meaningful event patterns, and have a short timeframe to identify, detect, and take action. Addressing these challenges requires efficient data processing, scalable infrastructure, and effective event modeling techniques in graph solutions and Cypher. Graph databases are useful for batch processing a portion of a complex event stream to provide macro-level insights and metrics to understand events but not take action. The same concepts (patterns and algorithms) used in graph databases when the event stream is at rest can be applied to an event stream while in motion using a streaming graph like Quine, often directly reusing the Cypher written in the database. Here’s how Cypher addresses the challenges found in complex event streams: Pattern Matching: Cypher excels at pattern matching, allowing you to detect patterns (sub-graphs) in the event stream. This is particularly useful for identifying sequences of events or detecting specific patterns, allowing you to efficiently filter and process relevant events based on their relationships and properties.</li> Event Correlation: You can define relationships between events and other entities, such as users, devices, or locations. This enables you to correlate events based on common attributes or shared relationships, often with high cardinality and a mix of categorical and numerical data, to identify patterns, anomalies, or complex dependencies.</li> Time-based Queries: Cypher provides temporal capabilities, allowing you to query and analyze events based on their timestamps or time intervals. You can filter events based on specific time ranges, compare temporal values, and perform time-based aggregations. This enables you to process time-dependent patterns, detect trends, and perform time window-based computations on the event stream.</li> Real-time Insights: You can continuously execute Cypher queries on an incoming event stream, allowing for dynamic analysis and near real-time decision-making. This enables you to monitor, detect patterns, and trigger actions based on the evolving stream of events.</li> </ul> Event Pattern Detection</h2> Specifying a pattern (sub-graph) to MATCH can identify specific sequences of events or combinations of events of interest. For example, when observing the efficiency of cache nodes in a CDN network, Cypher can easily identify when a series (10) of cache misses occur and send an alert to the NOC to trigger an investigation. The Cypher required to detect a MISS event only needs to identify the node types and relationships as a pattern. MATCH (server1:server)<-[:TARGETED]-(event1 {cache_class:"MISS"})-[:REQUESTED]->(asset)<-[:REQUESTED]-(event2 {cache_class:"MISS"})-[:TARGETED]->(server2:server) RETURN DISTINCT id(event1) AS event1</code></pre> Then, additional Cypher processes the event to take action, recording it as a metric or sending an alert if the metric constraint is exceeded. This technique is demonstrated in the CDN Observability</a> recipe. Graph-Based Event Correlation</h2> Cypher can respond to changes in the event stream in real time, allowing organizations to reduce the risk associated with a condition's duration before it is analyzed and addressed. For example, the Financial Risk Calculation</a> recipe models market changes in real-time so that organizations can provide adequate coverage for risk exposure while ensuring their regulatory compliance minimally affects their asset allocation. As basic patterns are matched, results are passed to business logic written in Cypher to generate an adjusted trading value, correlate (roll-up) trading events across the network, and trigger an alert when the trading system is out of compliance. When a pattern match query</a> detects an investment pattern, it triggers an output query</a> to process the StandingQueryResult</code>. For example, the result returned from an investment pattern in Cypher: MATCH (investment:investment)<-[:HOLDS]-(desk:desk)<-[:HAS]-(institution:institution) RETURN DISTINCT id(investment) AS id Triggers business logic in Cypher to generate a new property with a value based on the nodes investment.class property. SET investment.adjustedValue = CASE WHEN investment.class = "1" THEN investment.value WHEN investment.class = "2a" THEN investment.value * .85 WHEN investment.class = "2b" AND investment.type = 9 THEN investment.value * .75 WHEN investment.class = "2b" AND investment.type = 10 THEN investment.value * .5 END</code></pre> The investment events are then correlated through a roll-up function for each investment type. UNWIND [["1","adjustedValue1"], ["2a","adjustedValue2a"], ["2b","adjustedValue2b"]] AS stuff WITH institution,investment,desk,stuff WHERE investment.class = stuff[0] CALL float.add(institution,stuff[1],investment.adjustedValue) YIELD result AS institutionAdjustedValueRollupByClass CALL float.add(institution,"totalAdjustedValue",investment.adjustedValue) YIELD result AS institutionAdjustedValueRollup CALL float.add(desk,stuff[1],investment.adjustedValue) YIELD result AS deskAdjustedValueRollupByClass CALL float.add(desk,"totalAdjustedValue",investment.adjustedValue) YIELD result AS deskAdjustedValueRollup SET institution.percentAdjustedValue2 = ((institution.adjustedValue2a + institution.adjustedValue2b)/institution.totalAdjustedValue) * 100, institution.percentAdjustedValue2b = (institution.adjustedValue2b/institution.totalAdjustedValue) * 100</code></pre>Temporal Analysis</h2> With Cypher, you can express temporal conditions, such as events occurring within a specific time window, events happening before or after certain events, or events falling into a particular time range. This enables temporal analysis of event streams, including trend analysis, time-based aggregations, and windowed computations. For example, the temporal locality</a> recipe looks for emails sent or received by cto@company.com within a four to six-minute sliding window. The pattern query matches each individual (sender)-[:SENT_MSG]->(message)-[:RECEIVED_MSG]->(receiver)</code> pattern containing the CTO’s email address. MATCH (n)-[:SENT_MSG]->(m)-[:RECEIVED_MSG]->(r) WHERE n.email="cto@company.com" OR r.email="cto@company.com" RETURN id(n) as ctoId, id(m) as ctoMsgId, m.time as mTime, id(r) as recId</code></pre> And then calculates the duration between the emails to generate a sub-graph containing messages that went to or from the CTO within the time window. MATCH (n)-[:SENT_MSG]->(m)-[:RECEIVED_MSG]->(r), (thisMsg) WHERE id(n) = $that.data.ctoId AND id(r) = $that.data.recId AND id(thisMsg) = $that.data.ctoMsgId AND id(m) <> id(thisMsg) AND duration("PT6M") > duration.between(m.time,thisMsg.time) > duration("P") CREATE (m)-[:IN_WINDOW]->(thisMsg) CREATE (m)<-[:IN_WINDOW]-(thisMsg) WITH n, m, r, "http://localhost:8080/#MATCH" + text.urlencode(&#039; (n)-[:SENT_MSG]->(m)-[:RECEIVED_MSG]->(r) WHERE strId(n)="&#039; + strId(n) + &#039;"AND strId(r)="&#039; + strId(r) + &#039;" AND strId(m)="&#039; + strId(m) + &#039;" RETURN n, r, m&#039;) a RETURN URL</code></pre>Conclusion</h2> Cypher is a powerful and expressive query language well-suited for processing complex event streams. Quine streaming graph enables Cypher developers to leverage graph techniques early when processing a complex event stream to aggregate and shape events, detect patterns for alerting and early feedback, and perform event normalization before entering the data warehouse. Learn more and Try Quine</h2> If you want to try Quine using your own data, here are some resources to help: Learn more about Quine by visiting the Quine open source project</a>.</li> Download Quine - JAR file</a> | Docker Image</a> | Github</a></li> Check out the Financial Risk Calculation recipe</a> to see how Cypher is used for real-time rollups.</li> Check out demos and other videos at our YouTube channel</a>.</li> </ol> thatDot CEO Explains Streaming Graph to Cybersecurity Thought Leader 2024-07-02T00:00:00+00:00 Briefing Room on demand webinar on thatDot Youtube channel: The Unreasonable Effectiveness of Streaming Graph</a> thatDot founder and CEO Ryan Wright discussed the power of thatDot Streaming Graph and Novelty to detect the most well-hidden threats with the Bloor Group's Eric Kavenagh and Mark Lynd, who was ranked #1 global thought leader in cybersecurity by Thinkers360. With high-profile data breaches hitting the headlines every other day now, the way we're doing this is clearly a losing battle. Low and slow attacks like advanced persistent threats hiding in mountains of data are stealing whatever they want and many cyber professionals are just throwing up their hands and admitting defeat. DARPA funded thatDot technology development specifically to turn the tables on those threats. This webinar provides what you need to know to change the game to one where the attacker must be perfect to have a chance. Just one step out of line will get them caught. To quote Mark Lynd, "This is the holy grail." and "It takes us from reactive to proactive cybersecurity." Traditional graph data models offer depth but lack the immediacy required to outpace cybercriminals or the scale and processing speed needed to keep up with massive flows of information cyber professionals need to evaluate. With insightful questions from Mark to guide him, Ryan really goes deep on the power of this technology in the cybersecurity space. He provides some potent demonstrations of points like: The power of graph for relationship analytics.</li> Scaling and speed on direct graph analysis of categorical data providing real time threat detection.</li> Moving left so that cybersecurity analysis is done on data pipelines in real time.</li> Reducing false positives with context awareness for anomaly detection.</li> </ul> This powerful tech is useful for many things, from digital twins</a> to fraud detection</a>, but is particularly powerful in the threat detection and anomaly detection space for cybersecurity. Watch this exceptional video on the thatDot Youtube channel</a>. Learn for yourself how to bring graph-driven reasoning into the real-time nature of event-driven processing in the cybersecurity stream. Watch The Unreasonable Effectiveness of Streaming Graph</a>. Streaming Graph Processing on Categorical Data Enables Real-time Risk Calculation 2024-07-01T00:00:00+00:00 The failure of Silicon Valley Bank in 2023 exemplifies the severe consequences of not accurately assessing risk in a timely manner. Although nearly every financial institution prioritizes risk minimization, their methods for calculating risk often rely on detailed analysis of categorical data and relationships. Most existing algorithms, however, only handle static, numeric data. This requires transforming the data, typically through methods like one-hot encoding, into numerical formats that are bulky, sparse, and slow to process. After analysis, the data often needs to be converted back to its original categories, adding to the inefficiency. Current state-of-the-art solutions take hours to deliver insights. If we could perform this analysis earlier in the process, using the original categorical data as it streams in without modification, we could reduce the mean time to insight to seconds, potentially saving financial institutions significant amounts of money. This approach could also enable new capabilities, such as using graph NLP on streaming data to identify novel behaviors and detect anomalies like cyber-attacks before they impact systems. The combination of fast, in-line data processing engines like Flink or KsqlDB with graph algorithms and categorical analysis is exceptionally powerful. Join us to learn about a new open-source streaming intelligence system that revolutionizes risk analysis and other fast categorical data processing. Event Details: Title: Streaming Graph Processing on Categorical Data Enables Real-time Risk Calculation Date: July 12, 2024 Time: 10:45am - 11:20am ET Register for the Event</a> Why You Should Attend</h2> Attending Paige's speech at Data Connect 2024 is a must for anyone serious about staying at the forefront of data science and risk management. Paige will unveil groundbreaking techniques for real-time risk analysis, demonstrating how to cut mean time to insight from hours to seconds. This shift can save financial institutions substantial costs and enhance their ability to detect anomalies, including cyber-attacks, before they cause harm. Paige will explore the synergy of in-line data processing engines like Flink or KsqlDB with advanced graph algorithms and categorical analysis. By attending, you'll gain invaluable insights into innovative data processing methods that can revolutionize your organization's approach to risk and data management. Don't miss this opportunity to learn from a leading expert and enhance your strategic capabilities in the evolving data landscape. Akka to Pekko Migration for thatDot and Quine 2024-06-20T00:00:00+00:00 You don’t know what you’ve got till it’s gone. Musicians have sung this lament about relationships and the beauty of nature. It turns out to be true about open source software licenses as well. On September 7, 2022, Lightbend announced that they were changing the license for Akka from the open source Apache License 2.0 to the commercial Business Source License 1.1. This had major implications for Akka users. Operators of closed source services built using Akka were faced with a primarily financial dilemma about the cost of licensing compared to the cost of re-implementation. Authors of open source software depending on Akka had to re-evaluate their ability to remain open source themselves. At thatDot</a>, we found ourselves facing both of these challenges. thatDot publishes a streaming graph, Quine</a>, under an open source license. We also host SAAS services like Novelty for AWS that are closed source products built on top of Quine. To continue using new versions of Akka, we would have to re-evaluate Quine’s licensing model, and incur the cost of purchasing licenses from Lightbend for our SAAS services. Our immediate solution, like that adopted by many others, was to simply continue using the last version of Akka available under an open source license, version 2.6. This was a time-limited workaround though, since version 2.6 would eventually stop receiving security fixes. It also prevented us from using libraries that themselves required later versions of Akka for their own security fixes, or additional functionality. We needed an open source alternative. What we did</h2> Pekko is an open source fork of Akka hosted by the Apache Software Foundation. It provided us with a path forward that kept a core component connected to an active community without requiring extensive re-writing of our own code. It also gained support from important connector libraries built on top of Akka that released Pekko backed versions. Our migration required two main activities. The first was the modification of our own code that used Akka directly. The second was the replacement of all dependencies with the Pekko versions. The latter proved to be the more difficult one. Modifying our direct dependency on Akka was refreshingly straightforward. We had to replace all imports of akka packages with imports of org.apache.pekko packages, and the akka section of our config files with a pekko section. The bulk of this was accomplished with search and replace using regular expressions. The remaining pieces were found using simple (case-insensitive) searches for “akka”, and manually reviewing and editing the code or comments. For example, comments describing use of an Akka feature were modified, while those referring to discussions in Akka community forums to justify a decision were left unchanged. While this was slightly tedious, it wasn’t hard to work through. An unexpected challenge</h2> The real challenge was replacing libraries to remove all indirect dependencies on Akka. Replacing dependencies also required us to unwind the delicate set of indirect dependencies we had pinned to work around vulnerabilities. Migrating dependencies from Akka to Pekko can be done in 3 ways: Swapping in a drop-in replacement</li> Forking the library and replacing its usage of Akka with Pekko</li> Re-implementing the feature, possibly on a similar library with a Pekko version</li> </ul> In most cases, the community had Pekko equivalents that just worked after changing our build definition and import statements. In others, a Pekko version was not available, so we needed to use an alternative. These required us to make non-trivial changes to our code to re-implement the functionality. The community adoption of Pekko made our migration feasible. We only had to drop two libraries that didn’t have Pekko versions, and only lost one feature, Pulsar support. The Pulsar library we were using, pulsar4s has since added a Pekko version. Benefits</h2> Migrating to Pekko: Allowed us to continue offering Quine with the same license.</li> Reduced the maintenance burden of overriding and testing indirect dependencies to avoid security problems.</li> Avoided extra cost to running our SAAS products.</li> Opened up our ability to continue leveraging new libraries and releases from the community.</li> </ul> ‍ Microservice Hell: The State of the Art in Streaming Services 2024-06-19T00:00:00+00:00 The State of the Art</h2> Data lives in many different places. Some of this could live in Apache Kafka for example, while other bits of important related data could be sourced from server-sent events on a server somewhere. Maybe even some of your data lives in a text file that you need to stream in from. Let’s quickly emulate the state of the art and see what it’s like to retrieve some data from Kafka. Here is a docker-compose.yaml file that we can use to stand up Kafka: version: '2' services: zookeeper: image: confluentinc/cp-zookeeper:latest environment: ZOOKEEPER_CLIENT_PORT: 2181 ZOOKEEPER_TICK_TIME: 2000 ports: - "2181:2181" kafka: image: confluentinc/cp-kafka:latest ports: - "9092:9092" environment: KAFKA_BROKER_ID: 1 KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181 KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092 KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1 depends_on: - zookeeper</code></pre> Running docker-compose up -d</code> to stand up Kafka, we can now test and interact with it using a few bash commands. Let’s publish some data to our Kafka. First, let’s create a topic: docker exec -it kafka-kafka-1 kafka-topics \ --create \ --bootstrap-server localhost:9092 \ --replication-factor 1 \ --partitions 1 \ --topic my-family</code></pre> This command creates the my-family</code> topic. Let’s fill it in with some members of my family as an example. We can use the following bash command to begin publishing to the topic: docker exec -it kafka-kafka-1 kafka-console-producer \ --topic my-family \ --bootstrap-server localhost:9092</code></pre> And we can use this command to subscribe to the topic, and watch as data streams into Kafka: docker exec -it kafka-kafka-1 kafka-console-consumer \ --topic my-family \ --from-beginning \ --bootstrap-server localhost:9092</code></pre> Here’s what it looks like after inputting 7 different members of my family: ‍ In the above example, I submitted 7 strings of data, and I can see each being emitted from Kafka in my subscriber terminal. In order to work with this data though, we should work at a higher level of abstraction than the command line interface. We can’t transform the data very easily here. To build a data pipeline, we’ll need to harness a programming language so we can work this logic in. Let’s create a Scala service that prints the strings coming in from our my-family</code> topic, similar to what the kafka-console-consumer</code> was doing. import scala.concurrent.ExecutionContext import org.apache.pekko.actor.ActorSystem import org.apache.pekko.kafka.scaladsl.Consumer import org.apache.pekko.kafka.{ConsumerSettings, Subscriptions} import org.apache.pekko.stream.scaladsl.Sink import org.apache.kafka.clients.consumer.ConsumerConfig import org.apache.kafka.common.serialization.StringDeserializer import org.apache.kafka.clients.consumer.ConsumerRecord import org.apache.pekko.Done import org.apache.pekko.stream.scaladsl.Source import scala.concurrent.Future object KafkaTestMain extends App { implicit val actorSystem: ActorSystem = ActorSystem() implicit val ec: ExecutionContext = actorSystem.dispatcher val bootstrapServers = "localhost:9092" val topic = "my-family" val consumerSettings: ConsumerSettings[String, String] = ConsumerSettings(actorSystem, new StringDeserializer, new StringDeserializer) .withBootstrapServers(bootstrapServers) .withGroupId("group1") .withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest") val source: Source[ConsumerRecord[String, String], Consumer.Control] = Consumer.plainSource(consumerSettings, Subscriptions.topics(topic)) val done: Future[Done] = source .map(record => record.value()) .runWith(Sink.foreach(println)) done.onComplete { case _ => actorSystem.terminate() } }</code></pre> This small Scala application subscribes to our Kafka my-family</code> topic and just prints the strings emitted. This small example shows that we can create bespoke software that subscribes to data streams, and works with the emitted data. But like I mentioned at the beginning of this post, our data can live in many different locations. If my use-case demanded it, I would need to write up more logic to stream in data from many different sources. The above example is the initial step. But many more steps are required to make even data from this single source ready for production. Scalability, Resilience, Maintainability</h2> Because we are streaming data in, and not batch processing our data, our service must be up 24/7. This requirement means the following: The microservice must be able to scale with the amount of data being ingested, tackling challenges like service discovery, network latency, and load balancing.</li> It must be resilient, capable of handling network outages, able to self-heal if a fault occurs, and able to restore from timely backups.</li> It must be able to reign in complexity, and allow developers to add new features based on shareholder needs, with a careful eye on dependency management and rich documentation.</li> </ul> Applying all of this to our example of streaming data in from a single Kafka topic would require performing best-practices when implementing event-driven architecture patterns. Think techniques like event-sourcing, the saga pattern, and CQRS (Command Query Response Segregation). Applying proven patterns to our service will result in a scalable, resilient, and maintainable piece of software, but doing this repeatedly for every additional source of data is repetitive, and prone to errors. What we need is a new state of the art, allowing us to ingest data from event sources in a scalable, resilient, and easily maintainable way. Enter thatDot Streaming Graph, the NewState of the Art</h2> Let’s do the exact same thing as above, but using thatDot Streaming Graph, the world’s first streaming graph data processor. We’ll point Streaming Graph at our Kafka, ingest data from our my-family</code> topic, and view our results. This time though, we won’t need to write any bespoke programs to handle how to ingest data. First, grab your own copy of Quine (the open source version ofthatDot Streaming Graph) by clicking here</a> and downloading the JAR file. As of this blog post, I’m downloading v1.6.2 of Quine. To start it run the following command: java -Dquine.store.type=in-memory -jar Downloads/quine-1.6.2.jar</code></pre> This will start Quine with an in-memory persistor, meaning it will not save any data on disk. This is great for testing, since it doesn’t leave anything on disk that has to be cleaned up later. Navigate to http://127.0.0.1:8080/</code> and you’ll see the Quine Exploration UI. ‍ ‍ The docs are interactive, so you can enter the code directly on the page that explains how it works.Quine can ingest data from multiple different sources, simultaneously. Just like in the previous example, let’s create an Ingest Stream, populating our graph with data from Kafka ‍ We’ll use the following JSON to create an ingest stream from the my-family</code> topic. { "type": "KafkaIngest", "format": { "type": "CypherRaw", "query": "WITH text.utf8Decode($that) as name MATCH (n) WHERE id(n) = idFrom(name) SET n:Person, n.name=name" }, "topics": ["my-family"], "autoOffsetReset": "earliest", "bootstrapServers": "localhost:9092" }</code></pre> ‍ What does this JSON code do? First, we instruct Quine to create an Ingest Stream of type “KafkaIngest” Subscribe to the my-family</code> topic, and use the autoOffsetRest</code> option of earliest</code> to begin reading data from the beginning of the topic. Use a Cypher query to receive the raw bytes of data coming in from kafka, casting those bytes as UTF8 strings.Then create person nodes on the graph for each name in the incoming data stream. Pasting in that JSON, giving the Ingest Stream a name (in this case family-ingest-stream</code>), and then clicking Send API Request, should result in a 200 return code, meaning we successfully created an ingest stream. If we list the Ingest Streams, we can see our named ingest, along with the ingestedCount</code> of 7 records. Returning to the Exploration UI, enter the basic MATCH (n) RETURN n</code> cypher query to see the results of our ingest. ‍ We didn’t have to write up a custom microservice to load in data from this source. We just pointed Quine at our source, gave it a few parameters, and told it to start the ingest, transforming our data using Cypher. If we wanted to consume data from a file, we would similarly instruct Quine to ingest data from a file, and pass in a Cypher query to transform the data. Enhancing the Robustness of the Streaming Graph in Production Environments</h2> Much like our custom program designed for microservices, for production, it's imperative to focus on scalability, resilience, and maintainability. Optimizing for Scalability</h3> thatDot Streaming Graph, the commercial version of Quine, is engineered for horizontal expansion. Another pivotal feature is its implementation of backpressure, which dynamically regulates data processing speeds in alignment with the consuming service's capacity. This ensures Streaming Graph avoids overwhelming downstream services, no matter how high the system scales. No need for us to do anything to make that happen. Ensuring Resilience</h3> Streaming Graph's resilience is significantly bolstered by the backpressure mechanisms, safeguarding against data loss during unexpected downtimes by adjusting the data flow based on the consumer's current state. If the consumer can’t consume for a while, Streaming Graph pauses until it is ready. Streaming Graph is also designed to be self-healing. In the event of a cluster member failure, the system automatically delegates a hot-spare to jump in and assume the role of the downed member, ensuring uninterrupted data processing. This resilience is further enhanced by leveraging durable data storage solutions such as Cassandra or ClickHouse. Maximizing Maintainability</h3> thatDot Streaming Graph has the ability to ingest data from multiple sources, simultaneously, removing the need to create multiple bespoke services. You can ingest data from: Apache Kafka</li> AWS Kinesis</li> S3</li> Server Sent Events (SSE)</li> Websockets</li> And more</a></li> </ul> ‍ This also provides extra functionality, since you can resolve duplicates, and find relationships between multiple data streams in real time. Streaming Graph and Quine use the Cypher query language to transform all data sources, allowing a consistent experience when ingesting data sourced from different locations. Elevating Your Data Architecture: Discover the Power of thatDot Streaming Graph</h2> In software development, when complexity starts getting too high, refactoring to use a higher-level abstraction can be a winning strategy. The current state of the art with microservices and data pipelines is incredibly complex, demanding a higher level of abstraction. thatDot Streaming Graph is that higher level of abstraction. It can handle the problems of scalability and resilience automatically, so developers can focus on the logic that matters, transforming high-volume data into high-value insight. Click here</a> to download your own open source copy of Quine, and click here</a> to jump into our Discord to join other like-minded developers looking to solve the challenges of high-volume data pipelines at scale. ‍ Novelty Demo 2024-06-19T00:00:00+00:00 Novelty Tutorial</h2> </iframe> </div> This 12 min video demonstration walks through a Jupyter notebook powered scenario illustrating how to use thatDot Novelty to analyze CDN logs for anomalous activity. Click here</a> to download the CDN dataset for this example. ‍Download the Jupyter notebook</a> and try the demo yourself with an AWS instance</a> of thatDot Novelty. Demo Summary</h2> Novelty Score Endpoints</h3> The demo interacts with thatDot Novelty through its interactive REST API. You can stream observations</a> into thatDot Novelty using one of two API endpoints: Single observation: POST /api/v1/novelty/{context}/observe</code></li> Bulk observations: POST /api/v1/novelty/{context}/observe/bulk</code></li> </ul> After streaming in a batch of observations, you can rescore observations given the context of the entirety of the dataset using Novelty's read-only scoring endpoints: Single observation: POST /api/v1/novelty/{context}/read</code></li> Bulk observation: POST /api/v1/novelty/{context}/read/bulk</code></li> </ul> Novelty Score Results</h3> thatDot Novelty's Score Results response returns the observation score, along with additional useful information. Here is some of that data: observation: The observation that was streamed in to generate the result. A list of string observation components</li> score: score between 0 and 1 representing the most novel component of this observation. 1 is highly novel, 0 is not novel at all: the mostNovelComponent</code> field contains more details for which component led to this result</li> mostNovelComponent: which component of the observation was the most novel</li> sequence: sequence number assigned to uniquely identify this observation as made within this context.</li> uniqueness: scaled measure of uniqueness for the observation as a whole; ranges between 0 (no uniqueness) and 1 (totally unique)</li> </ul> Important Points</h3> Unique does not mean novel. Sometimes, completely unique and unseen observations can be normal, as described in the Demo when showing the normalcy of having completely unique IP addresses in a certain scenario</li> thatDot Novelty does not require training, but does take a bit of time depending on the use case to adapt to the data</li> </ul> Recent posts</h2> Want to read more news and other posts? Visit the resource center for all things thatDot. View all Resources</a> Help Center</h2> Streaming Graph Help View Docs</a> Novelty & Additional Help View Docs</a> Release Announcement for thatDot Streaming Graph 1.6.1 with ClickHouse Persistor 2024-06-19T00:00:00+00:00 A new version of thatDot Streaming Graph has just been released with the brand new ability to persist data directly in ClickHouse! The new ability to have multiple namespaces is another huge advantage in multi-tenant situations. With new v1.6.1 enhancements, you can: ▪ Persist data in Clickhouse. ▪ No longer see hot spares in ingest monitor payloads. ▪ Manage different teams or customers with namespace management, with a single thatDot Streaming Graph instance that has independent graph interpreters. ▪ Integrate with Kafka more robustly with added support for arbitrary Kafka properties on WriteToKafka Standing Query output via kafkaProperties field. ▪ Use multiple value standing queries more easily with the ability to trigger an event on arbitrarily-keyed property changes by using a RETURN properties(n) rather than having to list all properties manually. ▪ Get simpler, more standard names for the NumberIteratorIngest fields startAtOffset and maximumPerSecond. If you were previously using the names startAt or throttlePerSecond, you will need to update your recipes and API calls. ▪ Improve favicon support on all platforms. ▪ Execute simple text queries in the single-line query bar on the Explore UI with SHIFT-ENTER. Fixes: Some error messages normally encountered during Cypher query compilation were lost when the previous version was migrated to Scala 2.13. Reintroduce those error messages. Be sure to check out thatDot Streaming Graph for more information and if you already use it, update your local copy of thatDot Streaming Graph to v1.6.1 using your preferred method Post Title</h2> Download - Streaming Graph for Data Pipelines.</a> ‍ Can Streaming Graphs Clean Up the Data Pipeline Mess? 2024-06-18T00:00:00+00:00 In this article on Datanami, Alex Woodie discusses the problems with current event stream processing data pipelines, and the advantages a graph paradigm could bring to the table, with thatDot technology spotlighted. He talks about how thatDot's Ryan Wright found himself having to rebuild the data pipeline infrastructure of multiple times, and how brittle and difficult to maintain it could be. “The more data pipelines you build, the more they start looking like the same thing,” Wright says. “And you have to start wondering: How do we solve the higher-level question so we don’t have to keep rebuilding the same pipelines over and over again?” Learn more by reading the article "Can Streaming Graphs Clean Up the Data Pipeline Mess</a>?" on Datanami. Optimize Digital Twins to Real Time 2024-06-18T00:00:00+00:00 On RTInsights, thatDot's Rob Malnati writes about digital twins. This article provides a solid foundation as to what digital twins are, what they're used for, and how streaming graph technology can make them more effective. "As our world becomes increasingly connected, digital twins abstract and model almost everything to improve business operations, reduce risk, and enhance decision-making for better outcomes." "For digital twins to be truly useful, they must be able to drive actions – for example, issue alerts or power down equipment – the instant an issue emerges, perhaps even beforehand." Read more on the original article "Optimizing Digital Twins to Real Time</a>" on RTInsights. Stop Querying Your Data 2024-06-18T00:00:00+00:00 At the 2023 Knowledge Graph Conference in New York, Ryan Wright, CEO and Founder of thatDot, gave a presentation entitled: Streaming Graphs: Because We Cannot Afford to Query Anymore. Quine streaming graph can process millions of complex, multi-hop graph events per second. But what design decisions and tradeoffs went into making this possible? And why does it matter to data engineers and their day-to-day? Learn how Quine integrates with your event streaming pipeline (including Apache Kafka, Kinesis, RedPanda, and Apache Pulsar) and uses standing queries to generate results the instant a pattern in the data stream emerges. How does this mean you don't have to keep querying your data? Check out the "Streaming Graphs: Because We Cannot Afford to Query Anymore</a>" presentation recording to learn more. Streaming graph analytics: ThatDot’s open-source framework Quine is gaining interest 2024-06-18T00:00:00+00:00 Streaming Graph Analytics, and what it does. In this article on Venturebeat, George Anadiotis discusses the power of Quine, the increasing interest in the concept of streaming graph, and the influx of thatDot funding from cybersecurity</a> leader Crowdstrike. "What do you get when you combine two of the most up-and-coming paradigms in data processing — streaming and graphs? Likely a potential game-changer, at least that’s what is being hinted at by the likes of DARPA</a> and now CrowdStrike’s Falcon Fund,</a> which are betting on ThatDot</a> and its open-source framework Quine</a>." To learn more, read "Streaming graph analytics: ThatDot’s open-source framework Quine is gaining interest</a>" on Venturebeat. Understanding Batch VS Streaming Data 2024-06-18T00:00:00+00:00 In this article on InsideBigData, thatDot's Rob Malnati discusses the evolution of data architectures from batch toward a more real-time representation of the world. Often, this new way of dealing with data is essential as data processing demands change. "Batch processing is, and will remain, enormously useful for many everyday tasks. However, for all its utility, batch processing is at odds with how the world works. Whether you are talking about financial transactions, social media feeds, or clicks on news sites, data is being generated continuously. It streams past. And once it is gone, your ability to act on it at the moment is also gone." Read more at "Understanding Batch vs. Streaming Data Processing As Enterprises Go Real Time</a>" on InsideBigData. Understanding the Scale Limitations of Graph Databases 2024-06-18T00:00:00+00:00 In this article on eWeek, thatDot's Rob Malnati discusses why it'd difficult or even impossible to analyze really large datasets using graph databases. The difficulty is compounded by the modern need to respond to everything in real time. "Much has changed since the emergence of the most recent generation of graph databases from a decade ago. Enterprises are dealing with previously unimaginable volumes of data to potentially query. That data enters and streams through the enterprise in a variety of channels, and enterprises want action on that information in real time." To learn more, read "Understanding the Scale Limitations of Graph Databases</a>" on eWeek. What is Categorical Data? 2024-06-18T00:00:00+00:00 On datanami, thatDot founder and CEO Ryan Wright helps define the nature of categorical data</a>. This essential data type makes up about three quarters of all data an enterprise needs to analyze, yet is often not analyzed. "Why would enterprises ignore an entire class of data? Especially when it is essential to high-priority use cases like personalization, customer 360, fraud detection and prevention, network performance monitoring, and supply chain management?" To learn more read "What is Categorical Data?</a>" on datanami.com. Quine Aims to Simplify Event Processing on Data in Motion 2024-06-17T00:00:00+00:00 On InfoQ, Sergio De Simone talks about the advantages of the streaming graph style of data processing, and of Quine open source software in particular. "What sets Quine apart from other stream processing solutions</a>, says thatDot, is a set of three design choices</a> that lie at its foundations: a graph-structured data model, an asynchronous actor-based graph computational model, and standing queries." Check out "Quine Aims to Simplify Event Processing on Data in Motion</a>" on InfoQ to learn more. ThatDot accelerates streaming data analytics with open source Quine 2024-06-17T00:00:00+00:00 On VentureBeat, Shubham Sharma writes about thatDot's announcement of open source software Quine for streaming graph complex event processing. He discusses, among other things, the power of Quine to reduce the burden on developers of event stream processing data pipelines. "It can eliminate batch processing, multi-level joins, and other time-consuming and outdated processes that drag down and stall analysis on streaming data. This way, data pipeline engineering teams can easily interpret high-volume event data streams, innovate and ship products faster and use the emerging Graph AI tools driving the next wave in machine learning." Read more about how "ThatDot accelerates streaming data analytics with open source Quine</a>" on VentureBeat. thatDot launches Quine, a streaming graph engine 2024-06-17T00:00:00+00:00 On TechCrunch.com, Frederic Lardinois talks about the launch of Quine open source streaming graph complex event processing engine. “We’ve developed the streaming graph to really target the kind of the problem in the industry right now — the rock and hard place that we all sit between,” Quine’s creator and thatDot CEO and co-founder Ryan Wright</a> told me. “On one side, there’s huge volumes of data. For the last 10 years, big data has just become de rigueur, it’s a normal ordinary thing now and only getting bigger. But the other side of that is how do you interpret all that data?” Read more about "thatDot launches Quine, a streaming graph engine</a>" on TechCrunch.com. thatDot Launches Streaming Graph Platform 2024-06-17T00:00:00+00:00 “Enterprise data engineering teams are confined to the limitations and tradeoffs of the previous generation of event processing frameworks like Flink. They spend enormous time and effort building complicated event-driven architectures that only work on small time-windows of in-memory data and miss out on the bigger picture,” said Ryan Wright, the creator of Quine and founder/CEO of thatDot. “Quine can transform months of tedious data engineering into an afternoon’s work enabling data pipeline engineers to easily interpret high-volume event data streams, innovate and ship products faster, and to use the emerging Graph AI tools driving the next wave in machine learning.” On the editorial for DBTA Magazine, Stephanie Simone discusses the debut of Quine open source graph complex event stream processing software. She talks about how it combines graph data with a streaming data processing technologies in a developer-friendly way. Read more about "thatDot Launches Streaming Graph Platform</a>" on the online DBTA magazine. Authentication Fraud 2024-06-17T00:00:00+00:00 The Problem</h2> Metered attacks that generate low volume log-in attempts, from diverse IPs and across extended time frames, are designed to avoid the "3 strikes in 24 hours" business rules in authentication applications and the more complex analysis of log analytics / SIEM platforms. Batch solutions by definition cannot react until after a compromise has occurred while all real-time solutions impose time windows -- any data falling outside these rolling windows, no matter how important, is simply not processed. Either way, that means important patterns are missed and attempts succeed before you can stop them. The Solution</h2> Quine's changes the status quo by continuously assessing newly arriving events for their match to all known attack patterns, including the identification and tracking of partial behavior matches across any time frame, and billions or trillions of users/devices/applications, until a behavior pattern is fully observed. Once an attack pattern is fully detected, events are generated immediately to trigger an investigation alert or an automated remediation workflow. Quine's continuous analysis of event streams means there are not time windows to manage, and thus no windows for attackers to engineer their attacks around. And Quine provides this extended time frame of analysis without incurring the cost of SIEM solutions</a>, sifting through data from multiple sources to find and store only the patterns that matter – in this case, the ones that indicate a low and slow attack is underway. ‍ Key Value Take Away</h2> Continuously track behavior patterns across billions/trillions of devices, users, and applications</li> Provide analyst a complete record of historical actions by user, device, or application</li> Operate on one domain/customer, or across domains/customers</li> Costs effective vs. log analysis / SIEM data store quotas</li> </ul> Financial Fraud Detection 2024-06-17T00:00:00+00:00 The Problem</h2> Financial fraud detection requires monitoring billions of transactions, devices and users in real-time for suspect behaviors without false positives that alienate customers when service is denied in the middle of a foreign vacation or late night business event. The Solution</h2> What is needed is a system that do four things: detect complex patterns of behavior</li> combine multiple sources and scale up to millions of events/sec</li> take the appropriate, user-specified action when patterns are detected</li> do all of this in real time</li> </ol> Quine can monitor device and user behavior over extended time periods to detect expected exploit behaviors and new, novel, threat actions. By including categorical data such as store names, item types or sizes, geo locations, device versions, and day of the week, Quine understands the full context of behavior, eliminating false-positives. Additionally, Quine alerts provide a comprehensive view of past and current behavior for a device or user as supporting data for investigations. ‍ Key Value Take Away</h2> Behavior modeling for billions/trillions of users, devices and transactions</li> High-confidence risk scoring by leveraging the rich behavior context provided by categorical data analysis</li> Human-understandable alert information to support analysts investigations</li> Cost effective at scale with on premise licensing</li> Integrates with existing Apache Kafka,</a> AWS Kinesis, data lake, and API event</a> sources.</li> </ul> Graph AI 2024-06-17T00:00:00+00:00 The Problem</h2> Pick One. Recent AI research is generating a growing number of graph AI techniques that take advantage of graph data relationships, and the rich context it provides, however production graph data pipelines lack the performance needed to deploy these new tools at scale. Graph AI development promises significant advances for AI application to a range of use cases thanks to the rich data context available from a graph data model. Moving graph AI techniques from the lab to production scale, however, is a significant challenge due to the limited scaling performance of graph databases. The Solution</h2> Quine streaming graph provides a single platform for the; 1. development of graph AI techniques, and, 2. production deployment of your algorithms on high-volume data streams. Quine even supports data ingestion and transformation of multiple data and event sources as part of the solution, allowing data scientist to define these data operations in the lab and then migrate them "as is" to production scale platforms run by operations. Graph AI development in Quine is supports multiple ways: Construct your AI logic as Cypher queries and apply them via REST API.</li> Apply externally built algorithms as User Defined Functions. Example</a></li> Create custom low-level messaging primitives and node behavior on Quine. Example</a></li> Use Quine standing queries as event-based triggers to update values on other nodes. A set of related nodes updating each other can perform the computation for, and maintain intermediate results for algorithms on a graph.</li> </ul> ‍ Key Value Take Away</h2> A single platform to define ETL operations in the lab and production</li> A single platform to define, test and deploy graph AI techniques</li> Build native graph AI techniques as primitives or using Quine powerful standing query capabilities</li> Import externally built user defined functions for use via Quine</li> </ul> Log Analysis 2024-06-17T00:00:00+00:00 The Problem</h2> Monitoring systems comprised of multiple services is typically done by monitoring each service individually using it's logs, or on an end to end basis that lacks visibility into the individual performance characteristics of each service. Root cause analysis is usually based on operations personnel instinct and past experience, making automated remediation next to impossible for many use cases. The Solution</h2> With thatDot's streaming graph logs and events from servers, operating systems, databases, applications, and clients are ingested in real-time and assembled into a graph data model. The graph data model natively connects events with unlimited categorical classifications and calculated metrics to identify "alerts that matter" and instantly associate them to servers, VMs, containers, code versions, subnets, etc. This real-time comprehensive view of the inter-relationships between services allows rapid assessment of root causes for operations investigations or automated remediation workflows. ‍ Key Value Take Away</h2> Identify issues that matter, in real-time and at scale</li> Graph data modeling eliminates the complexity of deeply nested joins</li> NOC technicians can easily pivot data to understand issue impacts and root causes</li> Automatic handling of out-of-order data arrival</li> Entity resolution between log and event sources</li> Integrates with existing Apache Kafka</a>, AWS Kinesis, data lake, and API event sources</a></li> </ul> Real-time Blockchain Fraud Detection 2024-06-17T00:00:00+00:00 The Problem</h2> Real-time linking of transactions, accounts, wallets, and blocks within and across blockchains is not possible with current solutions. Instead, the user must either rely on batch processing, which means results are out of date, or perform recursive lookups across table joins, which means unacceptable latency. The Solution</h2> Graph data structures are ideal for modeling the relationships described in blockchain events. Flows of cryptocurrency between accounts and wallets are ideal inputs for graph data modeling. Accounts, addresses, time references, devices, assets, transaction details, etc. are all examples of categorical data connected by relationships and are therefore ideal to be represented as the nodes, edges, and properties provided in a graph data model. Most importantly, relationships between entities are first class citizens in the data model so the costs and complexity associated with table joins is entirely eliminated. Quine easily ingests event feeds from multiple sources and creates a single unified view of activity on the blockchain(s). When fraudulent activity is reported, or when evidence of fraud emerges, Quine's standing queries instantly recognizes the patterns and triggers an alert. Now the client is not only detecting the fraudulent behavior in real time but blocking transactions before they can complete. Of course, many fraud alerts are delivered well after a transaction occurs and the Quine streaming graph can instantly provide a complete list of past transactions for a wallet or block to aid investigations or to block future transactions with related parties. Graph is the ideal data model for blockchain relationship tracing. In the above screenshot from Quine's Exploration UI, you can see how Quine makes it easy to trace all the accounts with which an account engaged in fraudulent activities interacts. Watch thatDot's founder, Ryan Wright, demonstrate using Quine to tag fraudulent accounts and track transactions: </iframe> </div> ‍ Key Value Take Away</h2> Sub 5ms Access to Complete trace history</li> Adapt to new blockchains rapidly with streaming ETL built in</li> Real-time materialization of wallet, block and transaction state</li> On premise software to deploy in your data center or cloud of choice</li> Integrates with existing Apache Kafka,</a> AWS Kinesis, data lake, and API event</a> sources</li> </ul> Stateful Digital Twin 2024-06-17T00:00:00+00:00 The Problem</h2> While digital twins and the emerging subcategory of asset graphs promise operators greater visibility into the relationships between IT assets and equipment under management, current approaches are more like snapshots of a point in the past. Events take place in real time, meaning the digital twin is almost always out of date, limiting its utility. Lack of visibility translates into delayed reactions to threats or failure modes. Digital twins are out of step with enterprises increasingly moving to real time. The Solution</h2> Quine streaming graph generates digital twins and asset graphs that are stateful and event-driven. Stateful digital twins reflect activities in the environments they model in real time. Using Quine streaming graph, you can easily construct an digital twin that can ingest high volumes of event data in order to detect complex and subtle threats or changes in conditions the instant data arrives, triggering alerts and remedial action in real time. ‍ Key Value Take Away</h2> Use graph ETL to ingest cloud or system events from multiple systems.</li> Automatically construct a digital twin that includes system elements and the relationships between them.</li> Continuously update the digital twin to reflect ongoing changes in real-time.</li> Trigger events and alerts when specific conditions are identified in the twin.</li> Query for any past state of the digital twin.</li> </ul> Streaming Graph ETL 2024-06-17T00:00:00+00:00 The Problem</h2> Most ETL tools use the batch processing paradigm to find high-value patterns in large volumes of data. Whether the specific business application is fraud detection, cyber security, network observability, e-commerce or ad targeting, batch processing translates into delay. Even if you are processing data in small batches, you are missing opportunities to react to events as they happen and shape outcomes in ways beneficial to your business. A great example is insider trading. The cost of detecting someone who is about to execute an insider trade is much less than the cost of trying to unwind that trade later when batch processing picks it up. Even if the batch process runs every five minutes, that just means you'll find them sooner, not stop them. Ultimately stream vs. batch will result in the costly reversal of transactions, not stopping them in real-time. The Solution</h2> Streaming ETL using Quine means not just knowing but acting on events as they occur. Use Quine's ingest queries to materialize event data as a graph, with a graph’s ability to express and query for complex relationships between seemingly unrelated data. Then use Quine’s standing queries to monitor for key patterns (e.g. indicating a fraudulent transaction or cyber attack is underway) and take action when those patterns emerge. Quine’s graph ETL also makes it straightforward to process categorical data — everything from email addresses and model numbers to IP addresses and process IDs — that other systems ignore or try to encode. Use Quine Enterprise to scale your graph ETL to millions of events per second. ‍ Key Value Take Away</h2> Use standing queries to detect patterns as they occur and take action</li> Join data from multiple sources as scale</li> Resolve entities across sources</li> Mitigate out-of-order data arrival</li> De-duplicate data</li> Generate new events from data as it streams, in real-time</li> Integrates with existing Apache Kafka,</a> AWS Kinesis, data lake, and API event</a> sources.</li> </ul> Video Observability for Root Cause Analysis 2024-06-17T00:00:00+00:00 The Problem</h2> Real-time video observability that can solve Quality of Experience (QoE) issues while live broadcast events are still playing require the simultaneous monitoring of millions of data points. Video sessions flow across multiple systems including origins, CDNs, manifest services, and players provided by multiple vendors. Relational database approaches to perform this complex log analysis at productions scale run into costs constraints that prohibit comprehensive real-time operations for all but the highest value broadcast events. The Solution</h2> Quine streaming graph ingests logs and events from clients, CDNs, origins, etc. in real-time and materializes the data into a graph. The graph data model natively connects chunk QoE metrics with unlimited categorical classifications and calculated metrics to identify "alerts that matter to your audience" and instantly associate them to ASN, Geo, client type, asset names, encoding formats, CDN cache server, origin server, etc. This real-time comprehensive view of the inter-relationships between services allows rapid assessment of root causes while live video streams as still playing. ‍ Key Value Take Away</h2> Identify the QoE impacting issues that matter, in real-time and at scale</li> Graph data modeling eliminates the complexity of deeply nested joins</li> NOC technicians can easily pivot data to understand issue impacts and root causes</li> Automatic handling of out-of-order data arrival</li> Entity resolution between log and event sources</li> Integrates with existing Apache Kafka,</a> AWS Kinesis, data lake, and API event</a> sources.</li> </ul> Novelty Technology 2024-06-14T00:00:00+00:00 Introduction: a New Approach to Anomaly Detection</h2> Anomaly detection is a technique for finding important data. Decades of research has been spent on creating tools for anomaly detection with numeric data. But most data produced in the real world is not numbers—it is user names, identifiers, log statements, email addresses, URLs, access credentials, service names, file paths, timestamps, IP addresses, API paths, and a seemingly endless list of valuable data that is not a number. Non-numeric data is called “categorical data” and it has been mostly ignored by data analysis tools. So how could you find important categorical data with existing anomaly detection tools if they only work with numeric data? Not One-Hot</h2> The state of the art for using anomaly detection—or most other artificial intelligence techniques—on categorical data is to begin by converting the categorical data into numbers. There are standard ways of converting categorical data into numbers, the most common by far is “one-hot encoding.” Here is a list of 16 more</a>. These techniques are all cumbersome and lossy in one form or another. Each of those 17 transformations require a data scientist to bake in some interpretation—which future data may not align with. And critically, the standard approach—one-hot encoding—requires that you know the cardinality of the categorical data ahead of time, and that it remains very low. Each new value of the categorical data requires another dimension to be added to the matrix computations. Adding dimensions leads to a highly-complex feature space that is ruinous for anomaly detection! In addition to larger matrices requiring more computation time, achieving useful results typically becomes impossible because of what has become known as “the curse of dimensionality”: as the number of dimensions used for anomaly detection increases, every data point starts looking like an anomaly in some way and the results are useless. The Standard For a New Technique</h2> When we set out to create a new way to detect important data directly using categorical data, we set our sights rather high. The standards we wanted to meet for this new technique were: Streaming / Online / Real-Time. Answers now are much better than answers in 15 minutes, or an hour, or tomorrow, or next week. To be useful in the greatest number of applications, we should be able to provide results immediately. Since the streaming use case is a superset of the batch use case, any tool that can provide streaming results can also provide batch results; but not the other way around. Unsupervised. Humans should not have to manually label the data to indicate what is and isn’t anomalous. Supervised AI techniques require manually creating training sets. That process is extremely laborious and time consuming—and often leads to bias encoded into the system. Instead, this system should train itself automatically from the data. High cardinality. One of the greatest challenges with categorical data is that it often has extremely high cardinality. A system designed to handle this should not cause a data or analysis explosion when some fields have an extremely high number of values. Instead, the system should support constant-time processing and storage, incorporating new values and dimensions in the data on-the-fly, while still delivering real-time results. Distinguish unique from anomalous. Sometimes “new” is actually “normal.” When a data set regularly includes new unseen values, an anomaly detection system should take that into account. Instead of producing false alarms because data is unique, the system should learn from the context and the rest of the data seen so far to understand when unique data is actually typical. Learn behavioral patterns. Human behavior is complex, and system behavior can often compound to be even more complex. The ideal anomaly detection system would be able to learn idiosyncratic behaviors, applicable only in specific situations, and incorporate that learning into the final evaluation of data. Rank results in a total order. It is helpful to be told “yes” or “no” for whether data is anomalous. But in many real-world environments, we would also like to know how anomalous it is and how that compares to another anomaly we might be looking at. A strength of existing numerical methods is that they often give a total ranking of results. A new anomaly detection system should preserve this total ordering of anomaly scores but deliver those results immediately—with scores that are still totally ordered as the stream continues. In practice, if I’m already looking at a rather anomalous event, I want to be told if new data is even more anomalous. Give explanations. It’s great to reach a conclusion and learn that a piece of data is very novel and important. But it is much better to also understand WHY that datum is important. A system for categorical anomaly detection should embrace the goals of “Explainable AI</a>” so that the people involved can understand its conclusions. Theoretically sound. The Achilles heel of most AI systems today is that they are empirically tested but often lack a solid theoretical justification for the results they produce</a>. The individual steps in modern methods are often well-established, but evaluating the fully-composed technique is limited to empirical measurement instead of theoretical grounding. In contrast, we believe that new techniques are more powerful if they are theoretically sound all the way to their core, and the results reflect that. A system that is theoretically sound produces more reliable results—especially when faced with unfamiliar data. Since anomaly detection is entirely about finding and explaining unfamiliar data, the soundness of the approach is of critical importance. Theoretical Grounding</h2> Terminology: Novel vs. Anomalous</h3> We describe the system overall as an “anomaly detection system” because that is its most common use, and the name most well-known in the industry. But what we actually compute is a continuous score describing how novel each piece of data is. More than just swapping words, this distinction between “novel” and “anomalous” is an important one. Novelty is an objective feature we can ascribe to the data, and evaluate on a continuous spectrum. Applying this system to a data set, a user can use the novelty scores to decide whether or not data is anomalous. This reflects a separate application-specific process of translating the continuous-valued novelty score into a binary-valued anomaly decision. Considering an example application in cyber-security, “anomalous” is often the term used to describe what an analyst would probably prefer to call “suspicious.” These terms get used interchangeably in practice, but with subtle subjective distinctions. When approaching the theoretical justification for a new technique, we think it is important to deliberately use the term “novelty” as an objective feature of the data which happens before interpreting the results as “anomalous” or “suspicious.” Thus, “novelty” is the best term for the theoretical determination arrived at by this system. Probabilistic Graphical Models</h3> A collection of data can be described with a graphical model. This is a representation of the data built by structuring nodes and edges in relationships that represent the core features of the data. When that graph is structured with historical facts and probabilistic information about the data being examined, it can provide a wealth of statistical information about the dataset as a whole, and each individual data observation. Information Theory</h3> That statistical information provides a probabilistic view of the data which allows us to measure the Information Content (also known as “Shannon Information” or “Self-Information”) represented in the data. Information Content is the basis for the novelty scores returned from thatDot Novelty Detector. This approach runs parallel to traditional AI techniques where Information Content is often the primary measure used in the cost function (or “loss function”) of many machine learning methods. In fact, to help machine learning researchers and engineers use this system for other purposes beyond just anomaly detection, we even return the Information Content in the result payload for each observation. Inside Novelty Detector, the Information Content is computed and then combined with other information from a complex graph built by the system. A Dynamic Graph</h3> thatDot Novelty Detector builds a dynamic graphical model of the dataset in real-time. This is generally a challenging problem—often calling for a graph database or other complex tools. But those tools aren’t built for streaming data and end up crushed under the load of a high volume of streaming input data and the voluminous computation that result. To overcome this challenge, we built Novelty Detector on top of Quine. Quine is a streaming graph interpreter, capable of high-volume data processing and storage. Using Quine, we construct and maintain the graphical model that represents the incoming data. Quine records the historical facts and computes all the necessary probabilities and information measures needed to produce a score for each incoming data item. That graph interpreter can compare scores across the entire data context and explain what about the data was so novel. All of this is accomplished in real-time so that streaming results are scored immediately as they flow through the system. Conclusion</h2> The final result is a high-throughput, low-latency, high-cardinality, categorical data analysis tool capable of scoring the novelty of all incoming data in real-time. Streaming data dynamically updates a probabilistic graphical model to compute information content assessed holistically with the data context to provide a novelty score useful for finding anomalies and explaining the result. All together, thatDot Novelty Detector represents a breakthrough in the field of anomaly detection, with wide-ranging applications across industries. You can try thatDot Novelty Detector for free on AWS right away</a>. Need more help? Join our community on </a> Recent posts</h2> Want to read more news and other posts? Visit the resource center for all things thatDot. View all Resources</a> Help Center</h2> Streaming Graph Help View Docs</a> Novelty & Additional Help View Docs</a> Advanced Persistent Threat (APT) Detection 2024-06-14T00:00:00+00:00 The Problem</h2> Discovering advanced persistent threats (APT) is, by design, akin to finding a needle in a haystack. The threat actors behind APTs combine multiple tactics, techniques, and procedures (TTP) over extended periods of time to compromise and maintain access to their targets. The IBM Cost of Data Breach Report 2021 reported an average attacker dwell time of 212 days. APTs evade legacy security solutions which rely on time-batched loads of data that filter for Indicators of Compromise (IoC) by executing incremental actions spread across numerous systems at rates that exceed batch analysis size and time boundaries. APT detection requires a new approach. The Solution</h2> Matured within DARPA's Transparent Computing program specifically for the detection of APTs, Quine</a> and Novelty Detector</a> work together to efficiently uncover the aspects of advanced persistent threat detection. Quine’s graph data model uses categorical data</a> other systems ignore and excels at correlating individual events occurring in their billions/trillions across devices, software and services over any time period to find the behavior patterns (Indicators of Behavior or IoBs) that represent malicious activity. When patterns are detected, Novelty Detector can then apply its categorical anomaly detection techniques to identify when a string of related actions represents a novel/anomalous behavior, greatly reducing false positives. Quine Enterprise provides commercial support and licensing for clustered Quine and Novelty Detector. You can add real-time behavior-based APT detection to your stack at scale and with confidence. thatDot's core technology underpinning Quine and Novelty Detector was developed in partnership with DARPA. Read more about thatDot's origin and some examples of using Novelty Detector to detect data exfiltration</a> and credential theft</a>. ‍ Key Value Take Away</h2> Quine + Novelty Detector detect both known and emerging behavioral patterns in a single workflow.</li> Joins multiple data sets to enable real-time identification of attack behaviors across domains</li> Identify behaviors over extended time periods using incremental streaming analysis (not batch)</li> Native support for categorical data</a> simplifies operations and provides human-readable alerts for analysts</li> STIX Compliant, real-time detection of Indicators of Behavior (IoBs) and generation of STIX message events</li> </ul> Real-Time IoB Threat Hunting 2024-06-14T00:00:00+00:00 The Problem</h2> Modern threat detection requires data – lots of data – typically from multiple sources. This brings with it a number of interesting data engineering challenges, especially when we want to materialize that data into a single view and execute analysis in a timely and cost-effective manner. Finding indicators of behavior (IoBs) in real time amplifies already significant challenges: processing enough of the right kind of data from multiple sources in a timely fashion is beyond the capability of most systems. The Solution</h2> Quine + Novelty Detector together cover all aspects of real-time, automated, behavior-based threat hunting: Quine is used to detect known patterns (STIX) and emit scripted playbook responses (CACAO), while Novelty Detector uses patented categorical anomaly detection techniques to identify emerging threat patterns that are eventually fed back into Quine as new IoB patterns. Quine Enterprise provides commercial support and licensing for both clustered Quine and Novelty Detector, meaning you can easily add real-time, behavior-based threat hunting to your stack easily. ‍ Key Value Take Away</h2> Quine + Novelty Detector detect both known and emerging behavioral patterns in a single workflow.</li> STIX Compliant, real-time detection of Indicators of Behavior (IoBs) and generation of STIX message events</li> Joins multiple data sets to enable real-time identification of IoBs across domains</li> Automate STIX indicator additions via API, as well as CACAO Playbook event triggers for remediation</li> Native support for categorical data simplifies operations and provides human-readable alerts for analysts</li> </ul> Real-time AWS CloudTrail Threat Detection 2024-06-14T00:00:00+00:00 The Problem</h2> AWS CloudTrail logs are full of untapped information that can help reduce risk and improve event response times, especially when analyzed in context and in real time. A thatDot cyber security customer seeking to expand their offerings to include threat detection monitoring of AWS CloudTrail logs faced three challenges. They needed to: Reliably identify hard-to-detect insider and external threats using Indicators of Behavior (IoB) analysis</li> Generate highly informative alerts that low-tech customers could understand and act on</li> Shorten development cycles on new products</li> </ul> Typical use cases for their new product would include identifying both existing employees misusing credentials to access restricted resources and outsiders using valid but compromised credentials. This combines two of the toughest cyber-security challenges in the industry. The Solution</h2> Finding New Emerging Threat Behaviors, In Real-time (as attacks are happening)</h3> The team at thatDot solved the client's threat detection problem with the first modern threat-hunting stack to combine real-time identification of unknown or emerging threats. Using both Novelty Detector and an event processing system that can instantly identify known patterns and act on them (Quine Enterprise).\ Novelty Detector is a new graph AI technique built on the Quine streaming graph that uses categorical data from events (e.g. IP addresses, file names, file paths, API call types) in order to understand the context within which user and system actions take place. This rich context is used to evaluate behaviors in order to identify novel behaviors in real time, with a notably low incidence of false positives. Novelty Detector results displayed as a graph, making them easy to understand and act on. (From VAST use case.</a>) ‍ Novelty Detector separates truly novel events from those that are unique but not a threat. ‍ When it comes to instantly identifying and acting on known threats, including ones previously detected by Novelty Detector and classified, the client used Quine streaming graph. They used standing queries to monitor for patterns of behavior in the graph indicative of malicious behavior. And because Quine is not limited by time windows, they were able to build a threat detection system that monitored for a broader range of threat behaviors than traditional complex event processing systems and XDRs allow. Quine is ideal for SaaS businesses. Quine Enterprise can ingest millions of events/second from multiple streams, combine them into a single graph view, detect patterns for known threat indicators, and act instantly to emit contextually rich alerts. Human-Readable Results</h3> Both Quine and Novelty Detector are based on the same knowledge graph technologies that makes use of categorical data. This means the data structures they create and output -- node objects, their properties, and the relationships between those objects -- are expressed in a familiar human-readable format (subject, predicate, object). This means results are easy to understand and immediately contextualized. Knowing who did what when, whether or not they had the privileges to do so, how long they had those privileges, and similar contextual information -- all quite easy to generate with Quine and Novelty Detector -- means SOC/NOC analysts don't need to spend exorbitant amounts of time researching alerts. Fast Time To Market</h3> Quine Enterprise with Novelty Detector made development fast and straightforward. With both unknown and known threats covered, the client was able to quickly launch a threat detection product to round out their growing portfolio of cyber security products. Key Value Take Away</h2> Fewer false positives using shallow learning method that processes categorical data.</li> Profiles behavior (IoBs) instead of finding indicators of compromise (IoCs).</li> Contextually rich alerts in a human-friendly form make it easier for analysts to research and resolve.</li> Real-time processing of data means none of the delays of batch processing.</li> Scales to millions of events per second, making it suitable for fast-growing SaaS providers.</li> </ul> Webinar: Approach Zero False Positive Cyber Alerts 2024-06-10T00:00:00+00:00 We would like to invite you to an exclusive webinar featuring thatDot’s CEO, Ryan Wright, and The Bloor Group CEO, Eric Kavanagh, along with Top 5 Global Cybersecurity Thought Leader, Mark Lynd, as they discuss "The Unreasonable Effectiveness of Streaming Graph." This insightful discussion is a must-attend for anyone serious about cybersecurity, threat detection, and deep, real-time analytics. Event Details: Title: The Unreasonable Effectiveness of Streaming Graph Date: June 11, 2024 Time: 12 pm ET Register for Webinar</a> Why You Should Attend</h2> Cybersecurity experts face immense challenges with frequent data breaches dominating the headlines. Traditional anomaly detectors often fall short in identifying and neutralizing threats in real-time. They require constant human tweaking of threat signatures and sensitivity levels to avoid exhausting professionals with mountains of false positive alerts. What if you could build contextual awareness into the application? Join us to discover how thatDot Novelty, powered by the open-source technology Quine, is revolutionizing real-time threat detection and response. This cutting-edge technology combines event stream processing speed with a built-in AI that learns the contextual fingerprint of your data environment, and pinpoints problems automatically. Developed in a DARPA project, thatDot Novelty and thatDot Streaming Graph provide unparalleled capabilities in: Advanced Persistent Threat Detection</li> Insider Threat Detection</li> Attack Graph Analysis</li> Digital Twins</li> And many more critical use cases</li> </ul> Don’t miss this opportunity to learn from industry experts and gain a competitive edge in cybersecurity. Secure your spot today and be part of the future of threat detection and data analytics. We look forward to your participation. 4 Advantages to Streaming Analytics in Graph Form 2024-06-06T00:00:00+00:00 Existing software has forced people to choose between asking deep questions or getting their answers fast. Graph databases can do some pretty amazing things, but they’re not known for their analytical speed. If you want powerful fraud detection or cybersecurity threat detection, graph data analysis is a good choice. But if you need it fast, to turn that into fraud prevention and cybersecurity threat protection, graph databases are not a good choice. Streaming analytics makes sense to get the immediate level of speed that you need. But most event stream processing frameworks analyze data as if it were a standard row and column coming in one message/row at a time. That means they’re not ideal for things like fraud detection and cybersecurity that require complex relationship, pattern, and anomaly type analysis. Example in cybersecurity of being forced to choose between deep slow graph database analytics and fast shallow streaming analytics when what you need is fast deep analytics. You could do streaming analytics or you could do graph analytics, but not both. What you need to solve a lot of tough problems is event stream processing that sees the flowing data as a graph. That’s what thatDot does. It’s unique in the industry as far as I know. Being a pioneer can be a major problem because folks don’t know where you fit or what to compare you to. In particular, since there aren’t other options out there, why is this something you might need? Previous technology is like the guy who lost something, and looks where there’s adequate light, even though he knows it wasn’t lost there. It doesn’t really work, but making do with what you have is the only option. Event stream processing in graph form shines the light where you need it, in fast deep streaming analytics. Here are four advantages of doing graph analysis in an event stream that come immediately to mind: Shift analysis left for smart filtering. Because the data can be transformed and analyzed while it’s still flowing, you can intelligently filter out data you don’t want, even resolve duplicates from multiple data streams before dropping the masses of data into expensive databases. In the case of IoT, it can shift left all the way to the edge, and only push sensor notifications that matter. Instead of 1000 useless identical readings and three different readings, only the data that matter are sent on for analysis or action. </li> Ask deep relationship questions on the fly. The questions most event processors can ask do not dig deep into relationships and patterns simply because that’s not how they look at data. Finding the entities, properties, and analyzing the relationships to find important patterns on the fly is what streaming graph excels at. </li> Analyze categorical data without turning it into numeric data first. A lot of important data is categorical, such as IP addresses, names, and location information. State of the art without graph streaming analytics is to first convert that data into wide, sparse numerical information, then do analysis on that bloated numeric data, then turn it back into categorical data to return an answer. (All our tools only work on numbers, so let’s only look there.) thatDot Streaming Graph lets you analyze categorical data as categorical data, immediately. No delay while you land and fiddle with the data so your tools can analyze it. </li> Reduce mean time to value (MTTV). Get actionable answers immediately, sub-milliseconds. Landing the data in a graph database, cleaning and preparing the data, then finally doing some analyses and visualizing it takes time. </li> </ol> Doing graph analysis right in your event stream reduces the time to get answers from hours to seconds, or from seconds to milliseconds. This can mean the difference between fraud detection and fraud prevention, between finding out you were breached and catching a cybercriminal in the act. To learn more, check out https://www.thatdot.com/</a>. You’d be surprised at what is possible now. The Future of Modern Threat Hunting is Streaming Graph 2023-11-30T00:00:00+00:00 Towards a new model of threat hunting</h2> The continuous expansion of threat vectors and attack techniques requires a modern threat hunting architecture capable of large scale operations, real-time deep/complex event processing to identify Indicators of Behavior (IoB), and programmable automation to best leverage scarce SOC expertise. Central to the evolution from after-the-fact Indicators of Compromise (IoCs) to IoBs is the need to embrace an event driven architecture. Many industry initiatives aim to codify the intersection points between data sources, analysis systems, and remediation solutions. These efforts are centered around two characteristics that align with thatDot software in significant ways. A focus on behavior analysis - The evolution from the use of Indicators of Compromise (IoC) to Indicators of Behavior (IoB) has been driven by the desire to evolve from seeking static definitions of a completed attack (file# or an IP), to an understanding of how an attack happens. This change in perspective creates the opportunity to find attacks earlier, and with more flexibility. Use of graph data modeling - Representing behavior and relationships is a natural fit for graph data modeling techniques. Graph data structures are terrific at expressing the relationships between entities which simplifies analysis and infrastructure, so much so that STIX Indicators and the Kestrel protocol assumes the use of graph systems for their operation. Image source: available here</a>. New Standards Reduce Friction</h2> The cybersecurity industry is active on many fronts defining standards to smooth the frictions that exist between data sources, analysis engines, SIEMs, and automated response systems. A number of these standard include: STIX™ Indicators - Indicators convey specific observable patterns combined with contextual information intended to represent artifacts and/or behaviors of interest within a cyber security context. [Read more here.</a>] Kestrel - Kestrel threat hunting language provides an abstraction for threat hunters to focus on the high-value and composable threat hypothesis development instead of specific realization of hypothesis testing with heterogeneous data sources, threat intelligence, and public or proprietary analytics. [Read more here.</a>] CACAO - defines the schema and taxonomy for collaborative automated course of action operations (CACAO) security playbooks and how these playbooks can be created, documented, and shared in a structured and standardized way across organizational boundaries and technological solutions. [Read more here.</a>] These standards fit well with thatDot’s approach to a modern threat hunting stack, one powered by thatDot’s Quine streaming graph to detect and instantly alert on known patterns and that uses thatDot Novelty Detector to identify new emerging threat behaviors in real time. Highly Scalable IoB Pattern Recognition</h2> The evolution from a reactive IoC threat hunting model to a real-time IoB-based approach requires a new set of technical capabilities along with the tools to deliver them. Fortunately, the advent of IoB threat hunting, new standards, and ground-breaking streaming graph technology are all emerging to meet the need. As shown below, thatDot’s open source Quine streaming graph perfectly aligns with the requirement to ingest multiple data streams and natively process graph data model encoded IoBs to then generate events that invoke predefined remediation actions. The work flow looks as follow: Event sources are ingested from any common event stream queue, including Apache Kafka, AWS Kinesis, AWS SQS, or Apache Pulsar/DataStax Astra Streaming.</li> STIX-defined IoBs are loaded into Quine using Kestrel graph objects via API, or entered manually, as Quine standing queries</a>.</li> Quine standing queries continuously analyze newly arriving events for matches against IoB pattern definitions. Partial matches are identified and stored for any desired period of time to accommodate threat behaviors that occur incrementally over longer time frames.</li> Upon a full IoB pattern match, Quine generates a new event that is associated with a pre-defined CACAO Playbook action, for use by SOAR or analysts.</li> </ol> The Problems Quine Solves</h2> Quine solves some hard problems in this role. Let’s take a look at a few of the major points: Multiple Event Sources</h3> Modern threat detection requires data – lots of data – usually from multiple sources. This brings with it a number of interesting data engineering challenges, especially when we want to materialize that data into a single view and execute analysis in a timely and cost-effective manner. Combining threat Intelligence, EDR, XDR, and Cloud logs are increasingly common requirements for building a baseline of behavior models against which real-time data is assessed for known and new threats. thatDot’s Quine streaming graph is a new and powerful software tool for resolving many of the data engineering challenges associated with handling volumes of data from multiple sources. Scale For Costs - Scale graph event processing from 1,000s to 1,000,000s of events per second on commodity cloud VMs, more efficiently than nested joins. Out-of-Order Data Arrival - Quine standing queries evaluate each newly arriving event as it arrives and stores partial results until completion data arrives. Entity Resolution - Graph data models are known for leveraging the additional context gained by understanding the relationships between event datum. Finding Threat Behaviors</h3> IoBs are patterns of behavior expressed as actions taken by users or systems. Identifying the end to end pattern of an IoB across events generated by disparate systems is a perfect alignment with the Quine graph data model. Quine evaluates every single newly arriving event for partial or full match against defined IoB patterns. This incremental approach to evaluating data is paired with a highly efficient mechanism for persisting partial matches. The result is a threat detection solution that tracks millions or billions of suspect actions until there is a complete pattern match, at which point an event is generated to serve as an alert or to trigger an automated workflow. Incremental Evaluation Of Events For IoB Patterns Across Event Sources Image source: Quine Streaming Graph White Paper</a> (PDF) Automated Responses</h3> CACAO provides a graph-based data model. As such, CACAO implementations should implement protections against graph queries that can potentially consume a significant amount of resources and prevent the implementation from functioning in a normal way. Identifying Novel New Behaviors</h2> Of course, the most difficult part of threat hunting is identifying new threat vectors as near to the time when they first appear as possible. This is especially difficult since attackers are intentionally working to obscure their illicit behavior in large volumes of events. Systemic approaches that use traditional anomaly detection approaches have largely failed to detect sophisticated attacks without also identifying a significant number of false positives, forcing reliance upon manual human evaluations based on intuition and increasingly scarce security expertise. thatDot Novelty Detector brings a fresh approach to the problem of detecting illicit behavior. Novelty Detector is a new graph AI technique built on the Quine streaming graph. As such, Novelty Detector natively uses categorical data in events, such as IP addresses, file names, file paths, API call types etc. to fully understand the context of user and system actions. This rich context is used to evaluate behaviors via Information Theory analysis to identify novel new behaviors in real-time, with incredibly low incidence of false positives. Once a new novel behavior is evaluated, it can then be encoded as a new IoB and fed into an operating Quine streaming graph system for immediate use on newly arriving data, or applied to previous data if desired. Separately, Quine streaming graph and Novelty Detector software offer unique capabilities for organizations and service providers: real-time processing of categorical data to find known IoB patterns (Quine) and emerging new threat patterns (Novelty Detector). When combined as a single platform that uses industry standards for IoB definitions and intersystem communications, the result is a comprehensive modern threat hunting and remediation stack. thatDot Streaming Graph Delivers Scalable Threat Hunting</h2> Quine is available in both open source and enterprise (thatDot Streaming Graph) editions. However, Novelty Detector is available either in the AWS marketplace or under license as part of thatDot Streaming Graph</a>. Streaming Graph offers large organizations and managed security service providers (MSSPs) both the clustered, resilient version of Quine and Novelty Detector. It is meant for production applications where resilience, query performance, and throughput matter. Resilient clustering includes support for hot spares and distribution across multiple availability zones. We recently shared reproducible tests demonstrating both scale (thatDot Streaming Graph easily processed one million 4-node graph events/second) and resilience in the face of node failure. You can read about the tests here</a>. Try It Yourself</h2> If you want to try it on your own, here are some resources to help: Download Quine - JAR file</a> | Docker Image</a> | Github</a></li> Check out the Ingest Data into Quine blog series covering everything from ingest from Kafka</a> to ingesting .CSV data‍</a></li> Password Spraying Attack Detection - this recipe provides an example of detecting brute force attack patterns in authentication logs</li> </ol> Header image adapted from photo by Lianhao Qu</a> on Unsplash.</a> ‍ Monitoring Quine Streaming Graph using Grafana + InfluxDB 2023-06-06T00:00:00+00:00 Monitoring Data in Motion</h2> There has been a significant increase in the popularity of event streaming and stream processing applications/technologies within the data engineering community. With the accelerating growth of big data, IoT, and cloud computing, more organizations are facing the challenge of extracting actionable insights earlier in the event pipeline. For historical reasons, operational tools for monitoring, alerting, and diagnosing system issues are oriented toward data at rest. That doesn't mean they can't be just as useful for monitoring data in motion. It just means adjusting your monitoring regime to a streaming mindset. From Emerging Architectures for Modern Data Infrastructure - Andreessen Horowitz</a> NOTE Darker boxes indicate new or meaningfully changed categories since v1 of the architecture in 2020; lighter colored boxes indicate categories that have largely remained unchanged. Gray boxes are considered less relevant to this blueprint. A good example of a next-gen streaming infrastructure element is Quine. Quine is an event streaming technology designed to process graph-shaped event streams and produce high-value events in real time. In this blog post, we'll guide you through setting up Grafana backed by InfluxDB to monitor a Quine instance. We'll show you how to configure Quine to send data to InfluxDB, create a dashboard in Grafana to visualize this data, and use Grafana's powerful features to detect issues and anomalies in real time. By the end of this post, you'll have a solid understanding of how to monitor event stream pipelines using Grafana and InfluxDB, and you'll be equipped with the tools and knowledge needed to keep Quine running smoothly. Setting up Grafana and InfluxDB</h2> Grafana</a> is a tool that helps you visualize and understand operational metrics data. It lets you create visual dashboards to monitor and analyze data from sources across your data infrastructure. DevOps teams use Grafana metrics dashboards to make informed decisions. The observability subsystem for Quine is build for Grafana integration. Above is an example of my typical development and testing environment when working on a recipe</a>. The event sources and output sinks change depending on the scenario, but most of the time, I run Quine on my local host, configured to push metrics to InfluxDB and visualize the observations in Grafana. Using Docker containers makes it easy to configure and clean up my environment quickly. We need to do a little pre-work before launching the Docker containers. This is how I set up my environment using docker-compose</code>. You may do things differently based on how Docker is installed on your host. I like to keep docker-compose.yaml</code> files arranged inside their directories in a docker</code> directory that lives in $HOME</code>. This helps me keep things organized and makes sharing configs between my MacOS laptop and Ubuntu servers easy. I created a zip file</a> of my config to download and use with the blog post. cd $HOME wget https://quine-recipe-public.s3.us-west-2.amazonaws.com/quine-grafana-docker.zip unzip quine-grafana-docker.zip Archive: quine-docker.zip inflating: docker/cassandra/docker-compose.yaml inflating: docker/grafana/docker-compose.yaml creating: docker/grafana/grafana-provisioning/ creating: docker/grafana/grafana-provisioning/datasources/ inflating: docker/grafana/grafana-provisioning/datasources/datasource.yml creating: docker/grafana/grafana-provisioning/dashboards/ inflating: docker/grafana/grafana-provisioning/dashboards/quine.json inflating: docker/grafana/grafana-provisioning/dashboards/dashboard.yaml</code></pre> NOTE I included a docker-compose</code> file for Cassandra in the zip archive. I won't cover the Cassandra config in this article. The file is included as a reference if you choose to separate your persistent storage from the application to keep from competing for server resources. See the Cassandra Persistor</a> docs for a sample configuration file. You now have this directory structure in your $HOME</code> dir. docker ├── cassandra │ └── docker-compose.yaml └── grafana ├── docker-compose.yaml └── grafana-provisioning ├── dashboards │ ├── dashboard.yaml │ └── quine.json └── datasources └── datasource.yml</code></pre> With Docker configured and the quine-docker.zip</code> files loaded on your virtualization host, it's time to start the containers so that they are ready to receive data from Quine. Change into the grafana</code> directory and start the InfluxDB/Grafana stack: docker compose up -d</code></pre> You should see something similar to this appear in your terminal window: [+] Running 18/18 ⠿ grafana Pulled 8.7s ⠿ f56be85fc22e Pull complete 2.8s ⠿ 9efeca377709 Pull complete 3.0s ⠿ b4608283f0dd Pull complete 3.5s ⠿ 94ba646ecfcd Pull complete 3.9s ⠿ 6730f2b3d4cf Pull complete 4.1s ⠿ 871e090050be Pull complete 4.4s ⠿ 03d60ad4c029 Pull complete 5.7s ⠿ baaa3e79bf5c Pull complete 7.6s ⠿ 01c0c058d3df Pull complete 7.7s ⠿ influxdb Pulled 9.6s ⠿ 918547b94326 Pull complete 7.4s ⠿ 5d79063a01c5 Pull complete 7.7s ⠿ a8e9798c2a3f Pull complete 7.8s ⠿ e8074b4fc936 Pull complete 8.5s ⠿ a913b4722330 Pull complete 8.5s ⠿ 9c8265b2cf7a Pull complete 8.6s ⠿ 9037f1aeb9df Pull complete 8.6s [+] Running 4/4 ⠿ Volume "grafana_grafana-storage" Created 0.0s ⠿ Volume "grafana_influxdb-storage" Created 0.0s ⠿ Container grafana-influxdb-1 Started 0.5s ⠿ Container grafana-grafana-1 Started 0.7s</code></pre>Verify that the containers are running: docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}" NAMES STATUS PORTS grafana-grafana-1 Up 4 seconds 0.0.0.0:3000->3000/tcp grafana-influxdb-1 Up 4 seconds 0.0.0.0:8086->8086/tcp</code></pre> Congratulations! 🎉 InfluxDB, Grafana, and Cassandra are running in separate containers and listening on their default ports. Configuring Quine to Send Metrics Data</h2> Enable metrics reporting in Quine via configuration parameters that can be passed as Java system properties with -D</code> or contained in a Quine configuration</a> file. Quine can report metrics to jmx</code>, csv</code>, influxdb,</code> and slf4j</code> for analysis. The jmx</code> metrics reporter is enabled by default. java \ -Xmx12G -Xms12G \ -Dquine.metrics-reporters.1.type=influxdb \ -Dquine.metrics-reporters.1.database=db0 \ -Dquine.metrics-reporters.1.period=30s \ -Dquine.metrics-reporters.1.host={container_host} \ -jar quine-1.5.4.jar \ -r wikipedia --force-config</code></pre> A couple of things to note when passing configuration as system properties. The -D</code> parameters must come before -jar</code></li> When launching Quine with a recipe (-r</code>) you also have to pass --force-config</code></li> </ul> Alternatively, you can pass the following configuration stored in quine-metrics.conf</code> to Quine to accomplish the same thing. Create a quine-metrics.conf</code> file containing the HOCON configuration from the documentation</a>. quine { # where metrics collected by the application should be reported metrics-reporters = [ { # Report metrics to an influxdb (version 1) database type = influxdb # required by influxdb - the interval at which new records will # be written to the database period = 30 # Connection information for the influxdb database database = db0 scheme = http host = {container_host} port = 8086 # Authentication information for the influxdb database. Both # fields may be omitted # user = admin # password = admin } ] }</code></pre> Then launch Quine, passing the configuration file on the command line. java -Dconfig.file=metrics.conf -jar quine-1.5.4.jar -r wikipedia --force-config</code></pre>Quine Metrics</h2> Quine reports three classes of metrics; counters, timers, and gauges. TIP! When queried, the metrics summary</a> API endpoint reports the same metrics as a metrics reporter. Counters</h3> Quine uses counters to accumulate the number of times that events occur. Counters can return either a value or a histogram. node.edge-counts.*</code>: Histogram-style summaries of edges per node</li> node.property-counts.*</code>: Histogram-style summaries of properties per node</li> shard.*.sleep-counters</code>: Count the lifecycle state of nodes managed by a shard</li> </ul> Timers</h3> Quine reports the elapsed time in milliseconds it takes to perform persistor operations. persistor.get-journal</code>: Time taken to read and deserialize a single node's relevant journal</li> persistor.persist-event</code>: Time taken to serialize and persist one message's worth of on-node events</li> persistor.get-latest-snapshot</code>: Time taken to read (but not deserialize) a single node snapshot</li> </ul> Gauges</h3> Quine gauges report metrics as a value. memory.heap.*</code>: JVM heap usage</li> memory.total</code>: JVM combined memory usage</li> shared.valve.ingest</code>: Number of current requests to slow ingest for another part of Quine to catch up</li> dgn-reg.count</code>: Number of in-memory registered DomainGraphNodes</li> </ul> Create a Dashboard in Grafana</h2> A dashboard in Grafana contains a series of panels that provide an at-a-glance view of how Quine is performing. Log into Grafana. The username and password for the container is admin:admin.</li> Decide if you are going to keep the default password or skip changing it</li> </ul> If you launched Grafana using the docker-compose</code> files from the quine-docker.zip</code> file that I provided, you will see a dashboard called "Quine - Monitor a Recipe" in the lower left hand corner of the Dashboards card. Click on that dashboard to open it. Initially, the dashboard will be empty. It will fill in as you run a recipe. Let's start Quine with the Wikipedia recipe and the metrics.conf</code> file from above to get familiar with each visualization. java -Dconfig.file=metrics.conf -jar quine-1.5.4.jar -r wikipedia --force-config</code></pre> Metrics will populate the dashboard after about 30 seconds once Quine is running. You may need to reload your browser to have Grafana pull all of the metrics from InfluxDB. Also, be sure to set the time range in the upper right corner of the dashboard to "Last 15 minutes" to ensure that you have a current time range selected to visualize. Your dashboard will begin to populate like this: A Grafana dashboard view for Quine running the Wikipedia ingest recipe. Hover over each graph in the dashboard to expose a "three-dot" menu in the upper right hand corner of the panel. Click on the menu and select "edit" to review how each visualization is configured. Some visualizations use the query builder, and some are written directly as an InfluxDB query. Please modify the dashboard to match your environment and satisfy your needs. What I've Learned Monitoring Quine</h2> Monitoring a streaming graph is similar to any other database, with a few additional key metrics to watch. Quine is backpressured, which means that the performance of the persistence subsystem affects the flow of events in the graph.</li> Java garbage collection impacts backpressure. It is normal for Quine ingest rates to fluctuate as Java manages the heap. Keep an eye on when your heap consumption approaches the max memory configured for Java. I've found the best performance when launching Quine with a 12G (-Xmx12G -Xms12G</code>) memory allocation pool.</li> </ul> Conclusion</h2> The metrics dashboard built into the Exploration UI is good for understanding how Quine is currently operating. However, monitoring the performance of a recipe or solution over time requires a DevOps tool like Grafana. This blog will get you up and running with a sample dashboard that replicates all of the gauges in the Exploration UI that you can modify to suit your needs. Calculate Risk and Optimize Asset Allocation in Real Time 2023-05-31T00:00:00+00:00 The Hidden Cost of Batch Processing for Financial Institutions</h2> The recent failures at financial institutions like First Republic Bank, Signature Bank, and even Silicon Valley Bank have brought issues of regulatory compliance and capital management to the forefront for both industry members and the wider public alike. One thing these events have exposed is that the financial industry largely relies on an approach to managing mandated operational risk capital requirements, batch processing, that is ill-suited to the direction both the market and compliance are heading. Operationally, batch processing is time-consuming, costly, and often must take place in constrained time windows between market close and open. The knock-on financial effect of the operation limitations of batch processing are more impactful: institutions are slow to react to changing market conditions, which can lead to over- or under-allocation of certain classes of funds. Real-time Risk Calculation and Asset Allocation</h2> Using Quine streaming graph, financial institutions can respond to market changes in real time, providing adequate coverage for risk exposure while ensuring compliance minimally affects asset allocation. At a high level, Quine accomplishes this by doing what it does best: combining multiple feeds in real-time to build hierarchical models of elements like markets, trading entities, risk classes, and asset values, that adjust in real time to changing market conditions. At a specific level, we have created a Quine recipe that demonstrates, in the context of regulatory monitoring requirements like the Basel III</a> Liquidity</a> Coverage Ratio (LCR)</a>, Net Stable Funding Ratio (NSFR)</a> and liquidity risk monitoring tools as described in https://www.bis.org/bcbs/basel3.htm</a>, the following: Calculating risk while taking into account complex interdependencies and rules.</li> Constantly recomputing liquidity-indexed risk to determine capital requirements relative to market conditions.</li> Normalizing multiple sources to calculate relative value of assets and roll up the results to determine near-real time liquidity in event liquidation is necessary.</li> </ul> A sample view from the graph this recipe generates. The recipe can be found here</a>. Quine Developer Site 2.0</h2> As part of our continued focus on improving the Quine developer experience, we’ve made significant changes to the Quine.io site. The most notable change is a total restructuring of the recipe pages to interleave code and contextual or documentary information. Recipe documentation now includes a full walkthrough of a recipe and an explanation of how the recipe works so that recipes can also act as training material. The new, more structured recipe page. Other changes include: Improved developer journey by separating tutorials (getting started</a>), technical docs</a>, and recipe docs</a> into their own sections of the site</li> Full release notes and release history included in the downloads</a> page</li> Direct links to Quine blog posts, events, and self-service demos are now on the info</a> page</li> </ul> You can still download and easily get started with Quine (hint, hint) and we’d love to hear your feedback</a> and add features you think might help you build great things with Quine. ‍ ‍ Create a Quine Icon Library with Python 2023-04-25T00:00:00+00:00 Have you ever wanted to add flair to a graph visualization but are unsure which icons Quine supports? In this blog, we explore a Python script that fetches valid icon names from the web, configures the Exploration UI, then creates a graph of icon nodes for reference. The script uses several popular Python libraries, including Requests, BeautifulSoup, and Halo, along with the /query-ui</code> and /query/cypher</code> API endpoints. Environment</h2> Before we start, we need to ensure that we have the necessary libraries installed. We will be using requests</code>, beautifulsoup4</code>, log_symbols</code>, and halo</code>. You can install them using pip</code>: ‍Quine</a></li> ‍</a>Python 3</li> Requests library (pip install requests</code>)</li> BeautifulSoup library (pip install beautifulsoup4</code>)</li> Optional Halo library for operation visuals (pip install log-symbols halo</code>)</li> </ul> Start Quine so that it is ready to run the script. java -jar quine-1.5.3.jar</code> The Script</h2> The script begins by importing the required libraries: import requests import json from halo import Halo from log_symbols import LogSymbols from bs4 import BeautifulSoup</code></pre>Build a list of icon names</h2> We use the requests</code> library to GET the webpage referenced in the Replace Node Appearances API documentation. Quine supports version 2.0.0 of the [Ionicons</a>] icon set from the Ionic Framework. The link contains a list of 733 icons supported by Quine. A try...except</code> block handles any errors that might occur during the request. If the request is successful, the script saves the HTML content of the page. try: url = "https://ionic.io/ionicons/v2/cheatsheet.html" response = requests.get(url) html = response.content print(LogSymbols.SUCCESS.value, "GET Icon Cheatsheet") except requests.exceptions.RequestException as e: raise SystemExit(e)</code></pre> Next, we use BeautifulSoup to parse the HTML content of the page to extract all of the icon names. The soup.select</code> method finds all <input></code> elements with a name</code> attribute and returns a list, which are then looped over to extract the value</code> attribute of each tag later. We output len(all_icons)</code> to verify that we identified all of the icons. soup = BeautifulSoup(html, "html.parser") all_icons = soup.select("input.name") print(LogSymbols.SUCCESS.value, "Extract Icon Names:", len(all_icons)</code></pre>Create Node Appearances</h2> Now that we have the icon names, we can use them to create node appearances for the Quine Exploration UI. We'll use the json</code> package to format the nodeAppearances</code> data as JSON, and requests</code> to replace the current nodeAppearances</code> with a PUT to the /query-ui/node-appearances</code> endpoint. We wrap the API call in try...expect</code> as before to handle any errors. predicate</code>: filter which nodes to apply this style</li> size</code>: the size of the icon in pixels</li> icon</code>: the name of the icon</li> label</code>: the label of the node</li> </ul> Note: Cypher does not allow dash (-</code>) characters in node labels. We get around this by replacing all of the dashes with underscores in the node labels. nodeAppearances = [ { "predicate": { "propertyKeys": [], "knownValues": {}, "dbLabel": icon_name["value"].replace("-", "_") }, "size":40.0, "icon": icon_name["value"], "label": { "key": "name", "type": "Property" } } for icon_name in all_icons] json_data = json.dumps(nodeAppearances) try: headers = {"Content-type": "application/json"} response = requests.put( "http://localhost:8080/api/v1/query-ui/node-appearances", data=json_data, headers=headers) except requests.exceptions.RequestException as e: raise SystemExit(e) print(LogSymbols.SUCCESS.value, "PUT Node Appearances")</code></pre>Create Icon Nodes</h2> Finally, our script creates icon nodes by sending a series of POST requests to the Quine /query/cypher</code> endpoint. For each icon name, a Cypher query creates the corresponding icon node and connects it to the appropriate group node. We use Halo</code> to create a spinner while we POST the icon data to Quine. try: quineSpinner.start() for icon_name in all_icons: group = icon_name["value"].split('-',2) query_text = ( f'MATCH (a), (b), (c) ' f'WHERE id(a) = idFrom("{group[0]}") ' f' AND id(b) = idFrom("{group[1]}") ' f' AND id(c) = idFrom("{icon_name["value"]}") ' f'SET a:{group[0]}, a.name = "{group[0]}" ' f'SET b:{group[1]}, b.name = "{group[1]}" ' f'SET c:{icon_name["value"].replace("-", "_")}, c.name = "{icon_name["value"]}" ' f'CREATE (a)&lt;-[:GROUP]-(b)&lt;-[:GROUP]-(c)' ) if len(group) == 3 else ( f'MATCH (a), (c) ' f'WHERE id(a) = idFrom("{group[0]}") ' f' AND id(c) = idFrom("{icon_name["value"]}") ' f'SET a:{group[0]}, a.name = "{group[0]}" ' f'SET c:{icon_name["value"].replace("-", "_")}, c.name = "{icon_name["value"]}" ' f'CREATE (a)&lt;-[:GROUP]-(c)' ) quineSpinner.text = query_text headers = {'Content-type': 'text/plain'} # print(query_text) response = requests.post( 'http://localhost:8080/api/v1/query/cypher', data=query_text, headers=headers) quineSpinner.succeed('POST Icon Nodes') except requests.exceptions.Timeout as timeout: quineSpinner.stop('Request Timeout: ' + timeout) except requests.exceptions.RequestException as e: raise SystemExit(e)</code></pre>Running the script</h2> At this point, we are ready to run the script and visualize the icons supported in Quine. python3 iconLibrary.py</code> The script updates the console as it moves through the blocks of code that we described above: ✔ GET Icon Cheatsheet ✔ Extract Icon Names: 733 ✔ PUT Node Appearances ✔ POST Icon Nodes</code></pre> Navigate to Quine in your browser and load all of the nodes that we just created into the Exploration UI. There are multiple ways to load all of the nodes in the UI, for this example, we use MATCH (n) RETURN n</code>. The Exploration UI will warn that you are about to render 787 nodes which is correct for all of the icons and grouping nodes generated by the script. Hit the OK button to view the graph. Note: If you already had Quine open in a browser before running the script, you will need to refresh your browser window to load the new nodeAppearances</code> submitted by the query in order for the nodes to render correctly. In our case, the nodes are jumbled when they are first rendered. Click the play button in the top nav to have Quine organize the graph. Our result produced the graph visualization of all supported icons below: Conclusion</h2> There you have it, a graph visualization using all of the icons Quine supports! This script can generate the nodeAppearances</code> graph and serve as a starting point if you are looking to automate fetching non-streaming data from websites to enrich streaming data stored in Quine. If you want to learn more about Quine or explore using other API libraries with Quine, check out the interactive REST API documentation available via the document icon in the left nav bar. The interactive documentation is a great place to submit API requests. Code samples in popular languages are quickly mocked up in the docs for use when experimenting with small projects like this yourself. You can download this script and try it for yourself in this GitHub Repo. ‍ Dynamic Duo: Quine & Novelty Detector for Insider Threats 2023-04-18T00:00:00+00:00 Adding Quine to the Insider Threat Detection Proof of Concept</h2> A lot has changed since we first posted</a> the Stop Insider Threats With Automated Behavioral Anomaly Detection blog post. Most significantly, thatDot released Quine, our streaming graph, as an open</a> source project just as the industry is recognizing the value of real-time ETL and complex event processing in service of business requirements. This is especially true in finance and cybersecurity, where minutes (seconds or even milliseconds) can mean the difference between disaster, survival or success. Our goal, at the time, was to show how anomaly detection on categorical data</a> could be used to resolve complex challenges utilizing an industry recognized standard benchmark dataset, which happened to be static. The approach we used then was to pre-process (batch) the VAST Insider Threat challenge dataset</a> with Python then ingest that processed stream of data with thatDot's Novelty Detector to identity the bad actor. But with a new tool in our kit we decided to see what would be involved in updating the workflow by replacing the Python pre-processing, instead using Quine in front of Novelty Detector in our pipeline. This involved: Defining the ingest queries</a> required to consume and shape the VAST datasets; and</li> Developing a standing query</a> to output the data to Novelty Detector for anomaly detection.</li> </ol> Data from the dataset</a> is broken into three files: Employee to office and source IP address mapping in employeeData.csv</li> </ul> ingestStreams: - type: FileIngest path: employeeData.csv parallelism: 61 format: type: CypherCsv headers: true query: >- MATCH (employee), (ipAddress), (office) WHERE id(employee) = idFrom("employee", $that.EmployeeID) AND id(ipAddress) = idFrom("ipAddress",$that.IP) AND id(office) = idFrom("office",$that.Office) SET employee.id = $that.EmployeeID, employee:employee SET ipAddress.ip = $that.IP, ipAddress:ipAddress SET office.office = $that.Office, office:office CREATE (ipAddress)<-[:USES_IP]-(employee)-[:SHARES_OFFICE]->(office)</code></pre> Proximity reader data from door badge scanners in proxLog.csv</li> </ul> - type: FileIngest path: proxLog.csv format: type: CypherCsv headers: true query: >- MATCH (employee), (badgeStatus) WHERE id(employee) = idFrom("employee", $that.ID) AND id(badgeStatus) = idFrom("badgeStatus",$that.ID,$that.Datetime,$that.Type,$that.ID) SET employee.id = $that.ID, employee:employee SET badgeStatus.type = $that.Type, badgeStatus.employee = $that.ID, badgeStatus.datetime = $that.Datetime, badgeStatus:badgeStatus CREATE (employee)-[:BADGED]->(badgeStatus)</code></pre> Network traffic in IPLog3.5.csv</li> </ul> - type: FileIngest path: IPLog3.5.csv format: type: CypherCsv headers: true query: >- MATCH (ipAddress), (request) WHERE id(ipAddress) = idFrom("ipAddress",$that.SourceIP) AND id (request) = idFrom("request", $that.SourceIP,$that.AccessTime, $that.DestIP, $that.Socket) SET request.reqSize = $that.ReqSize, request.respSize = $that.RespSize, request.datetime = $that.AccessTime, request.dst = $that.DestIP, request.dstport = $that.Socket, request:request SET ipAddress.ip = $that.SourceIP, ipAddress:ipAddress CREATE (ipAddress)-[:MADE_REQUEST]->(request)</code></pre> These ingests form a basic structure that looks like: The ingest streams combine to create the essential graph structure. Because we have created an intuitive schema for identifying nodes by way of feeding idFrom()</code> deterministic and descriptive data that can be used to query for them very efficiently (and do so with sub-millisecond latency). A quick query efficiently displays relevant properties from connected nodes. Moving from Batch to Real Time Monitoring</h2> While this is certainly an improvement from our previous workflow, it is still highly manual (i.e., having to explicitly query for the data we’re looking for). The promise of a Quine to Novelty Detector workflow is automation with real-time results. By ingesting the data in chronological order (as presented in the source files), we are able to easily match proximity network events to the last associated proximity badge event in real-time. This is accomplished via standing query matches like: standingQueries: - pattern: query: >- MATCH (request)<-[:MADE_REQUEST]-(ipAddress)<-[:USES_IP]-(employee)-[:BADGED]->(badgeStatus) RETURN DISTINCT id(request) AS requestid type: Cypher outputs: print-output: type: CypherQuery query: >- MATCH (request)<-[:MADE_REQUEST]-(ipAddress)<-[:USES_IP]-(employee)-[:BADGED]->(badgeStatus) WHERE id(request) = $that.data.requestid AND badgeStatus.datetime<=request.datetime WITH max(badgeStatus.datetime) AS date, request, ipAddress MATCH (request)<-[:MADE_REQUEST]-(ipAddress)<-[:USES_IP]-(employee)-[:BADGED]->(badgeStatus) WHERE badgeStatus.datetime=date RETURN badgeStatus.type AS status,ipAddress.ip AS src,request.dstport AS port,request.dst AS dst</code></pre> The question remains, “How do we share the standing query matches from Quine to Novelty Detector?” This can be done in a number of ways (all via standing query outputs)</a> including, but not limited to: Writing results to a file that Novelty Detector ingests;</li> Emitting webhooks from Quine to Novelty Detector; or</li> Publishing results to a Kafka topic to be ingested by Novelty Detector.</li> </ol> Although the first two choices will work, they are severely suboptimal. Consider a simple example of a single employee’s data: Visualizing data from a single employee. Writing the aggregate 115,434 matches would be done one record at a time (on each standing query match) to the filesystem. andThen: type: WriteToFile path: behaviors.jsonl</code></pre> Using webhooks suffer the same issue as writing to file, and introduces induced latency from the HTTP transactions. andThen: type: PostToEndpoint url: http://localhost:8080/api/v1/novelty/behaviors/observe?transformation=behaviors</code></pre> Ultimately, we settled on the third option as it most closely resembles production environments, and is the most performant. andThen: type: WriteToKafka bootstrapServers: localhost:9092 topic: vast format: { type: JSON }</code></pre> The big question - did it work? Results from the Novelty Detector UI. Absolutely. The anomalous activity has been identified. Was it worthwhile? Sure, but… It Don’t Mean a Thing If It Ain’t Got That Real-Time Swing</h2> Although we were able to accomplish the same results with Quine in a single step this was still a batch processing-based exercise. The true value of a Quine to Novelty Detector pipeline is in the melding of complex event stream processing in Quine with shallow learning (no training data) techniques in Novelty Detector, providing an efficient solution for detecting persistent threats and unwanted behaviors in your network. This pattern, moving from batch processing, requiring heavy lifting and grooming of datasets, to real-time stream processing is one where Quine and Novelty Detector thrive. Try it Yourself</h2> If you'd like to try the VAST test case yourself, you can run Novelty Detector on AWS</a> with a generous free usage tier. Instructions for configuring Novelty Detector are available here. And the open source version of Quine is available for download here</a>. If you are interested there is also an enterprise version that offers clustering for horizontal scaling and resilience. And if you'd prefer a demo or have additional questions, check out Quine community slack</a> or send us an email</a>. ‍ idFrom(): the simple function that’s key to Quine streaming graph 2023-04-06T00:00:00+00:00 A simple concept at the core of a new way of processing data</h2> What’s a streaming graph? When we first released Quine streaming graph last year, we had to answer this question a lot. After all, a “streaming graph” had never existed before. As interest grew, we got pretty good at answering, usually something like this: Quine is a real-time event processor like Flink or ksqlDB. It consumes data from sources like Kafka and Kinesis, queries for complex patterns in event streams, and pushes results to the next hop in the streaming architecture the instant a match is made. However, unlike those venerable systems, Quine uses graph data structure. Hence, streaming graph. That seemed to work and, engineers being a curious lot, led inevitably to a second question: “How’s it different from a graph database?” That’s a fun question to answer, because it means we get to talk about idFrom()</code>. And explaining idFrom()</code> allows us to begin to unpack all the interesting architectural properties that make Quine uniquely well-suited for real-time complex event processing. "Big things have small beginnings." -- David from the film Prometheus (2012) Event-driven: what if we stopped querying databases?</h2> Unlike a graph database, which relies on an index to query for the existence of data in the graph, Quine uses idFrom()</code>, a custom Cypher function</a>. idFrom()</code> generates a unique node ID from a set of user-provided arguments – most commonly taken from the data in the event stream itself – which is then used in lieu of an index to locate and operate on a node and its properties. (We will get to the why in a bit but it will help first to look at how you use idFrom()</code>.) Say you want to analyze an event stream of edits from wikipedia to keep an eye out for edits made by specific authors to specific articles in specific databases. The json record (a pared back version of the actual Wikipedia event feed used in the Wikipedia API recipe featured in our docs example here</a>) might look like this: { "$schema": "/mediawiki/revision/create/1.1.0", "database": "wikidatawiki", "page_id": 83996749, "rev_id": 1869025669, "rev_timestamp": "2023-04-05T18:18:23Z", "performer": { "user_is_bot": true, "user_id": 6135162, }, "rev_parent_id": 1869025663, }</code></pre> To create the nodes in a continuous stream of records, you would use MATCH</code> to declare the node names then call the idFrom()</code> function to generate unique node IDs based on the values in the json itself. MATCH (revNode),(pageNode),(dbNode),(userNode), (parentNode) WHERE id(revNode) = idFrom("revision", $that.rev_id) AND id(pageNode) = idFrom("page", $that.page_id) AND id(dbNode) = idFrom("db", $that.database) AND id(userNode) = idFrom("id", $that.performer.user_id) AND id(parentNode) = idFrom("revision", $that.rev_parent_id)</code></pre> For now, we can skip adding properties to nodes but it helps our discussion to complete this simple graph by adding relationships between the nodes: Now, as each event streams in, Quine will create and connect nodes, forming the desired subgraph that looks like this: CREATE (revNode)-[:IN]->(dbNode), (revNode)-[:TO]->(pageNode), (userNode)-[:MADE]->(revNode) (parentNode)-[:NEXT]->(revNode)</code></pre> You can see the same subgraph with node ID no longer concealed by the node labels: Note the things you didn’t have to do to create this graph: Query to find out if the node exists already before</li> Consult a schema</li> </ol> Quine eliminates the need to check to see if the node exists before completing an operation. The deterministic nature of node IDs created using idFrom()</code> means a value or combination of values passed to the function will always result in the same ID. It will either create a new node based on the value or, if that node already exists, update it. In the latter case, because Quine is an event-sourced system, when Quine updates a node, it doesn’t need to look up if the node already exists. Quine appends the update to the existing node, preserving historical versions that can be retrieved using idFrom()</code> with the at.time idFrom()</code> and CRUD operations: why Quine is so dang fast</h2> Inasmuch as Quine uses a hash of a value to generate a node ID that is then used for CRUD operations, it bears a superficial similarity between NoSQL key-value stores As long as you know either the ID or the value, it is dead simple to retrieve data from the graph. However, because of Quine’s in-memory graph structure, it is far more efficient and performant operating on patterns, ranges (e.g. time-ordered), or otherwise related data than key-value databases. Using the node ID to anchor the query, you specify the edges to traverse to find connected data. This might be a query to retrieve a node’s properties using node ID (in this case, for revNode</code>): MATCH (n) WHERE strId(n) = "8b290926-271c-3497-b5d6-e30fcf934a73" RETURN id(n), properties(n)</code></pre> Which delivers these results: If you don’t know a node’s ID, you can query for it using the node’s properties and the strid()</code> function: MATCH (userNode:user {user_is_bot: true}) RETURN DISTINCT strid(userNode)</code></pre> But what about more complex queries – for example, a query that must retrieve multiple related objects. Key-value stores are famously inefficient in this scenario. But this is precisely where Quine’s architectural choices come in. Using an in-memory graph structure means you can query for any node in a subgraph, follow it’s edges, and produce one or more values. For example, say you want to find all revisions where a bot made an update to the wikidatawiki</code> database: MATCH (userNode:user {user_is_bot:true})-[:MADE]->(revNode:revision)-[:TO]->(pageNode:page)-[:IN]->(dbNode:db {database : "wikidatawiki"}) RETURN DISTINCT id(revNode) as id, id(userNode) as id2</code></pre> Either way, it starts with setting the node ID with idFrom()</code> . And idFrom()</code> makes Quine very, very fast. Standing queries and querying data from the future with idFrom()</code></h2> Standing queries persist inside the graph, monitoring the stream for specific patterns. Propagate them throughout the graph without you ever having to issue a query again. Standing queries persist, monitoring for matches. Once matches are found, standing queries trigger actions using those results (e.g. report results, execute code, transform other data in the graph, publish data to another source). To do this, every standing query must have two parts, the pattern</code> portion (what sub-graph you are matching for in the event stream) and the outputs</code> portion (the action you wish to take). Adapted from the recipe used in Getting Started</a>, here’s a standing query that monitors for non-bot revisions to the enwiki</code> database and outputs these events to the terminal: standingQueries: - pattern: query: |- MATCH (userNode:user {user_is_bot: false})-[:MADE]->(revNode:revision {database: "enwiki"}) RETURN DISTINCT id(revNode) as id type: Cypher outputs: print-output: type: CypherQuery query: |- MATCH (n) WHERE id(n) = $that.data.id RETURN properties(n) andThen: type: PrintToStandardOut</code></pre> Standing query matches printing to console. Because standing queries persist in the graph, incrementally updating partial results as new data arrives, you are not just querying the past and present state, you are setting up queries for data yet to arrive. And while idFrom()</code> is a key part of what makes standing queries possible, to really understand what makes Quine function so efficiently as a stream processor, we’ll need to dive into the actor-based, graph-shaped compute model. But that’s for a different post. Instead, I’ll leave you with a clever use of idFrom()</code> employed by developers at a SaaS company that uses Quine. Partitioning Key Spaces for a SaaS application using idFrom()</code></h2> Since you can generate a node ID by passing an arbitrary combination of values to idFrom()</code>, some Quine users with SaaS or internal multi-tenant applications have employed it to partition graphs by customer namespace or similar property. Sticking with the Wikipedia example, you could create distinct sub-graphs corresponding to each of the database types by adding $that.database</code> as an additional value determining each node ID: MATCH (revNode),(pageNode),(dbNode),(userNode),(parentNode) WHERE id(revNode) = idFrom("revision", $that.rev_id, $that.database) AND id(pageNode) = idFrom("page", $that.page_id, $that.database) AND id(dbNode) = idFrom("db", $that.database) AND id(userNode) = idFrom("id", $that.performer.user_id, $that.database) AND id(parentNode) = idFrom("revision", $that.rev_parent_id, $that.database)</code></pre> This creates a series of subgraphs partitioned by database and would allow you to be certain that if you query for data related to a specific database, you won’t inadvertently return data from others. And while the chance of key collision exists, it is vanishingly small</a>, making this approach suitable for use in multi-tenant SaaS applications. At any rate, this accomplished what the company wanted: a partitioned graph for data separation, all standing and ad hoc queries work the same across the entire graph, and the only real cost is the discipline of always using the compound key. Pretty clever. If any of this inspires you or piques your interest and you want to try Quine yourself, check out Getting Started</a> docs. ‍ Using Indicators of Behavior (IoB) Analysis for IoT data 2023-02-22T00:00:00+00:00 Data analysis based on Indicators of Behavior (or IoBs) has emerged as the new standard in real-time cybersecurity threat hunting. As data science practices and tooling have evolved to enable IoB analysis, we are finding that the identification of system and/or user behavior patterns, especially in real-time, extends beyond the cybersecurity domain to finance, e-commerce, and in particular, Internet of Things (IoT) use cases. The Shift to Critical IoT Data</h2> Self-driving cars, medical devices, and security monitoring are just a few examples of how IoT solutions are being applied to high stakes or critical use cases. As we leverage IoT for higher value use cases, it becomes apparent that they benefit enormously from real-time data analysis. Data from last week, yesterday, last night, or even an hour ago, doesn’t help achieve satisfactory outcomes when you’re using data to assist in navigating a car traveling 100 km/hr or monitoring an at-risk patient outside a closely monitored hospital setting. As IoT devices handle more critical operations we need more effective tools for monitoring, securing, and interpreting this data in real time. IoT Data Challenges</h2> A hallmark characteristic of IoT devices is their significant resource limitations: everything from processing power and storage to connectivity has historically been in short supply. The variability of natural environments, difficulty in upgrading firmware remotely, and intermittent communications compound the difficulty to manage networks of IoT devices in the way we’ve come to expect we can manage other large networks of connected devices.. Fortunately, thanks to LTE/5G and Moore’s law, IoT devices large and small increasingly have the resources to generate, process, and transmit more substantive data. The High Cost of Slow Data</h2> But this ramp up in the capacity of large numbers of connected devices just relocates the problem of processing data to the other end of the IoT network: to the data processing infrastructure. As data floods in, systems designed for intermittent trickles either can’t scale, cost too much, or, in an effort to solve both scale and cost, process data only in batches. When you add together the cost of proprietary device management software (licensing can cost as much as $1M/100,000 sensors) and data processing and storage infrastructure (e.g. SIEMs like Splunk are notoriously expensive), it is difficult to make a business case that supports the capital investment necessary to adopt and deploy IoT technology. ‍ Batch processing limited event data means opportunity costs. And while batch processing is fine for many tasks, finding and acting on sensor information when it can make a positive business impact doesn’t work if that data is processed only after it is too late to take action. Which begs the question: if use cases like preventative maintenance were the primary selling points of IoT, what’s the point? IoT is finally ready for IoB</h2> With more robust IoT devices collecting and transmitting richer data sets, there is now opportunity to apply real-time IoB detection and the threat detection and operational analysis it can deliver. With IoBs, security teams can watch for patterns of system and user behavior that indicate a notable event (cyber attack, upsell opportunity, or churn risk) is in process, rather than waiting to find evidence after the fact. When watching for IoBs , teams can build predictive models that limit negative impacts and capitalize on opportunities. As we look at increasingly complex IoT data we can apply this next generation of thinking to build state of the art analysis that improves the confidence and timeliness of IoT data analysis. Applying IoB Analysis to IoT Data</h2> As I described in my recent blog on the use of IoBs in modern Cyber Security Threat Hunting, IoBs draw upon the context of system and user behavior to more reliably identify attack threats. IoBs can as easily be applied to the behavior of “an about to fail” disc brake or “trending towards failure” heart valve. Importantly, the richer context behavior analysis provides significantly fewer false positives</a> than traditional analysis due to the added context provided by analyzing categorical data elements used to describe behaviors. Comparing Legacy Batch Processing of Metrics with Real-time IoB Analysis</h3> Monitoring Use Cases 1. Delivery Truck Monitoring Traditional Metrics Example (batch): Hourly or daily metrics on vehicle speed, idle time, and brake application.</li> </ul> </li> IoB Analytics Example (real time): By driver, identify patterns of rapid acceleration, turns, and braking indicating aggressive driving or an accident.</li> </ul> </li> </ul> 2. Utility Meter Monitoring Traditional Metrics Example (batch): Daily or even monthly batch analysis of usage to identify potential leaks or resource theft.</li> </ul> </li> IoB Analytics Example (real time): Real-time inputs for predictive modeling of shortages to avoid brownouts or identify leaks.</li> </ul> </li> </ul> 3. Device Security Monitoring Traditional Metrics Example (batch): Periodic metrics that report login attempts, source IP, time of day, bytes transferred.</li> </ul> </li> IoB Analytics Example (real time): Identify patterns of login success and failures, based on source IPs at different times of day over an extended time period.</li> </ul> </li> </ul> 4. Medical Device Monitoring Traditional Metrics Example (batch): Periodic metrics on patient temperature, glucose, heart rate, respiration.</li> </ul> </li> IoB Analytics Example (real time): Identify patterns of vital signs at different hours of the day to alert on deviations from expected patterns.</li> </ul> </li> </ul> IoB Analysis Requires New Tools</h2> The evolution from reactive batch processing of data to a real-time IoB-based approach requires a new set of technical capabilities along with the tools to deliver them. At minimum, a system must be able to: Easily combine streams of real-time and historical data.</li> Process both categorical and numerical variety.</li> Recognize and act on known patterns within this combination of categorical and numerical data, which typically requires a graph database or system that uses graph’s connected data structures.</li> Continuously analyze the constant flows of data for emerging behaviors, whether they signal threats or opportunities.</li> Update the system with newly identified threats and grow the number of IoBs for which to monitor.</li> </ol> The advent of ground-breaking streaming graph technology has emerged to meet the need: thatDot’s Quine streaming graph and Novelty Detector. thatDot’s open source Quine streaming graph aligns with the requirement to ingest multiple data streams of both categorical and numerical data. In fact, because it is a graph data processor, Quine is the only real-time event processor that works natively with categorical data, which makes it much easier to express IoBs as patterns. Because of Quine’s unique architecture, it can monitor streams for IoBs, detect their fingerprint patterns the instant they occur, and take immediate action, often in the form of predefined business rules or remediations. The work flow looks as follow: Event sources are ingested from any common event stream queue, including Apache Kafka, AWS Kinesis, AWS SQS, or Apache Pulsar/DataStax Astra Streaming.</li> IoBs are defined in Quine as standing queries, the watchdog queries that monitor the streams for important patterns.</a></li> Quine analyzes newly arriving events for matches against IoB pattern definitions. Partial matches are identified and stored for any desired period of time to watch for behaviors that occur incrementally over longer time frames.</li> When Quine detects a full IoB pattern match, it generates a new event that is associated with a pre-defined set of business rules: sending alerts to a SOC or passing an upsell offer to a user are a few examples.</li> All data is also passed through Novelty Detector</a>, a real-time shallow learning algorithm that identifies novel and notable behaviors that can be turned into IoBs if deemed necessary.</li> </ol> The data flow looks like this: The modern IoB workflow using Quine Enterprise. Streaming Graph Delivers End-To-End IoBs for IoT</h2> Quine is available in both open source and enterprise editions. However, Novelty Detector is available either in the AWS marketplace or under license as part of thatDot Streaming Graph. Streaming Graph offers large organizations and connected device manufacturers both the clustered, resilient version of Quine and Novelty Detector. It is meant for production applications where resilience, query performance, and throughput matter. Resilient clustering includes support for hot spares and distribution across multiple availability zones. We recently shared reproducible tests demonstrating both scale (thatDot Streaming Graph easily processed one million 4-node graph events/second) and resilience in the face of node failure. You can read about the tests here</a>. Try It Yourself</h2> If you want to try it on your own, here are some resources to help: Download Quine – JAR file</a> | Docker Image</a> | Github</a></li> Check out the Ingest Data into Quine blog series covering everything from ingest from Kafka</a> to ingesting .CSV data‍</a></li> Novelty Detector for Log Analysis</a> – learn more about Novelty Detector in this blog showing how to use it to analyze system logs to detect novel anomalous behavior.</li> </ol> Original photo for header by Sander Weeteling</a> on Unsplash</a> Quine’s Real-time Temporal Event Sequencing Produces New Insights 2023-02-08T00:00:00+00:00 One of the fundamental advantages of Quine’s architecture compared to other complex event stream processing technologies, like Flink and ksqlDB, is that it is not constrained by time windows. We demonstrated the value of this capability in the “Are You Ready for Low and Slow Auth Attacks?” blog, where we demonstrated how you can use Quine to identify password spraying attacks that take place over extended periods, defeating legacy detection mechanisms constrained by time windowing. But what about cases where the sequence of events is critical to detecting and investigating interesting incidents? For example, when performing root cause analysis (RCA) for performance issues in a NOC or security incidents in a SOC, the temporal ordering of events is often as important as the events themselves. Event Sequencing in Real Time: A Streaming Graph Strength</h2> Event sequencing can provide key information for accurate and timely detection and analysis, even in the most complex cases where causality and temporal ordering are difficult to ascertain. The key is architecting a graph structure that can most effectively answer your questions and produce insights. In the case of a streaming graph solution like Quine, this means modeling the graph so queries can effectively traverse nodes and edges natively, which is always more efficient than path matching based on node properties like timestamps. This is because the relations between nodes (edges) are persisted in the nodes themselves. We like using an event sequencing technique to explicitly identify order based on a pattern match detected by one of Quine’s most powerful features, the standing query. (Standing queries</a> monitor streams for specified patterns, maintaining partial matches, and executing user-specified actions the instant a full match is made.) We demonstrate this technique in the APT (Advanced Persistent Threat) Detection recipe (https://quine.io/recipes/apt-detection</a>) to create sequence edges as Quine ingests EDR (Endpoint Detection and Response) and network traffic logs while monitoring for an Indicator of Behavior (IoB) that matches malicious data exfiltration patterns. Our approach to this technique has four key components. Model a behavioral pattern as a subgraph</li> Develop Cypher to match the subgraph in the event stream</li> Encode the event sequence into the graph</li> Emit an alert containing a linkURL</code> to the subgraph inside the Quine Exploration UI</a></li> </ol> Our concern is not the timeframe of events (how quickly they happen). Rather, our focus is locating a specific sequence of events in order – WRITE->READ->SEND->DELETE</code> – regardless of the time interval across which the events occurred. A subgraph like the one below can model the data exfiltration event from the APT Detection recipe. An initial data subgraph produced by the ADP Detection recipe. Based on the model, we develop Cypher to match the subgraph in the pattern match section</a> of a Quine standing query: MATCH (e1)-[:EVENT]->(f)<-[:EVENT]-(e2), (f)<-[:EVENT]-(e3)<-[:EVENT]-(p2)-[:EVENT]->(e4) WHERE e1.type = "WRITE" AND e2.type = "READ" AND e3.type = "DELETE" AND e4.type = "SEND" RETURN DISTINCT id(f) as fileId</code></pre> Next augment the subgraph in the standing query output</a> to overlay sequencing with the CREATE clause, adding NEXT edges between the key nodes: MATCH (p1)-[:EVENT]->(e1)-[:EVENT]->(f)<-[:EVENT]-(e2)<-[:EVENT]-(p2), (f)<-[:EVENT]-(e3)<-[:EVENT]-(p2)-[:EVENT]->(e4)-[:EVENT]->(ip) WHERE id(f) = $that.data.fileId AND e1.type = "WRITE" AND e2.type = "READ" AND e3.type = "DELETE" AND e4.type = "SEND" AND e1.time < e2.time AND e2.time < e3.time AND e2.time < e4.time CREATE (e1)-[:NEXT]->(e2)-[:NEXT]->(e4)-[:NEXT]->(e3)</code></pre> The transformed subgraph in Quine becomes this. There are three important things to note here: The synthetic NEXT</code> edges only exist after the standing query match creates them</li> The NEXT</code> edge labels enable us to efficiently traverse the WRITE->READ->SEND->DELETE</code> path with a simple Cypher query.</li> Temporal sequencing is even more difficult when dealing with multiple input sources. Imagine matching the WRITE->READ->SEND->DELETE</code> pattern where the write, read and delete events come from one source and the send from another. Quine makes it easy to combine multiple event sources</a>.</li> </ol> Once Quine identifies the event, we can explore the graph further with queries like the following: MATCH (n)-[:NEXT*]->(m) WHERE strId(n)="20b2059e-19c5-3ab6-b465-fe3593c45bc8" RETURN DISTINCT collect(m),n</code></pre> The final output -- the WRITE --> READ --> SEND --> DELETE</code> subgraph. As an alternative, use the /api/v1/query/cypher/nodes</code> API endpoint to build a dictionary of malicious file names. [ { "id": "f00ae947-3dd5-3c92-a84f-118b401c80f1", "hostIndex": 0, "label": "ID: f00ae947-3dd5-3c92-a84f-118b401c80f1", "properties": { "data": "/tmp/miscellaneous.data" } } ]</code></pre> You can even use quick queries</a> to follow the NEXT</code> edges in the exploration UI to find actions that occurred earlier in the event timeline. Results using quick queries to explore the final subgraph for root cause analysis. Event Sequencing Benefits From Planning</h2> Processing event streams using a streaming graph like Quine requires adjusting how you think about your data. For example, when the recipe we used in this post was first developed, it was focused on evaluating a single concern; find a specific subgraph. This required a simple plan for creating node IDs. In the original use case, the choice was to create nodes using idFrom()</code> in its most basic form id(event) = idFrom($that)</code>, which was completely reasonable at the time. Now, asking a more complex question, "Show me any process that interacts with a file named /tmp/miscellaneous.data</code> " is more difficult because the node ID namespace plan did not include using individual node parameters. This is something to keep in mind when you plan your streaming event graph! Temporal data doesn't always need to be tied to timestamps. Instead, you can use temporal categories – morning/afternoon/night, before/after, etc. Many use cases, like our data exfiltration scenario, are built from understanding the sequence of events as a subgraph. What temporal use cases do you have that could benefit from detection using graph analysis, and how long does it take to detect those patterns today? Try for Yourself</h2> If you want to try Quine using your own data, here are some resources to help: Download Quine JAR</a>| Docker Image</a> | Github</a></li> Start learning about Quine now by visiting the Quine open source project</a>.</li> Check out the Ingest Data into Quine blog series covering everything from ingest from Kafka</a> to ingesting .CSV data‍</a></li> APT Detection recipe</a> - this recipe, referenced above, demonstrates the ability of streaming graphs to process event data without time windows.</li> </ol> ‍ Graph Neural Networks for Quine 2023-01-21T00:00:00+00:00 Introduction</h2> “Which records are the most similar to this one?” That’s a straightforward question, but it hides some very thorny problems! Similar in what way? How do we measure similarity? How do we incorporate different attributes? How do we weigh different values? How do we incorporate similar relationships to other similar records? Believe it or not, solving the similarity problem is more manageable than answering those questions. This is a perfect job for machine learning! New research in the last few years has brought natural language processing (NLP) tools to graphs as Graph Neural Networks (GNNs). Graph A.I. is starting to leave the research lab and enable critical new use cases in the industry. From cybersecurity to fin-tech, to social networks, to ad placement, applications for graph A.I. are sweeping across industries. While demonstrations on small datasets have proven the effectiveness and value of graph A.I. techniques, operationalizing graph A.I. in large graphs or with streaming data has been a significant obstacle! The latest release of Quine includes built-in functionality for many of the essential graph A.I. algorithms. This changes with the 1.5 release of Quine. This post shows how to quickly solve tricky questions like “similarity” using Quine’s new random walk graph algorithm feature. Take a Walk</h3> Random walks are often the central connection between graph-structured data and machine learning applications. A random walk is a string of data produced by starting at a graph node, following one of its edges randomly to reach another node, then following one of its edges randomly to reach another node, and so on. Walks let us translate the possibly-infinite dimensions of graph data into linear strings we can feed to graph neural networks. Generating one random walk in Quine 1.5 can be done by calling a function in a Cypher query: MATCH (n) CALL random.walk(n, 10) YIELD walk RETURN id(n), walk LIMIT 1</code></pre> Or through an API call (if you know the ID of a node on which to start): curl --request GET --url http://localhost:8080/api/v1/algorithm/walk/b93eabe7-38d2-30fa-8f96-097d75eb1f50</code></pre>‍Take All the Walks</h3> The previous examples start from one node and return one random walk. But graph A.I. usually requires building many walks from every node in the graph. To support this, Quine includes an API that will generate all random walks for an entire graph—regardless of how large the graph is. Quine is a “streaming graph” that operates on continuous data streams. With an API call, you can direct Quine to stream all the random walks from every node in the graph to a file stored locally or in an S3 bucket. curl --request PUT --url http://localhost:8080/api/v1/algorithm/walk --header "Content-Type: application/json" --data "{ 'bucketName': 'your-s3-bucket-name', 'type': 'S3Bucket' }"</code></pre>‍Remember When</h3> Most use cases for Quine include continuously-running data ingests that constantly modify the graph. To correctly generate a set of random walks, you need the graph to hold still while collecting random walk data. That’s hard to do in a stream that is constantly being updated with every new record that streams in. Fortunately, it is straightforward to query a fixed graph in Quine using the historical query functionality. Include a historical timestamp with the at-time</code> parameter in your query to generate random walks from the graph as it was at that fixed historical moment. **The rest of the graph can keep changing, but the GNN will see walks from a consistent and fixed view of the graph.** curl --request PUT --url "http://localhost:8080/api/v1/algorithm/walk?at-time=$(date +%s000)" --header 'Content-Type: application/json' --data '{ "bucketName": "your-s3-bucket-name", "type": "S3Bucket" }'</code></pre>Graph Neural Networks</h2> Graph neural networks (GNNs) are a way to combine the power of graph-structured data and cutting edge machine learning techniques. Research on GNNs is continuing rapidly, but several foundational techniques have become critical to modern applications of artificial intelligence on graph data. Node2Vec</h3> The Node2Vec algorithm is a important technique for applying some revolutionary machine learning techniques to graph data. The initial Node2Vec paper</a> by Grover and Leskovec appeared in 2016. It was an important paper which demonstrated how random walks on a graph behave like sentences in a corpus of natural language text. So by generating random walks, you could apply NLP (natural language processing) techniques and neural networks to graph data. The NLP technique used by Node2Vec is Word2Vec. In 2013, Mikolov et. al published</a> a landmark result showing that words from natural language can be “embedded” into a high-dimensional vector space. (Imagine plotting a point on a two-dimensional X-Y plane, but instead of 2 dimensions X and Y, put the dot in a plot with dozens, thousands, —or in the case of Large Language Models (LLMs)—hundreds of billions of dimensions!) The main take-away from this work is that word meanings can be learned by a computer and used mathematically. For instance, after training on English text, you can take the word “king”, subtract “man”, add “woman” and arrive at the word “queen”. Word2Vec works incredibly well for natural language; Node2Vec then took it a step further and applied this technique to graph data. Random walks on a graph are the link that allows us to use Word2Vec for Node2Vec. A random walk generates a string of data from the graph. That string of data is analogous to a sentence in natural language. Producing many random walks gives us data structured like a set of documents containing many sentences; this is the “corpus” that trains the NLP model. So if we can produce random walks from a graph, we can train a neural network to learn the meaning of the data in that graph. The random walk APIs in Quine 1.5+ allow a user to tune the random walks as described in the Node2Vec paper. The return</code> parameter determines how likely a walk is to return one step back where it came from (to “backtrack” to the previous node). The in-out</code> parameter determines whether a walk is more likely to explore the local region (“neighborhood”) around a node or travel far afield to explore corners of the graph far away. These parameters can tune the walks to learn different features of the graph and address different goals. GraphSAGE</h3> Whereas Node2Vec used random walks to generate a list of node IDs as the “corpus” to train Word2Vec, a richer approach was developed in the GraphSAGE</a> algorithm. The primary development in GraphSAGE is to include the ability to explore the local area of each node in a walk, then aggregate features from each of the nearby nodes back to the starting node. These features then get concatenated together and used in the learning process for representing nodes in a graph. This provides two big improvements vs. Node2Vec: 1.) it provides a lot more information for each node (Node2Vec uses only the node’s ID), and 2.) it learns a function which can be used to embed unseen nodes—making it much more useful in practical situations where we need to compare new nodes not present in the original training set. Embedding unseen nodes is important for streaming data workflows. It means that we can train a neural network on a consistent view of the graph at one moment in time, and then use that trained network to interpret new data (nodes) which stream in after the training. We can retrain the network any time we like, but we don’t have to skip over streaming data arriving in real-time. Every node can be embedded in real-time! Quine’s random walk generation includes the ability to define an aggregation query for each node encountered in a random walk. This can be used to explore the local neighborhood and/or aggregate multiple properties which get automatically folded into the walk. For example, the following Cypher query fragment is passed in to the query API parameter to collect different properties from nodes visited during a walk based on the node type: RETURN CASE WHEN "Movie" IN labels(thisNode) THEN thisNode.languages + [id(thisNode)] + thisNode.countries WHEN "Genre" IN labels(thisNode) THEN [id(thisNode), thisNode.genre] WHEN "Person" IN labels(thisNode) THEN [thisNode.born.year, id(thisNode), split(thisNode.bornIn, " ")[-1]] WHEN "Role" IN labels(thisNode) THEN id(thisNode) WHEN "User" IN labels(thisNode) THEN id(thisNode) WHEN "Rating" IN labels(thisNode) THEN thisNode.rating ELSE id(thisNode) END</code></pre>‍GNN Tutorial</h2> Streaming data and graph neural network techniques come together in Quine 1.5 to easily solve some powerful graph A.I. use cases. Let’s use an existing recipe to demonstrate how we can enrich it with graph neural networks. Let’s use the movieData recipe from here</a> and enrich it using a GNN to compute similarity for movies in that dataset. Workflow for using Quine to create the similarity data file. To get started, download the two data files to your working directory: https://quine-recipe-public.s3.us-west-2.amazonaws.com/movieData.csv</a> ‍https://quine-recipe-public.s3.us-west-2.amazonaws.com/ratingData.csv</a> Download the latest Quine executable</a>. Run the movieData recipe to ingest the movie data and build a graph: java -jar quine-1.5.0.jar -r movieData --recipe-value movie_file=movieData.csv --recipe-value rating_file=ratingData.csv</code></pre> Let that run to ingest all the data and build the movieData graph. When complete, you should see the ingest counters stop changing and an output that looks like this: INGEST-4 status is completed and ingested 74090 INGEST-1 status is completed and ingested 74090 INGEST-2 status is completed and ingested 74090 INGEST-3 status is completed and ingested 74090 INGEST-5 status is completed and ingested 100005</code></pre> We’ll use a Node2Vec approach for this example. Let’s start by generating random walks: curl --request PUT --url "http://localhost:8080/api/v1/algorithm/walk?count=20&amp;seed=foo&amp;at-time=$(date +%s000)" --header 'Content-Type: application/json' --data '{ "type": "LocalFile" }'</code></pre> NOTE: This bash command computes the current timestamp (in seconds) inline with date +%s</code> and then appends 000</code> to make this a millisecond timestamp in the past. This API call generates a string of node IDs by “walking” from every node in the graph, then saves a file to the local machine in the same working directory named something like: graph-walk-1674342158000-10x20-q0-1.0x1.0-foo.csv</code> NOTE: The file name is automatically derived from the parameters passed into the API call. If the API call includes a timestamp (in milliseconds), the filename will be the same if you issue the same API call again. If the API call includes at-time and a seed parameter, then the file contents will also be the same each time. But if the file already exists, trying to generate it again will return an error. The first line of the file looks like this: 00006819-5fb1-310b-a4b6-e3bfbf600aa4,00006819-5fb1-310b-a4b6-e3bfbf600aa4,5f97b729-7435-3a44-823e-2e7217885d27,c9e28d23-5d1f-353a-b619-d1ec849d2583,c79ad95d-901d-3484-9f07-ae79c8efca39,ac799ac2-a21f-3b83-a827-31aace7c4211,c79ad95d-901d-3484-9f07-ae79c8efca39,0e415310-62ea-3bf6-b674-2ff5a9e84cee,985084e1-a9d5-353b-8080-175d2447929f,7a3c184f-451b-36ff-b005-feca7d8ee73b,985084e1-a9d5-353b-8080-175d2447929f,59cf0b79-23a1-36b8-96f6-ff6a4aa61e43</code></pre> The line is a series of node IDs generated from a random walk starting at node 00006819-5fb1-310b-a4b6-e3bfbf600aa4</code>.‍ *NOTE: the first value of each line of the file output of the random walk call will always be the ID of the starting node, even if the walk returns other values. If a node has no edges to walk to, you’ll see a much shorter line: just the same node ID repeated twice.*‍ The graph has 164,777 nodes. We instructed Quine to generate 20 random walks for each node. So the output file has 3,295,540 rows. $ wc -l graph-walk-1674342158000-10x20-q0-1.0x1.0-foo.csv 3295540 graph-walk-1674342158000-10x20-q0-1.0x1.0-foo.csv</code></pre>‍Graph Embedding</h3> With the random walks generated, we can use the output to train a neural network and load similarity data back into Quine (as a new ingest stream). The following python code demonstrates how to train the neural net, create the embeddings, and computer similarity for each node in the graph: Download This Code</a> import sys, json, requests from gensim.models.word2vec import Word2Vec from gensim.models.callbacks import CallbackAny2Vec from multiprocessing import cpu_count from datetime import datetime file = sys.argv[1] print(f"Reading in training data from: {file}") line_count = 0 word_count = 0 data = [] with open(file, "r") as f: for line in f: l = line.strip().split(",") data.append(l) line_count += 1 word_count += len(l) print(f"Read in: {line_count} sentences with: {word_count} total words") class EpochLogger(CallbackAny2Vec): '''Callback to log information about training''' def __init__(self): self.epoch = 1 def on_epoch_begin(self, model): print(f"Iteration #{self.epoch} training started at: {datetime.now().strftime('%Y/%m/%d %H:%M:%S')}") def on_epoch_end(self, model): print(f"Iteration #{self.epoch} completed at: {datetime.now().strftime('%Y/%m/%d %H:%M:%S')}") self.epoch += 1 logger = EpochLogger() print("Preparing dictionary and beginning model training...") model = Word2Vec(data, vector_size=16, window=5, min_count=0, sg=1, workers=cpu_count(), callbacks=[logger]) print("Training Complete.") ks = sorted(list(set([d[0] for d in data]))) print("Computing and saving similarities...") with open(file + "-similarities.json", "w") as fd: for k in ks: d = { "target": k, "similarNodes": [{"id": x[0], "similarity": x[1]} for x in model.wv.most_similar(k)] } fd.write(json.dumps(d)+"\n") print(f"Similarity save completed at: {datetime.now().strftime('%Y/%m/%d %H:%M:%S')}") print("Saving the learned model...") model.save(file + ".wvmodel") print(f"Model save completed at: {datetime.now().strftime('%Y/%m/%d %H:%M:%S')}") ### Ingest similarities and create edges: print(f"Beginning async ingest of similarity data from: {file}-similarities.json") ingest_payload = { "type": "FileIngest", "path": file + "-similarities.json", "format": { "type": "CypherJson", "query": "UNWIND $that.similarNodes AS s MATCH (a), (b) WHERE id(a) = $that.target AND id(b) = s.id CREATE (a)-[:is_similar_to]-&gt;(b)" } } url = "http://localhost:8080/api/v1/ingest/similarities" headers = {"Content-Type": "application/json"} response = requests.request("POST", url, json=ingest_payload, headers=headers) if response.status_code != 200: print(response.status_code)</code></pre> Run this code with a command like (replace with your file name): python graph_embedding.py graph-walk-1674342158000-10x20-q0-1.0x1.0-foo.csv</code></pre> The script returns the following output from the training process: Now we have similarity data saved to the file graph-walk-1674342158000-10x20-q0-1.0x1.0-foo.csv-similarities.json</code>, which has single lines that pretty-print to look like this: { "target": "00006819-5fb1-310b-a4b6-e3bfbf600aa4", "similarNodes": [ { "id": "b221ceb5-9a0a-3c93-b0d0-870ab56ff924", "similarity": 0.9671398997306824 }, { "id": "97290657-1a1e-34c1-8bd1-3db0aefaf13b", "similarity": 0.957493245601654 }, { "id": "2fa7e55a-840e-37ec-bb30-8736e71d037b", "similarity": 0.9573205709457397 }, ... ] }</code></pre> The data in this file tells us that, for each target</code> node, we have a measure of its similarity to the other node somewhere else in the graph with the corresponding id</code>. We’ve saved the top-10-most-similar-node-ids for each node in the graph to a file. Each pair of target</code> and similarNodes.id</code> generated in the ...similarities.json</code> file is a new edge we can create and name is_similar_to</code>. After saving the learned model, our Python script made an API call creating an ingest stream named similarities</code> to load the data into Quine. Once it’s complete we can visualize the data. Check the status of the similarities ingest stream with an API call like: curl --request GET --url http://localhost:8080/api/v1/ingest/similarities { "name": "similarities", "status": "Completed", ...</code></pre>‍Seeing Similarities</h3> Our graph now has edges showing the results for the question we began with: “Which records are the most similar to this one?” Let’s see which movies are similar to each other: MATCH (n: Movie)-[:is_similar_to]->(m: Movie)-[:is_similar_to]->(n) RETURN n, m LIMIT 100</code></pre> And we get insightful results: Visualizing similarities returned by Quine. This graph is a lot of fun to explore! Looking at a few of the connections, we see that the GNN learned many useful similarities: The Hobbit trilogy appears in a connected triangle near the top.</li> The George Carlin (and Bill Hicks) pentagram is left of center.</li> Many sequels are related to their predecessors.</li> A pair of Bond movies appears just below the center (more if we didn’t limit results)</li> Movies in the same genre are much more likely to be similar.</li> …and many more! Your version might be a little different based on the randomness of the GNN training. Take a look and see what you find.</li> </ul> Similar movies will have similar genres, similar actors in similar roles, similar users providing similar ratings, similarities of similarities, and so on. The similarities also include the dataset’s other kinds of graph relationships. The entire graph is incorporated into the graph neural network learning process through random walk production. Want to explore more kinds of similarities? Try this query to explore similar actors: MATCH (n: Person)-[:is_similar_to]->(m: Person)-[:is_similar_to]->(n) RETURN n, m LIMIT 100</code></pre>‍Conclusion</h2> Machine Learning on graphs is an effective new technique applicable to all sorts of datasets—even many that don’t immediately need a graph. Representing JSON or CSV data as a graph allows us to draw connections that become crucial for neural networks to draw conclusions that would otherwise be impossible to detect or require enormous manual human analysis. This is only the beginning! In a future post, we’ll look at how to apply these graph neural network techniques to your live streaming data to constantly embed new nodes which weren’t part of the training set. Categorical Data: An Untapped Source of Real-Time Insights 2023-01-12T00:00:00+00:00 As we enter a period of uncertainty, it is important to get more value from existing investments, including event stream processing infrastructure. Categorical data represents a largely ignored and untapped resource capable of providing significant business impact. But categorical data has largely been ignored by enterprises. This post explores what categorical data is, why it has been ignored, and most importantly, the value it can unlock with minimal investment. Categorical data contains important insights</h2> There are two principal types of data: categorical and numerical</a>. Numerical data, as the name implies, refers to numbers or metrics (e.g. temperatures, counts, scores or ratings.) Categorical data is everything else – colors, product models, addresses (IP and terrestrial), telephone numbers. And what is more, categorical can express the relationship between objects – an individual and their favorite color or the education distribution by postal code are two examples. Categorical data is vast and expressive, describing attributes of the real world which, for enterprises, can provide the holy grail of insights: understanding and even predicting behavior. ExampleEliminating False Positive Security Alerts Counting the frequency of an employee accessing a high-value service is easily described with numerical values: UserID, Service#, CountofAccessAttempts. Categorical data, on the other hand, provides a rich context that can be used to identify attackers with much higher confidence, and the categorical data is already in our logs: UserID, UserAgent, DeviceOS, ServerIP, FilePath, TimeofDay. This more complete context can eliminate false positives creating significant ROI and happier analysts! So why is categorical data ignored?</h2> The simple answer is that today’s tools are not really designed to work with categorical data, particularly for real-time event processing. That has started to change in recent years with the emergence of knowledge graphs like Neo4J and Janus graph, however current graph databases can’t scale to handle real-time data volumes. Instead, enterprises must resort to encoding categorical data into numerical values so it can be processed with current event stream processing systems, a computationally expensive operation that obfuscates the very same relationships between objects that makes categorical data so valuable. Quine streaming graph for real-time behavioral insights</h2> Quine was built specifically to provide enterprises</a> with the ability to process huge volumes of categorical data in real time. Technically speaking, Quine can ingest millions of events per second</a>, render them as a graph that reveals connections between data, and produce actionable insights, all with sub-millisecond latency. Developers and data scientists query Quine using Cypher, the emerging standard for graph database query languages. Cypher makes it easy to create and detect complex patterns that indicate behaviors. Quine makes it possible to detect those patterns as they are emerging</a>, rather than later, after an event has taken place. Practically, this is the difference between anticipating and intervening before a customer abandons a shopping cart or an intruder compromises a key system and learning about it later, when you query your data warehouse. Even then, because tools require categorical data to be encoded, predictive analytics may not succeed in detecting an issue, let alone providing enough information to defend against future occurrences. Drop-in solution for unlocking categorical data</h2> Quine is not just a high-performance graph database. Quine is also a complex event processor designed to consume from and publish to Apache Kafka, AWS Kinesis and SNS/SQS, and Pulsar. Quine designed to work with your existing ETL pipeline</a>. This minimizes the cost to leverage existing infrastructure to unlock the value contained in unused categorical data. Quine ingests data, creates, a graph, and publishes results to event processors. ‍ Quine uses ingest queries</a> to eliminate data silos and combine multiple event sources into a complete graph view. ‍Standing queries</a> – queries that persist in the graph keeping a lookout for the patterns that matter most to your business – publish results to the next hop in the event stream processing infrastructure the instant a match is made. What this means in practical terms is that Quine can feed new, categorical data-derived insights into existing workflows, feeding dashboards, alerting analysts, and improving the quality of data that gets stored in company data lakes. Novelty Detector: Anomaly Detection using Categorical Data</h2> Quine streaming graph is built to drop into your existing infrastructure to look for known patterns across event streams. But what if you don’t know what to look for? How do you deal with new threats or learn new ways to improve customer experience? Built on Quine, Novelty Detector</a> is unlike previous generations of anomaly detectors. Novelty Detector combines a shallow learning algorithm developed by thatDot with streaming graph’s ability to process categorical data in real time, streaming out only the truly unique and anomalous events. Because Novelty Detector is a self-training algorithm, it delivers results fast and with no need for data science resources. Data engineers can simply direct data from existing event feeds into Novelty Detector and it will learn what is normal, what is unexpected but not important, and what is truly unique and requires further action. Novelty Detector identifies true anomalies using categorical data. And just like Quine itself, when a truly unique and anomalous event is detected, Novelty Detector pushes the results to the next hop in the workflow. Delivering Business Value Categorical Data</h2> Streaming graph is ideal when one or more of the following criteria describe the problem you are trying to solve: You need to get answers and take actions on those answers sooner rather than later. Therefore, the delay built into batch processing of data has too high a cost.</li> You have high volumes of real-time event data that exceed the capacity of traditional graph databases.</li> Your answers to your questions are spread across multiple sources that need to be combined to create a complete picture.</li> </ol> Example use cases vary across a wide range of verticals and use cases, but the examples below should give you a good sense of the breadth of Quine’ streaming graph’s uses: Ethereum blockchain fraud detection demo video</a> - (Quine)</li> Streaming network monitoring to improve end-user experience (Quine)</li> Stopping data exfiltration</a> before it happens (Novelty Detector)</li> Pre-processing log files to reduce SIEM and data lake storage costs</a> or de-dupe and data cleanse</a> (Quine)</li> </ul> Return on investment</h2> Unlocking a new source of actionable insight needs to make business sense, which means it should have a measurable impact on outcomes. On the cost side, Quine has demonstrated the ability to scale well past existing graph solutions at extremely reasonable prices – processing 425K events/second for $13/hour on AWS. Quine is easy to use and is distributed under either an open source</a> or a commercial license that includes support, clustering for horizontal scale and resilience, and access to Novelty Detector. We are happy to help you with your use case or feel free to explore Quine open source</a> yourself. Either way, you’ll find Quine the fastest, easiest way to unlock the potential of categorical data and deliver incremental business value without massive capital investment. ‍ Banner photo credit: Photo by Dennis Kummer</a> on Unsplash</a> ‍ Quine Streaming Graph: A Year in Open Source 2022-12-20T00:00:00+00:00 Becoming Developer Focused</h2> Since February, we spent much of our time – whether coding, writing docs, or talking to people – iteratively improving the developer experience: who is Quine built for, what jobs do they tell us they need to get done, what isn’t working for them currently, and what resources do they need to be successful. As more people started hearing about Quine, mainly through shared blogs, word of mouth and events, we received a lot of feedback that challenged very basic things like how we even talk about Quine. For example, we heard early that people didn’t really understand what a streaming graph was or where Quine fits in an event stream pipeline. Was it a graph database? What did it even do? All the important foundational stuff. We realized that we needed to simplify everything: we needed to create or update web pages, documentation and blog posts that were simple, clear, and developer-focused. Documentation underwent a transformation, with significant focus on the Getting Started and Concepts sections. Of the 32 blog posts published this year, 23 were either 100% or primarily focused on how engineers and architects can use or better understand the utility of the open source version of Quine. Highlights include:</h3> A six part series on ingesting data</a> into Quine</li> A deep dive into idFrom</a> and its implications for event streaming (idFrom is Quine’s strategy for maintaining state</a> for a potentially infinite stream of data; Quine deterministically generates known IDs from data)</li> A post explaining how to use Quine’s time series-like function</a>: reify-time</li> Data de-duping event streams with Kafka and Quine</a></li> Two popular blogs explaining the difference between categorical and numerical</a> data and why it matters in event processing</li> </ul> Strengthening the Community's Voice</h2> We also hired a dedicated developer relations director – Michael Algietti – whose job is to engage with the community to understand how best we can serve it and guide it toward self-sustainment. But hiring Michael doesn’t mean we’ve delegated the community to one person and forgotten about it, though. Far from it. Primarily through Quine slack and events, everyone in the company has supported open source users, answering questions and providing feedback for PRs. As 2022 ends and 2023 begins, that focus is turning into something special: the first glimmers of a self-sustaining community is beginning to manifest. Community members have submitted code, developers are building production systems using Quine OSS, and people are far more likely to tell us what we can do to improve Quine than they were even four months ago. In 2023 we will continue to learn along with you to improve Quine. We can’t wait to see how you use streaming graph techniques in your modern data pipelines. Happy Holidays from the team at thatDot! Quine 1.4.0: Scale, Stability, Supernode Mitigation 2022-11-22T00:00:00+00:00 A Major Release Fast on the Heels of A Major Milestone</h2> Today marks the release of Quine 1.4.0 with significant improvements made to resource utilization and developer experience. This release impacts both the open source and enterprise versions of Quine and is not backwards compatible with previous versions. The development of Quine 1.4.0 and our recent landmark achievement in which Quine Enterprise processed one million graph events per second</a> are deeply intertwined events. We ran those tests using an early 1.4.0 release candidate and we incorporated learning and bug fixes from those tests into the final released version. Quine 1.4.0 contains much foundational work for the next leap forward in terms of performance and stability. Highlights of Changes impacting both Quine Community and Quine Enterprise</h2> (Full release notes for Quine 1.4.0)</a> Domain Graph Nodes (DGNs) - sometimes the biggest impact feature isn’t very glamorous but one that sets the table for more ambitious and high visibility improvements. Such is the case with DGNs, which not only contributes immediate performance and resource utilization improvements but lays the foundation for coming breakthroughs in supernode mitigation. (Supernodes, the bane of graph data models, are nodes with too many edges, which impacts memory usage and performance.) With the cluster of PRs associated with DGN, we rewrote the serialization, persistence, and message passing system used by DistinctId</code> standing queries. Instead of using substantial memory on every node that stores a component of a relevant standing query, these partial query objects are stored in a new top-level persistent entity called DomainGraphNode</code>. Impact: reduces memory footprint, prepares the way for supernode mitigation, does not support graphs created in Quine 1.3.2 or earlier. Evolution of supernodes during ingest visualized in Quine's Exploration UI.</a> Documentation Improvements – as part of our never-ending quest to make docs.quine.io as easy and clear to use as possible, we made significant improvements throughout, but most notably for: Getting Started</a> - reorganized and simplified all elements of onboarding Quine and Quine basics</li> Core Concepts - added Streaming Systems</a> with details on Apache Kafka and Kinesis</li> Components - expanded on the properties of components, and most importantly on the idFrom function</a>.</li> Recipe Documentation - expanded and reorganized across docs site.</li> </ul> Impact: better organization and simpler explanations of some of Quine’s notable core concepts. Usability and performance improvements to reify.time</code> - With Quine 1.4.0, the reify.time</code> function now yields only the finest-granularity reified period node whereas reify.time</code> previously returned an array with all time nodes. With this adjustment, reify.time</code> makes it easier to write well-behaved queries that would otherwise tend to turn large period time nodes into supernodes. For example, the time node for "2022" would tend to become a supernode. That is avoided by changing reify.time</code> such that the Cypher query calling it does not inadvertently operate on every period in the hierarchy, and instead, only operates on the time node for the smallest period. Existing Cypher scripts will be syntactically compatible. However, the behavior will change, specifically impacting the YIELD</code> clause following CALL reify.time</code>. Previous behavior was that the YIELD</code> clause would emit every time node in the period hierarchy created by reify.time</code>. The new behavior is that the YIELD</code> block will emit only for a single time node. Impact: better query ergonomics, reduces likelihood a supernode is created, does not support reify.time</code> queries created for Quine 1.3.2 or earlier. Added text.urlencode</code> and text.urldecode</code> Cypher functions – These handy functions are especially useful for standing query results. They perform URL encoding or decoding of a string used as an HTTP POST body or URL component. This is particularly useful to create the #-string query linking to a specific view in the Quine Exploration UI. Impact: standing query outputs are now more readily usable in other, downstream applications and for demos. Added atomic "count" return value to incrementCounter</code> procedure - The core functionality of incrementCounter</code> takes advantage of Quine’s unique computational model to keep a perfectly-synchronized counter on a single node. The old VOID version (i.e., returning nothing) ensured the counter was updated correctly but provided no way to access the uniquely incremented counter value. We added a "count" yielded value to the procedure that can be used via the standard syntax for Cypher procedures: https://s3.amazonaws.com/artifacts.opencypher.org/openCypher9.pdf, page 122). For example, CALL incrementCounter(myNode, "counter") YIELD count AS updatedCount</code> will increment a property named "counter" on the node referred to as myNode</code>, then add a variable to the query context called updatedCount</code>, containing the new value of that property. Of particular note is that multiple query executions running in parallel will get unique values returned from the procedure's yielded "count". An example of how to use the new version MATCH (n) WHERE id(n) = idFrom(-1) CALL incrementCounter(n, 'prop') YIELD count RETURN id(n), count</code> This increments the "prop" counter on the node with id idFrom(-1)</code> and returns the count</code>. Multiple invocations of this query, even in parallel, will all return a unique count</code> value. Impact: counters are more flexible, and can be read under high parallelism. Standing queries can now use idFrom</code> - Previously, ingest and ad hoc queries could use idFrom, Quine’s hash-based high-performance alternative to indices used for node lookup and retrieval. Standing queries can now use idFrom</code>-based ID constraints, provided that all arguments to the idFrom</code> are literal values. An example of a standing query using idFrom</code>: MATCH (n) WHERE id(n) = idFrom('my', 'special', 1, 'node') RETURN DISTINCT id(n) AS specialNodeId</code> will match on exactly 1 node: the node with the id idFrom('my', 'special', 1, 'node')</code>. Impact: brings standing queries, whether using MultipleValues or DistinctId, closer to the full range of functionality available in ad-hoc queries. Added support for decoding steps during ingest (base64, zlib, and gzip) – This one is pretty simple and self-explanatory</a>, but we draw your attention to it as it means you can ingest compressed event data from Kafka and Kinesis. Impact: better performance, supports gzip, zlib, and base64 input compression Enterprise-focused Improvements</h2> Improvements to Quine Enterprise focused on improved cluster management and querying cluster state. Added Cypher function clusterPosition()</code> to get the executing member's position – You can now get the index of the executing cluster member. When combined with locIdFrom(clusterPosition(), "prop1", "prop2")</code>, the node will have a unique hash on the current host Cluster-position aware locIdFrom</code> function – The locIdFrom</code> function now accepts Quine cluster positionintegers as its first argument rather than a partition. To map a partition to a position integer, the new kafkaHash</code> function may be used. For example, locIdFrom(kafkaHash(“india”), “West Bengal”, 12345)</code>. Note: ‘QuineIds’ allocated by Quine 1.3.2 and earlier may have inconsistent mappings in Quine 1.4.0 and later. Extended support for bloom filter-optimized persistence to clusters – Previously Quine used a bloom filter to help decide if a node already exists. When unsure, Quine would have to query the persister. The bloom filter was disabled for Quine Enterprise because in the case of a cluster, when a hot spare joined the cluster, it would have to rebuild its bloom filter (taking minutes), thus causing the cluster performance to degrade while waiting for the host to join. This change allows a host to join the cluster, build its bloom filter in the background while always hitting the persister early on. Once the bloom filter is loaded, then the optimization can be utilized. It effectively allows a host to join fast, keep the cluster healthy, and the cost is that the new host will be a bit slower until the bloom filter is available. Improvements resulting from One Million Events/Second testing</h2> Scaling to 1M+ events/second and demonstrating recovery from various failure scenarios If you are interested in bug fixes and improvements yielded while processing high volume event streams with Quine, here’s a quick list pulled from release notes: Enriched logging in edge cases involving shard resolution</li> Simplified node wakeup protocol: edge cases involving simultaneous request to sleep and wake should now be more efficient</li> Cassandra persister batched writes now respect configured timeout and consistency options</li> Singleton-snapshot and journals may now be enabled at the same time</li> Improved shutdown behavior in failsafe case</li> Node edge and property counts will now be correctly reflected in the metrics dashboard</li> Enterprise: Improved cluster stability when cluster members experience temporary disconnections</li> </ul> Getting Started</h2> If you want to try Quine using your own data, here are some resources to help: Download Quine 1.4.0 - JAR file</a> (263MB)| Docker Image</a> | Github</a></li> Start learning about Quine now by visiting the Quine open source project</a>.</li> Check out the Ingest Data into Quine blog series covering everything from ingest from Kafka</a> to ingesting .CSV data‍</a></li> CDN Cache Efficiency Recipe</a> - this recipe provides more ingest pattern examples</li> </ol> And if you require 24 x7 support or have a high-volume use case and would like to try the Quine Enterprise, please contact us. You can also read more about Streaming Graph here</a>. ‍ ‍ Header image: Photo by Nina Luong</a> on Unsplash</a> Additional image: Photo by Ricardo Gomez Angel</a> on Unsplash</a> ‍ Why Digital Twins Need to Go Real Time 2022-11-15T00:00:00+00:00 (This post is modified from a version that ran in RT</a> Insights</a> Oct 13, 2022.) Quine streaming graph was built to analyze event streams in real time and drive event pipelines. Now our users are coupling them with digital twins and asset graphs to create accurate, up-to-the-second views of their infrastructure. The potential of real-time digital twins</h2> When things go wrong, our instinct is to retrace our process to pinpoint the problem. This process can be slow and frustrating. So, imagine if AI could step in and flag the misstep with data in real time. What if you could pinpoint the exact place — a blockage in a manufacturing process, a small interruption in a logistics chain that could bring deliveries to a halt or the failure of a piece of critical cloud infrastructure, for example? What if you could combine real-time alerting and diagnostics with your digital twin? As our world becomes increasingly connected, digital twins are being used to abstract and model almost everything to improve business operations, reduce risk, and enhance decision-making for better outcomes. They provide greater context to challenges by creating clear relationships and streamlining workflows as a virtual representation of the real world, including physical objects, processes, relationships, and behaviors. Even though they serve a valuable function as part of the enterprise technology toolkit, digital twins are not a technology per se but a specification for the structure and use of real-time data. But this concept is still relatively new, and as data is moving rapidly, how do we maximize the outcomes and role of digital twins? Seeing digital doubles: An introduction</h2> IBM defines digital twins as “a virtual representation of an object or system that spans its lifecycle, is updated from real-time data, and uses simulation, machine learning, and reasoning to help decision-making.” In short, they can connect digital and physical items through data. According to McKinsey, “by 2025, smart workflows and seamless interactions among humans and machines will likely be as standard as the corporate balance sheet, and most employees will use data to optimize nearly every aspect of their work.” Realistically, that’s only a couple of years away, meaning we have a short time to get this right. Currently, digital twins are helping revolutionize engineering, security, eCommerce, supply chain issues, and manufacturing to ensure better outcomes across industries. Embrace and enhance</h3> Until recently, digital twins were used to simulate real-world processes rather than interact with the world in real time. Either synthetically generated or previously captured data was run (and rerun) in controlled scenarios. As a design and diagnostic aid in product lifecycle management, digital twins have proven enormously helpful (think NASA engineers during the Apollo 13 mission for the early use of a twin). Digital twins for industrial processes. But as enterprises increasingly feel the pressure to replace offline batch processing of event data with real-time event processing, digital twins will need to go real time to remain relevant and valuable. This means moving the digital twin out of legacy databases, and in particular, graph databases, and into systems capable of processing potentially vast amounts of data, usually arriving via data pipeline software like Apache Kafka</a> or Spark, the instant it arrives. Such systems, streaming graphs</a>, have evolved in recent years and combine the complex event processing capabilities of Flink and ksqlDB with the powerful data structures popularized by Neo4J, a traditional graph database</a>. Going real time, in this case, means more than just processing data as it streams in, though. For digital twins to be truly useful, they must be able to drive actions — for example, issue alerts or power down equipment — the instant an issue emerges, perhaps even beforehand. Build a real-time, streaming asset graph</h3> New streaming graph systems have evolved to embed query logic and compute resources in line with data flows. They act like nets stretched across the data stream to capture interesting patterns as they race past and trigger workflows where instant matches occur. If our goal is to streamline and make our data processing more precise, digital twins need to be graph and real time. The best way to do this is by embracing and engaging with streaming graph, which combines event stream processing with the ability to query graph data. The volume of events being handled in the physical world must be translated into tangible and usable data. By making digital twins real time and graph, we can take the training wheels off this area of AI/ML and allow it to run at its highest potential for maximum business impact. ThatDot makes the only technology that combines event processing with graph (graph data pipelines is a simple way to describe it) and, as such, is tailor made for digital twins. Build Your Own Digital Twin with thatDot</h2> thatDot software is available in both open source Quine and commercial Streaming Graph</a>. You can try it yourself. Learn how to ingest your own data and build a streaming graph that can detect all sorts of changes and problems in real time. Try Streaming Graph</a> free for yourself.</li> Learn more about thatDot Streaming Graph</a>.</li> Join the Quine Discord Community</a> and get help from thatDot engineers and community members.</li> Check out the Ingest Data into Quine</a> blog series covering everything from ingest from Kafka</a> to ingesting .CSV data</li> Download open source Quine – JAR file</a> | Docker Image</a> | Github</a></li> </ol> What is the difference between batch and stream processing? 2022-10-25T00:00:00+00:00 Is it better to fix a problem now or later?</h2> The typical answer when someone describes the difference between batch processing</a> and stream processing</a> is that batch data is collected, stored for a period of time, and processed and put to use at regular intervals (e.g. payroll, bank statements) while streaming data is processed and put to use as close to the instant it is generated (think of alerts from sensor data). While accurate, this answer fails to capture why the difference is important and why companies are moving decisively toward stream processing architectures. We experience the world as a constant stream of events. We make decisions by comparing this stream of information to our experiences and memories. We perceive and react to threats or recognize and seize opportunities. And often reacting in a timely fashion is rewarding – we avoid the snake bite or grab the best seat at the movie theater. Stream processing more closely reflects this very human mode of experience. Enterprises ingest as many streams of information as they can handle, look for patterns in the data that represent threats or opportunities as it flows past, and when said patterns emerge, they act. The cost of not acting could be a data breach or a lost revenue opportunity. Batch processing still works well when you need to process huge amounts of data and the results can be delivered at regular intervals. But if recent trends hold, more of these jobs will move to streaming because companies can’t accept the hidden cost of batch any longer and remain competitive. Counting the Cost of Not Acting</h2> A great example is insider trading. The cost of detecting someone who is about to execute an insider trade is now much less than the cost of trying to unwind that trade later when batch processing picks it up. Even if the batch process runs every five minutes, that just means you’ll find them sooner, not stop them. Ultimately stream vs. batch will show up in the balance sheet and the stock price. The one potential argument against streaming is that it might not handle the amount of data as cost effectively as batch handles. However, with the advent of systems like Kafka, Flink, and their cloud analogues, such cases are getting rare. Quine Stream Graph for ETL Pipelines</h2> We build Quine to not just detect emerging patterns of interest in high volumes of data but to act on the results with sub-millisecond latency. Practically speaking, this means finding evidence of a password spray attack or streaming CDN service interruptions when they are technical issues and before they become business issues. Quine consumes event data from one or more streams originating in Kafka, Kinesis, or data lakes, uses a graph data structure to materialize the often complex relationships between events that evidence important system or user behavior. Quine uses standing queries to trigger actions like sending alerts or updating machine learning models the instant such patterns become apparent. Far from acting as a passive filter, Quine actually drives the workflow. And Quine scales to meet the needs of modern enterprises, as this test demonstrating Quine’s ability to process and alert on one million events/second demonstrates. When to Use Quine</h2> Batch processing is great for jobs where response time doesn’t matter. And batch processing tools have been around for a long time so you have your choice. But for jobs where the cost of not knowing and therefore not acting are unacceptable, Quine is idea. For use cases like financial fraud detection,video observability, and manufacturing process management using a digital twin, Quine streaming graph is really the only choice. Getting Started</h2> If you want to try Quine using your own data, here are some resources to help: Download Quine – JAR file</a> | Docker Image</a> | Github</a></li> Start learning about Quine now by visiting the Quine open source project</a>.</li> Check out the Ingest Data into Quine blog series covering everything from ingest from Kafka</a> to ingesting .CSV data‍</a></li> CDN Cache Efficiency Recipe</a> – this recipe provides more ingest pattern examples</li> </ol> And if you require 24 x7 support or have high-volume use case and would like to try the Quine Enterprise, please contact us. You can also read more about Quine Enterprise here. Special thanks for the image used in the image to Amritanshu Sikdar</a> on Unsplash.</a> See Quine in Action: 3 Live Demos Showing Graph ETL Use Cases 2022-10-20T00:00:00+00:00 Nothing Beats a Live Demo</h2> Over the last three weeks, we've been fortunate enough to deliver presentations at events hosted by DataStax (makers of Apache Cassandra), Confluent (makers of Apache Kafka), and the PDX Video Tech Meetup (sponsored by AWS Elemental). Each video includes a live demo showing Quine in action and includes ways for you to follow along and go further. Enjoy! DataStax Hands-On Workshop: Password Spray Detection</h2> I joined the team at DataStax to demonstrate Quine graph ETL in action using the Password Spray Detection recipe. In this hands-on workshop, we cover how to use Quine with AstraDB, DataStax's Cassandra-as-a-Service DB. If you recall, we used the Cassandra persistor for our performance tests where Quine broke 1 million events/second (read the blog describing the reproducible tests here</a>). </iframe> </div> You can access the Github repo with the recipe and the excellent and comprehensive README</a> here: https://github.com/datastaxdevs/workshop-streaming-graph-quine</a> Confluent Current22 Demo: Advanced Persistent Threat Use Case with Apache Kafka</h2> Ryan Wright (@rrwright) delivered a bite-sized demonstration of how to use Quine to detect APT attacks. At 10 minutes, this demo packs a lot of useful information into a lightning talk format and points to many of the features that make streaming graph ETL an essential tool for cybersecurity solutions. </iframe> </div> Slides and Kafka resources are available here</a>, including more info on how to add Quine graph ETL to Apache Kafka data pipelines. PDX Vid Tech Meetup: Real-time Video CDN Root Cause Analysis</h2> Rob Malnati (@robmalnati) and Allan Konar (@7evenbridges)presented a live demonstration using Quine to ingest, transform, sessionize, and analyze log data from CloudFront, AWS Elemental, and Mux client APIs. Video QoE/S issues were identified in real time and root cause analysis notifications were generated automatically. This example uses AWS Kinesis for the event stream feed. </iframe> </div> We will be publishing the recipe for this soon but a related recipe --CDN Cache Efficiency</a> -- is available now to try. You can also read about Kinesis integration here. Download and Try</h2> If you want to try it on your own logs, here are some resources to help: Getting Started Guide</a></li> Download Quine - JAR file</a> | Docker Image</a> | Github</a></li> Check out the Ingest Data into Quine</a> blog series covering everything from ingest from Kafka</a> to ingesting .CSV data</li> </ol> And please don't hesitate to sign up for Quine community slack</a>. There are lively discussions and it is a great place to get fast answers to pressing question. You can find me there or on Twitter (@michaelaglietti). Thanks! Great Quine Community Events for October 2022 2022-10-03T00:00:00+00:00 October 2022 Events: Meet the Quine Team, Learn About Streaming Graph</h2> Our thatDot team is focused on creating sessions to educate and inform key audiences about streaming graph and Quine while helping developers get the most out of their data. If you will be at one of these events (either online or in-person) and would like to schedule a meeting, or if you’re interested in the topics below but won’t be able to attend, reach out at info@thatdot.com. We can connect you with our team of experts who are presenting afterward. Our team will be at the following events in October (and November): ‍Current: The Next Generation of Kafka Summit</a>, Ryan Wright, CEO and Founder‍ Austin, TX, October 4-5 Join the first-ever data streaming industry event at Current 2022: The Next Generation of Kafka Summit. You’ll be able to immerse yourself in all things real-time data with peers, industry analysts, expert speakers, and more. Current captures the fast-moving data streaming movement, bringing this broad community together for learning, sharing, and networking to help you unlock the value from data. Session: Build a Streaming Graph Pipeline on Kafka with Quine</a></li> </ul> ‍ ‍PDX Video Tech Meetup</a>, Rob Malnati, COO, and Allan Konar, Solution Architect Portland, OR, October 5 Sponsored by AWS Elemental, this networking event will be held downtown Portland at Lucky Labrador Beer Hall and will bring local industry peers together for video tech talks. Session: Analyze This! - Real-time Video Root Cause Analysis Using CMCD (read more</a>)</li> </ul> ‍ </iframe> </div> ‍DataStax Workshop,</a> Michael Aglietti, Director of Developer Relations Virtual, October 19, Wednesday 08: 00 am PT 16:00 pm GMT 17:00 pm CET 20.30 pm IST DataStax events are great venues for networking with colleagues, learning from real-world DataStax and Apache Cassandra™ use cases, and discovering new approaches to thrive in the new decade of open-source, scale-out, and cloud-native data in-person and online. Workshop Session: Real-time Graph ETL for Modern Data Pipelines with Quine and Cassandra</a></li> </ul> ‍ Reactive Summit 2022,</a>Ryan Wright, CEO and Founder Detroit, MI, October 25 Reactive Summit is where application architects and developers learn and collaborate on the latest Reactive patterns and projects for building distributed systems using Serverless, Cloud Native Design, Reactive programming, Reactive systems, Reactive Streams, event-sourcing, microservices, and more. Session Streaming Graphs, Because We Can't Afford to Query Any More</a></li> </ul> And a teaser of what is coming in November:</h3> DataStax Cassandra Day Workshops (sessions take place the same day in two locations): Seattle, Ryan Wright, CEO and Founder</li> Houston, Michael Aglietti, Director, Developer Relations</li> </ul> Seattle, WA, and Houston, TX, November 10 DataStax events are great venues for networking with colleagues, learning from real-world DataStax and Apache Cassandra™ use cases, and discovering new approaches to thrive in the new decade of open-source, scale-out, and cloud-native data in-person and online. We’ll have specific session info soon so watch this space! Why You Should Attend</h2> Cybersecurity experts face immense challenges with frequent data breaches dominating the headlines. Traditional anomaly detectors often fall short in identifying and neutralizing threats in real-time. They require constant human tweaking of threat signatures and sensitivity levels to avoid exhausting professionals with mountains of false positive alerts. What if you could build contextual awareness into the application? Join us to discover how thatDot Novelty, powered by the open-source technology Quine, is revolutionizing real-time threat detection and response. This cutting-edge technology combines event stream processing speed with a built-in AI that learns the contextual fingerprint of your data environment, and pinpoints problems automatically. Developed in a DARPA project, thatDot Novelty and thatDot Streaming Graph provide unparalleled capabilities in: Advanced Persistent Threat Detection</li> Insider Threat Detection</li> Attack Graph Analysis</li> Digital Twins</li> And many more critical use cases</li> </ul> Don’t miss this opportunity to learn from industry experts and gain a competitive edge in cybersecurity. Secure your spot today and be part of the future of threat detection and data analytics. We look forward to your participation. Scaling Quine Streaming Graph to Process 1 Million Events/Second 2022-09-27T00:00:00+00:00 Note: If you want to reproduce this test, we have published the test details on Github</a> so that you can understand and run it yourself. Solving the Unsolvable: Graph that Scales Past 1 Million Events/Second</h2> This is not a blog post about benchmarking Quine streaming graph. This is a post with an operational focus that explains how Quine solves the previously unsolvable: scaling graph data processing past a million events per second. In conventional terms, that means millions of simultaneous writes and multi-node graph traversals per second -- an unprecedented achievement. The tests this post covers also demonstrate Quine Enterprise's resilience in the face of common failure scenarios. Most importantly, this blog is about the new use cases for graph this performance makes possible. Finding relationships within categorical data is graph's strongpoint. Doing so at scale, as Quine now makes possible, has significant implications for cyber security, fraud detection, observability, logistics, e-commerce, and really any use case graph is both well-suited for and which must process high velocity data in real time. Scaling to 1M+ events/second and demonstrating recovery from various failure scenarios tl;dr</h2> Our tests delivered the following results: 1M events/second processed for a 2 hour period 1M+ writes per second</li> 1M 4-node graph traversals (reads) per second</li> 21K results (4-node pattern matches) emitted per second</li> </ul> </li> 190 commodity hosts plus 1 hot spare running Quine Enterprise We found this 190-host configuration to be significantly cheaper than the 140-host cluster initially tested.</li> </ul> </li> 48 storage hosts using Apache Cassandra persistor</li> 3 hosts for Apache Kafka</li> </ul> What is Quine?</h2> For those of you new to Quine, the simplest way to describe it is “real-time graph ETL”. Quine streaming graph combines the graph data structure and persistence of graph databases (e.g. Neo4J) with the streaming properties of systems like Flink. Drop Quine into a streaming system between two Apache Kafka</a> or Kinesis instances and start materializing and querying your real-time events as a graph. There’s a lot more to Quine of course, so if you are interested in how it works – asynchronous actor model, caching strategies, "all nodes exist," and more – check out the docs</a>. In particular, see the comparison to standard database systems</a>. Quine Operational Profiling</h2> The goal of this test is to demonstrate a high-volume of sustained ingest, that is resilient to cluster node failure in both Quine and the persister using commodity infrastructure, and to share performance results along with details of the test for those interested in either reproducing results or running Quine in production. Infrastructure Used</h3> Quine Cluster</h4> Number of Hosts: 191</li> Host Type: n2-custom-8-16384</a> 8 vCPU, 16GB Intel Cascade Lake Max</li> JVM heap set to 12GB</li> 190 cluster size with 1 hot spare</li> </ul> </li> Cost: $28.73/hour</li> </ul> Cassandra Persistor Cluster</h4> Number of Hosts: 48</li> Host Type: n2d-custom-16-131072</a> 16 vCPU, 128GB AMD Rome</li> 1 x 375 GB local SSD each</li> TTL: 15 minutes on snapshots (to control disk costs in testing and journals tables)</li> </ul> </li> Cost: $21.07/hour</li> </ul> Kafka</h4> Number of Hosts: 3</li> Host Type: n2-standard-4</a> 4 vCPU, 16GB RAM</li> Preloaded with 8 billion events (sufficient for a sustained 2-hour ingest at 1 million events per second)</li> 420 partitions</li> </ul> </li> Estimated Cost: Part of the data pipeline, not estimated</li> </ul> Infrastructure Update</h3> Our initial testing, provisioned 141 c2-standard-30</a> hosts. However, as we proceeded with further testing, we made an important discovery. By deploying a higher number of smaller n2-custom-8-16384</a> hosts, we achieved the same overall performance while significantly reducing our monthly costs. Using 191 smaller hosts proved to be a more cost-effective solution compared to the initial setup with 141 larger hosts. This adjustment allows us to maintain optimal performance while ensuring budget efficiency. Regarding the Cassandra persistor layer’s settings, we set a TTL of 15 minutes and replication factor of 1 in order to manage quota limits and spending on cloud infrastructure. This does not fit every possible use case, but it is fairly common. Other scenarios which are more data-storage oriented will often increase the replication factor and/or TTL. In those variations, maintaining the 1 million events/sec processing rate would require increasing the number of Cassandra hosts or disk storage, both of which are budgetary concerns more than technical concerns. This cluster configuration was meant to demonstrate that high-volume graph processing is possible. In a later post we'll describe how to optimize the cluster to achieve these results and minimize infrastructure costs. The Test</h3> The plan is set out below, with each action labeled and the results explained. Events are clearly marked by sequence # on the Grafana screen grabs below the table. A few notes on the test: A script is used to generate events</li> Host failures are manually triggered.</li> We used Grafana for the results (and screenshots).</li> We pre-loaded Kafka with enough events to sustain one million events/second for two hours.</li> A Cassandra cluster is used for persistent data storage. The Cassandra cluster is not over-provisioned to accommodate compaction intentionally (a common strategy) so that the effects of database maintenance on the ingest rate can be demonstrated.</li> The cluster is run in a Kubernetes environment</li> </ul> Sequence Actions, Expected Results, and Actual Results Overview</h3> Sequence 1</h4> Action: Start the Quine cluster and begin ingest from Kafka.</li> Expected Result: The ingest rate increases and settles at or above 1 million events per second.</li> Actual Result: Observed.</li> </ul> Sequence 2</h4> Action: Let Quine run for 40 minutes to establish a stable baseline.</li> Expected Result: Quine does not fail and maintains a baseline ingest rate at or above 1 million events per second.</li> Actual Result: Observed.</li> </ul> Sequence 3</h4> Action: Kill a Quine host.</li> Expected Result: Quine ingest is not significantly impacted. The hot spare steps in to recover quickly, and Kubernetes replaces the killed host, which becomes a new hot spare.</li> Actual Result: Observed at 17:47. No impact to ingest rate. The hot spare recovered quickly, and ingest was not impacted.</li> </ul> Sequence 4</h4> Action: Perform Cassandra persistor maintenance.</li> Expected Result: Cassandra regularly performs maintenance, Quine experiences this as increased latency and should backpressure the ingest to maintain stability during database maintenance.</li> Actual Result: From 17:55 - 18:15, the ingest rate is reduced as a corresponding increase in latency is measured above 1ms across all nodes from the Cassandra persistor.</li> </ul> Sequence 5</h4> Action: Kill two Quine hosts.</li> Expected Result: Observe the following sequence: hot spare recovers one host, while the whole cluster suspends ingest due to being degraded. Kubernetes replaces killed hosts, the first replaced host recovers the cluster, and the second replaced host becomes the new hot spare.</li> Actual Result: Observed from 18:18 - 18:25. Due to Kubernetes, the impact was not visible. However, the expected sequence was confirmed in the logs.</li> </ul> Sequence 6</h4> Action: Stop and resume a Quine host for about 1 minute to inject high latency.</li> Expected Result: Quine detects the host is no longer available, boots it from the cluster, and the hot spare steps in to recover. When the rejected host resumes, it learns it was removed from the cluster, shuts down, is restarted by Kubernetes, and becomes the new hot spare.</li> Actual Result: Observed from 18:41 - 18:46. No impact on ingest rate as the back-pressured ingest was for a single host in the cluster, and the recovery happened quickly.</li> </ul> Sequence 7</h4> Action: Stop and resume a Cassandra persistor host for about 1 minute to inject high latency.</li> Expected Result: Quine back pressures ingest until Cassandra persistor has recovered.</li> Actual Result: Observed from 18:47 - 18:54. Due to replication factor = 1, ingest was impacted until Cassandra persistor recovered. Ingest then resumed to > 1M events per second.</li> </ul> Sequence 8</h4> Action: Kill a Cassandra persistor host.</li> Expected Result: Quine suspends ingest until Cassandra persistor recovers with a new host.</li> Actual Result: Observed from 18:54 - 19:10. The host was recovered quickly due to Kubernetes, and ingest briefly recovered to 1M events per second by 18:58 (only a few minutes).</li> </ul> Sequence 9</h4> Action: Perform Cassandra persistor maintenance.</li> Expected Result: Cassandra regularly performs maintenance. Quine experiences this as increased latency and should backpressure the ingest to maintain stability during database maintenance.</li> Actual Result: From 17:55 - 18:15, the ingest rate is reduced as a corresponding increase in latency is measured above 1ms across all nodes from the Cassandra persistor.</li> </ul> Sequence 10</h4> Action: Let Quine consume the remaining Kafka stream.</li> Expected Result: Observe the Quine hosts drop to zero events per second (not all at once).</li> Actual Result: Observed from 19:10 - 19:35. Around the time Cassandra persistor latency was returning to 1ms, and ingest returned to 1M events per second. The pre-loaded ingest stream began to become exhausted on some hosts. For the following 20 minutes, hosts exhausted their partitions in the stream.</li> </ul> The Results</h3> Figure 1: Overall Ingest Rate As you can see from the overall ingest rate results: #1 shows an initial peak of 1.25M events/sec</li> #2 Quine settles into a steady ingest rate > 1 million events/sec</li> #3 Quine recovers nicely after killing single node</li> Quine settles into a steady ingest rate > 1 million events/sec</li> #s 4 and 9 show Cassandra maintenance event (see Cassandra Latency - Figure 3)</li> #5 Quine has no problem with two-node failure events.</li> </ul> We observed that a persistor node high-latency event (7) has a more marked impact on performance than either a Quine node failure (5) or an outright failure of a persistor node (8). In the case of a clear failure, Kubernetes is quick to replace the node, allowing ingest to resume. In cases when a persistence node state is non-responsive but not clearly down, Quine’s response is to back pressure ingest until the node is recovered. An alternate variation on this test could use more persistor machines to stabilize ingest rates during maintenance events. Figure 2: Per Host Ingest Rate - Quine Only The individual Quine node ingest graphs indicate when individual nodes are offline and reinforces the observation that Quine Enterprise’s cluster resilience allows for smooth operation during high-volume ingest, even in the face of a Quine node shut down or failure. Quine’s overall performance, and hence an area of operational focus for anyone planning a production deployment, more closely conforms with persistor performance. Figure 3: Cassandra Persistor Latency The median query latency for the Cassandra cluster during this test was <1 ms. Even during/following persistor shutdown (8) or node failure (7), cluster latency stayed < 1.5 ms. Events at (1), (5), and (8), all reflect increased latency for single nodes. Standing Queries and 1 Million 4-node traversals per second</h3> Figure 4: Standing Query Results (events emitted/second) The purpose of running any complex event processor, Quine included, is in detecting and acting on high-value events in real time. This could mean detecting indications of a cyber attack, or video stream buffering, or identifying e-commerce upsell opportunities at check out. This is where Quine really excels. Standing queries</a> are a unique feature of Quine. They monitor streams for specified patterns, maintaining partial matches, and executing user-specified actions the instant a full match is made. Actions can include anything from updating the graph itself by creating new nodes or edges, writing results out to Kafka (or Kinesis, or posting results to a webhook). In this test, Quine standing queries monitored for specific 4-node patterns requiring a 4-node traversal every time an event was ingested. Traditional graph databases slow down ingest when performing multi-node traversal. Not Quine. Quine’s ability to sustain high-speed data ingest together with simultaneous graph analysis is a revolutionary new capability. Not only did Quine ingest more than 1,000,000 events per second, it analyzed all that data in real-time to find more than 20,000 matches per second for complex graph patterns. This is a whole new world! Summary Results</h2> Resource Usage and Performance Metrics Overview</h3> Quine Host Metrics</h4> Description: GB RAM used per Quine host Value: 12 GB</li> </ul> </li> Description: CPU% used per Quine host Value: 60%</li> </ul> </li> </ul> Cassandra Persistor Node Metrics</h4> Description: CPU% used per Cassandra persistor node Value: 80%+</li> </ul> </li> </ul> Performance Metrics</h4> Description: Overall Ingest Event Records/Second Value: >1,000,000</li> </ul> </li> Description: Standing Query Results/Second Value: 21,000/sec</li> </ul> </li> Description: Average Persistor Latency Value: 1 ms</li> </ul> </li> Description: Data Storage Disk Space Used (Cassandra) Value: 70 GB/host</li> </ul> </li> </ul> Why Quine Hitting 1 Million Events/Sec Matters</h2> Since its release in 2007 at the start of the NoSQL revolution, Neo4J have proven conclusively the value of graph to connect and find complex patterns in categorical data</a>. The graph data model is indispensable to everything from fraud detection to network observability to cybersecurity. It is used for recommendation engines, logistics, and XDR/EDR. But not long after NoSQL hit the scene, Kafka kicked off the movement toward real-time event processing. Soon, event processors like Flink, Spark Streaming and ksqlDB brought the ability to process live streams. These systems relied on less-expressive key-value stores or slower document and relational databases to save intermediate data. Quine is the graph analog and is important because now you can do what graph is really good at -- finding complex patterns across multiple streams of data using not just numerical but categorical data. Quine makes all the great graph use cases viable at high volumes and in real time. Next Steps</h2> If you want help planning your own test, or you would like to try the Quine Enterprise, please contact us. You can also read more about Streaming Graph here</a>. Or you can start learning about Quine now by visiting the Quine open source project</a>. We have a Slack channel where folks can ask questions and we are always up for a call. Streaming Graph ETL: Real-time Video Observability Simplified 2022-09-06T00:00:00+00:00 A Live Event Stream, a CDN, and a Manifest Services Provider Walk into a Bar</h2> Video observability, or the end-to-end monitoring of complex video streaming architectures, entails some of the most challenging aspects of data engineering. A live event will usually traverse three and sometimes more partner systems on its path from origin to the end user. Until recently, no single service provider in this chain of delivery had access to performance metrics of upstream or downstream providers, making diagnosing and resolving issues more difficult. But as standards begin to emerge for data sharing between partners, a new challenge has emerged: how to combine enormous amounts of high cardinality and high dimensionality data, formatted inconsistently, into a single cohesive view that can be acted on in real time? Quine is a streaming graph processor that provides a unique solution for solving all of these challenges by combining graph data modeling (e.g., Neo4J) with highly efficient event stream processing (e.g., Flink or ksqlDB). End-to-end video observability is complicated by the many platforms and services a single stream traverses. Video Observability Is Hard</h2> End-to-end video observability is challenging for many reasons: ‍Multiple services from multiple vendors: No one vendor operates the entire platform. Critical-path services include: origins, trans/encoders, manifest services, entitlement services, ad services, CDNs, and video players. Each element generates its own logs with different formats for user, client, and session IDs, time stamps, and focuses on different parts of the platform.‍</li> Many Dimensions to Track: Even when these logs are combined and somehow synchronized, you encounter high data dimensionality: device hardware configurations, device software versions, client IPs, server IPs, video player versions, video assets versions, etc. ‍</li> High Data Cardinality: Subnets and IPs, country/state/city designations, time stamps, and the combination of these with all the above dimensions.‍</li> Categorical Data: Most log data are non-numbers – URL strings, classifications, asset titles, IP addresses, etc. and while valuable, this data is often discarded. Encoding these values to numbers is very difficult to manage and not always useful.</li> ‍Scale: Live video events generate significant volumes of data within very short time periods, as live broadcast events start for millions of viewers.‍</li> Real-time events need real-time fixes: the time to fix a video streaming issue is when the issue is ruining the user experience, especially for live streams.</li> </ul> Multiplying Complexity</h2> Any one of the above reasons presents a significant barrier to correctly detecting and diagnosing network issues. Take this entire matrix of possible data combinations and operational challenges together and you are faced with a significant challenge to model and analyze the end-to-end behavior of live events in time frames suitable for remediation. Additionally, the costs of legacy log analysis tools are prohibitive at scale. As a result, most broadcasters monitor individual elements of the video delivery workflow and use intuition to link element behavior on the end-to-end system. Sometimes, this doesn’t work out so well</a>. What Data Structure For Video Observability?</h2> As described in an earlier blog, Defining Video Observability</a>, combining event data from each component of the video delivery workflow is required to understand the contributions of each component of the end-to-end video delivery experience. Connecting the logs of Origins, Manifests services, and CDNs provides several meaningful benefits to: understand the impact of one system component on the complete system and other components.</li> build and measure KPIs that align with user experience.</li> prioritize issues based on their impact to user experience.</li> identify root causes of identified problems and issues.</li> </ul> Assembling log and event data into a holistic end-to-end view allows operators at any point in the experience stack to quickly identify an issue’s root cause and enables real-time remediation and automation. Without a representation of the entire system, it is quite common to incorrectly diagnose causes, wasting time and good will with operations staff, vendors, and customers. The emerging CMCD standard represents a more efficient mechanism for matching CDN and video player client data to correlate video stream viewer experience with the CDNs that delivers the video streams. I'll dig deeper into the specifics of CMCD in a future post. Solving Multi-Source Data Ingestion Challenges</h2> Synthesizing a unified view from multiple event streams in real time presents operational challenges as well as some of the data-specific problems discussed above. When millions of devices are connecting to global CDNs with thousands of POPS in dozens of countries, data will almost certainly arrive out of order. As partners, event types, and platforms change, schemas must be able to react without downtime. And, perhaps most importantly of all, detecting patterns that indicate issues and acting on them instantaneously can be a challenge for most databases. Quine, which combines a graph data model with the real-time capabilities of event stream processing, is built to solve these operational issues. Streaming Graph Efficient Analysis</h3> Log event dimensionality and cardinality are a critical challenge in video observability. Near endless combinations of data, as shown below, require hundreds of tables in traditional RDBMS systems. Joins of tables to connect data from multiple tables are compute intensive and the costs of these joins increases with the number of tables. The cost to query across tables is particularly expensive when there is a “fan out” of one table to subsidiary tables as shown with “Asset” in the image. Graph data modeling offers an alternate approach, storing dimensions that would be rows in tables as nodes and describing the relationship between nodes as edges. This model makes associating a video playout event with the CDN, user, asset, geography etc., a very low cost action. When applied to video observability data at scale, the efficiency of graph is significant as compared to traditional RDBMS operations. Graph data structures provide an excellent alternative to relational models for real-time analysis. Importantly, once data is stored in the streaming graph, we can define new KPIs that encompass insights from each element of video delivery workflow. Calculating a continuous state for end-to-end latency, or tracing the current state of CDN or asset health for a specific geo/ASN/CDN becomes trivial, even though this represents hundreds of thousands, or even millions, of separate values. Categorical Data</h3> Categorical data</a> -- content titles, email addresses, process IDs, IP addresses -- is incredibly valuable for root cause analysis and, amazingly enough, frequently ignored by enterprses. Quine greatly expands the utility of log data and the effectiveness of log analytics by processing non-numerical data in its natural, categorical form. The avoidance of one-hot encoding simplifies data management and reduces computation needs, while making the system more human-friendly to operate. Knowing When to Act on Streaming Data</h3> A significant advantage of the Quine streaming graph is its ability to generate actions in real time as data arrives. It does this using a feature unique to Quine: standing queries</a>. Think of standing queries as a sort of filter placed inline with the event stream, watching for any event data that is part of a pattern of interest – for example, a series of events that suggest an issue with a POP’s network or client connectivity. As new events are ingested into the graph, standing queries update this partial match waiting and watching for a full match to occur. Traditional systems must continuously query to see if a full match has occurred. This is an expensive operation and introduces delays. With Quine, when a full match occurs, action is instantaneous. Possible actions can include anything from sending alerts to other systems (via Kafka, Kinesis, HTTP POST, and more) to updating the graph data itself. Either way, by acting in real time, Quine can be the difference between anticipating and avoiding an issue and trying to fix it once it has already occurred. Out-of-Order Data Handling</h3> In a distributed system of global scale, events do not always arrive in the order they were created. Systems that have dropped off the network can send event data when they reconnect seconds, minutes, hours, or even days later. Quine solves this by maintaining partial matches to queries, adding to the graph as data arrives and triggering actions like alert messages when a complete match is made. The order, and the interval between events in a pattern, do not matter. For example, the creation of a “user video session” will complete as soon as the periodic client beacons, CDN logs for video chunks, origin server logs, and manifest files all arrive. An Example of Real-time Root Cause Analysis</h2> The combination of all these streaming graph capabilities produces a system well suited foringestion</a> of logs of events characterized by highly dimensional, categorical data from multiple systems or sources, as well as the evaluation of this data for outage or service degradation conditions in real time. Consider an example using client, CDN, and origin logs where the goal is to identify and track patterns of events suggestive of performance issues that could lead to service degradations and issue specific, actionable alerts when the number of these events (which I call KPIs here) exceed a user-defined threshold. After ingesting events into Quine, standing queries will continuously evaluate arriving data for patterns of service failure or degradation. When these “issue causes” are identified for any new data, high level KPIs (e.g. count of failure events for a server or Geo/ASN) will roll up the individual events to assess the significance of issues. When KPIs indicate a significant issue is occurring, the root cause definition is already known and made available to upstream systems for automated remediation, or published to NOC ticket management systems. ‍ The figure above illustrates the events Quine monitors for and, in this case, a real-time alert (in red) that reports client-observed re-buffering at significant enough volume to warrant investigation, and with identification that the issue is related to a CDN edge server in Tampa that is service users on the AT&T ISP (in orange). The alert that is issued provides the information an operator would need to understand and take action on an issue. Standing queries can publish this alert, and/or the raw event data that contributed to the KPI threshold being met, to Kafka, Kinesis, an API or even Slack -- whatever fits the desired workflow. Without a graph data structure, combining all this categorical and numerical data into a single materialized view and quickly traversing connections to detect completed patterns would not be possible. However, unlike graph databases, Quine is designed to process process high volumes of event data and trigger alerts in real time. All this adds up to more reliable stream delivery, more revenue, satisfied advertisers, and most importantly of all, happy customers. Try Quine Streaming Graph Yourself</h2> If you want to try it on your own logs, here are some resources to help: Download Quine - JAR file</a> | Docker Image</a> | Github</a></li> Check out the Ingest Data into Quine blog series covering everything from ingest from Kafka</a> to ingesting .CSV data‍</a></li> CDN Cache Efficiency Recipe</a> - this recipe provides more ingest pattern examples</li> </ol> Are You Ready for Low and Slow Auth Attacks? 2022-08-23T00:00:00+00:00 Preventing Authentication Attacks In Real Time</h2> Authentication attacks come in many forms, each using different strategies with distinct, often difficult to detect, characteristics. Detecting password spraying attacks is particularly difficult due to the deliberately low frequency of authentication attempts, the number of services probed, and the extended time period across which attempts are made. Detecting and preventing password spraying attacks in real time is impossible with current solutions. I’ll take a look at why this is and how Quine changes the game. Low and Slow Attacks: Brute Force, A Little Bit At A Time</h2> In the past, brute force attacks have been synonymous with easy-to-spot bursts of machine-driven activity designed to overwhelm defenses. But as attackers gain sophistication, they have found ways to reduce their profile while still harnessing the power of automation. Low and slow attacks use automation to spread authentication attempts over days, weeks and months, in addition to distributing the attempts across a network of target systems, from a range of source IPs. Based on Mitre definitions of brute force attacks</a>, Password Spraying, Password Guessing, and Credential Stuffing attacks all leverage metered activity to probe password systems so slowly that failed attempts go undetected by legacy time window-based “lock out” business rules. Why Low and Slow Attacks Work</h2> Volumetric brute force password attack strategies can often be detected due to their size and velocity using typical statistical analysis mechanisms. Password spraying attacks take a very different approach, probing multiple accounts for commonly used or compromised passwords. These attacks attempt to stay under the threshold that would trigger “3 strikes and you’re locked out” rules typically used by authentication applications. Of course, current authentication attack prevention measures do not stop with rules defined in authentication applications. Logs, often from multiple systems (e.g. firewalls, DNS, and web authentication logs), are typically processed by log/SIEM analysis solutions which perform more complex analysis, including analysis of multiple data sets concurrently or across longer time periods. SIEMs, however, are by definition not analyzing data in real time and their use is limited by the volume of data they retain for active analysis, and specifically by the costs to retain and process that data. . Detection Time Frames vs. Low and Slow Attack Behavior Patterns Real-time application rulesets don’t have the context gained from looking at long time periods of data or from data sourced from other systems. Batch-based log/SIEM analysis tools can perform more complex analytics but are not in the real-time flow of authentications, meaning you may not find out about successful attacks until hours or days later, and make it prohibitively expensive to incorporate the extended time frames of data needed to find low and slow attack behaviors. Detection Time Frames vs. Low and Slow Attack Behavior Patterns The tradeoffs with current approaches are stark: either impose time windows to process events in real time and reduce cost or sacrifice real time responsiveness to store and process data over a greater time interval at great expense. In either case, it isn’t clear you’ll be able to prevent all or even some low and slow attacks. That’s what makes this attack vector so insidious Low Cost, Real-time Analysis without Time Windows</h2> With low and slow attack strategies exploiting the limited time window visibility of existing application and log analysis solutions, new detection and response tools are needed. These tools need to: support detection of attack behavior patterns in logs from multiple systems, over extended periods of time, while being,</li> cost aligned with the large data retention needs of active extended time window monitoring.</li> </ol> Low and Slow Attack Detection Requires a New Tool ROI Paradigm Cost-effective complex log analysis on enterprise or service provider scale requires a new approach. Streaming Graph Makes Windowless Pattern Detection Cost Effective, Real Time</h2> The open source Quine Streaming Graph offers a new approach to complex behavior analysis necessary for the detection of password spraying and other low and slow attacks (including advanced persistent threats, or APTs). Two key Quine innovations are of particular interest in this context - standing queries and partial match tracking over extended time windows Standing queries are queries that live in the streaming graph and continuously filter against new data for query matches in real-time. Finding low and slow behaviors across scale volumes of logs from multiple systems and extended time periods using graph query definitions which have proven much more efficient than traditional RDBMS query logic.</li> Partial match tracking across in-memory and persistent storage, at scale, allows Quine to retain possibly interesting incomplete matches until the moment when a complete match occurs. By deferring storage of high volumes of partial matches to inexpensive persistent storage solving for the cost issues associated with traditional log analysis systems, while operating in the real-time workflow when attacks are occurring to minimize the impact of a breach.</li> </ol> And Quine eliminates time windows without incurring the cost of SIEM solutions, sifting through data from multiple sources to find and store only the patterns that matter – in this case, the ones that indicate a low and slow attack is underway. Learn more or try Quine yourself</h2> Quine is available in both open source and enterprise editions. You can try it yourself. Learn how to ingest your own data and build a streaming graph that can detect all sorts of attacks in real time. Join Quine Community on Discord</a> and get help from thatDot engineers and community members.</li> Download Quine - JAR file</a> | Docker Image</a> | Github</a></li> Check out the Ingest Data into Quine blog series covering everything from ingest from Kafka to ingesting .CSV data</li> Try the Ethereum Fraud Detection recipe</a> - this recipe showcases ingest and standing query patterns that you may find helpful.</li> </ol> ‍ Photo credit: Karl Ibri @karlibri</a> What's the difference between Categorical and Numerical Data? 2022-08-10T00:00:00+00:00 According to a 2020 Microstrategy survey, 94% of enterprises report data and data analytics are crucial to their growth strategy. And yet, surprisingly, as much as 73% of the data that enterprises collect is never used, including a vast majority of what is termed “categorical data.” Why would enterprises ignore an entire class of data? Especially when it is essential to high-priority use cases like personalization, customer 360, fraud detection and prevention, network performance monitoring, and supply chain management? The simple answer is that using categorical data with today’s tools is complex, and most data scientists aren’t trained to use it. Figuring out how to use categorical data will help companies solve complex problems that have long evaded them. And they’ll be able to do so with data they already have. Here’s a look at categorical data, why it’s hard to wrangle, and how it could be useful. Categorical Data 101</h3> There are two main types of data: categorical and numerical. Numerical data, as the name implies, refers to numbers. Categorical data is everything else. Categorical data is non-numerical information that is divided into groups. As its name suggests, categorical data describes categories or groups. Some examples of categorical data could be: A list of most popular baby names;‍</li> Census data, such as citizenship, gender, and occupation;</li> ID numbers, phone numbers, and email addresses;</li> Brands (Audi, Mercedes-Benz, Kia, etc.).</li> </ul> In some instances, categorical data can be both categorical and numerical. For example, weather can be categorized as either “60% chance of rain,” or “partly cloudy.” Both mean the same thing to our brains, but the data takes a different form. The Challenges of Categorical Data</h3> The same thing that makes categorical data so powerful makes it challenging. While it is easy for you and me to tell the relative difference between a dog and a plane versus a dog and a cat, doing so computationally is not so straightforward. To express the difference between two pieces of categorical data, one must use graph-based analytical tools or have a background in graph theory. This is why “knowledge graphs” have been a recent hot topic. Since graph tools are not so widespread in today’s enterprise and academic landscape, data scientists instead fall back on the statistical techniques they know and for which there are ready tools. Most machine learning algorithms can only handle numerical data. They can count instances of categorical data with real but limited utility. The other alternative is turning categorical data into numeric values using one of several encoding techniques. These techniques all tend to be slow and produce poor results – even making some goals impossible, like anomaly detection. Using categorical data comes with another challenge: high cardinality. Cardinality refers to the number of possible values for a particular category. For example, the cardinality of a list of all models of iPhone ever made is a relatively manageable 34. On the other hand, a list of serial numbers for all 2.2 billion iPhones sold since production began represents a high-cardinality data set. The size and complexity of traditional analytical approaches spiral quickly out of control with high-cardinality data. Additionally, almost all tools for turning categorical values into numbers (like one-hot encoding) require a fixed set of possible values known in advance. As some high-cardinality data values are unknown, this poses a problem since those tools cannot represent data they have never seen. With all these challenges, you can begin to understand why enterprises end up ignoring categorical data altogether. So, What Can You Do with Categorical Data?</h3> The enormous and unrealized value of categorical data for enterprises resides in its ability to represent the relationships between values in a way humans can readily understand and express. These relationships can include all the properties associated with an object – I am tall, blonde, married, and have two children – or the relationship between two objects – I wrote this article, and you are reading this article. You can use categorical data to efficiently group and connect classes of objects; for example, you can show all tall, blonde, married authors and the readers of their articles organized by geographic area and hobby. In doing so, you can uncover some unique insight and analysis. When you combine this “relationship thinking” with a computer’s ability to process enormous amounts of data, the astonishing power of categorical data becomes apparent. The Strengths of Graph Technology</h3> With the emergence of graph technology in recent years, enterprises can finally represent these relationships directly. A graph is built of nodes and edges; you can picture this with circles for nodes and arrows for edges that connect nodes. The node-edge-node pattern connects two categorical values (nodes) by a relationship represented by the edge. This is a natural way to represent data because that node-edge-node pattern corresponds perfectly to the subject-predicate-object pattern at the core of a natural human language. So anything you can say in words can be represented naturally in a graph. Then we can analyze the relationships between the values by following the connections between categorical data in a graph. Graph data structures connect information in a way that resembles the way we speak and think. The challenge of using categorical data is like having a pantry of canned food and no can opener. There’s food there, but you have no tools to access it. Instead of looking at the same data with the same approach, the next generation of streaming graph data tools needs to make categorical data more accessible and usable. We already see the success of categorical data as the key to improving anomaly detection in cybersecurity. But it’s only now that the tools for using this data to solve challenging problems are becoming available. thatDot Software for Categorical Data Processing</h2> thatDot streaming graph software is built specifically for categorical data. It combines a graph data structure (like Neo4J or TigerGraph) with the performance and scale of event processing systems like Flink and Spark Streaming. thatDot Novelty, built on thatDot Streaming Graph, is the first anomaly detection system to use categorical data, making it uniquely powerful. thatDot Streaming Graph is powered by Quine open source software. You can try it yourself either by downloading Quine or starting a Streaming Graph free trial. Learn how to ingest your own categorical data and build a streaming graph that can detect all sorts of attacks in real time. Try Streaming Graph</a> free for yourself.</li> Learn more about thatDot Streaming Graph</a>.</li> Join the Quine Discord Community</a> and get help from thatDot engineers and community members.</li> Download open source Quine – JAR file</a> | Docker Image</a> | Github</a></li> </ol> ‍ Blog Posts on Related Topics</h3> Stop Insider Threats With Automated Behavioral Anomaly Detection</a></li> Network Log Analysis Using Categorical Anomaly Detection</a></li> New to Quine’s Novelty Detector: Visualizations and Enhancements</a>‍</li> </ul> This article, in a slightly altered form, first appeared in Datanami</a> on July 25th, 2022. Photo by JJ Ying</a> on Unsplash</a> Kafka data deduping made easy using Quine's idFrom function 2022-07-27T00:00:00+00:00 Using Quine with Kafka as Source and Sink to Process Categorical Data</h2> Quine streaming graph is specifically designed to find high-value patterns in high-volume event streams, consuming data from APIs, data lakes, and most commonly, event stream processing systems. Quine is complementary to systems like Flink and ksqlDB, both of which are quite powerful but do not make it easy to connect and find complex patterns in categorical data</a>. A streaming system like Kafka allows developers to divide their monolithic applications into manageable components while addressing resilience and scalability needs. Switching to real-time event processing does not come without tradeoffs, however. Duplicate messages are common in streaming systems, and duplicate events will inevitably show up in a Kafka stream, especially at scale. Quine natively addresses duplicate and out-of-order data issues in streaming data pipelines. The Problem: Message Duplication Causes Multiple Negative Effects</h2> In a high-volume data pipeline, duplicate messages are unavoidable. The duplication of events is often the necessary side effect of guaranteeing that data is successfully delivered. The traditional solution is for a consumer application to record what it's seen recently and drop any event that is already processed. But event duplication can become a major challenge as your streaming system scales across multiple partitions. Stream consumers are usually distributed on different machines to help the system scale, making it difficult to quickly share knowledge of which events have already been processed. Each Kafka partition typically has its own consumer, if the consumer fails to process the event for any reason, when it resumes, the operation will request events starting from an earlier offset in Kafka. The result is that duplicate events will get sent downstream to other applications. Processing events multiple times can cause inconsistencies within the facts that your application logic depends on. The effect is wrong analytic insights; or worse, your application performs the wrong actions. Here are a few of the common approaches for managing duplicate events in a streaming system. Allow duplicate messages to occur. Maybe processing duplicate events is not a problem in your system. However, most of the time this is not the case.</li> Perform deduplication in a database. This approach starts off fine until your DB won’t scale. It is common for this to turn into a batch processing approach that defeats the reason that you decided to develop a streaming system in the first place.</li> Create a deduplication service. Call out from your streaming system to look up an event (or event ID) to see if it has already been processed. This is the natural evolution of option #2 which turns into its own expensive and painful service to manage.</li> Change your business logic or requirements to allow idempotent processing. If none of the previous options are appealing, you might try to alter your algorithms or your goals so that processing a duplicate message will have no effect. This is often impossible.</li> </ol> A Better Solution: locate nodes in the graph with idFrom(…)</code></h2> Duplicate data delivery is one of the main problems Quine is built to solve. To understand how Quine solves this problem, let's first understand two of Quine's fundamental design concepts: In Quine, streaming event processing is performed by graph nodes backed by actors</a> scaled across any number of servers.</li> Quine behaves as if all nodes already exist.</li> </ol> Each event that Quine processes operates on a specific set of nodes in the graph. With traditional static graphs, your application must ensure that each node is created exactly once—and this becomes a big performance drain. Quine behaves as if all nodes exist already, but are not yet filled with data or connected to any other nodes. You don’t have to worry about “creating nodes” twice because all possible nodes exist already. There will always be exactly one right place to handle each message, if only it can be found… To find the node responsible for each message, Quine has a built-in function called idFrom(…). idFrom</code> takes data from the incoming event and deterministically turns it into a unique node ID in the graph. idFrom</code> is entirely deterministic. Given the same arguments, idFrom</code> will always return the same node ID. This is similar to a “consistent hashing” approach used for other purposes, but in this case, Quine returns a well-formed node ID instead of a hash. Node IDs are user-configurable, so they can take many forms, but by default node IDs will be UUIDs. See the documentation on idProviders for more information on idFrom</code> and alternate options for node ID types. Once we know the ID of a node in the graph, that node will handle processing the event and deduplicating future events. So if the same event is received by Quine twice, idFrom</code> will return the same nodeId</code> each time. Since Quine only saves to disk the changes to each node, the duplicate event becomes a no-op. The practical effect of this is that using idFrom</code> will resolve duplicate events in the stream automatically. So you can go back to building your application instead of micromanaging the event stream delivery guarantees. Using idFrom</code> within ingest stream queries is standard practice, even when a node is expected to show up repeatedly in the successive events. Take, for example, the Wikipedia page ingest</a> recipe. The ingest stream query refers to a dbNode</code> for each database where a page-create</code> event belongs. ingestStreams: - type: ServerSentEventsIngest url: https://stream.wikimedia.org/v2/stream/page-create format: type: CypherJson query: |- MATCH (revNode) WHERE id(revNode) = idFrom("revision", $that.rev_id) MATCH (dbNode) WHERE id(dbNode) = idFrom("db", $that.database) MATCH (userNode) WHERE id(userNode) = idFrom("id", $that.performer.user_id) SET revNode = $that, revNode.type = "rev" SET dbNode.database = $that.database, dbNode.type = "db" SET userNode = $that.performer, userNode.type = "user" WITH *, datetime($that.rev_timestamp) AS d CALL create.setLabels(revNode, ["rev:" + $that.page_title]) CALL create.setLabels(dbNode, ["db:" + $that.database]) CALL create.setLabels(userNode, ["user:" + $that.performer.user_text]) CALL reify.time(d, ["year", "month", "day", "hour", "minute"]) YIELD node AS timeNode CREATE (revNode)-[:at]->(timeNode) CREATE (revNode)-[:db]->(dbNode) CREATE (revNode)-[:by]->(userNode)</code></pre> Let's take a closer look at line two of the query. Notice that even when starting with an empty Quine system, we begin by MATCHing the dbNode</code>. We don’t create it because it already exists. We MATCH it with a WHERE constraint on its ID using idFrom</code>: MATCH (dbNode) WHERE id(dbNode) = idFrom("db", $that.database)</code></pre> Using idFrom</code>, Quine calculates the node ID using a combination of the string "db" and the value of the database</code> field passed in from the event: $that</code>. idFrom</code> will always return the same node ID when given the same arguments. NOTE: It's good practice to prefix the idFrom()</code> with a descriptive name for the type of values being passed in in order to effectively create a namespace to further ensure there won't be accidental collisions on the id that gets created. If another field coincidentally had the value as $that.database</code>, prefixing it with a string will ensure the same value from different types doesn’t accidentally refer to the same node when it shouldn’t. If we query the top five most connected database nodes, it reveals that idFrom</code> deterministically calculated node IDs thousands of times over a short period while processing the Wikipedia page-create</a> Kafka stream. ❯ curl -s -X "POST" "http://0.0.0.0:8080/api/v1/query/cypher" \ -H 'Content-Type: text/plain' \ -d $'MATCH (n) WHERE n.type = "db" MATCH (n)-[r]-() RETURN DISTINCT n.database, count(r) ORDER BY count(r) DESC LIMIT 5' \ | jq . }</code></pre> This produces the following results: .tg .tg td .tg th .tg .tg-5l9e .tg .tg-7d05 .tg .tg-wpo4 Database Count commonswiki 2953 wikidatawiki 1883 enwiki 790 ruwiki 144 enwiktionary 139</code></pre> Using idFrom</code> to calculate the nodeId</code> tells us exactly where in the graph that message should be handled—whether it’s the first or thousandth time we’ve referred to that node. The processing on each node will only apply updates if the data actually needs updates. So duplicate messages routed to the same node will have the second message behave as a no-op and cause no troublesome side effects. idFrom</code> is a powerful tool that makes complex streaming data easier to reason about in a graph and is the foundation for developing with the Quine streaming graph. Just like Kafka, Quine is Open Source</h2> If you are using Kafka and have issues with duplicate data, Quine’s a great solution. Quine is open source so trying it out is as simple as downloading it and connecting it to Kafka. Here’s a list of resources to get you started: Download Quine - JAR file</a> | Docker Image</a> | Github</a></li> Check out the Ingest Data into Quine blog series covering everything from ingest from Kafka to ingesting .CSV data</li> Apache Log Recipe</a> - this recipe provides more ingest pattern examples</li> Join Quine Community on Discord</a> and get help from thatDot engineers and community members.</li> </ol> ‍ Save Big on SIEM Storage Costs Using Quine's Semantic ETL 2022-07-21T00:00:00+00:00 The High Cost of Storing Low Value Data</h2> The high cost of SIEM has given rise to countless articles and dozens of companies</a> promoting strategies or products to reduce monthly bills, with some claiming 50-90% reductions.While the 50-90% number seems a little overblown and sure to be met with skepticism — enterprises tend to take a “better to store it and pay the price than regret we didn’t later” approach, especially when the data may have compliance implications1 — the appeal is easy to understand. I took a look at the current methods for reducing SIEM costs and compared them to what graph ETL using Quine can accomplish all while considering impact on data fidelity. The State of Stream Pre-Processing: Random, Destructive, and Only Somewhat Effective</h2> Legacy event log pre-processing offerings typically employ one or more of six basic strategies to reduce the amount of data stored in the SIEM: Sample data</li> Filter out fields</li> Filter out events</li> De-duplicate</li> Aggregate/roll-up</li> Re-route some data to cheaper alternatives for cold storage (e.g,. Logstash or Amazon S3)</li> </ol> These solutions also usually include the ability to set rules that refine system behavior by data source or event type – for instance, sampling one in five events from a log of failed authentication attempts but one in twenty events from an Apache access log. It is important to note that stream pre-processing can only be applied to each stream and each record individually. Since many modern event processing use cases — not just SIEM but those for machine learning and e-commerce — depend on combining multiple data sources to model complex events, the single-stream approach means storing duplicate data from each stream required to connect them later (in SQL terms, these data are the keys used to join the various data sets once they are stored). We were paying for 600 [GB] to 700 GB per day with Splunk, which meant we were lousy co-workers to our IT group, because we had to tell them, 'Send us this field, not that field,' and limit the data ingestion severely," said John Gerber, principal cybersecurity analyst at Reston, Va., systems integrator SAIC. -- from Elastic SIEM woos enterprises with cost savings</a> </blockquote> As the quote above makes clear, some approaches also require lots of operational intervention, meaning delays for analysts and data scientists and an overall increase in cost of ownership. The more important limitation is that these approaches cannot determine the value of the data they discard. They either throw data away or, in the case of aggregation, reduce fidelity</a>. All data is considered to have the same value. Quine’s approach is different: it turns high volumes of low-value data into low volumes of high-value data. Instead of storing data in Splunk or a similar system and then determining value, Quine can evaluate data as it arrives and make choices to store or discard based on the problem you are trying to solve. Quine Ingest Queries: Semantic ETL for High Value Data</h2> At the heart of how Quine processes data are two query types: ingest and standing queries (more on the latter below).2 Quine uses ingest queries to consume event data and construct your streaming graph database. Ingest queries perform real-time ETL on incoming streams, combining multiple data sources (for example from multiple Kafka topics, Kinesis streams, data from databases, live feeds via APIs</a>) into a single streaming graph, eliminating the need to keep duplicate data around for joins. Using Quine’s ingest ETL, you can join all the data, eliminating cross-data stream duplicates. That accounts for some incremental data reduction over existing methods, which along with the other five strategies (all of which Quine supports) means Quine offers superior savings on your SIEM costs. But more than just deduplicating data, joining streams lets you draw conclusions early about what makes some data more valuable than other data. Quine’s real power, however, is its ability to apply a semantic filter to your data to find patterns made up of multiple events. And it does so as data streams in. Save Only the Patterns That Matter</h2> Ingest queries make it easy to organize the high value, often complex, patterns in data into graph structures. These patterns are characterized by the relationships between multiple events. In a practical sense, you are shaping the data into a form that anticipates the analysis you will perform downstream in your SIEM. Quine can join, interpret, and trim away any data not relevant to the answers. What you end up creating in your graph-ETL are subgraphs, or patterns of two or more nodes and connecting edges. Here are a few real world examples from the Quine community: Find and store all instances where there have been attempts (both successful and failed) to log into the accounts of members of the executive team from multiple IP addresses</li> </ol> A subgraph for monitoring authentication fraud attempts. Find and store all instances where multiple processes in different office locations are sending message to the same IP address</li> </ol> A subgraph for monitoring processes and the IP addreses to which they write. In both of these examples, the test for what you keep and what you discard is based on what might possibly be important, on what matters to your business. What if data takes time to become interesting?</h2> One challenge processing streaming data – especially when event data arrives from many networked sources – is that it can arrive late or out of order, obscuring what would otherwise be an interesting pattern. Consider the examples above. What if the login attempts in example one were spread out over days or even weeks? What if log events from several locations in example two (above) were delayed for several hours or started at different times? Quine handles this late arriving data (as well as out of order data) using standing queries. Standing queries persist in the graph, storing partial matches and triggering actions when a full match occurs. Standing queries live inside the graph and automatically propagate the incremental results computed from both historical data and incoming streaming data. Once matches are found, standing queries trigger actions using those results (e.g., execute code, transform other data in the graph, publish data to another system like Apache Kafka or Kinesis). The implication for SIEM storage reduction is that Quine can temporarily retain possibly interesting incomplete patterns until a match occurs. It is neither discarded nor taking up costly space in your SIEM. Then, at the instant the match occurs, it is sent along to the SIEM system for regular processing. If a match doesn’t occur within a useful period, the data can be discarded automatically. Want to go further? Consider bypassing your SIEM altogether and sending alerts and data directly to your SOC or NOC’s dashboards, analysts, or data science team as it arrives and matches occur. But that’s for the next blog post. Until then, try out Quine’s graph ETL on your own log data. It is open source and easy to get started with. Who knows, it might just save you a few million dollars. Help Getting Started</h2> If you want to try it on your own logs, here are some resources to help: Download Quine - JAR file</a> | Docker Image</a> | Github</a></li> Check out the Ingest Data into Quine blog series covering everything from ingest from Kafka</a> to ingesting .CSV data</li> Apache Log Recipe</a> - this recipe provides more ingest pattern examples</li> Join Quine Community Slack</a> and get help from thatDot engineers and community members.</li> </ol> —--- 1 Set aside that there are better, cheaper alternatives for this specific use case (using an expensive SIEM provider this is sort of like renting a penthouse apartment for all your junk instead of a storage locker) and the fact remains: companies aren’t going to get rid of a certain amount of their data, no matter what. 2 If you are interested in a deeper technical understanding of Quine's architecture, try our white paper. Drive Streaming Event Workflows with Standing Queries 2022-07-06T00:00:00+00:00 Standing Queries: Turning Event-Driven Data into Data-Driven Events</h2> Quine's super power is the ability to store and execute business logic within the graph. That query can then operate directly on data as it streams in. We call this type of query a standing query. A standing query incrementally matches some graph structure while new data is ingested into the graph. Quine’s special design makes this process extremely fast and efficient. When a full pattern match is found, a standing query takes action. A standing query is defined in two parts: a pattern and an output. The pattern defines what we want to match, expressed in Cypher using the form MATCH …</code> WHERE …</code> RETURN …</code>. The output defines the action(s) to take for each result produced by the RETURN</code> in the pattern query. The result of a standing query output is passed to a series of actions which process the output. This output can be logged, passed to other systems (via Kafka, Kinesis, HTTP POST, and more), or can even be used to perform additional actions like running new queries or even rewriting parts of the graph. Whatever logic your application needs. How nodes match patterns</h2> Each node in Quine is backed by an actor, which makes each graph node act like its own little CPU. Actors function as lightweight, single-threaded logical computation units that maintain state and communicate with each other by passing messages. The actor model enables you to execute a standing query that is stored in the graph and remembered automatically. When you issue a DistinctId</code> standing query, the query is broken into individual steps that can be tested one at a time on individual nodes. Quine stores the result of each successive decomposition of a query (smaller and smaller queries) internally on the node issuing that portion of the query. The previous node's query is essentially a subscription to the next nodes status as either matching the query or not. An actor associated with each node performs incremental computation. Any changes in the next node’s pattern match state result in a notification to the querying node. In this way, a complex query is relayed through the graph, where each node subscribes to whether or not the next node fulfills its part of the query. When a complete match is made, or unmade, the chain is notified with results and an output action is triggered. Info There are two pattern match modes: distinctId and multipleValues This must take the form of MATCH WHERE RETURN</code> When the mode is DistinctId</code>, the pattern query RETURN</code> must also be DISTINCT</code>. Creating a standing query</h2> The first step to making a Standing Query is determining the graph pattern you want to watch for. You may have deployed Quine in your data pipeline to perform a series of tasks to isolate data, implement a specific feature, or monitor the stream to find a specific pattern in real time. In any case, Quine will implement your logic using Cypher. Let's demonstrate this concept using Quine's built in synthetic data generator that was introduced in v1.3.0. Say that you have a need to establish the relationships between all numbers in a number line and any number that is divisible by 10 using integer division</a> (where dividing always returns a whole number; the remainder is discarded). ingestStreams: - format: query: |- WITH gen.node.from(toInteger($that)) AS n, toInteger($that) AS i MATCH (thisNode), (nextNode), (divNode) WHERE id(thisNode) = id(n) AND id(nextNode) = idFrom(i + 1) AND id(divNode) = idFrom(i / 10) SET this.i = i, this.prop = gen.string.from(i) CREATE (thisNode)-[:next]->(nextNode), (thisNode)-[:div_by_ten]->(divNode) type: CypherLine type: NumberIteratorIngest ingestLimit: 100000</code></pre> Creates a graph with 100000 nodes and a shape that we can use for our example. Numbers divisible by 10 using integer division In the example above, I want to count the unique times that a pattern like the one visualized above occurs in a sample of 100000 numbers. A key to our pattern is the existence of the "data" parameter in a node that is generated by the gen.string.from()</code> function. The complete recipe</a> is in the Quine repo if you want to follow along. To detect a pattern in our data, we can write a Cypher query in the pattern</code> section: standingQueries: - pattern: query: |- MATCH (a)-[:div_by_ten]->(b)-[:div_by_ten]->(c) WHERE exists(c.prop) RETURN DISTINCT id(c) as id type: Cypher outputs: count-1000-results: type: Drop</code></pre> It is looking for a number which is the ten-divisor of another number which is also the ten-divisor of a number in the graph. That basically means it's looking for one of the first 1000 nodes created by our "number iterator" ingest. ❯ java -jar quine -r sq-test.yaml Graph is ready Running Recipe Standing Query Test Recipe Using 1 node appearances Using 11 quick queries Running Standing Query STANDING-1 Running Ingest Stream INGEST-1 Quine web server available at http://0.0.0.0:8080 INGEST-1 status is completed and ingested 100000 | => STANDING-1 count 1000</code></pre> This example simply counts how many are detected, using the standing query output</code> variant: type: Drop</code> Standing query result output: driving workflows</h2> Say that instead of just counting the number of times that the pattern matches, we need to output the match for debugging or inspection. We can replace the Drop</code> output with a CypherQuery</code> that uses the matched result and then prints information to the console. When issuing a DistinctId</code> standing query, the result of a match is a payload that looks like: { "meta": { "isPositiveMatch": true, "resultId": "2a757517-1225-7fe2-0d0e-22625ad3be37" }, "data": { "a.id": 45110, "a.prop": "YH32SISr", "b.id": 4511, "b.prop": "fqx8aVAU", "c.id": 451, "c.prop": "61mTZqH8" } }</code></pre> This payload includes the ID of the node that initially matched in the data</code> field. So We can write a new Cypher query to go fetch additional information triggered by this match: MATCH (a)-[:div_by_ten]->(b)-[:div_by_ten]->(c) WHERE id(c) = $that.data.id RETURN a.i, a.prop, b.i, b.prop c.i, c.prop</code></pre> The MATCH</code> portion looks similar to our standing query, but this time we're not monitoring the graph, we're fetching data from the three-node pattern rooted at (c)</code>. Replacing the count-1000-results</code> output with inspect-results</code> from below would accomplish just that. inspect-results: type: CypherQuery query: |- MATCH (a)-[:div_by_ten]->(b)-[:div_by_ten]->(c) WHERE id(c) = $that.data.id RETURN a.i, a.prop, b.i, b.prop c.i, c.prop andThen: type: PrintToStandardOut</code></pre> The outputs stage of a standing query is where you can express your business logic and put Quine to work for you in your data pipeline. Take some time to review all of the possible output types in our API documentation</a> located on the quine.io website. Modifying standing queries</h2> Modify a Standing Query Output</h3> Another time that you need to notify Quine of changes in your standing queries is when you modify the outputs</code> section of an existing standing query. The Quine API has two methods for the /api/v1/query/standing/{standing-query-name}/output/{standing-query-output-name}</code> endpoint that allow you to DELETE</code> and POST</code> a new output to an existing standing query. From above, let's change the original standing query output type from Drop</code> to a new CypherQuery</code> that outputs the matches to the console. We will use two API calls to accomplish the change. Delete the existing output: curl --request DELETE \ --url http://0.0.0.0:8080/api/v1/query/standing/STANDING-1/output/count-1000-results \ --header 'Content-Type: application/json'</code></pre> Create the new output: curl --request POST \ --url http://0.0.0.0:8080/api/v1/query/standing/STANDING-1/output/inspect-results \ --header 'Content-Type: application/json' \ --data '{ "type": "CypherQuery", "query": "MATCH (a)-[:div_by_ten]->(b)-[:div_by_ten]->(c) WHERE id(c) = $that.data.id RETURN a.id, a.prop, b.id, b.prop c.id, c.prop", "andThen": { "type": "PrintToStandardOut" } }'</code></pre>Propagate a New Standing Query</h3> When a new standing query is registered in the system, it gets automatically registered only new nodes (or old nodes that are loaded back into the cache). This behavior is the default because pro-actively setting the standing query on all existing data might be quite costly depending on how much historical data there is. So Quine defaults to the most efficient option. However, sometimes there is a need to actively propagate standing queries across all previously ingested data as well. You can use the API to request that Quine propagate a new standing query to all nodes in the existing graph. Here's how the request looks in curl</code>. curl --request POST \ --url http://0.0.0.0:8080/api/v1/query/standing/control/propagate?include-sleeping=true \ --header 'Content-Type: application/json'</code></pre> Review the in-product API documentation via the Quine web interface for additional code snippets. Conclusion</h2> In this blog post, we looked at the different types of standing queries that you can create in Quine. A standing query is a powerful tool for data processing because it allows you to express your business logic as part of your data pipeline. We also looked at how you can modify an existing standing query output type and propagate a new standing query across the graph. Quine is open source if you want to explore standing queries for yourself using your own data. Download a precompiled version or build it yourself from the codebase from the Quine Github</a> codebase. Have a question, suggestion, or improvement? I welcome your feedback! Please drop into Quine Slack</a> and let me know. I'm always happy to discuss Quine or answer questions. ‍ Quine Streaming Graph 1.3.0: Focus on Usability, Query Performance 2022-07-06T00:00:00+00:00 Performant Pagination at Scale, Improved Querying and User Docs, Advanced Recipes</h2> It is hard to believe we released Quine 1.2.0 only six weeks ago, especially when I look at the work that has gone into not just Quine but also documentation, how-to blogs and example recipes. Indeed, 1.3.0 cements a pattern of releases made up of a few features needed to achieve performance at scale and loads of smaller usability improvements that has emerged since we released Quine as an open source project in February. Additions to Quine included vastly improved pagination performance inside of our Cypher compiler, overhauled the API documentation, making journals a default when running recipes, improved Cypher query support, and a number of small but consequential changes to the system’s logging behavior. In addition, we’ve migrated the documentation to its own site to make it easier to make community contributions and keep docs in sync with releases, added three new recipes and made substantial updates to one of the favorites. The common theme throughout: usability and performance. Pagination in Quine Streaming Graph</h2> As part of our work to make all aspects of the system perform predictably and well at throughput rates of hundreds of thousands or even millions of events per second, we have undertaken some plumbing upgrades. To give you an idea of the engineering involved, check out this blog post</a> about the three most common pagination approaches (or I can save you time and tell you it explains page, point, and keySet-based pagination). We combined aspects of all three in our approach. The other notable work focused on usability and community enablement. Quine and API documentation plus Improved Usability</h2> We switched to the Stoplight Elements framework to make API documentation</a> easier to access and migrated from quine.io/docs to docs.quine.io. Not huge changes in themselves, but together they ensure docs are more accessible to the community to modify and never lag releases. Ingest Streams from Kafka and other Sources</h3> We also completed five blog posts on ingesting streams, ranging from simple CSV files to internet feeds to Kafka integration. Real-time Graph Analytics for Kafka Streams with Quine</a></li> Building a Quine Streaming Graph: Ingest Streams</a></li> Ingesting data from the Internet into Quine Streaming Graph</a></li> Ingesting From Multiple Data Sources into Quine Streaming Graph</a></li> Ingest and Analyze Log Files Using Streaming Graph</a></li> </ul> Live event stream, log, and network observability recipes</h3> And of course when we wrote an explainer on ingesting and processing log files, we couldn’t resist a recipe that uses Quine logs as the source. We all know that consuming, parsing, and visualizing Java log output is a huge challenge, one that lacks a widely available solution. We think Quine might be an answer. Use the Quine Log Recipe</a> as a baseline, then modify the regular expression inside the ingest stream Cypher query to fit your logs. ‍ Quine processing its own logs. ‍ In addition to the Quine Java log ingest recipe, we’ve created a recipe showing how to ingest and build a streaming graph from a feed of IMDB</a> movie data. (For anyone really interested in log processing, there’s also an Apache web logs analytics</a> recipe). Rounding out the trio of new recipes is a fun one: Ethan’s Pi Day recipe</a> using Quine to calculate Pi using Liebniz’s formula. On the topic of observability and root cause analysis, the CDN Cache Efficiency</a> recipe got a major update: Moved shaping the graph from standing queries into the ingest stream.</li> Updated code to reflect Cypher best practices.</li> Added quick queries to perform efficiency calculations.</li> Optimized the manifestation of nodes.</li> Added client device nodes.</li> Increased the data sample size</li> </ul> Quine Synthetic Data Generator</h3> With Quine v1.3.0 we also introduced a powerful series of built-in synthetic data Cypher functions. The synthetic data functions can be used within ingest streams to create booleans, bytes, floats, integers, strings, or nodes. This allows you to generate streaming synthetic data that can be used for testing or development purposes. Search for gen.</code> to check out how to use the functions on the Cypher Functions</a> page of docs.quine.io</a>. Next Up</h2> Quine is open source if you want to explore standing queries for yourself using your own data. Download a precompiled version</a> or build it yourself from the codebase from the Quine Github</a> codebase. Have a question, suggestion, or improvement? I welcome your feedback! Please drop into Quine Slack</a> and let me know. I'm always happy to discuss Quine or answer questions. ‍ Release Notes: Release Quine 1.3.0 Features: - Added a pagination (SKIP/LIMIT) optimizer to the Cypher query engine for historical queries with no unaliased values (#1822) - Enabled journals by default when running a recipe (#1814) - Added support for using the Stoplight Elements interactive documentation behind an authentication proxy (#1781) Bugfixes: - Fixed an issue where waking up a node would not correctly re-register its s standing queries, potentially resulting in dropped results (#1830) - Fixed an issue where Cypher subqueries could be executed with too many variables in scope (#1821) - Fixed an issue where some Cypher constructs (notably: variable-length relationship patterns) could be executed with too many variables in scope (#1821) - Fixed a documentation rendering issue for Standing Query Outputs (#1815) - Renamed the metric "persistors.snapshot-sizes" to "persistor.snapshot-sizes" for consistency (#1788) - Fixed the behavior of DISTINCT during Cypher query execution, making it work correctly with SKIP and/or LIMIT (#1777) Misc: - Simplified startup log messages (#1831) - Update some error messages to use the correct name for DistinctId Standing Queries (#1796) - Improved UX for API-issued historical queries near the present time (#1786, #1789) - Removed logback-config logging library: to configure logging, use standard logback.xml (#1754) - Added timestamps to node journal events in debug.node and node debug APIs (#1741) - Removed StandingQueryPattern.Graph API (#1795) - Improved distribution of randomly-generated partitioned IDs (#1801) - Documented metrics endpoint in openapi specification (#1792) - Added peephole optimization for property value comparsion (#1783) - Refactored to simplify DomainGraphBranch representation (#1771) Updates: - rocksdbjni to 7.3.1 (#1825) - msgpack-core to 0.9.2 (#1824) - cats-core to 2.8.0 (#1826) - metrics to 4.2.10 (#1823) - scala-library to 2.12.16 - sbt-paradox to 0.10.2 (#1809) - sbt-scalafix to 0.10.1 (#1808) - scala-java-time to 2.4.0 (#1798) ==== Quine Enterprise Additions ==== Release Quine Enterprise 1.3.0 Misc - Removed hydrolix persistor (#1739) Updates - proguard-base to 7.2.2 - scala-logging to 3.9.5 (#1776) - classgraph to 4.8.147 (#1784) ==== Quine.io / docs.thatdot.com: Probably not in release notes ==== * 1471b8201 Fixed typo in Kinesis section (#1829) * e216d1475 Resolve left nav issue on docs page (#1819) * 60fc048b5 updated the social link to a community invite (#1816) * 72fe04a2d Added 3d data tutorial (#1806) * 01061a434 initial quine log recipe commit (#1802) * 96dba566b Added the movieData recipe. (#1787) * 6330c807f (query-manager-fiddling) 1.2-docs-bugFix (#1758) * b98b791d0 Refactor site to use - instead of _ in urls (#1772) Collapse</code></pre> Understanding the Scale Limitations of Graph Databases 2022-07-05T00:00:00+00:00 A New Kind of Database: Using Graph Models to Unlock Categorical Data</h2> Graph databases and models have been around for well over a decade, and are among the most impactful technologies to emerge from the NoSQL</a> movement. Graph data models are natively designed to focus on the relationships within and between data</a> representing this data as nodes connected by edges. As such, the graph model is strikingly similar to the way humans often think and talk. The node-edge-node pattern in a graph corresponds directly to the subject-predicate-object pattern common to languages like English. So, if you’ve ever used mind-mapping technology or diagrammed ideas on a whiteboard, you’ve created a graph. A critical advantage of graph databases is their ability to express relationships between categorical data</a> (any non-numerical value e.g. email addresses, colors, models of cars, or geographic locations). This is not possible otherwise without using encoding methods that destroy much of the value of this data, and explains why most categorical data (and 73% of all data</a>) is simply ignored by enterprises. Graph databases allow you to explore the relationships between data types. Graph data models have become part of the standard toolkit for data scientists applying artificial intelligence</a> (AI) to everything from fraud detection and manufacturing control systems to recommendation engines and customer 360s. Given this broad applicability, it’s no surprise Gartner believes that graph database technologies will be used</a> in more than 80% of data and analytics innovations, including real-time event streaming, by 2025. But as adoption accelerates</a>, limitations and challenges are emerging. And one of the most significant limitations graph databases face is their inability to scale. Volume and Velocity of Modern Data Generation</h2> Much has changed since the emergence of the most recent generation of graph databases from a decade ago. Enterprises are dealing with previously unimaginable volumes of data to potentially query. That data enters and streams through the enterprise in a variety of channels, and enterprises want action on that information in real time. Original graph designs couldn’t have imagined today’s sheer volume of data or the computation power needed to put that data to work. And it’s not just the volume of data dragging graph databases down. It’s the velocity of that data. Graph databases are great at allowing you to make connections, but they don't scale. While graph databases can excel at computation on moderately-sized sets of data at rest, they get especially siloed</a> and suffer significant tradeoffs when real-time actions on streaming data</a> are desired. Streaming is actively moving data; it constantly arrives from diverse sources. And enterprises want to act upon it immediately in event-processing pipelines because when certain events are not caught quickly, as they happen, the opportunity to act disappears. For example, security incidents, transaction processing (such as fraud or credit validations), and automated</a> machine-to-machine actions. Anomalies and patterns need to be recognized with AI and ML algorithms that can automate (or at least escalate) an action. And that recognition needs to occur before an automated action can proceed. Graph databases were simply never built for this scenario. They are typically restricted to hundreds or thousands of events per second. But today’s enterprises need to be able to process a velocity of millions of events per second and, in some advanced use cases, tens of millions. There’s a hard limit both on how quickly graph systems can process data and on how much complexity (like how many hops in the query) they can handle. Because of those limits, graph systems often don’t get used. Since graph systems don’t get used, data engineering teams have no option other than to recreate the graph database-like functionality spread throughout their microservices</a> architecture. The Rise of Custom Data Pipeline Development</h2> These workarounds to query the event streams in real time require significant effort. Developers typically turn to event stream processing systems like Flink and ksqlDB, which make it possible, but not easy, to use familiar SQL</a> query syntax to query the event streams. It’s not uncommon for enterprises to have teams of data engineers developing extensive and complex micro service architectures for months or years to get up to the scale and speed needs of streaming data. However, these systems tend to lack the expressive query structures needed to find complex patterns in streams efficiently. Event stream processing systems like Apache Kafka and Kinesis created a new, event-driven architecture. As noted, to operate at the volume and velocity that enterprises require, these systems have had to make tough tradeoffs that lead to significant limitations. For example, time windows can restrict a system’s ability to connect events that do not arrive within a narrow time interval (often measured in seconds or minutes). This means that rather than providing some critical insight or business value, an event is instead simply ignored if it arrives even seconds too late. Even with costly limitations like time windows, event stream processing systems have been successful. Many can even scale to process millions of events per second—but with significant effort and limitations that fail to deliver the full power of graph data models. Quine Streaming Graph Was Created to Meet Demand</h2> The demand for insights from instant event data streams and the value they deliver has never been higher. As adoption accelerates, businesses should expect to see new data infrastructure emerge to eliminate many of the scale struggles that can hold back the power of graph database models. That's why we created Quine streaming graph. Quine solves the problem of scalable graph databases that can process millions of events per second</a>. Quine’s unique approach combines graph data and streaming technologies into a modern, developer-friendly open source software package. For the first time, teams can process categorical data in real time without resorting to encoding methods. Developers and data pipeline engineers use Quine to rapidly build high volume, real-time, complex event processing workflows at scale, especially if they are using Kafka</a> or Kinesis</a>. A handful of Quine queries can replace months of development time and millions in costs, eliminating batch processing, multi-level joins, time windows, and other time-consuming and outdated processes that drag down and stall analysis on streaming data. Next Steps</h3> And if you want to try Quine yourself, you can download</a> it here. To get started, try the Ethereum Blockchain Fraud Detection, Wikipedia Ingest</a> or Apache Log Analytics</a> recipes for different use cases for streaming graph. If you have questions or want to check out the community, join Quine slack</a> or visit our Github</a> page. ‍ Note: A version of this post was previously published in eWeek on May 26th, 2022. Photo Credits:</h4> Header image: by JJ Ying</a> on Unsplash</a> Photo 1: by Alina Grubnyak</a> on Unsplash</a> Photo 2: by John Barkiple</a> on Unsplash</a> Photo 3: by israel palacio</a> on Unsplash</a> thatDot appreciates the work of these artists and the fact they've made their excellent work available for use. Network Log Analysis Using Categorical Anomaly Detection 2022-06-24T00:00:00+00:00 The distributed nature of modern virtualized software architectures has created added complexity in the networking stack, making it difficult to attribute behavior to any single service. Instrumenting services will give you insight into activity within the service, but doesn’t provide the entire picture. What’s missing is insight into the communication behaviors that happen between two logical hosts. In an attempt to better expose this area I found a dataset containing over 200m network connection summary records from the open source Zeek</a> network monitoring service. Each Zeek log contains a number of fields including the originating host, the responding hosts with summary fields for connection state and connection history. A record converted to CSV looks like this (emphasis mine): 1331902125.080000, CIp1er3EKU2WUebCDe, 192.168.202.94, 52307, 1**92.168.23.100,**445, tcp, -, 10.550000, 4803, 3174, SF, -, 0, ShADdaFf, 32, 6475, 27, 4590, (empty) The metrics available in those records aid in informing standard monitors such as bandwidth (bytes received, bytes sent). Analysis of only the available metrics, however, is ignoring significant information encoded into the categorical elements of the log. This includes the hosts’ IP addresses and the summary abbreviations for connection state (SF) and connection history (ShADdaFf). For connection state, the entire field maps to a description. For connection history, each character maps to a different activity within the TCP lifecycle. Capital letters indicate originating server requests and lowercase letters indicate responding server responses. Using thatDot Novelty Detector’s data transformation API, I was able to build a simple function to manipulate the raw logs into something more useful. The function is responsible for: Mapping abbreviations to their corresponding definitions for easier understanding.</li> Separating the activity for sending and receiving hosts.</li> Create the ordered data observation for submission to the API.</li> </ul> This function was then stored as a transformation that could be applied to all incoming data. Data Transformation Map</h2> With the transformation in place, I was able to ingest the records and build a tree to visualize the connection history, ultimately giving us insight into a general fingerprint of conversation behavior. Once the system has recognized the fingerprint, it will begin to highlight connection paths that have deviated from normal behavior. Visualization Of Communication Patterns</h2> The principle reason for using thatDot’s Novelty Detector for this analysis however, is to surface the “novel” data from amongst the volumes of “normal” data. This sampled plot chart does a nice job of identifying the highly novel network conversations. The items highest on the X axis are the most Novel observations which may or may not also be Unique in the data. It is always interesting to see when Unique data, shown via the coloring, is NOT Novel. Differentiating such “false-positive” events is a significant benefit of including categorical data in our analysis. Example Observation Detail Visualization</h2> From this scatter plot chart we click through to one of the high novelty scored observation which leads us to the tree below, showing us that completing a handshake connection is abnormal for these two hosts. It is much more typical for these connections to time out. Observation Detail Visualization</h2> This same mechanism is useful for a range of use cases: Real-time DDoS detection, such as TCP half-open</a> (SYN flood) attacks.</li> Public-Private hosts communications. Use to determine which hosts are trying to connect and why (protocol, port, etc)</li> New protocol use between known hosts</li> New hosts successfully communicating with known hosts</li> </ul> In summary, this turns out to be a useful tool to aid in enriching existing telemetry data to aid in discovery, remediation and automation. thatDot Novelty Detector</h2> thatDot Novelty Detector is the first general-use application designed for finding anomalies in real-time in data sets that include categorical data. Available as an application for deployment in any cloud or data center thatDot Novelty Detector exposes an API that scores submitted observations for their “novelty” enabling real-time anomaly detention with fewer false positives than traditional threshold based metric analysis. ‍ Reducing False Positive Alerts With Contextual Anomaly Detection 2022-06-24T00:00:00+00:00 Too many false positives!</h3> Traditionally, monitoring alerts are produced comparing metrics against thresholds to identify behavior outside the norm. This approach of metrics-based alert definitions often generates too many false positives that lead to wasted human time and effort or worse yet, loss of confidence and ignoring alerts as general practice! Efforts to improve alert quality typically lead to devising more granular alerts. This approach leads to improved alerting for specific conditions, but introduces significant complexity in alert definitions and their associated maintenance as dimensionality increases. Machine Learning approaches often crumble under the same “curse of dimensionality” that humans feel: when looking at hundreds of alerts no person or machine can find the true anomalies. Dynamic threshold definitions that accommodate historically observed trends such as time-of-day or seasonal variations are helpful, but still limit us to looking for the problems we know to expect. What we all want are high-confidence alerts that identify truly anomalous events as they occur in real-time, from a system that learns and adapts to our data immediately. A New Approach: Use Categorical Data Categorical data is composed of the strings of information included in our logs and events: file names, IP addresses, HTTP status codes, geographical information, etc. Including categorical data in our monitoring analysis provides a greatly expanded context from which to evaluate application and network performance logs. As much as 80% of the information in our logs and events is categorical data. Why not include it in our monitoring? Doing so let’s us reduce the false positives that often overwhelm the people monitoring these systems, and also let’s us explain WHY and alert was generated. Not Everything New Is Anomalous The additional context gained by incorporating categorical dimensions of data provides a significant benefit in rapidly identifying unique data, identified as having high “surprise” value in our system, as well as recognizing anomalous data as separate from unique values. thatDot Novelty Detector learns a fingerprint for the data it observes, so that it can tell when “new” is actually just “normal”. High cardinality is a normally expected condition of many data types. User agents, IP addresses, and file names, are all examples of data that can have many values. Shown below are two examples that illustrate the value of context for differentiating unique vs anomalous data. The above example shows the identification of a highly unique observation in a CDN log monitoring data set. To scatter plot of the data uses color to indicate the “surprise” or uniqueness of each observation, while the left hand scale of the scatter plot indicates thatDot’s anomaly score for each observation. The tree to the right is from thatDot Exploration UI and shows the context of the observation. It has both high Surprise and Anomaly scores, being the first observation of the FUJIFILM ISP out of 800,294 observations. In this second example we see an observation in the scatter plot that is yellow indicating high “surprise” or uniqueness, but this observation receives a low anomaly score from thatDot. thatDot’s Exploration UI tree shows that observing a unique Server IP value under the Spectrum ISP is not anomalous, despite this IP being seen for the first time, as the context of previous data has taught the system that new client IP values are a usual occurrence for the Spectrum ISP. Alerts With Fewer False Positives Utilizing the additional context provided by including categorical data in our anomaly detection can significantly improve the quality of our alerting. When we have high confidence in our ability to identify the real signal-from-the-noise users save the time they historically spent chasing false positives, and they get back time to build more automation into our support processes. thatDot Novelty thatDot Novelty is the first general-use application designed for finding anomalies in real-time in data sets that include categorical data. Available as an application for deployment in any cloud or data center thatDot Novelty exposes an API that scores submitted observations for their “novelty” enabling real-time anomaly detention with fewer false positives than traditional threshold based metric analysis. Read more about Novelty and access the Novelty free trial here</a>. Where Quine Streaming Graph Fits In Kafka-Based Data Pipelines 2022-06-22T00:00:00+00:00 </h2> The Answer to the Common Question: What is Quine or Streaming Graph?</h2> “Quine is a real-time streaming graph that fits perfectly between two Kafka instances.” This is the most common answer I give whenever a data engineer asks “What is Quine?” As an answer, it works remarkably well. The reason it works is simple: everyone knows what Kafka does, even if they don’t run it themselves in production (which is rare). It is also a heck of a lot pithier than: “Quine is an open source stream processing application with a graph data model designed to ingest high volumes of event data from sources like Kafka or Kinesis and process them in real time using Cypher or Gremlin. The results of those queries can then be used to update the graph itself, can be stored in another database or data warehouse, or can be output back into the Kafka or Kinesis-based data pipeline. thatDot Streaming Graph is the commercial distributed high scale version." While accurate, this isn’t exactly a conversation starter like the shorter description. Inline Graph Analytics</h2> One of the first “a-ha’s” when we talk to data engineers operating real-time data pipelines is that while Quine shares much in common with graph databases (data is represented as nodes and edges, nodes have properties, and you can query it using the two most common graph query languages), it is radically different in one specific way: it runs inline with your stream, becoming another part of the data pipeline. Unlike graph databases, which are static stores accumulating data and are therefore essentially an off-ramp from the data stream, Quine doesn’t divert the flow of data through the system. Graph databases cannot process data inline in real time. This is not meant to be a slight on graph databases. They just weren’t built from the ground up to exist inline with Kafka-driven data streams. Quine runs inline with the data flow to process data into a real-time graph . Ingesting Kafka data to build a Real-time Streaming Graph</h2> As the diagram above implies, Quine ingests data from Kafka and turns it into a dynamic streaming graph. In the fifth installment of his Ingesting Data into Quine</a> blog series, Michael Aglietti covers the how’s and why’s in detail so I won’t delve too much deeper here. I will make a few points: Streaming Graph scales with Kafka – Quine is designed to process streaming data and turn it into a graph without slowing down the flow of data through the system. A single node of Quine can ingest and process thousands of events per second</a> hosted on a commodity server. A thatDot Streaming Graph cluster can process millions of events per second with tens of thousands of simultaneous queries.</li> Quine takes advantage of Kafka’s ability to regulate the stream – as anyone who has operated a production system knows, things don’t always go smoothly. Perhaps a host in a Streaming Graph cluster fails and throughput slows as a hot spare comes online. In that case, Streaming Graph counts on Kafka’s ability to handle back pressure. But that’s not all. Streaming Graph itself is also back-pressured. If Quine or Streaming Graph is busy with a resource-intensive task downstream, or possibly waiting for the durable storage to finish processing, it will back pressure the ingest stream so that it does not overwhelm other components.</li> </ol> Inline means Not Just Ingest but Output</h2> If Quine were just a highly write-optimized graph data processor, it would be pretty remarkable. But inline means keeping the data flowing through the pipeline. It means Quine is not just a sink in Kafka terms but a high-velocity source. And this is where Quine is truly unique. If ingest streams represent the sink side of Quine, standing queries turn Quine into a source. The way standing queries work is that they persist at all times on all nodes in the graph, accumulating partial matches as data flows through and triggering an action when a complete match is made. Think of them as a net you stretch across the data stream that is designed to catch only specific data patterns. Once a match is made, the standing query triggers an action which can include executing an arbitrary piece of code, updating the graph itself, writing the results out to a database, or publishing data right back out to a Kafka topic. And it can do this all with sub-millisecond latency. It’s at this point in a call that the record scratch sound effect interrupts the conversation and one of the engineers on the call is like, “Hold up….I don’t believe you.” (If “Quine lives between two Kafka streams” is what we repeat most often on calls, “I don’t believe you” is what engineers we are talking to most often say.) Standing queries turn the whole idea of querying a database on its head. They are far more equivalent to the continuous queries of an event stream processor. They work because Quine is built on an asynchronous actor model. That is, every node in the graph also has an actor associated with it capable of performing discrete compute tasks and sending messages to other nodes. The Quine technical white paper</a> digs into this all in depth if you are interested. What is important about standing queries is they allow Quine and Streaming Graph to not just ingest high volumes of data but process the data and then send it out to continue its journey through the data pipeline. No off-ramps. No slow downs. When real-time really does mean real-time</h2> By way of conclusion, let’s revisit the statement that kicked off this post: “Quine is a real-time streaming graph that lives between two Kafka instances.” Graph analysis is incredibly powerful, especially when it comes to maximizing the value of categorical data. Graphs allow you to express relationships between objects in a direct and natural way that is both human readable and performant. Use cases like XDR, financial fraud detection, authentication attacks, insider trading prevention, or network observability and root cause analysis, would all benefit tremendously if they could apply a graph model to their data. So why don’t they? The single biggest reason – and another thing we hear on calls all the time – is that graph databases can’t process the data fast enough. People end up batch processing data, which is the opposite of real time. Quine is real-time graph processing that sits inline with your Kafka-based data pipeline and detects complex patterns the instant they emerge. Drop Quine in between two Kafka instances and you will discover a whole new dimension to your data. And if you don’t believe me, that’s okay. We’re used to it. Next Steps and Further Reading</h2> thatDot Streaming Graph is the commercial, distributed cluster scale version. Try it out in the Free Trial</a>. Quine is open source if you want to try it for yourself. Download a precompiled version or build it yourself from the codebase Quine Github</a>. Or drop into the Quine Discord Community</a>. We're always happy to discuss Quine or answer questions. And if you have a question, suggestion, or improvement, Contact Us</a>. And if you’re interested in learning more about building a streaming graph from various ingest sources., check out previous installments in this blog series: Real-time Graph Analytics for Kafka Streams with Quine</a></li> Building a Quine Streaming Graph: Ingest Streams</a></li> Ingesting data from the internet into Quine Streaming Graph</a></li> Ingesting From Multiple Data Sources into Quine Streaming Graph</a></li> Ingest and Analyze Log Files Using Streaming Graph</a></li> </ul> Ingest How-To: Real-time Graph Analytics for Kafka Streams with Quine 2022-06-20T00:00:00+00:00 Quine adds Real-time ETL for Kafka-based Event Streams</h2> Kafka is the tool of choice for data engineers when building streaming data pipelines. Adding Quine into a Kafka-centric data pipeline is the perfect way to introduce streaming analytics to the mix. Adding business logic directly into an event pipeline allows you to process high-value insights in real time. Quine also allows you to add processing of categorical data</a>, which makes up a vast majority of the data your business generates, yet is often overlooked or discarded. Simple Streaming Pipeline for ETL</h2> Consider this straightforward, minimum viable streaming pipeline. A simple streaming pipeline with Quine ingesting Kafka streaming data In this simple pipeline, Vector</a> will produce events (dummy_log</code> lines) once a second and stream them into a Kafka topic (demo-logs</code>) where an ingest stream from Quine will transform the log events into a streaming graph. Setting up Vector</h2> Start by installing Vector in your environment. My examples use macOS and may need slight modifications to work correctly in your environment. I installed Vector with brew install vector</code>, which includes a sample Vector.toml</code> config in /opt/homebrew/etc/vector</code>. I extended the sample Vector config to build our pipeline. Run Vector to get a feel for the events that Vector emits. ❯ vector -c /opt/homebrew/etc/vector/vector.toml</code></pre> Vector generates dummy log lines from a built-in demo_logs</a> source. The log lines are transformed in Vector using the parse_syslog</a> and emit a JSON object. { "appname": "Karimmove", "facility": "lpr", "hostname": "some.com", "message": "Take a breath, let it go, walk away", "msgid": "ID416", "procid": 9207, "severity": "debug", "timestamp": "2022-06-14T15:34:11.936Z", "version": 2 }</code></pre> Once Vector is emitting log entries, we need to connect that output to Kafka by adding in a Kafka sink</a> element into the Vector.toml</a> file. # Stream parsed logs to kafka [sinks.to_kafka] type = "kafka" inputs = [ "parse_logs" ] bootstrap_servers = "127.0.0.1:9092" key_field = "quine" topic = "demo-logs" encoding = "json" compression = "none"</code></pre>Local Kafka Instance to use with Quine</h3> Kafka is the next step in the pipeline. I set up a single node Kafka cluster in Docker. There are more than enough examples on the internet of how to set up a Kafka cluster in Docker, and please set up the cluster in a way that fits your environment. My cluster uses a docker-compose</a> file that launches version 7.1.1 of Zookeeper and Kafka containers. Start the Kafka cluster and create a topic called demo-logs. Note I had to run the docker compose up command a couple of times before both the Zookeeper and Kafka containers launched cleanly. Make sure the containers fully load at least once before including the -d</code> option to run them in detached mode. ❯ Docker compose up -d ❯ docker exec Kafka Kafka-topics --bootstrap-server kafka:9092 --create --topic demo-logs</code></pre> Use kcat</a> to verify the Kafka cluster is up and that the demo-logs</code> topic was configured. Quine Config</h3> Ok, let's get Quine configured and ready to receive the log events from Kafka via an ingest stream. We can start with a simple ingest stream that takes each demo log line and creates a node. ingestStreams: - type: KafkaIngest topics: - demo-logs bootstrapServers: localhost:9092 format: type: CypherJson query: |- MATCH (n) WHERE id(n) = idFrom($that) SET n.line = $that</code></pre>Launch the Pipeline</h3> Let's launch Vector and Quine to get the pipeline moving. Launch Vector using the modified vector.toml</a> configuration. ❯ vector -c vector.toml</code></pre> Launch Quine by running the Kafka Pipeline</a> recipe. ❯ java -jar quine-x.x.x -r kafka_pipeline.yaml</code></pre> And verify that we see nodes generated in Quine. Quine app web server available at http://0.0.0.0:8080 | => INGEST-1 status is running and ingested 18</code></pre> Congratulations! 🎉 Your pipeline is operating! Improving the Ingest Query</h2> The ingest query that I started with is pretty basic. Using CALL recentNodes(1)</code>, let's take a look at the newest node in the graph and see what the query produced. ❯ ## Get Latest Node curl -s -X "POST" "http://0.0.0.0:8080/api/v1/query/cypher" \ -H 'Content-Type: text/plain' \ -d "CALL recentNodes(1)" \ | jq '.' { "columns": [ "node" ], "results": [ [ { "id": "9fde7ef4-c5ec-35f1-ae5f-619bd9ab7d5c", "labels": [], "properties": { "line": { "appname": "benefritz", "facility": "uucp", "hostname": "make.de", "message": "#hugops to everyone who has to deal with this", "msgid": "ID873", "procid": 871, "severity": "emerg", "timestamp": "2022-06-14T19:58:16.463Z", "version": 1 } } } ] ] }</code></pre> The ingest query creates nodes using idFrom()</code>, populated them with the properties that it received from Kafka, and didn't create any relationships. We can make this node more useful by giving it a label and removing parameters that are not interesting to us. Additionally, using reify.time()</code>, I can associate the node with a timeNode</code> to stitch together events that occur across the network in time. Analyzing the sample data</h3> Quine has a web-based graph explorer that really comes to life once you have a handle on the shape of the streaming data. But I am starting from the beginning with a bare-bones recipe. For me, when I start pulling apart a stream of data, I find that using the API to ask a few analytical questions serves me well. I'll use the /query/cypher</code> endpoint to get a feel for the shape of the sample data streaming from Kafka. I don't recommend doing a full node scan on a mature streaming graph, but my streaming graph is still young and small. Using my REST API client of choice, I POST a Cypher query that returns the metrics (counts) for parameters that are interesting. ‍ That's a lot of JSON results to review; let's take this over to a Jupyter Notebook to continue the analysis. My REST API client includes a Python snip-it tool that makes it really easy to move directly into code without having to start from scratch. In Jupyter</a>, within a few cells, I had the JSON response data loaded into a Pandas DataFrame and an easy to review textual visualization of what the sample data contains. I let the pipeline run while I developed simple visualizations of the metrics. Right away, I could see that the sample data Vector produces is random and uniformly distributed across all of the parameters in the graph. And after 15000 log lines, the sample generation exhausted all permutations of the data. Conclusions and Next Steps</h2> I learned a lot about streaming data while setting up this pipeline. Vector is a great tool that allows you to stream log files into Kafka for analysis. Add a Quine instance on the other side of Kafka, and you are able to perform streaming analytics inside a streaming graph using standing queries. Use the same workflow to develop an understanding of streaming data that you do for data at rest</li> Perform streaming analysis by connecting Quine to your Kafka cluster</li> Use Cypher ingest queries to form the graph within a Quine ingest stream.</li> </ul> Quine is open source if you want to run this analysis for yourself. Download a precompiled version or build it yourself from the codebase Quine Github</a>. I published the recipe that I developed at https://quine.io/recipes</code>. Have a question, suggestion, or improvement? I welcome your feedback! Please drop into Quine Slack</a> and let me know. I'm always happy to discuss Quine or answer questions. Further Reading</h3> And if you're interested in learning more about building a streaming graph from various ingest sources., check out previous installments in this blog series: Building a Quine Streaming Graph: Ingest Streams</a></li> Ingesting data from the internet into Quine Streaming Graph</a></li> Ingesting From Multiple Data Sources into Quine Streaming Graph</a></li> Ingest and Analyze Log Files Using Streaming Graph</a></li> </ul> ‍ Modernizing ETL For Cloud 2022-06-13T00:00:00+00:00 Quine Streaming Graph: A New Approach to ETL for Cloud</h2> Cloud architectures enable and encourage a new level of integration with 3rd party systems and data sources to deliver the enriched and personalized services our users and customers are looking for. Today’s data-driven services place significant new demands on our data pipelines, in terms of scale, agility and flexibility. Recent data pipeline evolution has focused on improving efficiency of existing ingestion workflows, but what we really need is to rethink the objective of data pipelines and let the needed form follow. If our purpose is to drive event-driven architectures, train AI algorithms and filter big data for valuable data, then the real objective of a modern data pipeline is to assemble, distill and publish only the most relevant data needed to better inform and monitor our software infrastructure. We don’t want big lakes of data, we want small streams of high-value data. Identifying, processing and packaging high-value data requires a lot from our data pipelines. Unlike ETL of the past, which operated within a limited and deterministic scope between a source and a sink, the cloud-era requires a much broader set of functions: Rapid adoption of an ever-growing set of unstandardized data sources</li> Accommodating a range of ingestion methods, including files, APIs, and webhooks</li> Ingestion of data from a global footprint of partners and sources</li> Producing custom formats of data for consumption by our applications</li> Real-time processing of data</li> Simple management of ongoing ingestion and publication changes</li> Ease of use in support of a widening audience of less technical data users</li> </ul> This is a big ask. The recent trend towards Data Lakes, Data Warehouses, Data Lake Houses, etc. has solved for some inefficiencies in data pipeline processing by concentrating data operations to avoid duplication of effort and data storage. These solutions, however, do not remove the complexity of downstream processing that is needed to make our data more valuable in terms of timeliness, relevance or insight. Data lakes push data pipeline complexity “underwater”; they do not eliminate it. Data lakes move data pipeline operational complexity “underwater”</h5> Newer, real-time ETL solutions such as Apache Kafka combined with thatDot's open source streaming graph Quine</a>, however, promise a more “cloud-centric” approach to data pipeline engineering. These solutions combine multi-modal distributed data ingestion with real-time data transformation and computation, in the data ingestion process itself. The ability to operate on data as it is ingested provides significantly more efficient and simplified data operations, while expanding the range of functions available. This approach of adding computation to data ingestion also brings a significant advantage in terms of distilling value from our data, turning big data into smart data, before it gets to our applications! This can be especially useful in use cases such as feeding data into ML/AI solutions or for reducing data volume passed to downstream applications. Embedding compute with ingestion is efficient and delivers real-time ETL</h5> The tight integration of data operations capabilities directly with data streams ingestion delivers the wide range of capabilities needed to deliver on our “modern data pipeline” requirements. Efficient – a single system to orchestrate global data ingestion, transformation and publication of data in real time, operated using common tooling and methodologies</li> “Cloud” Data Ingestion – graceful accommodation of API and webhook integrations, distributed data ingestion, per-source configurable ingestion adaptors</li> Real-time ETL – data is operated upon as it is ingested, combined with historical data as needed from any time window, and directly published to downstream systems</li> Out-Of-Order-Data-Handling – Data is processed correctly no matter what order, no matter when it comes in.</li> Event Multiplexing – Decompose strings, CSV and JSON data into atomic elements that can be individually transformed and reassembled into custom data for use by downstream services</li> Customizable Publication – Extensible operation by individual work groups, allowing them to define data format and transformation operations with common tools</li> Manageability & Usability – Cloud-friendly system deployment and management with common tools and methodologies, and a single system path to explore for debugging</li> </ul> It is fantastic when new technologies allow us to increase speed and function, while also reducing complexity. The combination of compute functions with data ingestion provides a new way to meet business requirements, bringing a new level of agility and efficiency to increasingly complex data pipelines. From Streaming Graph Theory to Practice</h3> We've published a series of how-to blogs that take you step-by-step through the ETL process using Quine's ingest feature. Together with Quine Docs,</a> these blogs will show you how to process high volumes of data with an intelligent, actor-based ETL system that can drive workflows. Building a Quine Streaming Graph: Ingest Streams</a></li> Ingesting data from the internet into Quine Streaming Graph</a></li> Ingesting From Multiple Data Sources into Quine Streaming Graph</a></li> Ingest and Analyze Log Files Using Streaming Graph</a></li> </ul> Next Steps</h3> And if you want to try Quine yourself, you can download</a> it here. To get started, try the Ethereum Blockchain Fraud Detection</a>, Wikipedia Ingest</a> or Apache Log Analytics</a> recipes for different ingest stream examples. If you have questions or want to check out the community, join Quine slack</a> or visit our Github</a> page. Ingest and Analyze Log Files Using Streaming Graph 2022-06-07T00:00:00+00:00 Processing Machine Logs with Streaming Graph</h2> You know we had to get here eventually. I'm looking into all of the ways that Quine can connect to and ingest streaming sources. Last time I covered ingest from multiple sources,</a> a Quine strength. Next up is my old friend, the log file. Log files are a structured stream of parsable data using regular expressions. Log lines are emitted at all levels of an application. The challenge is that they are primarily islands of disconnected bits of the overall picture. Placed into a data pipeline, we can use Quine to combine different types of logs and use a standing query to match interesting patterns upstream of a log analytics solution like Splunk or Sumo Logic. Log Line Structure</h2> Processing log files can quickly become as messy as the log files themself. I think that it's best to approach a log file like any other data source and take the time to understand the log line structure before asking any questions. Quine is an application that produces log lines, and just like many other applications, the structure of the log lines follows a pattern. The logline pattern is defined in Scala, making it very easy for us to understand what the log line contains. pattern = "%date %level [%mdc{akkaSource:-NotFromActor}] [%thread] %logger - %msg%n%ex"</code></pre>Quine Log RegEx</h2> Each Quine log line was assembled using the pre-defined pattern. This presents a perfect opportunity to use a regular expression, reverse the pattern, and build a streaming graph. NOTE The regex link in the example below uses the log output from a Quine Enterprise cluster. Learn more about the Streaming Graph and other products created by thatDot. The regular expression will work for both Streaming Graph and Novelty. I developed a regular expression</a> that reverses the logline and returns the log elements for use by the ingest stream ingest query. I also published a recipe</a> that uses the regular expression to parse Quine log lines on Quine.io. (^\d{4}-\d{2}-\d{2} \d{1,2}:\d{2}:\d{2},\d{3}) # date and time string (FATAL|ERROR|WARN|INFO|DEBUG) # log level \[(\S*)\] # actor address \[(\S*)\] # thread name (\S*) # logging class - # the log message ((?:(?!^[0-9]{4}(?:-[0-9]{2}){2}(?:[^|\r?\n]+){3}).*(?:\r?\n)?)+)</code></pre>Quine Log Ingest Stream</h2> In my previous article, I connected to a CSV</code> file using the CypherCsv FileIngest</code> format so that Quine could break the rows of data stored in the file back into columns. The CypherLine FileIngest</code> format allows us to read each line into the $that</code> variable and process it through a Cypher query. ingestStreams: - type: FileIngest path: $in_file format: type: CypherLine query: |- // Quine log pattern "%date %level [%mdc{akkaSource:-NotFromActor}] [%thread] %logger - %msg%n%ex" WITH text.regexFirstMatch($that, "(^\\d{4}-\\d{2}-\\d{2} \\d{1,2}:\\d{2}:\\d{2},\\d{3}) (FATAL|ERROR|WARN|INFO|DEBUG) \\[(\\S*)\\] \\[(\\S*)\\] (\\S*) - (.*)") as r WHERE r IS NOT NULL // 0: whole matched line // 1: date time string // 2: log level // 3: actor address. Might be inside of `akka.stream.Log(…)` // 4: thread name // 5: logging class // 6: Message WITH *, split(r[3], "/") as path, split(r[6], "(") as msgPts WITH *, replace(COALESCE(split(path[2], "@")[-1], 'No host'),")","") as qh MATCH (actor), (msg), (class), (host) WHERE id(host) = idFrom("host", qh) AND id(actor) = idFrom("actor", r[3]) AND id(msg) = idFrom("msg", r[0]) AND id(class) = idFrom("class", r[5]) SET host: Host, host.address = split(qh, ":")[0], host.port = split(qh, ":")[-1], host.host = qh, actor: Actor, actor.address = r[3], actor.id = replace(path[-1],")",""), actor.shard = path[-2], actor.type = path[-3], msg: Message, msg.msg = r[6], msg.type = split(msgPts[0], " ")[0], msg.level = r[2], class: Class, class.class = r[5] WITH * CALL reify.time(datetime({date: localdatetime(r[1], "yyyy-MM-dd HH:mm:ss,SSS")})) YIELD node AS time CREATE (actor)-[:sent]-&gt;(msg), (actor)-[:of_class]-&gt;(class), (actor)-[:on_host]-&gt;(host), (msg)-[:at_time]-&gt;(time)</code></pre> The ingest stream definition: Reads Quine log lines from a file</li> Parses each line with regex</li> Creates host, actor, message, and class nodes</li> Populates the node properties</li> Relates the nodes in the streaming graph</li> Anchors the message with a relationship to a time node from 'reify.time'</li> </ul> Configuring Quine Logs</h2> Ok, let's run this recipe and see how it works. By default, the log level in Quine is set to WARN. We can increase the log level in the configuration or pass in a Java system configuration property when we launch Quine. NOTE Set the log level in Quine ( or Quine Enterprise) via that thatdot.loglevelconfiguration option.. Setting Log Level in Configuration</h3> Start by getting your current Quine configuration. The easiest way to get the configuration is to start Quine and then GET</code> the configuration via an API call. ❯ curl --request GET \ --url http://0.0.0.0:8080/api/v1/admin/config \ --header 'Content-Type: application/json' \ > quine.conf</code></pre> Edit the quine.conf</code> file and add "thatdot":{"loglevel":"DEBUG"}</code>, before the quine</code> object. ❯ jq '.' quine.conf { "thatdot": { "loglevel": "DEBUG" }, "quine": { "decline-sleep-when-access-within": "0", "decline-sleep-when-write-within": "100ms", "dump-config": false, "edge-iteration": "reverse-insertion", "id": { "partitioned": false, "type": "uuid" }, "in-memory-hard-node-limit": 75000, "in-memory-soft-node-limit": 10000, "labels-property": "__LABEL", "metrics-reporters": [ { "type": "jmx" } ], "persistence": { "effect-order": "memory-first", "journal-enabled": true, "snapshot-schedule": "on-node-sleep", "snapshot-singleton": false, "standing-query-schedule": "on-node-sleep" }, "shard-count": 4, "should-resume-ingest": false, "store": { "create-parent-dir": false, "filepath": "quine.db", "sync-all-writes": false, "type": "rocks-db", "write-ahead-log": true }, "timeout": "2m", "webserver": { "address": "0.0.0.0", "enabled": true, "port": 8080 } } }</code></pre> Now, restart Quine and include the config.file</code> property. java -Dconfig.file=quine.conf -jar quine-x.x.x.jar > quineLog.log</code></pre> DEBUG</code> level log lines will stream into the quineLog.log</code> file. Passing Log Level at Runtime</h3> Another slightly more straightforward way to enable Quine logs is to pass in a Java system configuration property. Here's how to start Quine and enable logging from the command line. java -Dthatdot.loglevel=DEBUG -jar quine-x.x.x.jar > quineLog.log</code></pre> DEBUG</code> level log lines will stream into the quineLog.log</code> file. Ingesting Other Log Formats</h2> You can easily modify the regex I developed for Quine log lines above to parse similar log output, like those found in *nix based system files or other Java applications. Standard-ish Java Log Output</h3> Depending on the log level</code>, Java emits a lot of information into logs. This ingest stream handles application log lines from most Java applications. Sometimes the log entry itself spans multiple lines. - type: FileIngest path: $app_log format: type: CypherJson query: |- WITH *, text.regexFirstMatch($that.message, '^(\\d{4}(?:-\\d{2}){2}(?:[^]\\r?\\n]+))\\s+?\\[(.+?)\\]\\s+?(\\S+?)\\s+(.+?)\\s+\\-\\s+((?:(?!^\\d{4}(?:-\\d{2}){2}(?:[^|\\r?\\n]+){3}).*(?:\\r?\\n)?)+)') AS r WHERE r IS NOT NULL CREATE (log { timestamp: r[1], component: r[2], level: r[3], subprocess: r[4], message: r[5], type: 'log' }) // Create hour/minute buckets per event WITH * WHERE r[1] IS NOT NULL CALL reify.time(datetime({date: localdatetime(r[1], "yyyy-MM-dd HH:mm:ss,SSS")}), ["hour","minute"]) YIELD node AS timeNode // Create edges for timenNodes CREATE (log)-[:at]->(timeNode)</code></pre>Ubuntu Ubuntu 22.04 LTS Syslog</h3> If you're developing distributed applications, you will most likely need a regular expression that parses the Ubuntu /var/log/syslog</code> file. First, you need to edit /etc/rsyslog.conf</code> and uncomment the line to emit the traditional DateTime</code> format. # # Use traditional timestamp format. # To enable high precision timestamps, comment out the following line. # $ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat</code></pre> The log line format is: %timestamp:::date-rfc3339% %HOSTNAME% %app-name% %procid% %msgid% %msg%n</code> - type: FileIngest path: $syslog format: type: CypherLine query: |- WITH text.regexFirstMatch($that, '^(\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}\\.\\d*?\\+\\d{2}:\\d{2}|Z).?\\s(.*?)(?=\\s).?\\s(\\S+)\\[(\\S+)\\]:\\s(.*)') AS s WHERE s IS NOT NULL CREATE (syslog { timestamp: s[1], hostname: s[2], app_name: s[3], proc_id: s[4], message: s[5], type: 'syslog' }) // Create hour/minute buckets per event WITH * WHERE s[1] IS NOT NULL CALL reify.time(datetime({date: localdatetime(s[1], "yyyy-MM-dd'T'HH:mm:ss.SSSSSSz")}), ["hour","minute"]) YIELD node AS timeNode // Create edges for timenNodes CREATE (syslog)-[:at]->(timeNode)</code></pre>MySQL Error Log</h3> Working on a web application that's been around for a while, it's probably sitting on top of a MySQL database. The traditional-format MySQL log messages have these fields</a>: time thread [label] [err_code] [subsystem] msg</code> For example: 2022-04-14T06:55:26.961757Z 0 [System] [MY-011323] [Server] X Plugin ready for connections. Socket: /var/run/mysqld/mysqlx.sock</code> Add these log entries to your streaming graph for analysis too. - type: FileIngest path: $sqlerr_log format: type: CypherLine query: |- WITH text.regexFirstMatch($that, '^(\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}\\.\\d{6}Z)\\s(\\d)\\s\\[(\\S+)\\]\\s\\[(\\S+)\\]\\s\\[(\\S+)\\]\\s(.*)') AS m WHERE m IS NOT NULL CREATE (sqllog { timestamp: m[1], thread: m[2], label: m[3], err_code: m[4], subsystem: m[5], message: m[6], type: 'sqllog' }) // Create hour/minute buckets per event WITH * WHERE m[1] IS NOT NULL CALL reify.time(datetime({date: localdatetime(m[1], "yyyy-MM-dd'T'HH:mm:ss.SSSSSSz")}), ["hour","minute"]) YIELD node AS timeNode // Create edges for timenNodes CREATE (sqllog)-[:at]->(timeNode)</code></pre>Conclusion</h2> Streaming data comes from all kinds of sources. With Quine, it's easy to convert that data stream into a streaming graph. Quine is open source if you want to run this analysis for yourself. Download a precompiled version or build it yourself from the codebase Quine Github</a>. I published the recipe that I developed at https://quine.io/recipes/quine-log-recipe</a>. The page has instructions for downloading the quineLog.log</code> files and running the recipe. Have a question, suggestion, or improvement? I welcome your feedback! Please drop in to Quine Slack</a> and let me know. I'm always happy to discuss Quine or answer questions. Ingesting From Multiple Data Sources into Quine Streaming Graph 2022-06-02T00:00:00+00:00 Building a Streaming Graph from Multiple Sources</h2> As part of the ongoing series in which I exploring different ways to use the ingest stream to load data into Quine, I want to cover one of Quine's specialities: building a streaming graph from multiple data sources. This time, we'll work with CSV data exported from IMDb to answer the question; "Which actors have acted in and directed the same movie?" The CSV Files Usually, if someone says that they have data, most likely it's going to be in CSV</code> format or pretty darn close to it. (Or JSON</code>, but that is another blog post.) In our case, we have two files filled with data in CSV</code> format. Let's inspect what's inside. File 1: movieData.csv</a> The movieData.csv</code> file contains records for actors, movies, and the actor's relationship to the movie. Conveniently, each record type has a schema, flattened into rows during export. Should we separate the data back into discrete files and then load them? No, we can set up separate ingest streams to act on each data type in the file. Effectively, we will separate the "jobs to do" into Cypher queries and stream in the data. File 2: ratingData.csv</a> Our second file, ratingData.csv</code> is very straightforward. It contains 100,000 rows of movie ratings. Adding the ratings</code> data into our model completes our discovery phase for the supplied data. Original implied schema of IMDB data. The CypherCsv Ingest Stream</h2> The Quine API documentation defines the schema</a> of the File Ingest Format ingest stream for us. The schema is robust and accommodates CSV, JSON, and line file types. Please take a moment to read through the documentation. Be sure to select type: FileIngest -> format: CypherCsv using the API documentation dropdowns. I define ingest streams to transform and load the movie data into Quine. Quine ingest streams behave independently and in parallel when processing files. This means that we can have multiple ingest streams operating on a single file. This is the case for the movieData.csv file because there are several operations that we need to perform on multiple types of data. Movie Rows</h2> The first ingest stream that I set up will address the Movie rows in the movieData.csv file. There are 9,125 movies in the data set. I create two nodes from each Movie row using an ingest query, movie and genre. I store all of the movie data as properties in the Movie mode. WITH $that AS row MATCH (m) WHERE row.Entity = 'Movie' AND id(m) = idFrom("Movie", row.movieId) SET m:Movie, m.tmdbId = row.tmdbId, m.imdbId = row.imdbId, m.imdbRating = toFloat(row.imdbRating), m.released = row.released, m.title = row.title, m.year = toInteger(row.year), m.poster = row.poster, m.runtime = toInteger(row.runtime), m.countries = split(coalesce(row.countries,""), "|"), m.imdbVotes = toInteger(row.imdbVotes), m.revenue = toInteger(row.revenue), m.plot = row.plot, m.url = row.url, m.budget = toInteger(row.budget), m.languages = split(coalesce(row.languages,""), "|"), m.movieId = row.movieId WITH m,split(coalesce(row.genres,""), "|") AS genres UNWIND genres AS genre WITH m, genre MATCH (g) WHERE id(g) = idFrom("Genre", genre) SET g.genre = genre, g:Genre MERGE (m:Movie)-[:IN_GENRE]->(g:Genre)</code></pre> Quine passes each line to the ingest stream via the variable $that</code> to which I assign the identity row</code>. A MATCH</code> is made when the row.Entity</code> value is Movie</code> and a node id</code> is returned from the idFrom()</code> function. SET</code> is used to give the node a label and to store metadata as node properties. Each movie row has a pipe |</code> delimited list of genres in the genres</code> column. I split the column value apart and created a Genre node for each genre in the list, labeled and containing the genre as a property. Finally, the Movie</code> node is related to the Genre</code> node with MERGE</code>. Person Rows</h2> The second ingest stream addresses the Person</code> rows in the same way I did for the Movie</code> rows. There are 19047 person records in the movieData.csv</code> file. WITH $that AS row MATCH (p) WHERE row.Entity = "Person" AND id(p) = idFrom("Person", row.tmdbId) SET p:Person, p.imdbId = row.imdbId, p.bornIn = row.bornIn, p.name = row.name, p.bio = row.bio, p.poster = row.poster, p.url = row.url, p.born = row.born, p.died = row.died, p.tmdbId = row.tmdbId, p.born = CASE row.born WHEN "" THEN null ELSE datetime(row.born + "T00:00:00Z") END, p.died = CASE row.died WHEN "" THEN null ELSE datetime(row.died + "T00:00:00Z") END</code></pre> The ingest query in this ingest stream matches when the row.Entity</code> is Person</code>, creates a node using the idFrom()</code> function, and stores the Person metadata in node parameters. Join Rows</h2> Looking at the rows that have Join</code> in the Entity</code> column leads me to believe that the data in this CSV</code> file originated from a relational database. There are two types of joins in the file, Acted</code> and Directed</code>. The ingest queries below process them. Acted In WITH $that AS row WITH row WHERE row.Entity = "Join" AND row.Work = "Acting" MATCH (p) WHERE id(p) = idFrom("Person", row.tmdbId) MATCH (m) WHERE id(m) = idFrom("Movie", row.movieId) MATCH (r) WHERE id(r) = idFrom("Role", row.tmdbId, row.movieId, row.role) SET r.role = row.role, r.movie = row.movieId, r.tmdbId = row.tmdbId, r:Role MERGE (p:Person)-[:PLAYED]->(r:Role)<-[:HAS_ROLE]-(m:Movie) MERGE (p:Person)-[:ACTED_IN]->(m:Movie)</code></pre> ‍ Acted join rows create relationships between Person, Role, and Movie nodes. There are two paths created from the Person nodes. The first path (p)-[:PLAYED]->(r)<-[:HAS_ROLE]-(m)</code> establishes the relationship between actors (Person) and the roles they have played as well as the roles in a movie (Movies). A second path is formed that directly relates an actor to movies they acted in. Directed WITH $that AS row WITH row WHERE row.Entity = "Join" AND row.Work = "Directing" MATCH (p) WHERE id(p) = idFrom("Person", row.tmdbId) MATCH (m) WHERE id(m) = idFrom("Movie", row.movieId) MERGE (p:Person)-[:DIRECTED]->(m:Movie)</code></pre> The Directed ingest query matches join rows and creates a path relating directors with the movies they have directed. Ratings</h2> WITH $that AS row MATCH (m) WHERE id(m) = idFrom("Movie", row.movieId) MATCH (u) WHERE id(u) = idFrom("User", row.userId) MATCH (rtg) WHERE id(rtg) = idFrom("Rating", row.movieId, row.userId, row.rating) SET u.name = row.name, u:User SET rtg.rating = row.rating, rtg.timestamp = toInteger(row.timestamp), rtg:Rating MERGE (u:User)-[:SUBMITTED]->(rtg:Rating)<-[:HAS_RATING]-(m:Movie) MERGE (u:User)-[:RATED]->(m:Movie)</code></pre> The last ingest query processes rows from the ratingData.csv</code> file. The query creates User and Rating nodes, then relates them together. Running the Recipe</h2> As my project progressed, I developed a Quine recipe</a> to load my CSV</code> files and perform the analysis. Running the recipe requires a couple of Quine options to pass in the locations of the CSV</code> files and an updated configuration setting. java \ -Dquine.in-memory-soft-node-limit=30000 \ -jar ../releases/latest -r movieData \ --recipe-value movie_file=movieData.csv \ --recipe-value rating_file=ratingData.csv</code></pre> After ingesting the CSV</code> files, it results in the data set stored in Quine: The data model in Quine for the IMDB data. The orange Movie and Person nodes are created directly from the Entity</code> column in movieData.csv</code>. The User node is from ratingData.csv</code> and the green nodes were derived from data stored within an entity row. The ActedDirected</code> relationship is built by the standing query in the recipe. Answering the Question</h2> Getting all of this data into Quine was only part of the challenge. Remember the question that we were asked, *"which actors have acted in and directed the same movie?"*‍ Quine is a streaming graph; if we were to connect the ingest streams to the streaming source, rather than CSV</code> files, the standing query</a> inside of the recipe that I developed would answer the question for movies in the past as well as movies in the future. Our standing query matches when a complete pattern for the situation when an actor (Person</code>) both ACTED_IN</code> and DIRECTED</code> the same movie. MATCH (a:Movie)<-[:ACTED_IN]-(p:Person)-[:DIRECTED]->(m:Movie) WHERE id(a) = id(m) RETURN id(m) as movieId, m.title as Movie, id(p) as personId, p.name as Actor</code></pre> When the standing query completes a match, it processes the movie id</code> and person id</code> through the output</a> query and actions. standingQueries: - pattern: type: Cypher mode: MultipleValues query: |- MATCH (a:Movie)<-[:ACTED_IN]-(p:Person)-[:DIRECTED]->(m:Movie) WHERE id(a) = id(m) RETURN id(m) as movieId, m.title as Movie, id(p) as personId, p.name as Actor outputs: set-ActedDirected: type: CypherQuery query: |- MATCH (m),(p) WHERE strId(m) = $that.data.movie AND strId(p) = $that.data.person MERGE (p:Person)-[:ActedDirected]->(m:Movie) log-actor-director: type: WriteToFile path: "ActorDirector.jsonl"</code></pre> My standing query creates a new ActedDirected</code> relationship between the Person and Movie nodes, then logs the relationship. Four hundred ninety-one actors acted in and directed the same movie in our data set. { "data": { "Actor": "Clint Eastwood", "Movie": "Unforgiven", "movieId": "4a6d64c8-9c90-3362-b443-4d2e7b2fb9d1", "personId": "4638a820-3b68-3fc7-9fa7-341e876b701e" } }</code></pre>Conclusion</h2> Phew, we made it through! And we learned a lot along the way. CSV data is streamed into Quine</li> Quine can read from external files and streaming providers</li> You can ingest multiple streams at once, movies and reviewers, and combine them into one streaming graph</li> Always separate ingest queries using the jobs to be done framework</li> </ul> Quine is open source if you want to run this analysis for yourself. Download a precompiled version or build it yourself from the codebase Quine Github</a>. I published the recipe that I developed at https://quine.io/recipes</a>. The page has instructions for downloading the CSV</code> files and running the recipe. Have a question, suggestion, or improvement? I welcome your feedback! Please drop in to Quine Slack</a> and let me know. I'm always happy to discuss Quine or answer questions. ‍ ‍ ‍ Real-time Blockchain Monitoring is Hard without A Streaming Graph 2022-05-31T00:00:00+00:00 The Challenges of Finding Fraud on the Blockchain</h2> Blockchain-based technology growth has been explosive, with over 10,000 cryptocurrencies alone available to rapidly growing consumer and commercial user bases. Real-time governance and compliance techniques are needed to ensure confidence in the space in order for them to be embraced as alternatives to fiat currencies. The combination of new technology, well established user expectations for real-time transactions and rapidly evolving regulations demand new tools to handle the complexities of these distributed and pseudonymized</a> systems. Detecting, tracing, and mitigating fraud across block chain(s) relies on many of the same practices used by more traditional banking systems: modeling user behaviors, watching for suspicious transactions relative to known exploits or typical behavior patterns (often termed "know your customer" or KYC</a>), and rapid action to limit fraudulent transactions (e.g., from hacked accounts), ideally in real-time. The use of pseudonymity practices (e.g., the use of private addresses), however, require new data analysis techniques and mechanisms to maximize the contextual value of the data that is available to identify fraud while minimizing the impact false positives and investigative overhead on customers and business operations. Given that user identity data is more limited in crypto, it becomes necessary to maximize the use of available information about the interactions of accounts and wallets to identify and trace money laundering. Fortunately, cryptos underlying blockchain(s) are essentially append-only ledgers and provide a complete history of such interactions and can be readily analyzed. That is, so long tools suited to modeling and querying these relationships are available. Relational Databases Work at Low Transaction Volumes</h3> Modeling and monitoring relationships between event data can be done at low volume using legacy relational database tools. Tables are built to represent the relationships between an address and it’s transactions, which are joined with tables about the addresses and their accounts, which are then joined with tables about blocks, which are joined with tables from other blockchains… Such “nested joins” and the use of Foreign Keys to relate tables together are the state of the art today, but they are expensive computationally and slow to manifest query responses. Reducing queries to small blocks of time has been the modus operandi of the industry. However, the use of “time windowing,” a well-established practice, limits the data used to make decisions; the antithesis of the context enrichment we are seeking to analyze the relationships between blockchain event logs. Using Categorical Data to Process the Blockchain</h3> Enter graph technology. Graph data structures are ideal for modeling the relationships described in blockchain events . Flows of cryptocurrency between accounts and wallets are ideal inputs for graph data modeling. Accounts, addresses, time references, devices, assets, transaction details, etc. are all examples of categorical data</a> connected by relationships and are therefore ideal to be represented as the nodes, edges, and properties provided in a graph data model. Most importantly, the graph data model makes the relationships between entities first class citizens in the data model so the costs and complexity associated with table joins is entirely eliminated. Graph is the ideal data model for blockchain relationship tracing. Knowing a graph data model is a good alignment with blockchain event data, we then need to confront the well-known performance limitations of graph databases. Graph databases are still databases, and queries that traverse multiple levels of relationship degrees dramatically impact database performance. Unfortunately, this leads developers to once again fall back to batch processing with time limited windows of data. While graph is more efficient than relational databases in modeling relationships, it still lacks the performance throughput needed for real-time fraud detection and mitigation use cases. What is needed is a system that combines the graph data model with an event processing architecture that provides fast enough throughput to cost-effectively perform deep graph traversal queries across the complete history of one or more blockchain's events. Achieving such performance maximizes the contextual value of available event data by looking at real-time and historical transactions, while acting fast enough to drive real-time transaction challenges to new ones. Enter Quine Streaming Graph for Fraud Detection</h3> This is where Quine streaming graph comes in. Quine is designed to process high volumes of event stream data in real time in order to detect complex and sometimes subtle patterns like the sort that might indicate fraud. Quine scales to tens of thousands of events per node, and can easily handle blockchain transaction volumes in real time, while also providing access to the complete trace of activities through and across blockchains. Quine can simultaneously consume both streaming data sources (Kafka</a> and Kinesis</a>) and static sources stored in databases and data lakes in order to build an integrated graph data model. If you are interested in trying Quine yourself, download</a> it here and try it with the Ethereum tag propagation</a> recipe. If you have questions or want to check out the community, join Quine slack</a> or visit our Github</a> page. ‍ Blog header photo credit: Photo by Héctor J. Rivas</a> on Unsplash</a> ‍ Ingesting data from the internet into Quine Streaming Graph 2022-05-24T00:00:00+00:00 The previous article in this series (Quine Ingest Streams</a>) introduced the ingest stream and the basic structure for for creating them. In this article, I go deeper, exploring the ingest query and its role in the ingest stream. A quick review of ingest streams: An ingest stream connects Quine to data producers.</li> Ingest streams use backpressure to avoid becoming overloaded.</li> Data is transformed by the ingest query into a streaming graph.</li> Using idFrom</code> allows us to act as if all nodes in the graph already exist.</li> Ingest streams are created either by API calls or Recipes.</li> </ul> For this article, we use the built-in wikipedia</code> recipe as a starting point. Defining an Ingest Stream</h2> The wikipedia page ingest</a> recipe defines an ingest stream that receives updates from the mediawiki.page-create</code> event stream</a>. Here's a copy of the ingest stream from the recipe: ingestStreams: - type: ServerSentEventsIngest url: https://stream.wikimedia.org/v2/stream/page-create format: type: CypherJson query: |- MATCH (revNode) WHERE id(revNode) = idFrom("revision", $that.rev_id) MATCH (dbNode) WHERE id(dbNode) = idFrom("db", $that.database) MATCH (userNode) WHERE id(userNode) = idFrom("id", $that.performer.user_id) SET revNode = $that, revNode.type = "rev" SET dbNode.database = $that.database, dbNode.type = "db" SET userNode = $that.performer, userNode.type = "user" WITH *, datetime($that.rev_timestamp) AS d CALL create.setLabels(revNode, ["rev:" + $that.page_title]) CALL create.setLabels(dbNode, ["db:" + $that.database]) CALL create.setLabels(userNode, ["user:" + $that.performer.user_text]) CALL reify.time(d, ["year", "month", "day", "hour", "minute"]) YIELD node AS timeNode CALL incrementCounter(timeNode, "count") CREATE (revNode)-[:at]->(timeNode) CREATE (revNode)-[:db]->(dbNode) CREATE (revNode)-[:by]->(userNode)</code></pre> This ingest stream has three elements: type</code>, url</code>, and format</code>. The type declaration for an ingest stream establishes the structure for the ingest stream object definition. This ingest stream is a ServerSentEventsIngest1</code> stream. Reviewing the ServerSentEventsIngest</code> schema documentation</a> from the API docs provides us with the schema that we need to follow for the ingest stream definition. NOTE The schema definition will default to File Ingest Stream when first opened. Be sure to click on the down arrow 🔽 next to File Ingest Stream and select Server Sent Events Stream from the drop down to view the correct schema. Here's the schema for a ServerSentEventsIngest</code> Quine Server Ingest Stream Schema The structure of the ServerSentEventsIngest</code> stream is pretty straight forward. type</code> specifies the schema type for the ingest stream format</code> defines what the ingest stream will do with each line it receive format</code> defines what the ingest stream will do with each line it receive type</code> identifies the line format in the stream query</code> defines the Cypher ingest query parameter</code> name of the parameter to store the current datum url</code> defines the connection URL for the data producer parallelism</code> and maximumPerSecond</code> tune the bandwidth for the ingest stream and when to apply backpressure Wikipedia page-create</code> Data</h2> Quick aside, we need to understand the data that we are working on before we start pulling the ingest query apart. Here's a sample page-create</code> json object to review. View more samples by visiting the Wikipedia event streams</a> page, selecting the mediawiki.page-create</code> stream, then clicking the green "Stream" button. { "$schema": "/mediawiki/revision/create/1.1.0", "meta": { "uri": "https://en.wikipedia.org/wiki/Established_population", "request_id": "85b7bd4b-23a5-4c20-84a1-d89430c21f6c", "id": "8a34f1c0-a276-4a2b-ae2e-305f8822011c", "dt": "2022-05-20T16:43:34Z", "domain": "en.wikipedia.org", "stream": "mediawiki.page-create", "topic": "eqiad.mediawiki.page-create", "partition": 0, "offset": 231788500 }, "database": "enwiki", "page_id": 70828723, "page_title": "Established_population", "page_namespace": 0, "rev_id": 1088883819, "rev_timestamp": "2022-05-20T16:43:33Z", "rev_sha1": "d9uoc7gw3cj3ejhs8ihvsi61hp54icq", "rev_minor_edit": false, "rev_len": 82, "rev_content_model": "wikitext", "rev_content_format": "text/x-wiki", "performer": { "user_text": "Invasive Spices", "user_groups": [ "extendedconfirmed", "*", "user", "autoconfirmed" ], "user_is_bot": false, "user_id": 40272459, "user_registration_dt": "2020-09-30T23:11:08Z", "user_edit_count": 9319 }, "page_is_redirect": true, "comment": "#REDIRECT [[Naturalisation (biology)]] {{R cat shell| {{R from related topic}} }}", "parsedcomment": "#REDIRECT <a href=\"/wiki/Naturalisation_(biology)\" title=\"Naturalisation (biology)\">Naturalisation (biology)</a> {{R cat shell| {{R from related topic}} }}", "rev_slots": { "main": { "rev_slot_content_model": "wikitext", "rev_slot_sha1": "d9uoc7gw3cj3ejhs8ihvsi61hp54icq", "rev_slot_size": 82, "rev_slot_origin_rev_id": 1088883819 } } }</code></pre> Take a moment to get familiar with the page-create</a> schema from the wikipedia API documentation. The sample object is a bit messy for us to really see what is going on, so let's clean it up a bit. Showing just the keys from the object with jq</code> makes it much easier to plan our ingest query. ❯ jq '. | keys' /tmp/data.json [ "$schema", "comment", "database", "meta", "page_id", "page_is_redirect", "page_namespace", "page_title", "parsedcomment", "performer", "rev_content_format", "rev_content_model", "rev_id", "rev_len", "rev_minor_edit", "rev_sha1", "rev_slots", "rev_timestamp" ]</code></pre> The mediawiki recipe is an example use case for the reify.time</code> user function. It creates temporal nodes in the graph and relationships with the page-create</code> nodes based on the rev_timestamp</code>. By demonstrating the reify.time</code> function, our ingest query creates revision nodes, db nodes, and user nodes that are related to each other and their representative time nodes. To learn more about creating time-series nodes in Quine, read about time reification here.</a> The Ingest Query</h2> The ingest query is the workhorse of the ingest stream. Each datum, the page-create</code> object in this case, is processed by the ingest query. The query is written in Cypher and is responsible for parsing data, creating nodes, storing data and setting relationships in the streaming graph. First, the ingest query creates the nodes we want using MATCH</code> and WHERE</code>. The node id</code> is assigned using the idFrom</code> function. MATCH (revNode) WHERE id(revNode) = idFrom("revision", $that.rev_id) MATCH (dbNode) WHERE id(dbNode) = idFrom("db", $that.database) MATCH (userNode) WHERE id(userNode) = idFrom("id", $that.performer.user_id)</code></pre> Notice that we pass two parameters to the idFrom</code> function. The first parameter, establishes a unique namespace for the id</code> to avoid collisions. The second parameter is the rev_id</code> from the page-create</code> object. The result from idFrom</code> is a deterministic UUID for each node. Next, we store the rev</code>, db</code>, and user</code> values as properties in the respective nodes and label each node for clarity in the graph explorer. Quine parses the ingested line and stores the results in a variable, $that</code>. You can retrieve values from the ingested datum using dot notation as $that.<attribute></code>. SET revNode = $that, revNode.type = "rev" SET dbNode.database = $that.database, dbNode.type = "db" SET userNode = $that.performer, userNode.type = "user" CALL create.setLabels(revNode, ["rev:" + $that.page_title]) CALL create.setLabels(dbNode, ["db:" + $that.database]) CALL create.setLabels(userNode, ["user:" + $that.performer.user_text])</code></pre> There is quite a bit going on in this simple line. Specifically, the use of WITH *. Let's take a moment to understand why we chose to use this pattern. By calling WITH *</code>, Cypher changes the scope of data available. If you explicitly list each node in the data and accidentally omit a variable, it's lost for the remainder of the query, and you can get unexpected errors. Using the glob ensures that all nodes and variables are at your disposal in the ingest query. WITH *, datetime($that.rev_timestamp) AS d</code></pre> The ingest query make a CALL</code> to the reify.time</code> function to create a new timeNode</code>. The resulting node is based on the year, month, day, hour, and minute of the rev_timestamp</code>. It also increments the count</code> parameter of the timeNode</code>. CALL reify.time(d, ["year", "month", "day", "hour", "minute"]) YIELD node AS timeNode CALL incrementCounter(timeNode, "count")</code></pre> Finally, the ingest query creates the relationships between nodes in the graph. CREATE (revNode)-[:at]->(timeNode) CREATE (revNode)-[:db]->(dbNode) CREATE (revNode)-[:by]->(userNode)</code></pre> Now, let's run the recipe to see how the ingest query builds out the graph in Quine. With the latest Quine jar file downloaded from Quine.io</a> start the recipe from the command line. ❯ java -jar quine-x.x.x.jar -r wikipedia</code></pre> The recipe includes a standing query that outputs nodes to the terminal as they arrive. You should see activity quickly after launching the recipe. Before the graph gets too large, open Quine explorer (http:/0.0.0.0:8080) and run the time nodes stored query. Each of the time nodes were created by the ingest query using the timestamp in the page-create</code> object. We call these synthetic nodes. Synthetic nodes are useful when looking for abstract patterns between loosely related nodes. In this case, which updates were done during a particular time bucket. Quine Exploration UI - Time Buckets ‍ Using the API, let's inspect the ingest</code> stream using the ingest endpoint. ❯ http GET http://0.0.0.0:8080/api/v1/ingest Content-Type:application/json HTTP/1.1 200 OK Content-Encoding: gzip Content-Type: application/json Date: Fri, 20 May 2022 20:06:01 GMT Server: akka-http/10.2.9 Transfer-Encoding: chunked { "INGEST-1": { "settings": { "format": { "parameter": "that", "query": "MATCH (revNode) WHERE id(revNode) = idFrom(\"revision\", $that.rev_id)\nMATCH (dbNode) WHERE id(dbNode) = idFrom(\"db\", $that.database)\nMATCH (userNode) WHERE id(userNode) = idFrom(\"id\", $that.performer.user_id)\nSET revNode = $that, revNode.type = \"rev\"\nSET dbNode.database = $that.database, dbNode.type = \"db\"\nSET userNode = $that.performer, userNode.type = \"user\"\nWITH *, datetime($that.rev_timestamp) AS d\nCALL create.setLabels(revNode, [\"rev:\" + $that.page_title])\nCALL create.setLabels(dbNode, [\"db:\" + $that.database])\nCALL create.setLabels(userNode, [\"user:\" + $that.performer.user_text])\nCALL reify.time(d, [\"year\", \"month\", \"day\", \"hour\", \"minute\"]) YIELD node AS timeNode\nCALL incrementCounter(timeNode, \"count\")\nCREATE (revNode)-[:at]->(timeNode)\nCREATE (revNode)-[:db]->(dbNode)\nCREATE (revNode)-[:by]->(userNode)", "type": "CypherJson" }, "parallelism": 16, "type": "ServerSentEventsIngest", "url": "https://stream.wikimedia.org/v2/stream/page-create" }, "stats": { "byteRates": { "count": 1354157, "fifteenMinute": 1552.6927122874843, "fiveMinute": 1398.959143968717, "oneMinute": 1099.4731678954581, "overall": 1448.3578957557581 }, "ingestedCount": 914, "rates": { "count": 914, "fifteenMinute": 1.0510781922502073, "fiveMinute": 0.9474472912218986, "oneMinute": 0.7431750446830565, "overall": 0.9775815796950665 }, "startTime": "2022-05-20T19:50:26.494025Z", "totalRuntime": 934608 }, "status": "Running" } }</code></pre> The ingest query defined via the recipe is named INGEST-1</code> and is currently running. Info Did you know tat you can make API calls directly from the embedded API documentation? Select the page icon (📄) from the left nav inside of Quine Explore then navigate to the API endpoint that you want to exercise. Adjust the API call as needed, and press the blue "Send API Request" Button.. Pausing the stream via the API is done via the ingest/{name}/pause endpoint</code>. ❯ http PUT http://tow-mater:8080/api/v1/ingest/INGEST-1/pause Content-Type:application/json HTTP/1.1 200 OK Content-Encoding: gzip Content-Type: application/json Date: Fri, 20 May 2022 20:09:27 GMT Server: akka-http/10.2.9 Transfer-Encoding: chunked { "name": "INGEST-1", "settings": { "format": { "parameter": "that", "query": "MATCH (revNode) WHERE id(revNode) = idFrom(\"revision\", $that.rev_id)\nMATCH (dbNode) WHERE id(dbNode) = idFrom(\"db\", $that.database)\nMATCH (userNode) WHERE id(userNode) = idFrom(\"id\", $that.performer.user_id)\nSET revNode = $that, revNode.type = \"rev\"\nSET dbNode.database = $that.database, dbNode.type = \"db\"\nSET userNode = $that.performer, userNode.type = \"user\"\nWITH *, datetime($that.rev_timestamp) AS d\nCALL create.setLabels(revNode, [\"rev:\" + $that.page_title])\nCALL create.setLabels(dbNode, [\"db:\" + $that.database])\nCALL create.setLabels(userNode, [\"user:\" + $that.performer.user_text])\nCALL reify.time(d, [\"year\", \"month\", \"day\", \"hour\", \"minute\"]) YIELD node AS timeNode\nCALL incrementCounter(timeNode, \"count\")\nCREATE (revNode)-[:at]->(timeNode)\nCREATE (revNode)-[:db]->(dbNode)\nCREATE (revNode)-[:by]->(userNode)", "type": "CypherJson" }, "parallelism": 16, "type": "ServerSentEventsIngest", "url": "https://stream.wikimedia.org/v2/stream/page-create" }, "stats": { "byteRates": { "count": 1653281, "fifteenMinute": 1530.565647994232, "fiveMinute": 1428.2092910910662, "oneMinute": 1488.2104624440235, "overall": 1448.444229804896 }, "ingestedCount": 1117, "rates": { "count": 1117, "fifteenMinute": 1.0361739604926652, "fiveMinute": 0.96669545913622, "oneMinute": 1.0032209384426753, "overall": 0.9786067989220232 }, "startTime": "2022-05-20T19:50:26.494025Z", "totalRuntime": 1141066 }, "status": "Paused" }</code></pre> Notice that the updates in your terminal window stopped and the INGEST-1</code> ingest stream has a status of "Paused"</code>. Restart the stream with a PUT</code> to the /ingest/{name}/start</code> endpoint. Updates will resume in your terminal window and the ingest stream status will return to "Running"</code>. Conclusion</h2> We are just getting warmed up whit ingest streams! This post walked through a simple ingest stream and ingest query to read server-sent events (SSE) from the Wikipedia streaming events service. Next up in the series is Ingesting CSV data where we will go over how Quine streams in data that is stored in a CSV file. I welcome your feedback! Drop in to Quine Slack</a> and let me know what you think. I'm always happy to discuss Quine or answer questions. ‍ Building a Quine Streaming Graph: Ingest Streams 2022-05-21T00:00:00+00:00 Quine Ingest Streams</h2> Quine is optimized to process high volumes of data in motion and then stream out high-quality insights in real-time. The ingest stream is where a streaming graph starts. It connects to data producers, transforms the data, then populates a streaming graph to be analyzed by standing queries. Quine streaming graph combines multiple sources to detect high value patterns. Let's get under the hood to understand how ingest streams work. Quine is fundamentally a stream-oriented data processor that uses a graph data model. This provides optimal integration with streaming data producers and consumers such as Kafka and Kinesis. Quine builds on this streaming foundation to provide batch-like capabilities by converting data stored in files to streaming data to load into the graph. Ingest Stream Concepts</h2> What is an Ingest Stream? An ingest stream connects a data source to Quine and prepares the emitted data for the streaming graph. Within the ingest stream, an ingest query, written in Cypher, updates the streaming graph nodes and edges as data is received. Backpressuring Ingest Streams Inevitably, when streaming data producers outpace consumers, the consumer will become overwhelmed. In Quine, as an ingest stream begins to get more data than it can process, it manages the dataflow to avoid becoming overwhelmed using "backpressure." A backpressured system does not buffer, it causes producers upstream to *not* send data at a rate greater than it can process. The problem with buffering is that a buffer will eventually run out of space. And then what? The system must decide what to do when the buffer is full: drop new results, drop old results, crash the system, or backpressure. Backpressure is a protocol</a> defining how to send a logical signal UP the stream with information about the downstream consumers readiness to receive more data. That backpressure signal follows the same path as data moving downstream, but in reverse. If downstream is not ready to consume, then upstream does does not send. Quine uses a reactive stream implementation of backpressure, Akka Streams</a>, built on top of the actor model to ensure that the ingestion and processing of streams are resilient. Info Curious about the operational challenge associated with reactive streams? Read the Reactive Manifesto</a> to understand the problems faced by every streaming processor in a high-volume data pipeline. Including asynchronous, non-blocking backpressure is the only method to ensure that all data from a high-volue stream is processed without data loss or processing delays. All Nodes Exist With a graph data model, nodes are the primary unit of data — much like a "row" is the primary unit of data in a relational database. However, unlike traditional graph data systems, a Quine user never has to create a node directly. Instead, the system functions as if all nodes exist. Quine represents every possible node as an existing "empty node" with no interesting history. As data streams into the system, the node becomes interesting, and Quine creates a history for the node. We added an idFrom</code> function to Cypher that takes any number of arguments and deterministically produces a node ID from that data. This is similar to a consistent-hashing strategy, except that the ID produced from this function is always an ID that conforms to the type chosen for the ID provider. You will use idFrom</code> in the ingest query part of every ingest stream that you create. For example, the absolute minimum ingest query to load incoming data into the graph is simply a wrapper around the idFrom</code> function. MATCH (n) WHERE id(n) = idFrom($that) SET n.line = $that</code></pre> Historical Versioning Each node in the graph records all of its historical changes over time. When a node's properties or edges are changed, the change event and timestamp are saved to an append-only log for that particular node. This historical log can be replayed up to any desired moment in time, allowing for the system to quickly answer questions using the state of the graph as it was in the past. This is a technique known as Event Sourcing</a>, applied individually to each node. Syntax and Structure</h2> The first step when defining an ingest stream is to understand the overall shape of your data. This includes identifying the data elements necessary for standing queries to use in a MATCH</code>. An ingest query is defined by setting a type</code> described by the API documentation</a>. Quine supports eight types of ingest streams. Each type has a unique form and requires a specific structure to configure properly. For example, constructing an ingest stream via the /api/v1/ingest/{name}</code> API</a> endpoint to read data from standard in and store each line as a node looks similar to the example below. { "type": "StandardInputIngest", "format": { "type": "CypherLine", "query": "MATCH (n) WHERE id(n) = idFrom($that) SET n.line = $that" } }</code></pre> Quine natively reads from standard-in, passing each line into a Cypher query as: $that</code>. A unique node ID is generated using idFrom($that)</code>. Then, each line is stored as a line</code> parameter associated with a new node in the streaming graph. Info When creating an ingest stream via the API, you are given the opportunity to name the steam with a name that has meaning. For example, you can name the above ingest stream standardIn to make it easier to reference in your application. Alternatively, creating an ingest steam via a recipe, Quine automatically assigns a name to each steam using the format INGEST-1 where the first ingest stream defined in the recipe is INGEST-1 and subsequent ingest steams are name in order with # counting up. Here is the same ingest stream defined in a Quine Recipe</a>. ingestStreams: - type: StandardInputIngest format: type: CypherLine query: |- MATCH (n) WHERE id(n) = idFrom($that) SET n.line = $that</code></pre>Ingest Stream Reporting</h2> Inspecting Ingest Streams via the API Quine exposes a series of API endpoints that enable you to monitor and manage ingest streams while in operation. The complete endpoint definitions are available in the API documentation. List all running ingest streams</a></li> Look up a running ingest stream</a></li> Pause an ingest stream</a></li> Unpause an ingest stream</a></li> Cancel a running ingest stream</a></li> </ul> Let's take a look at the information available from the INGEST-1</code> ingest stream from the Ethereum Tag Propagation</a> Recipe. Start the recipe. ❯ java -jar quine-x.x.x.jar -r ethereum</code></pre> List the ingest streams started by the Ethereum recipe using the /api/v1/ingest</code> endpoint. ❯ curl -s "http://localhost:8080/api/v1/ingest" | jq '. | keys' [ "INGEST-1", "INGEST-2" ]</code></pre> The Ethereum recipe creates two ingest streams; INGEST-1</code> and INGEST-2</code>. Now, view the ingest stream stats using the /api/v1/ingest/INGEST-1</code> endpoint. ❯ curl -s "http://localhost:8080/api/v1/ingest/INGEST-1" | jq { "name": "INGEST-1", "status": "Running", "settings": { "format": { "query": "MATCH (BA), (minerAcc), (blk), (parentBlk)\nWHERE\n id(blk) = idFrom('block', $that.hash)\n AND id(parentBlk) = idFrom('block', $that.parentHash)\n AND id(BA) = idFrom('block_assoc', $that.hash)\n AND id(minerAcc) = idFrom('account', $that.miner)\nCREATE\n (minerAcc)>-[:mined_by]-(blk)-[:header_for]->(BA),\n (blk)-[:preceded_by]->(parentBlk)\nSET\n BA:block_assoc,\n BA.number = $that.number,\n BA.hash = $that.hash,\n blk:block,\n blk = $that,\n minerAcc:account,\n minerAcc.address = $that.miner", "parameter": "that", "type": "CypherJson" }, "url": "https://ethereum.demo.thatdot.com/blocks_head", "parallelism": 16, "type": "ServerSentEventsIngest" }, "stats": { "ingestedCount": 57, "rates": { "count": 57, "oneMinute": 0.045556443551085735, "fiveMinute": 0.06175571100053622, "fifteenMinute": 0.04159128290271318, "overall": 0.07659077758191643 }, "byteRates": { "count": 78451, "oneMinute": 62.49789862393008, "fiveMinute": 84.92629746711795, "fifteenMinute": 57.22987512826503, "overall": 105.41446006900763 }, "startTime": "2022-05-17T18:56:08.161500Z", "totalRuntime": 744041 } }</code></pre> Reporting on Ingest Stream progress using a Status Query When creating an ingest query via a recipe, you can add a status query that runs continuously. For example, the status query below prints the information for each graph node, and a link to the visualization in the web UI. statusQuery: cypherQuery: MATCH (n) RETURN count(n)</code></pre>Ingest Stream Blog Series</h2> This is just the beginning. There's lots more to cover. Over the next few weeks, we will cover the most common ingest streams in separate blog posts. Ingesting data from an Internet Source</a></li> Ingesting Multiple Sources/CSV Data</a></li> Ingesting Log Files</a></li> Ingesting Data from Kafka</a></li> Quine in a Data Pipeline</li> </ul> Try Adding Ingest Data To Quine Yourself</h3> And if you want to try Quine yourself, you can download</a> it here. And in addition to the Ethereum recipe, take a look at the Wikipedia Ingest</a> and Apache Log Analytics</a> recipes for different ingest stream examples. If you have questions or want to check out the community, join Quine slack</a> or visit our Github</a> page. Time Series Streaming Graph and other Quine 1.2.0 Highlights 2022-05-10T00:00:00+00:00 Time As Categorical Data and other Improvements</h2> Last week saw the release of Quine 1.2.0 and with it, some notable new features, several new recipes, and a cluster of performance related improvements that include two important changes impacting backwards compatibility. All in all, Quine Streaming Graph 1.2.0 is a significant update and worth a detailed look. Time Series Streaming Graph</h3> Reification of time – easily the most immediately useful feature of 1.2.0, reify.time</code> is a new custom Cypher procedure you can use to generate a structured representation of timestamp data. Think buckets of days, hours, minutes, and seconds represented as nodes. Does this make Quine a time series database now? More like a time series streaming graph. reify.time</code> makes it straightforward to create and connect events to time-based nodes and execute time-related data analysis via standing queries</a>. For time-series use cases when you're looking for real-time insights, the ability to treat time as categorical data and render it as nodes on the graph is pretty powerful. And if you want to quickly try it yourself, we’ve added a new Wikipedia Ingest</a> recipe. Here’s the reify.time</code> documentation page</a> so you can learn more. Cypher Feature Support</h3> Cypher subqueries - with the introduction of the CALL {}</code> command, Quine now adds the subquery. Since Neo4J has the best Cypher documentation in the biz, here’s a link to their Cypher Manual page</a> on CALL {}</code>. OpenCypher and Java 11 - starting with 1.2.0, Quine has switched to OpenCypher, which requires Java 11 or later. Previous versions required Java 8 or later. Rest API and Storage Enhancements</h3> REST API UPGRADE (DOCS) - we are preparing to move from the standard Swagger stacked REST API documentation to Spotlight’s Elements framework. Check it out on the REST API documentation page. If you have opinions or feedback, we’d love to hear it. Look for it in Quine itself in a near-term release. Switch to Spotlight Elements for improved API Docs Legibility Added Quine.persistence.effect-order – This configuration option</a> is particularly useful for Cassandra users who want to use Quine+Cassandra as their database of record. quine.persistence.effect-order</code> can be set to either memory-first</code> or persistor-first</code>. The latter option, persistor-first</code> is meant to support this use case directly. In the presence of system failure, with persistor-first</code> Quine will have durably stored updates before anything else occurs (e.g. triggering standing queries). Supernode performance enhancements - super nodes, or nodes with LOTS of edges, can negatively impact performance. As part of an ongoing effort to mitigate that impact, we have added two features to improve performance: Improved serialization for nodes with an extremely large edge and/or property counts in the persistence backend. While large property counts don’t technically make a node a supernode but the performance problems are similar. thatDot platform users should note that a similar change was made to Quine cluster member behavior.</li> Nodes with an extremely large edge and/or property count can now be accessed via the Literal Operations REST APIs</li> </ol> Storage Format Change – data stored in Quine 1.1.2 and earlier can not be used in Quine 1.2.0 without migration. Additional Improvements - storage format in addition to improved serialization for nodes with large edge and/or property counts, updates to Quine 1.2.0 storage included: Calling the debug API on a node in a historical query now only includes journal events up to the time of the historical query</li> Rename Cassandra store options insert-time</code> and select-timeout</code> to write-timeout</code> and read-timeout</code>, respectively (#1733)</li> Bugfix: Setting snapshot-singleton=true</code>, snapshot-schedule=on-node-update</code>, and journal-enabled=false</code> no longer causes the most recent event on a node to be dropped</li> </ul> Download</a> and try Quine 1.2.0 today.</h3> thatDot, makers of Quine, Announces CrowdStrike Falcon Fund Investment 2022-04-27T00:00:00+00:00 We are both excited and proud to announce that CrowdStrike, through their Falcon Fund, have made an investment in thatDot. This news comes on the heels of the release of Quine</a>, our open source streaming graph software. The investment serves not only to validate our vision for what Quine can be, but the concrete progress we’ve made in executing on that vision. The vision we are executing on is both simple and audacious: we see engineers and data scientists using Quine as the central hub for high volume, real-time, complex event processing workflows at scale. A handful of Quine queries can replace months of development time and millions in costs, eliminating the need to build complex microservices architectures that drag down and stall analysis on streaming data. It is rare that such a revolutionary advance occurs in such an important infrastructure category. Investment from CrowdStrike and others will help thatDot scale more rapidly to meet the many applications of the platform across a large variety of use cases. Or, as Michael Sentonas, Chief Technology Officer at CrowdStrike said: “CrowdStrike and thatDot share a commitment to bringing speed and efficiency to data pipeline development teams through real-time, critical analysis of telemetry. The thatDot platform unlocks value for these teams, enabling them to understand and act upon massive amounts of data quickly and confidently.” </blockquote> For those of you who’ve already joined the Quine community</a>, thanks for helping us reach this important milestone. For those of you who are new to the community and to streaming graph, we want to welcome you to the adventure</a>. ‍ Your Graph DB Won’t Scale? Stop Querying it. 2022-04-14T00:00:00+00:00 </h2> It is almost inevitable when talking to data engineers or scientists about Quine streaming graph that they start ticking off all the graph databases they’ve already tried and how vastly different each was to operate. That’s not surprising. The tidy category name of graph database obscures what is in fact a pretty diverse set of database technologies. From purpose-built graph databases (Neo4J and TigerGraph) to triple stores (AWS Neptune), and from distributed graphs tightly coupled to column stores (Titan) to multi-model document stores with graph wrappers (ArrangoDB and OrientDB), you’ll find widely divergent</a> underlying data structures, indexing and storage architectures, performance profiles, and target use cases. They have two traits in common that always come up in these calls, though: they all behave like classic databases in that you must proactively query them to see if data is available and they’ve proven unable to scale</a> beyond what are today relatively modest workloads, especially increasingly common event stream processing workloads. Those two traits are in fact profoundly interconnected. Querying a database until you get the answer you need is such a deeply ingrained pattern when it comes to detecting patterns in data that we don’t even question it. It is also grossly inefficient. Compute resources are spent polling for data that is either not in the database or has been in the database for some amount of time. The only way to know if the data is available is to issue the query. Applying this model to event streams of even moderate volume (e.g. 1,000 events/sec) only compounds the problem, rendering graph databases incapable of delivering results in anything close to a reasonable timeframe. [link to benchmarking whitepaper] So when we describe how Quine can scale up to thousands of events/second per node, handle out of order and late arriving data, and doesn’t rely on time windows, people actually tell us they don’t believe us. That’s impossible, they say. How? Our answer? Stop querying your data. It sounds absurd or deliberately provocative, but this is exactly the design choice Quine makes. Quine is a streaming graph. It combines characteristics of complex event processing software (consuming high volume event streams from Kafka and Kinesis) with some of the defining aspects of graph databases. ‍ The evolution of Quine streaming graph. ‍ Quine supports the Cypher</a> and Gremlin</a> query syntax, its data structure is defined by nodes, edges, and properties of nodes, and it performs best when you structure your graph for the questions you need answered. Quine’s approach to finding complex patterns and relationships in event data, and the scale at which Quine operates, differs fundamentally. Quine doesn’t query the database. When is a query not a query? When it is a Quine Standing Query.</h2> The standing query</a> is what makes Quine different. And after spending several hundred words telling you not to query your data if you want to scale, I must admit the name might be misleading. But 1) naming is hard and 2) let’s focus on the standing part for now. ‍ A standing query is like a filter on streams of event data. You issue it once, it propagates into the graph and then…you wait. As events stream in, standing queries keep track of incremental changes in the graph state. At the instant a match is made, the standing query springs into action, triggering an arbitrary operation</a> you’ve specified – updating node properties, adding edges, writing out to a Kafka topic or a database. ‍ Traditional DB query patterns vs. standing queries in Quine streaming graph ‍How is this possible without expending tremendous compute resources? The key innovation that makes standing queries possible – and therefore allows Quine to scale to millions of events per second — is that Quine combines a graph data model with an asynchronous actor-based</a> compute model built on the same graph. In Quine, every node both stores data and can instantiate an actor as needed to send and receive messages and perform arbitrary computation on these messages. Actors are similar to threads in that they are computationally efficient, can be loaded on-demand and are reclaimed when no longer needed. Going back up a level, this design makes possible the incremental compute necessary for a standing query. As events are ingested, actors on nodes responsible for a standing query detect incremental changes to the graph state that matter to them and pass messages up a tree-like hierarchy of nodes responsible for coordinating the standing query. Again, all this is done with lightweight actors and no compute is expended except when incremental matches are made. Unlike with graph databases</a> [PDF; see section 6.3 Complex Queries], the patterns standing queries filter for can be quite complex without increasing query latency. In fact, the notion of query latency doesn’t really exist. A match is made when the data necessary for the match is present. No resources are expended until that happens. Then, and only then, is an action triggered. Standing queries aren’t the only reason Quine processes high-volume event data so effectively. Its use of semantic caching – also a byproduct of a graph-based data and compute model – and division of read and write concerns between an in-memory graph structure and write-optimized persistors</a> both contribute to its ability to scale. But it is the decision to not actively query the data that unlocks the performance at scale necessary for event-driven applications. Graph databases remain a great choice for many uses. And the graph query syntax they mainstreamed is highly effective for finding deeper, more valuable relationships between events. But as event-driven data grows in both volume and importance, graph needs to evolve away from the old database patterns. Learn more or try Quine Streaming Graph</h2> If you are interested in learning more about Quine’s architecture and design choices, or if you want to try Quine for yourself, visit Quine.io</a> for docs and downloads. ‍ 9 Events, 90 Days - See Quine Streaming Graph in Action 2022-04-10T00:00:00+00:00 9 Events, 90 Days. Whew!!!</h2> It has been an exciting couple of months since the launch of Quine, our open source software. The thatDot executive team has been connecting with developers, data engineers and other in-person and virtually to show how they can turn high volume data into high value data. During the rest of the year, our team will be at additional events - including meetups and conferences - across the US to talk all things data, streaming graph and Quine. If you will be at one of these conferences and would like to schedule a meeting contact us. Don't see something you can attend? Drop us a line at info@thatdot.com</a> and we'll add you to our mailing list and find a way to connect. And if you want to connect with other engineers and data scientists exploring Quine, join the Quine community slack channel</a>. Our team will be at the following events: ODSC East 2022</a>, Ryan Wright, CEO and Founder</h3> Boston, MA, April 19-21 ODSC – Open Data Science Conference – is the largest applied data science conference, essential for anyone who wants to connect to the data science community and contribute to the open source applications it uses every day. Sessions: Quine: A Streaming Graph for Event-Driven Data Pipelines</a></li> Noiseless Anomaly Detection with Streaming Graph A.I.</a></li> </ul> The Knowledge Graph Conference</a>, Ryan Wright, CEO and Founder</h3> New York City, May 2-6 The Knowledge Graph Conference is emerging as the premiere source of learning around knowledge graph technologies. We believe knowledge graphs are an underutilized yet essential force for solving complex societal challenges like climate change, democratizing access to knowledge and opportunity, and capturing business value made possible by advances in AI. Session: We will update as soon as the schedule is released.</li> </ul> Past Quine Streaming Graph Events</h2> For those who will not be at the events listed above, we have the presentation available to view to learn more about Quine: Recordings:</h3> Let’s Talk Data with Joe Reis https://www.youtube.com/watch?v=1P-iHaAPs4g&t=4s</a></li> Scala Love https://www.youtube.com/watch?v=RF_-ETNVr4Y&list=PLBqWQH1MiwBTMk9HV-RNN7sQpB9ZPi_Az&index=12</a></li> Global Big Data Conference https://www.youtube.com/watch?v=2-APGoQ8QnI</a>‍</li> ODSC Webinar https://www.youtube.com/watch?v=kpzjLTDhkoE</a></li> </ul> Presentation:</h3> ‍Introducing Quine: A streaming graph for modern data pipelines</a></li> </ul> ‍ ‍ Key Concepts to Help You Get Started With Streaming Graph 2022-04-06T00:00:00+00:00 Getting Started with Streaming Graph</h2> When I started at thatDot six weeks ago, I began keeping a journal of what I learned about Quine, and this new category of software, streaming graph. Journaling helps me organize my thoughts and it is both useful and fun to look in retrospect to see how my ideas and understandings evolved. It is also a great way to distill ideas and share concepts that I find challenging, exciting, or both And that’s what I hope this post can do for you: give you a good starting frame of reference for streaming graph while instilling in you the same sense of excitement I feel after six weeks of working with Quine. Start with the questions you need answered.</h2> Before using Quine ask yourself: do I know what question am I trying to answer? If you don’t, and you just want to load as much data as possible into a database and start exploring, Quine offers no great advantage over graph databases. But if there are patterns of events that you want to watch out for and, when detected, upon which you want to take action , Quine is the right choice. Consider a few examples: you want to detect, then block, fraudulent blockchain transactions as well as identify in real time all parties who transact with the source.</li> you collecting sensor readings and real-time environmental data and want to combine them with historical readings and maintenance data in order to anticipate and head off costly outages.</li> </ul> In both cases the approach is the same: you start by identifying the patterns -- the collection of events related in a specific way to one another -- that indicate a suspect transaction or imminent system failure and what actions should be taken once that pattern emerges. These patterns help to determine which data to represent as nodes, which as properties of nodes, and which as edges expressing the relationship between nodes. Getting this right is the key to building an efficient streaming graph and is the essence of graph data modeling. Because graphs are based on the same subject-predicate-object structure we use to communicate, I personally find expressing data models as graphs straightforward. Quine streaming graph is a lot like graph databases except when it is not.</h2> Experience with graph databases and graph query languages (especially Cypher) will make getting started with Quine relatively easy. In fact, most of your Cypher queries will just work on Quine</a>. This ease of getting started can lead you to mistake Quine for a traditional graph database. Quine is much more than a graph database. Quine was built for a specific purpose – to apply the graph data model so you can detect complex relationships within event streams in real time. In its simplest form, Quine’s usage pattern is: consume streaming data from Kafka or Kinesis, and shape it into a graph. Then, create queries that find complex patterns among the relationships inside the graph, and trigger some arbitrary action, including modifying the graph itself or writing back out to Kafka</a> or Kinesis</a>. The key here is finding patterns within streams of data and taking action on the matches in real-time. If analogies are your thing, the difference between Quine streaming graph and a graph database is the difference between setting a net across a raging torrent to catch what flows through it and casting and recasting a net into a lake until you catch something interesting. If Quine behaved like a legacy graph database, you’d need to checkpoint the stream, ingest a sampling of data, then query it and subsequent checkpoints continuously until a match was made and only then could you take action. Not only are you expending resources with needless queries, but you are also introducing a delay between when a pattern is complete and action is taken. Instead, using standing queries (more on this later) in Quine, you can trigger an action the instant a match occurs. This has profound implications on the design and performance of your solution, allowing you to fine tune the graph structure and queries to zero in on the patterns in event data that matter most to you. Or in other words, you still need to develop the questions to ask of your data to solve or, avoid your business concerns. When and how to work on data: ingest and standing queries.</h2> Quine operates on data using two contexts - ingest querie</a>s and standing queries</a>. Together, these control the majority of the interactions you’ll have with the streaming graph. Standing queries, because they are such a unique and powerful feature of Quine, tend to get a lot of attention. But I’ve learned that the ingest query, while not as flashy, plays a critical role in building an efficient streaming graph. Think of ingest queries like an ETL processor. They connect to one or more data sources, classify data as nodes, properties of nodes, and may even create edges, then load the data into the graph. The ingest query sets the structure that your questions will take. Standing queries take over from ingest queries to look for patterns as they emerge from the event streams. The instant a match is made, standing queries trigger actions. Standing queries can send data to an external system (e.g., writing out to Kafka) or perform actions on the graph itself, like creating new nodes, updating properties, or creating and updating edges. Event driven data in. Data driven data out. The best way to start learning about the role each query type plays is to look at recipes. The Apache Log recipe</a> provides a good example of the ingest query in action, extracting data using regex to create two node types (log, which represents an HTTP request and verb, which represents the HTTP Method). At the same time, it connects the log nodes to their associated HTTP Method using the verb edge. It gives you a great sense of how the structure of the graph maps to the sorts of questions you’d want to ask.</li> The CDN Cache Efficiency recipe</a> uses a relatively simple ingest query and showcases the power of the standing query to transform the graph, incrementing counters when there are cache misses and classifying network elements based on reliability.</li> </ul> ‍Putting it all together</h2> With all this in mind – how streaming graph differs from graph DBs, the need to start with the question and work backward, and using recipes to understand the role of ingest and standing queries – pulling down, dissecting, and then modifying one of the advanced recipes is the next logical next step. I learned a lot exploring the Ethereum Tag Propagation recipe</a>. It uses a live data feed from the Ethereum blockchain and both the ingest and standing queries are robust examples. The Ethereum recipe and the recipes mentioned above are all available as open source on Quine.io</a>. Interacting with streaming data in Quine opened my eyes to the possibilities of streaming graph solutions. Give it a try yourself. Quine is easy to get up and running from the download</a> page. Let me know how you do once you’ve had a chance to experiment. Reach out on the Quine community slack channe</a>l, I’m @allan. Did you figure out something really cool using Quine? Share your work with the community through Github</a>. Submit a pull request with your recipe and a short description. ‍ The Evolution To Streaming Graph from Graph Databases 2022-03-22T00:00:00+00:00 After a decade of growth in large scale data repositories -- databases, data warehouses, data lakes, even data lake houses -- a perceptible shift</a> is underway in internet architecture toward event-driven programming</a>. This shift is being spurred on by a wide range of factors including consumer demand for real-time responsiveness, businesses seeking to personalize the customer experience to maximize engagement, and a shared desire for more effective security protections. Sources of data to deliver on these objectives is a solved issue. The inexorable digitization of commerce, media and social experiences has turned every customer interaction into an ever-growing stream of event data from which companies wish to distill actionable insights in real time</a>.. Extracting high confidence insights through the joining of multiple data sources to expand the context of understanding, however, remains a challenge. Using Databases to Query Event Streams: Nobody Wins‍ Together, the enormous scale of data and the requirement to process it in real time poses a huge challenge to companies embracing this new event-driven paradigm</a>. Event streaming solutions like Kafka and Kinesis have quickly evolved to organize and operationalize data streams. However, once companies have turned their data into event streams, they encounter a second problem: how to query these streams in real time. Databases have proven ill-suited for this task. Processing such big data volumes to extract valuable insights has made it necessary to break the endless stream of events into batches just so existing databases can join data together and apply computation to it. This means waiting for batches of data load, a process that regularly takes hours. And then you are only querying these batches, or snapshots, of data. This is neither streaming or real-time. It is straightforward to translate RDBMS data into graph data structures. Quine’s property graph structure encodes relationships as edges. The limitations of databases manifest in the enterprise as complex systems composed of various libraries and services</a> tied together with custom code deployed as micro-services to cross the chasm between high-throughput event streaming platforms and the lower-throughput but higher value operations of databases. Data engineers have been quite clever in developing sophisticated solutions to deal with these challenges. The typical solution is a complex set of micro services that allows for agile adaptation to new constraints as they surface, but result in an application stack that is difficult to support and which requires expert consultation to manage or change. This complexity inevitably leads to a rebuild every 18-24 months: a poor investment and solution. Event Stream Processing Addresses only Half the Problem To query the event streams in real time, developers had to turn to event stream processing systems like Flink and Spark. They make it possible for developers to use familiar SQL syntax to query the event streams. And while this involved some tradeoffs like time windows</a>, in many regards event stream processing systems have been quite successful. Many can scale to process millions of events per second. But these systems are designed for the relational data model, which while pervasive in industry, lacks the expressive query structures needed to find complex patterns in streams. And it is in the detection of these complex patterns that the real power of event stream processing lies. So what are we left with? Graph databases, which are well-suited for finding complex relationships with large data sets at rest, were not designed for real-time event data streams. They simply can’t keep up. And event stream processing systems lack the ability to query for the complex relationships and to do so without resorting to time windows. Enter Streaming Graph There is a better way: the streaming graph. Streaming graph is a variety of “streaming databases</a>,” an emerging class of software designed specifically to process infinite streams of data. Quine streaming graph brings together the scalability of event stream processing systems with the ability to query for complex relationships offered by graph databases. Quine streaming graph adds the scalability of event stream processing systems to graph. Instead of trying to engineer around the shortcomings of graph databases. Quine’s streaming graph technology includes some important innovations: Uses a graph data model to understand the relationships between data natively, without the need for joins, nested joins and foreign key management</li> Continuous and incremental application of queries to newly arriving data, eliminating the need for time windowing</li> Distributes and parallelizes read and write operations to high-throughput and low-latency queries at scale</li> </ul> Quine streaming graph offers teams a simple, drop-in solution capable of ingesting high volume Kafka or Kinesis data streams with sub-millisecond query performance, even when the query involves deep graph traversal. Quine also eliminates time-windows and makes it possible to handle out of order and late-arriving data that is a common limitation of other event-stream processing systems. Streaming graph fills the architectural gap that exists between high-volume event streaming and high-value graph database computation. The combination of a native understanding of data relationships with high volume event processing provides a new tool for realizing the real-time use cases event-driven programming is meant to enable. Graph AI techniques can now come out of the lab and drive the next generation of recommendations, root-cause analysis, fraud and security threat detection in production. Learn more about Quine streaming graph, available both in open source and enterprise editions, at www.thatdot.com</a> Let's Talk Streaming Graph! (with demo) 2022-03-16T00:00:00+00:00 TGIF! Let's Talk (streaming graph) Data</h2> On Friday, 11 March Ryan Wright sat down with Joe Reis</a> and Matt Housley,</a> CEO and CTO respectively of Ternary Data</a>, to discuss Quine, the open source streaming graph. What followed was an hour of remarkably incisive questions and well-informed discussion. In addition to the video, we've pulled out the transcript of some particularly good bits. If you like this, give Joe and Matt's show a follow. </iframe> </div> Excerpt One 0:46‍</h4> Joe (Ternary): I'm sorry, what the heck is a streaming graph? </blockquote> ‍Ryan (thatDot, Quine): So it's kind of like, imagine the unholy love child of Apache Kafka and Neo4J... something sort of like a graph database, but aimed at high volume event stream processing. The goal is really to interpret what does that data stream mean? Like the data in that high volume event stream. What does it mean? And that's what we build big micro service architectures for. </blockquote> Quine</a> is this standalone application meant to help make that process a whole lot easier. So the idea is basically you can consume that event data, form it into a graph, because the graph is so expressive, and so powerful, to look at and relate data to each other and analyze it, except that it's really fast. And that's really been the linchpin for the database mindset: [people think] graph DBs are cool, but they're too slow. So Quine is trying to change that and bring fast streaming graphs that trigger action to the world of Event Stream Processing. </blockquote> Excerpt Two 6:16</h4> Matt (Ternary) It's interesting to me, too, that you're, you're solving these very hard graph problems. But you guys also decided to adopt a real time analysis model. And by that, I mean, people complain a lot about micro batch. Micro batches is often a perfectly good approach. But you're saying basically, that not only can you process this graph, but you're not doing it in like ten second micro batches, you're actually taking each event and saying, Okay, this event arriving triggers or this this piece, this node triggers some kind of analysis that I can see things happening almost immediately. What kind of latency? Are we talking for that processing to happen? </blockquote> Ryan: Yeah, oh, great question. So we've, we've measured it, an to kind of set set the stage a little bit, the kind of thing that we're typically looking for is, look for a graph pattern that is maybe four or five nodes connected in a certain kind of way. And so it's each each node in that graph, you might think of as one row in a relational table, you know, there's an equivalence there to say these are the same kinds of things. And so when you traverse an edge, you're doing a join between tables to say...this row here is connected to that row in that table over there. And so it's the kind of situation that is usually join, join, join, join, join, there's your answer. And so to do lots of those (joins) is just untenable. And so the reason to go in the direction of a graph is because it gives us instead of having to join tables, we just get the small little units of a node. And so we can hop across four, five, six, ten, fifty, you know, any number of nodes, in a much more efficient way than if we're joining tables together. And so what we've done and measured for some of our applications is: look for patterns that ar, four or five nodes, and then measure the latency that it takes to compute that in the single digit microseconds. So something like 8,000 or 9,000 nanoseconds has been some of the fastest stuff that we do. Which to anybody who's worked in this space, you should have the response that is: 'No way I don't believe it.' </blockquote> Joe: [laughing] I don't believe it. </blockquote> Quine Demo</h3> There's also a demo if you're interested, starting at the 40:21 mark. If you want to learn more, check out the Quine.io docs</a> and never hesitate to jump into the Quine Community Slack</a> channel. Recipe for Streaming Graph Success 2022-03-10T00:00:00+00:00 Quine Recipes Make Getting Started Easy In the world of infrastructure software, there is a certain cachet associated with standing up and operating vast, complicated systems. Like tearing down and rebuilding a motor or supercharging your 3D printer</a>, the challenge appeals to the engineer’s mind. At least until the third or fourth time you have redeploy one of those complex systems because of a particularly pernicious gremlin. That’s when you start asking yourself (usually at around 3 am the night before launch) why can’t someone make a distributed system that is both easy to deploy and designed to scale up to production workloads. This is the exact question that led to the creation of Quine streaming graph</a> and, more recently, the introduction of Quine recipes. Intro to Quine Streaming Graph Recipes</h3> A recipe, in simple terms, is a (YAML or JSON) document containing all the information Quine needs to execute any batch or streaming data processing task. It is referenced when invoking Quine and is often used for modeling, development and testing on local systems. Here’s an example of a recipe that creates a graph by ingesting each line in "$in_file" as graph node with property "line": version: 1 title: Ingest contributor: The thatDot Team summary: Ingest input file lines as graph nodes description: Ingests each line in "$in_file" as graph node with property "line". ingestStreams: - type: FileIngest path: $in_file format: type: CypherLine query: |- MATCH (n) WHERE id(n) = idFrom($that) SET n.line = $that standingQueries: [ ] nodeAppearances: [ ] quickQueries: [ ] sampleQueries: [ ]</code></pre> Pretty simple. But don’t underestimate the extensibility of Recipes. Using the same simple template as this recipe, you can configure Quine to ingest and process multiple event streams, build highly-connected graphs, and set up standing queries that do everything from handling out-of-order and late arriving data to writing results back into the graph or out to Kafka topics. And once you’ve constructed your recipe, everyone on your team has a handy reference for what you’ve built. This is especially useful for recurring tasks, like log processing (see the Apache Access Log recipe</a>) or for teams that are growing or that want to maintain continuity as people come and go. Did I mention that you can embed comments in your recipes? Recipes are also a great way to contribute back to the community. For example, community member Alok Aggarwal, contributed a recipe for calculatingCDN cache efficiency</a>that is already among the most popular on the Quine site. Best part: once you’re satisfied with the recipe, it can be pushed to a production system via the Quine RESTful API</a>. Anatomy of a Quine Streaming Graph Recipe:</h3> To develop a recipe that is executable from a command line, you may use the following YAML template as a starting point: version: title: contributor: summary: description: ingestStreams: [] standingQueries: [] nodeAppearances: [] quickQueries: [] sampleQueries: [] statusQueries: []</code></pre> The first five items – version number, title, author (you, the contributor), as well as summary and description (optional but nice to have) – are pretty self explanatory. If you plan to submit the recipe to Quine.io, the optional fields should be filled in to provide the community with context for your recipe, and any details such as data source and output formats. The next two sections - ingestStreams and standingQueries define your recipe’s behavior. Ingesting and Modeling Data in the Streaming Graph The first query type we will build is an Ingest Stream. Information in the Ingest Stream provides everything Quine needs to find and consume data in order to build a streaming graph. Quine was specifically designed to handle the demands of high volume streaming data. You can use recipes to ingest from Kafka</a>, Kinesis</a>, and SNS/SQS</a>. In addition to event streaming sources, you may ingest data from files</a>, named pipes</a>, and stdin</a>. In the simple Ingest example above CypherLine indicates the source is a file. Data ingested from a file is read into the system line by line, from a functional perspective, behaving just like a stream when consumed. The only difference is that a file is automatically read into the system as fast as the system can handle, and a stream may be rate limited by the incoming data. Standing Queries: Quine’s Superpower</h3> Now that we have data ingested into the graph, we should do something with it (although you don’t have to). Let’s set up a standing query.</a> Standing queries persist in the graph, waiting until a query condition is matched, triggering an action (e.g., updating the graph, executing code, or writing to Kafka). Standing queries are definitely worth mastering. WIth every standing query, we have to provide two things. First, we need to provide the pattern</code> for the system to match, then describe the action we want it to take in the form of a query output</code>. The simple example we started with doesn’t include a standing query so let’s take a look at the one from a another recipe (the Ethereum recipe</a>), which propagates the tainted flag along outgoing transaction paths: standingQueries: - pattern: query: |- MATCH (tainted:account)<-[:from]-(tx:transaction)-[:to]->(otherAccount:account), (tx)-[:defined_in]->(ba:block_assoc) WHERE tainted.tainted IS NOT NULL AND NOT EXISTS (ba.orphaned) RETURN id(tainted) AS accountId, tainted.tainted AS oldTaintedLevel, id(otherAccount) AS otherAccountId type: Cypher mode: MultipleValues outputs: propagate-tainted: query: |- MATCH (tainted), (otherAccount) WHERE tainted <> otherAccount AND id(tainted) = $that.data.accountId AND id(otherAccount) = $that.data.otherAccountId WITH *, coll.min([($that.data.oldTaintedLevel + 1), otherAccount.tainted]) AS newTaintedLevel SET otherAccount.tainted = newTaintedLevel RETURN strId(tainted) AS taintedSource, strId(otherAccount) AS newlyTainted, newTaintedLevel type: CypherQuery andThen: type: PrintToStandardOut</code></pre>Optional Recipe Elements to Customize the Experience</h3> Now that you’ve ingested and are querying data, you can use the remaining parameters to customize the user experience. nodeAppearances: []</code> use to customize the web exploration UI</a> quickQueries: []</code> use to add queries to node context menus in web exploration UI</a> sampleQueries: []</code> use to customize sample queries listed in web UI statusQueries: []</code> specifies a Cypher query to be executed and reported to the Recipe user Try the Ethereum Tag Propagation Recipe</h3> If you are interested in learning more about recipes, there’s no better way than to try one for yourself. I recommend the Ethereum Tag Propagation recipe because it uses actual live data and its use case – detecting tainted transactions on the blockchain – is more relevant by the day.. And if you are interested in creating your own recipe, here are some additional reference resources to get you started: Recipe Reference</a></li> Cypher Language Reference</a></li> Writing and Contributing to the Community</a></li> </ul> Thanks for taking the time to read this and bon appétit! ‍ Computing Recursive Rollups in a Kafka Event Streaming Pipeline 2022-02-24T00:00:00+00:00 Streaming Graph Combines Graph DBs with Stream Event Processing</h2> Quine's graph</a>-based streaming event architecture simplified my processing codebase. I tried pre-aggregation of rollup data using a Kafka streams application and found that other solutions like Kafka Streams KTables were not well suited, or a natural fit, for my data set. So, then my team fell back on relational database (RDBMS) path table patterns to represent the hierarchical data. With the RDBMS we recomputed the rollups using complex recursive function queries through the path-table structure, each time our UI needed to display that data. Using Quine, I replaced the complex queries with succinct Cypher queries for the stream computed rollup value that updates at each underlying event change. The Cypher query language used in Quine encourages treating relationships as a primary quality of your data. It provides a rich set of features for constraining queries across multi-step relationships. The depth constraints on the relationship allowed me to replace those recursive SQL functions with a n-depth relationship expression. As I will show later, I was able to match all tiers of my hierarchical data with a single expression. In my use case, I have a hierarchical graph of meta-data that groups sets of pass/fail Requirements. Event data carrying pass/fail results relate directly to leaf nodes in a hierarchical graph. Complete sets of events are produced for multiple subjects, and we need to be able to provide the percent pass vs fail for each subject at every level of the grouping hierarchy. Inserting Quine into my Kafka streaming pipeline is done by simply consuming from topics that are already inputs in Kafka. When deployed, Quine is configured to ingest data</a> from my existing results topic, and my grouping data topic. The new outputs from Quine are subscribed to by the streams recording service and written to the database. Ingesting the Data</h3> In my system, when the grouped Requirement data is updated, it is streamed out over a Kafka topic. Systems being evaluated for satisfaction of these requirements produce Result</code> events that are dis-embodied from the grouping context. They also flow in on a Kafka topic. I started by populating the Quine graph with my hierarchical data. From my perspective, the easiest approach to this is to first pre-process the groups and leaf nodes into content suited for streaming input. So, a JSON document that has nested data structures can be flattened into expressions of nodes and their relationships. This upfront transformation can easily be performed by something like a Kafka-Streams application. I transform something like this document: { “groups”: [ { “name”: “group-1”, “groups”: [ { “name”: “group-1-1”, “items”: [ { “name”: “requirement-1.1-a”, … }, { “name”: “requirement-1.1-b”, … } ] }, …</code></pre> Into this list of events that express the parent group relationship: { “name”: “requirement-1.1-a”, “group”: “group-1-1”, “type”: “requirement”, … } { “name”: “requirement-1.1-b”, “group”: “group-1-1”, “type”: “requirement”, … } { “name”: “group-1.1”, “group”: “group-1”, “type”: “group”, … } { “name”: “group-1”, “group”: “root”, “type”: “group”, … }</code></pre> When the data flows into Quine, it is detected by two (2) ingest queries. I used a different query for each value of type</code> in my input JSON. The query is expressed as JSON data that you POST</code> to the /api/v1/ingest/<name></code> REST endpoint. You can set <name></code> to a unique value by which the ingest query can be identified. For example, my query path was: /api/v1/ingest/test_groups</code>. { …, “format”: { “query”: “ WITH * WHERE $that.type = ‘group’ MATCH (g) WHERE id(g) = idFrom(‘group’, $that.name) OPTIONAL MATCH (parent) WHERE id(parent) = idFrom(‘group’, $that.group) CREATE (g)-[:has_parent]->(parent) SET g = $that, g:Group “, “type”: “CypherJson” } }</code></pre> ‍Note: While in production, I ingest my data from the Kafka streams. However, for experimentation I used the FileIngest</code> type of ingest query. This allowed rapid exploration of Quine without having to force events to be sent through Kafka. In that case, you would add fields like: type: FileIngest, path: /json/file/to/load.json</code> This ingest query matches all records flowing in where the json record referenced by $that</code> has a field named type</code> with a value of group</code>. We will refer to it as node variable g</code>. The WHERE</code> clause says the id of the node g must be equal to the result of the idFrom</code> function</a>. The idFrom</code> function creates a unique, reproducible ID from the sequence of values given. In this case we will include our node type, and record name. You do not have to worry about concerns such as if the node already exists. If it does, g</code> will reference that node. If it does not, g</code> will materialize a node for the computed id. The SET</code> part later assigns all the fields in the incoming record as properties of the node, updating g</code> if it previously existed. This ingest-query also creates a relationship between node g</code> and the parent node identified by the group</code> field in my incoming records. The OPTIONAL MATCH</code> creates a secondary node query, exactly like the one for node g</code> but this time the values passed into ‘idFrom’ represent the parent node. Then we declare a relationship using CREATE</code> that says node g</code> relates to parent node, and we name that relationship has_parent</code>. Quine adds the idFrom</code> function to the Cypher query language, along with the implicit materialization of nodes in match statements to facilitate building the graph from streaming sources without requiring your application to be concerned with the order of data insertion. If we know the expected identifier for the node that we are creating a relationship to, then it will materialize. It just will not have the attributes until a specific record adds them. I repeated this pattern to also ingest the requirement</code> records. At this point my hierarchical data is represented in the Quine graph. Next, I have event records that have relationships to a Subject and a Requirement. This one is a little more complicated: { …, “format”: { “query”: “ MATCH (result), (requirement), (subject) WHERE id(result) = idFrom(‘result’, $that.subjectId, $that.requirementName) AND id(subject) = idFrom(‘subject’, $that.subjectId) AND id(requirement) = idFrom(‘requirement’, $that.requirementName) CREATE (subject)-[:results]->(result)-[:requirements]->(requirement) SET result = {status: $that.status, timestamp: $that.timestamp}, result:Result, subject = {subjectId: $that.subjectId}, subject:Subject “, “type”: “CypherJson” } }</code></pre> The above ingest query matches three different nodes as the data flows in, based on identifying data in the event record. It then assigns the event data to the subject and result node and creates two relationships from subject to result and from result to requirement. The graph may look something like this now: ‍ Graph visualization using the Quine web UI. Streaming data out of Quine</h3> My goal was to have the rollup computations performed in Quine and emitted as streaming events when any of the result nodes change. I record those in our relational database for the webservice’s API to retrieve with simple queries. To emit the data, we create a Standing Query</a>. This query identifies changes that match and can trigger additional queries to fetch data and emit to one of several outputs, such as a log file, or Kafka topic. The first part of the Standing Query includes the pattern of data that triggers further behavior. Normally, they are only triggered when added items join the set of results represented by the query. The includedCancellations option set to true requests that the Standing Query be triggered for nodes that stop matching the criteria as well. This works to detect any substantive change in my data set, as I have a status field that is either PASS</code> or FAIL</code> { "includeCancellations": true, "pattern": { "query": " MATCH (s :Subject)-[:results]->(res :Result)-[:requirements]->(req :Requirement)-[:has_parent]->(g :Group) WHERE res.status = 'PASS' RETURN DISTINCT id(res) as id ", "type": "Cypher" }, … }</code></pre> Next, we want to do something in response to finding these results that change their status. Namely, I want to aggregate the total and passing counts for each level in the graph of groups for each subject that has results. So, we will add to the standing query an output: "outputs": { "createGroupResult": { "type": "CypherQuery", "query": " MATCH (subject :Subject)-[:results]->(res :Result)-[:requirements]->(:Requirement)-[:has_parent*]->(group :Group) WHERE id(v) = $that.data.id MATCH (subject)-[:results]->(allres :Result)-[:requirements]->(:Requirement)-[:has_parent*]->(group :Group) RETURN id(subject) as subject_id, id(group) as group_id, sum(CASE allres.status WHEN 'PASS' THEN 1 ELSE 0 END) AS pass, count(av) as total ", …</code></pre> ‍ That creates another CypherQuery that will consume the result IDs that changed, and computes the rollups needed. It first matches on the subject related through that result, and with relationship expression -</code>****[:has_parent*]-></code> it matches all the groups in the parent hierarchy. Then it performs another match constrained to that subject and each of those groups, where it finds allres (all the results in between) and uses the aggregation functions sum and count to produce the pass and total values. That little ‘*’ asterisk replaces the need we had for recursive query functions in my RDMS that are complex, and difficult to work with in SQL. In a graph walking an n length chain of relationships is a natural fit. With an RDBMS it is often a requirement to predict this sort of query requirement and denormalize the data insertion to accommodate efficient retrieval. Or it requires fairly complex queries. When solving this with an RDBMS you are creating a bespoke, poor man’s graph data store, overtop models optimized for completely different purposes. However you do so without any of the tools inherent in Quine and the Cypher query language that allow working with the data efficiently. For future queries, I wanted to record these aggregations back into the graph. So before chaining in an external output to the Kafka Stream, I add another CypherQuery to the chain: "andThen": { "type": "CypherQuery", "query": " MATCH (subject :Subject), (group :Group) WHERE id(subject) = $that.data.subject_id AND id(group) = $that.data.group_id MATCH (gr) WHERE id(gr) = idFrom('group_result', id(subject), id(group)) CREATE (subject)-[:group_results]->(gr)-[:has_group]->(group) SET gr = {pass: $that.data.pass, total: $that.data.total}, gr:GroupResult RETURN subject.subject_id as subjectId, group.id as groupId, group.name as name, gr.pass as pass, gr.total as total ", …</code></pre> This matches the group and subject from the previous output IDs, and then materializes a GroupResult that holds the pass and total values with relationships to that group and subject. Finally, it defines the return expression which will be a JSON object containing subjectId, groupId, name, pass, and total fields. We send them on to an external output by adding an andThen to the previous CypherQuery. Note: for testing purposes, much like you can ingest from files, you can output to a file with a WriteToFile</code> in the output chain: andThen:{type:WriteToFile, path: rollups.json}</code> A Win-Win Situation</h3> My organization has chosen to use Kafka streaming to produce our event data flows, as well as to build our web services APIs over top traditional RDBMs data stores for skill set purposes. By emitting my rollup values and storing them into the RDBMs, I am effectively materializing the view required to optimize answering the data questions our API authors encounter. Working with the graph provides a flexible approach over simulating a graph in an RDBMS with recursive queries. Ramp up on the Cypher query language was quick, as there are concise training materials in video form on various learning websites. Learning to ingest the data, compute the rollups and egress the data from Quine took me about the same amount of clock time as a team of people spent expressing the recursive queries and tuning them in the RDBMS. The biggest issue being that, as a rule, with an RDBMS you must optimize your data insertion model for your data retrieval needs. This can lead to substantial changes for little questions. With Quine’s graph model, it is much easier to make incremental changes to the event data models without rewriting the whole pipeline. Additionally, by feeding the rollups back into the Quine graph, they are available as inputs to further standing queries. It now becomes easy to create further events when the rollups change, such as raising a red flag if one of the subjects scores too low for one of the top tier groups. Courses I used to ramp up on the Cypher query language: NoSQL: Neo4j and Cypher (Part: 1-Beginners) https://www.udemy.com/course/neo4j_beginners1/</a>,</li> NoSQL: Neo4j and Cypher (Part: 2-Intermediate) https://www.udemy.com/course/neo4j_intermediate/</a></li> </ul> Announcing Open Source Release of Quine Streaming Graph 2022-02-23T00:00:00+00:00 The World's First Streaming Graph for Categorical Data</h2> We at thatDot are pleased to announce that today the release of Quine</a>, our streaming graph for event processing, as an open source project. Quine’s unique approach</a> combines graph data and streaming technologies into a modern, developer-friendly open source software package. For the first time, teams can process categorical data</a> in real time without resorting to encoding methods. Developers and data pipeline engineers use Quine to rapidly build high volume, real-time, complex event processing workflows at scale. A handful of Quine queries can replace months of development time and millions in costs, eliminating batch processing, multi-level joins, time windows, and other time-consuming and outdated processes that drag down and stall analysis on streaming data. “Enterprise data engineering teams are confined to the limitations and tradeoffs of the previous generation of event processing frameworks like Flink. They spend enormous time and effort building complicated event-driven architectures that only work on small time-windows of in-memory data and miss out on the bigger picture,” said Ryan Wright, the creator of Quine and Founder/CEO of thatDot. “Quine can transform months of tedious data engineering into an afternoon’s work enabling data pipeline engineers to easily interpret high-volume event data streams, innovate and ship products faster, and to use the emerging Graph AI tools driving the next wave in machine learning.” Community Created, Pre-built Recipes for Common Workflows Early access launch partners, community members, and contributors have already created pre-built application functions called “recipes</a>” —to package up valuable use cases for one-click operation. These include: Blockchain Real-time Tag Propagation - Ingests Ethereum and propagates dirty money tags to trace money laundering.</li> CDN Cache Efficiency Analysis - Continuously monitor CDN logs to materialize cache efficiency by PoP, Geography, and ASN, generating alerts and tracing root cause.</li> Apache Server Log Observability - Ingests Apache server events and observes event lineage between services.</li> </ul> “thatDot’s Quine is a powerful new tool for anyone building event-driven applications. Standing queries let us match complex patterns as data arrives as well as query the past shape of data without the restriction of time windows.” Roy Hodgman, Data Science Manager, Rapid7 Quine is freely available on GitHub</a> and directly from the Quine Community website</a>. Quine is part of thatDot’s portfolio of event processing solutions. Elements of thatDot’s solutions were instrumental to DARPA’s cybersecurity research for insider threat detection and stopping Advanced Persistent Threats (APTs) . Register for the upcoming conferences, webinars or recordings showcasing Quine: Open Data Science Conference (ODSC) Webinar</a> (March 3, 2022)</li> Global Big Data & AI Conference</a> (March 18-19, 2022)</li> PDXDataEngineering Meetup</a> (March 10, 2022)</li> </ul> ‍ ‍ AWS Names thatDot's Novelty Detector As A Containers Anywhere Partner 2021-12-08T00:00:00+00:00 Bringing cloud-based data management into the enterprise data center, where much enterprise data still lives, is now simpler than ever with AWS Containers Anywhere and thatDot is excited to be a launch partner of the new AWS Marketplace for Containers Anywhere</a> program. As expected, AWS Re:Invent 2021 delivered great training content and insights into where the Cloud industry is going. The emphasis on AI and ML enablement was a clear topic of focus, reflecting the ever wider adoption of technology that leverages data for better business outcomes. With AWS Marketplace for Containers Anywhere enterprises can leverage leading technologies within the bounds of their own data centers. thatDot Novelty Detector</h2> Existing anomaly detection techniques rely on numerical data and threshold analysis, which breaks down in the face of high data dimensionality and produces high volumes of false-positives. thatDot Novelty Detector uses categorical data</a> to build a comprehensive behavioral fingerprint of your data. This deep contextual understanding</a> eliminates false-positives and provides WHY an anomaly was identified, making it immediately actionable. Popular uses of thatDot Novelty Detector by AWS users include Detecting Stolen Credential Use in AWS CloudTrail Logs and Detecting Data Exfiltration in AWS CloudTrail Logs</a>. If you are looking to deploy thatDot Novelty Detector in any AWS region, you can follow these instructions</a> documented in the thatDot documentation site. ‍ Stop Insider Threats With Automated Behavioral Anomaly Detection 2021-10-13T00:00:00+00:00 Introduction: A Murky and Labor-Intensive Challenge</h2> Finding a malicious employee is one of the toughest cyber-security challenges in the industry. Someone who has been deliberately given access to sensitive information… but violates that trust and secretly steals private data to give to a third party. Finding such a threat has always been a murky and labor-intensive challenge. Until now. Behavioral Anomaly Detection</h2> We’ve developed a new technique for measuring how unusual each behavior is. Unlike traditional approaches, this innovation uses non-numeric categorical data</a> directly, instead of trying to convert it to numbers first. We can analyze that data in real-time to score each behavioral event to explain how unusual it is. As a result, it’s very easy to automate the analysis of behavioral data, and get powerful results in real-time. VAST Insider Threat Dataset</h2> A standard benchmark dataset for insider threat detection is the publicly available VAST Insider Threat challenge dataset</a>. Originally released in 2009, it remains a good example of the data available—and challenge presented—to cyber-security professionals today who are trying to detect and stop a malicious insider. This post will use the VAST dataset for the “mini-challenge” #1: “MC1 – Badge and Network Traffic”</a> to demonstrate automated insider threat detection. For comparison, the accepted solution to the problem can be found here</a>, where a laborious manual exploration was the previously accepted approach to this problem. The Embassy Scenario</h2> The data describes the workplace environment of 60 employees at an embassy. There are 30 offices shared with two people to an office. A classified space is separated from the offices. There are two doors, one provides access between the outside and the office area, and the other provides access from the office space to the classified space. Employees are expected to scan their badges every time they enter from the outside, and every time they enter and leave the classified space. Each office holds the desktop computers assigned uniquely to each employee who shares that office. The last octet of each computer address corresponds to the employee ID. Source Data [download</a>]</h2> Data from the dataset</a> is broken into three files: employee data in employeeData.csv</code>, proximity data from door badge scanners in proxLog.csv</code>, and network traffic in I PLog3.5.csv</code>. Some sample records of each file (below) shows their shape. Note: the original data includes a USER WARNING</code> that this is Synthetic Data</code> . This field appears on each record of the employee and network traffic files, but is deleted from this point on for brevity. Data is read from each .csv</code> file and parsed into Python dictionaries, also parsing out timestamps into datetime</code> objects: employeeData = [] ipLog = [] proxLog = [] # Reading employee data with open("employeeData.csv", "r") as fd: for l in csv.DictReader(fd): del l["USER_WARNING"] employeeData.append(l) # Reading IP log data with open("IPLog3.5.csv", "r") as fd: for l in csv.DictReader(fd): del l["USER WARNING"] l["AccessTime"] = datetime.fromisoformat(l["AccessTime"]) ipLog.append(l) # Reading proximity log data with open("proxLog.csv", "r") as fd: for l in csv.DictReader(fd): l["Datetime"] = datetime.fromisoformat(l["Datetime"]) proxLog.append(l)</code></pre> Looking at the first three sample records from each file helps build a sense of what this data looks like: pprint(employeeData[:3]) len(employeeData)</code></pre>[{"EmployeeID": "0", "IP": "37.170.100.0", "Office": "0"}, {"EmployeeID": "1", "IP": "37.170.100.1", "Office": "0"}, {"EmployeeID": "2", "IP": "37.170.100.2", "Office": "1"}] 60 records</code></pre>pprint(proxLog[:3]) len(proxLog)</code></pre>[ { Datetime: datetime.datetime(2008, 1, 1, 7, 28), ID: "44", Type: "prox-in-building", }, { Datetime: datetime.datetime(2008, 1, 1, 8, 31), ID: "44", Type: "prox-in-classified", }, { Datetime: datetime.datetime(2008, 1, 1, 9, 23), ID: "38", Type: "prox-in-building", }, ]; 10162 records</code></pre>pprint(ipLog[:3]) len(ipLog)</code></pre>[ { DestIP: "37.170.100.200", ReqSize: 7063, RespSize: 49591, Socket: "80", SourceIP: "37.170.100.38", }, { AccessTime: datetime.datetime(2008, 1, 1, 9, 43, 8, 861000), DestIP: "37.157.76.124", ReqSize: 5171, RespSize: 434285, Socket: "80", SourceIP: "37.170.100.38", }, { AccessTime: datetime.datetime(2008, 1, 1, 9, 47, 41, 282000), DestIP: "37.170.30.250", ReqSize: 32818, RespSize: 182798, Socket: "25", SourceIP: "37.170.100.38", }, ]; 115414 records</code></pre>Behavioral Data</h2> The proximity data in proxLog</code> and network traffic data in ipLog</code> both represent different kinds of behavioral data. Though they are separate files, each type of data is an aspect of the same behavioral information we’re trying to analyze. They can be combined together by adding the relevant employees status from the proxLog</code> to each record from ipLog</code>: sorted_emp_prox = {} for id in [e["EmployeeID"] for e in employeeData]: sorted_emp_prox[id] = sorted([x for x in proxLog if x["ID"] == id], key=lambda x: x["Datetime"]) def emp_state_at_time(employee, dt): last = None for e in sorted_emp_prox[employee]: if e["Datetime"] < dt: last = e else: # e["AccessTime"] >= dt: if last is not None: return last["Type"] else: return "unknown" if last is not None: return last["Type"] else: return "unknown" emp_for_ip = {e["IP"]: e["EmployeeID"] for e in employeeData} behaviors = [[emp_state_at_time(emp_for_ip[x["SourceIP"]], x["AccessTime"]), x["SourceIP"], x["Socket"], x["DestIP"]] for x in ipLog] pprint(behaviors[:5])</code></pre> We’ve added proximity badge status to each network record, and then trimmed those records down to only four values relevant for our automated detection: [ ["prox-in-building", "37.170.100.38", "80", "37.170.100.200"], ["prox-in-building", "37.170.100.38", "80", "37.157.76.124"], ["prox-in-building", "37.170.100.38", "25", "37.170.30.250"], ["prox-in-building", "37.170.100.38", "80", "37.116.192.39"], ["prox-in-building", "37.170.100.38", "80", "10.24.74.254"], ... ]; // 115414 records</code></pre>Automated Detection</h2> With the combined proximity+network data, we can dump that out as a .csv</code> file and use the built-in “Getting Started” page to easily feed this to thatDot’s Novelty Detection system: Loading data trains the system and scores each result as it streams in. Streaming results are often more difficult to produce but more useful than batch results. However, since this scenario is a batch of data and meant to be analyzed as a batch, we can use a quick Python script to make a second pass that feeds the data to the novelty scoring system in a read-only fashion (does not update the model) to get results where each is informed by the entire batch: e = read_results("behaviors", behaviors, 10000)</code></pre>Read: 10,000 Rate: 52,631 / second Read: 10,000 Rate: 38,910 / second Read: 10,000 Rate: 56,497 / second Read: 10,000 Rate: 43,668 / second Read: 10,000 Rate: 55,248 / second Read: 10,000 Rate: 56,497 / second Read: 10,000 Rate: 47,169 / second Read: 10,000 Rate: 35,087 / second Read: 10,000 Rate: 49,019 / second Read: 10,000 Rate: 46,296 / second Read: 10,000 Rate: 56,818 / second Read: 5,414 Rate: 40,103 / second</code></pre> And plot the results: px.scatter(limit([x for x in e if x["score"] > 0.5]), title="Scores after observing all data (sampled)", y="score", x="sequence", marginal_y="violin").show()</code></pre> ‍ More novel, high scoring results show up higher on the plot. We see from this sampling of results (not every single data point is shown), that there is a distinct group of nodes which show up with very high scores around 0.94. Novelty Detector has found a small number of clearly unusual behaviors. Looking at the observation results shows that each of these events was network traffic to the same destination IP address: 100.59.151.133</code>. What’s more, these connections were usually made from different source IP addresses, but while the respective employee responsible for that computer was badged into the classified space. Who Done It?</h2> Novelty Detector successfully found the needle in the haystack! Eighteen records in the network traffic log correspond with traffic sent to this suspicious IP address: suspicious_ip = list(set([x["observation"][3] for x in e if x["score"] > 0.9]))[0] suspicious_ip_traffic = [x for x in ipLog if x["DestIP"] == "100.59.151.133"] pprint(suspicious_ip_traffic[:3])</code></pre>[{"AccessTime": datetime.datetime(2008, 1, 8, 17, 1, 33, 1000), "DestIP": "100.59.151.133", "ReqSize": 8889677, "RespSize": 12223, "Socket": "8080", "SourceIP": "37.170.100.31"}, {"AccessTime": datetime.datetime(2008, 1, 10, 14, 27, 12, 238000), "DestIP": "100.59.151.133", "ReqSize": 6543216, "RespSize": 22315, "Socket": "8080", "SourceIP": "37.170.100.31"}, {"AccessTime": datetime.datetime(2008, 1, 10, 16, 1, 53, 956000), "DestIP": "100.59.151.133", "ReqSize": 8543125, "RespSize": 12312, "Socket": "8080", "SourceIP": "37.170.100.16"}] 18 records</code></pre> So who is the guilty party? Using the behavioral data about employee proximity, we can put together a set of people who were available to access the compromised computer during each of the 18 illicit data transmissions: def available_emp(atTime): a = [] available = ["prox-out-classified", "prox-in-building"] for e in employeeData: if emp_state_at_time(e["EmployeeID"], atTime) in available: a.append(e["EmployeeID"]) return a opportunities = [available_emp(x["AccessTime"]) for x in suspicious_ip_traffic] [len(x) for x in opportunities]</code></pre> At each of the relevant times, there were between 38–60 employees available. [52, 43, 49, 42, 42, 38, 53, 60, 47, 54, 52, 48, 47, 44, 44, 56, 45, 45]</code></pre> But who was available during ALL times? had_opportunity = set(opportunities[0]) for o in opportunities: had_opportunity = had_opportunity.intersection(o) had_opportunity</code></pre>{"27", "30"}</code></pre> Only two employees! Maybe they are colluding. But if not, one of them would be a witness if their officemate’s computer was compromised. Looking at the traffic records, we can focus in on suspects who wouldn’t be caught by officemate in the room with the compromised computer: def not_caught_by_officemate(suspects, suspicious_events): still_suspect = [] for s in suspects: suspect_office = [e["Office"] for e in employeeData if e["EmployeeID"] == s][0] officemate_ip = [e["IP"] for e in employeeData if e["Office"] == suspect_office and e["EmployeeID"] != s][0] sent_traffic = [x["SourceIP"] for x in suspicious_ip_traffic] if officemate_ip in sent_traffic: still_suspect.append(s) return still_suspect not_caught_by_officemate(had_opportunity, suspicious_ip_traffic)</code></pre>{"30"}</code></pre> Which leaves only Employee #30 as the guilty party. Employee #30 had opportunity during each event and was not observable by the officemate. Bonus: Real-Time Automated Detection</h2> Novelty Detector doesn’t have to be used only on static data. In fact, it’s meant to run on live streaming data and to produce results immediately, in real-time. What would the experience be like to a person who was monitoring this data live while it came in? Could they have stopped the malicious insider sooner? This image shows the “Results” section of Novelty Detector tool, which updates live with all the recent results: This plot shows the scores as they were produced in real-time; higher dots correspond with more novel data. The colors in this plot indicate how unique each behavioral observation was (dark red means we had never seen that specific observation before). You can see from the x-axis that it only shows the most recent result. Another plot on this page shows the highest scoring results overall: Since these results are the real-time streaming results, each observation is scored immediately without any knowledge of what data will arrive after it. This system uses past data to understand how novel new data is, so at the beginning of the stream, there is a lot of novel data. Those early high scores are genuinely novel at that moment, given what little has been seen so far, but for those of us with knowledge of the whole dataset, we know that is not enough data to have a representative sample of what is yet to come. After about 25k observations in this dataset, the novelty scoring system has automatically learned what this data looks like. Notice that no data labeling is required! Simply turn on the system and let it observe a representative sample of your data. Ignore the early results while it is automatically training to fit the data. In a real-world scenario, this system would likely be running long before the insider began their malicious activity, so the early tuning would have long since been accomplished. A security analyst observing this stream of results would regularly see low-scoring results they would easily ignore during the course of normal business. But the moment the malicious insider sneaked onto an employee’s computer, it would produce the first top-scoring result: The observer would have a clear signal about the first occurrence of malicious activity—in this case, when Employee #30 used their office-mate’s computer while #31 was in the classified space. That warning would come with enough information for an observer to understand the context of what is happening, why is it unusual, and where to go RIGHT NOW to catch the malicious insider red handed! If the observer missed this opportunity or needed to wait for more evidence, subsequent malicious activity on other computers would give the authorities additional opportunities and evidence. This Is Just the Beginning…</h2> Novelty detection based on categorical data is a major new innovation in cyber-security threat detection. It comes out of years of DARPA-funded R&D with some of the world’s foremost experts. As shown here, it’s a game-changer for insider threat detection. —but it doesn’t stop here! This tool can be used in a wide variety of other domains to measure in real-time how unusual is each and every piece of data. Our blog includes other examples like detecting stolen credential use</a>, or data exfiltration in the cloud</a>. While anomaly detection algorithms have been around for decades, there has never been an effective way to apply them to categorical data… until now. The first of it’s kind, thatDot’s Novelty Detector is currently being used to detect real-time cyber security threats, reduce the cost of data analysis tools (e.g. only analyze the unusual data), find fraudulent activity, detect stolen credentials, analyze log data, audit financial transactions, and much more. Try Novelty</h2> Experience the power of thatDot's Novelty</a> and sign up for our free trial</a> today and see the difference for yourself. Or get in touch with us via the request a demo</a> page to talk to a solution architect. Quine Streaming Graph Scales to 1.1 Trillion Log Events per Month 2021-09-21T00:00:00+00:00 Just the numbers, please</h2> Achieved a sustained rate of recording 425,000 records per second into our streaming graph</li> We ran 64 thatDot Quine Enterprise hosts on AWS c5.2xlarge instances (the most important feature being the 8 CPU cores per host)</li> This was supported by 15 Apache Cassandra hosts on m6gd.4xlarge instances</li> The total costs were less than $13/hour to run at full scale using reserved instances</li> </ul> Why spend the time?</h2> thatDot's enterprise version of Quine is designed to be a web-scale stream processing and data store solution. We’re always developing with an eye toward performance and wanted to demonstrate the linear scaling the system is built to deliver. The architecture of Quine takes some interesting twists on traditional data management systems. For example, each node in a Quine datastore is capable of performing its own computation. This makes Quine hugely parallel, meaning that in practice, we can just about always make use of every part of every core on a host machine. Furthermore, we take advantage of deterministic entity resolution to ensure that regardless of how much data Quine has already ingested, each additional record takes roughly the same amount of time to process, analyze, and store. Starting the scaling effort</h2> To start we set a goal of processing 250,000 records per second. If you ask me, that’s a lot of data. If you ask our lead Sales Engineer Josh, he’ll just laugh and say, “I’ve seen worse”. Regardless, it’s a nice round number that’s easy to talk about and do math with. Great for benchmarking. For our dataset, we generated a simulated log of process creation events, like you might find reported by an intrusion detection system. Each record had 9 fields of varying types. Our plan was to create a node in Quine for each process event, creating a property on the node for each field, and linking each process to its parent via an edge. Each record was protobuf-encoded, to keep the [de]serialization simple but nontrivial, and to reduce how large the Kafka topic’s storage would need to be. We started small, with just 4 hosts, each a c5.2xlarge. We got to a moderate, but respectable, 30,000 records per second ingested, from Kafka to Cassandra. That’s 7,500 records per second per host. Once we had played around a bit, we doubled the cluster size, up to 8 hosts. Accordingly, our ingest rate doubled to creating 56,000 nodes per second. The big clusters</h2> With the Quine application smoothly performing on 8 hosts, we decided to quadruple the cluster size to 32 hosts, expecting to see our pattern of linear scaling continue. Unfortunately, instead of the 220,000 records per second we expected, we saw only 144,000. Strangely, the 144,000 figure wasn’t stable — the cluster constantly fluctuated in ingest rate. We were momentarily baffled, until we realized that in our excitement, we had forgotten to increase Cassandra’s size proportionally to the new Quine cluster size. The Quine graph will back-pressure when some system components, like the Cassandra data storage layer, cannot keep up. Back-pressuring keeps the overall system stable instead of overwhelming the now-underpowered data storage layer. The result was a disappointing-but-oddly-beautiful initial performance graph. Once we realized that the Cassandra JVM was scrambling to keep up with the memory pressure of a 32-host Quine Enterprise cluster, we had no trouble increasing the Cassandra cluster to use more hosts, and bigger hosts. By kicking Cassandra up from using 9 2xlarge instances to 15 4xlarge, we had more than enough capacity to run Quine at-scale. We hit our 220,000 records per second hypothesis without any trouble. The cluster smoothly maintained its pattern of linear scaling, ingesting a more or less constant 220,000 events per second, creating nodes and edges. We were so close to hitting our goal of 250,000 records per second. We probably could have added just a few more hosts to reach the goal. After all, we had every indication that every host added to the cluster would add around 7,000 records per second to the cluster’s ingest rate. We did not add just a few more hosts. We doubled the cluster size instead, hitting an EC2 resource limit imposed by AWS, but once clear of that hurdle it was smooth sailing and everything was working as hoped! We kicked off the ingest on each of the 64 clustered thatDot Quine hosts, and immediately, it was clear we’d far exceeded our goal. Linear Scaling!</h2> Linear scaling to 64 hosts had been achieved. Our initial target of 250,000 records per second was exceeded, and we maintained an ingest rate of over 425,000 log events per second, each creating a node with several properties, and most creating at least one edge to connect them into a graph. That’s over 1.1 trillion events a month. The work continues</h2> Of course, we’re not done. We’re always working to explore what can be done with stream processing systems at large scales, and we’re excited to offer our products to support others in their explorations. ‍ ‍ Data Exfiltration Detection in AWS CloudTrail Logs Using Categorical Data 2021-07-20T00:00:00+00:00 Use Categorical Data to Detect and Prevent Data Exfiltration from AWS CloudTrail logs</h2> In our previous blog, Identifying stolen credential use in AWS CloudTrail logs with high confidence using categorical anomaly detection, we discussed the power of graph machine learning to analyze the categorical data</a> in AWS CloudTrail logs to identify novel or anomalous behaviors. We then compared thatDot Novelty Detector</a> findings to the results of traditional statistical anomaly detection tools. This article builds on our previous analysis to further investigate the use of categorical anomaly detection to identify multi-stage exploit campaigns in AWS CloudTrail logs. About AWS CloudTrail and Data Exfiltration</h2> AWS CloudTrail</a> monitors calls to AWS services and delivers detailed logs, providing a complete audit of management calls, with optional inclusion of data calls. To detect attacks effectively, you will need both, but the resulting high volume of log events creates a glut of data, making it very difficult to detect a behavioral attack like data exfiltration. Data exfiltration can be an exploit on its own, or as is increasingly seen, it can be one step in a larger ransomware campaign. Attackers recognize the obfuscating effect of high volume data movement, and use the complexity of the logs to hide their activity by moving slowly and carefully. This challenge requires an anomaly detection system capable of understanding the shared characteristics of ‘novel’ behaviors while ignoring the simpler sequential and time stamped information often relied upon by other methods. Data Exfiltration from AWS — an Example</h2> Monitoring data transfer activity for data theft is a true “needle in a haystack” problem. A storage layer like S3 is in a continuous state of change with data being ingested, exported, updated, duplicated, moved, and expired. With storage being a fundamental service, monitoring user or system behavior at any scale can be challenging. Rulesets designed to limit undesired activity lack the granularity required to avoid unintended interruptions in business functions, and more granular rules are incredibly difficult to manage. Administrators need a better way to highlight system and user behaviors that stand out from the norm. In the case described below a hacker has initiated a series of steps to steal S3 data. Preventing Exfiltration using Graph-based Techniques</h2> If your security monitoring system is configured to monitor CloudTrail logs and uses thatDot Novelty Detector, you can detect data theft quickly. Novelty Detector generates an observation for each CloudTrail event. Each observation receives a novelty score indicating how relevant the observation is and how much it warrants your attention. High novelty doesn’t immediately trigger a security event. However, the system will identify what makes the observation novel, providing the unique insight necessary to speed manual or automated categorization. As observations stream into thatDot Novelty Detector, it generates real time plots showing it learning both system and user behavior and learning what is ‘normal’ for the environment. The plots below provide a graphical representation of the observed behaviors with scores for normal behaviors gradually dropping over time to form a down-and-to-the-right curve. Observations that stand out from the norm receive high novelty scores and appear higher on the chart. Looking at a plot of the most recent data, we see Novelty Detector has detected a series of observations with particularly high scores. The high scoring observations are for a particular user, raul, who initiated a series of API calls to the AWS S3 service: GetCallerIdentity – To validate S3 access for the credentials</li> CreateSnapShot – To create a data snapshot of the target data using Account ID and Volume ID</li> ModifySnapshotAttributes – To modify the snapshot permissions to allow export of data</li> At this point the data snapshot is then “pulled” from an external account</li> DeleteSnapshot – To delete the snapshot and cover their tracks</li> </ol> These activities present as novel because the user, raul, does not typically execute these API calls. Novelty Detector not only identifies the activity as novel, but provides the critical insight that raul’s execution of these actions is the most novel element. Novelty Detector retains the CloudTrail event ID and event time, enabling analysts to navigate to the actual CloudTrail events to investigate the details. Most importantly, it forwards the observation and score to a security monitoring system for action. One of the most compelling aspects of Novelty Detector’s identification of raul’s data exfiltration is the multi-stage view of all of the activities associated with this exploit. In our first blog, we showed how thatDot Novelty Detector was able to identify indicators of stolen credential use. When combined with our data exfiltration results, we are able to see the entire sequence of events: Stage 1, initial credential validity testing and broader probing of associated permissions, and Stage 2, the series of calls that indicate data exfiltration. Identifying multi-stage campaigns in this way provides the audit trail needed to trace the entire intrusion and thus understand the related consequences. When we see this pattern and the multiple novel events generated by raul, it’s obvious that raul’s credentials are being used in multiple, highly unusual, operations. In response, for this one user, we can immediately force a logout and password reset while also immediately identifying impacted data. This remediation limits the disruption only to the impacted user and provides our analysts a comprehensive view of the breach impact. Summary</h2> thatDot’s categorical analysis delivers the capability to generate high value, high confidence alerts, so analysts can quickly find true anomalies, judge them for maliciousness, easily trace the comprehensive impact of an exploit, and act accordingly. By looking beyond just numbers, thatDot filters out the noise, and finds anomalies in the richer set of categorical data including values like usernames, hostnames, file paths, URLs, process names and more. These benefits mean that a security analyst will focus on high-value, high likelihood, alerts, leading to a 10x increase in productivity. Good security analysts are hard to find, and their limited time and the burnout common in the industry make it crucial to use their time wisely. If you’d like to see this in action, or learn more from the team that is pioneering categorical data analysis at thatDot. ‍ The Known Security Challenge of the Unknown 2021-07-07T00:00:00+00:00 Lacking Categorical Data, Enterprises are Vulnerable to Cyber Attacks</h2> Destructive attack campaigns like WannaCry, NotPetya, or even the Mirai DDoS family, succeed because they integrate new techniques or new hardcoded credentials to access and victimize their targets. Once used, though, they rapidly lose much of their sting as they are understood and mitigated. Organizations patch exploitable vulnerabilities, change credentials, and block known-hostile traffic and executables with application firewalls and gateways. Traditional security is most effective when called upon to identify and block what it can understand. Unfortunately, profit-motive and system complexity breed a seemingly endless stream of new threats. Some are simply trivially reconstituted versions of older attacks, like malware, and some are ingeniously constructed, like multi-component credential theft and ransomware campaigns. There will always be latency between the arrival of a new threat and the corresponding protection that will recognize and disrupt it. To address this gap, security practitioners and vendors have long attempted to identify new threats by first learning what good traffic or artifacts look like, then identifying what’s new and applying some type of logic to decide if that new event or artifact is good or bad. This has been done with machine learning-based analysis of executable objects to identify malware and through network-based anomaly detection to find hostile behavioral patterns. Both approaches have struggled with a poor signal-to-noise ratio, requiring additional effort and expertise to distill the real threats from the vagaries of a dynamic environment. These approaches are often used in tandem to minimize the likelihood of false negatives. Resulting security events can be ascribed to one of two detection techniques: Signature Detection or Anomaly Detection Signature Detection</h4> When a pattern is available in an object or in a series of activities, signatures can be used to describe and then identify what is known. This creates a low number of false positives, but it puts the user on an unending treadmill of effort to research, identify, and upgrade new protection with near continuous updates. Anomaly Detection</h4> For anomaly detection to work, a baseline needs to be established or learned. After this, new behaviors, connections, users, or services, will be surfaced as concerns for resolution by security analysts. This will create a low number of false negatives but will generate unmanageable quantities of false positives in any dynamic or user-facing environment. Anomalous events happen all the time, making them poor providers of conclusive data. So, in short, signatures are too specific and time-delayed while anomalies are too general and time-consuming. Enter the Impact of Novelty Detector</h2> When discovering and describing anomalous events, much of the event context is commonplace. It is the combination of elements that make an event an anomaly. The characteristic of an element that creates an anomalous event is that element’s novelty. This more granular attribute of an event can be analyzed to disambiguate the troubling from the simply unusual. As an example, think of an anomaly that is created by an unexpected file access. That operation will be characterized by multiple elements, including the user, user IP address, target file, target file system, time of day, network, geography, filetype, user role, and others which together form a behavior context. If the access is anomalous because it is a first-time file access (the filename is novel), that class of detection will generate a flood of false positives because the operation is new, but the context is quite common. In contrast, consider a case where the characteristic that makes the event anomalous is the user. If multiple events are flagged and a user’s ID is the novel characteristic, this is more concerning. The context of the detection is a pattern of file accesses made unusual because the user doesn’t ordinarily access those files. Add to this a novel geography, time of day, or user/role combination, and that novelty suddenly transforms a first-time access into a likely security event. Novelty data defines a path to evolving anomaly analysis from high volume, low confidence detections to low volume, high confidence events </blockquote> Collecting and training on novelty data defines a path to evolving anomaly analysis from high volume, low confidence detections to low volume, high confidence events. It represents a deeper level of modeling that simulates the type of second-order investigation and clarification typically performed by analysts. The savings in time, alert fatigue, and missed events, make novelty analysis a foundational improvement for threat and active attack detection. To learn more about novelty and its revolutionary impact on threat detection visit the Novelty Detector</a> overview page. ‍ Find Stolen Credentials Use in AWS CloudTrail Logs using Quine Graph 2021-04-24T00:00:00+00:00 Using Categorical Data to Secure Cloud Infrastructure</h2> (Note: This is part one of a two part series. If you find this blog interesting, make sure and check out Stop Data Exfiltration in AWS CloudTrail Logs With Categorical Data</a>.) The move to the cloud represents new challenges for enterprise security teams. Systems are more distributed and the impact of credential theft is greater than ever. Running your services in a public cloud vendor like AWS requires you to monitor and detect attacks in real-time, but how do you do that without drowning in the noise? Existing tools can highlight statistical anomalies, but are limited to counts and thresholds and have been shown to produce an unacceptable rate of false positives. Novelty Detector processes categorical data</a> to uncover real threats and reduce false positives. About AWS CloudTrail</h2> AWS CloudTrail</a> monitors calls to AWS services and delivers detailed logs, providing a complete audit of management calls, with optional inclusion of data calls. To detect attacks effectively, you will need both, but the resulting high volume of log events creates a glut of data, making it very difficult to detect a behavioral attack like those leveraging stolen credentials. Stolen credentials provide access to sensitive resources, and an attack will tailor its activities to make those actions look like normal usage. In addition, attackers recognize the obfuscating effect of high traffic volume, and use the complexity of the logs to hide their activity by moving slowly and carefully. This challenge requires that anomaly detection be capable of understanding the shared characteristics of novel behaviors and to ignore the simpler sequential and time stamped information that is often relied upon by other methods. In the example below, we have configured CloudTrail to monitor management, data, and Insights events. We have configured thatDot Novelty Detector to read the “trail” of CloudTrail events as they are written to s3. Stolen Credential Use—an Example</h2> Let’s say through a dark web scan, you’ve become aware that many of your staff’s credentials may have been compromised. It might be through a company partner, a service provider, or the breach of a service popular in your industry. To address these newly exposed credentials, a common brute force remediation is to force a logout and password reset for every employee. This is obviously extremely disruptive and can create unintended interruptions and delays in business functions. By doing this, you have only changed the passwords, and have not detected a source, purpose, or target of a credential theft attack. A motivated adversary may have already installed keylogging or browser-subverting capabilities that will capture any new password and be able to gather other authenticating credentials. If this has happened, in spite of your efforts, you will still have no idea whether an attack is coming. In the case described below, the compromise is real and a hacker has begun to probe the AWS account. She finds that she can successfully scan S3 buckets and creates a script to try other high-value services such as Service Discovery and ELB and runs the script later that evening. Enter: thatDot</h2> If your security monitoring system is configured to monitor CloudTrail logs and uses thatDot Novelty Detector</a>, you can detect the attack quickly. Novelty Detector generates an observation for each CloudTrail event, and these observations have a novelty score that indicates how relevant it is and how much it warrants your attention. Novelty doesn’t immediately trigger a security event, but the system will identify what makes the observation novel, and this provides the unique insight to speed manual or automated categorization. At this point we can show the real-time plots in Novelty Detector to show it learning both system and user behavior and then seeing everything as normal in the form of a down-and-to-the-right curve. Looking at the data below, which shows the most recent observations, we see that Novelty Detector detected an observation with a particularly high score. User raul accessed three different AWS resources for the first time, and then three other AWS resources a few minutes later. The activities are novel, but the critical insight is that raul’s execution of these actions is the element that is most novel. Novelty Detector retains the CloudTrail event ID and event time, enabling us to navigate to the actual CloudTrail events and investigate the details. Most importantly, it forwards the observation and score to our security monitoring system for action. ‍ Seeing this pattern and the multiple novel events generated by raul, it’s obvious that raul’s credentials are being used in multiple, highly unusual, operations. In response, for this one user, we can immediately force a logout and password reset. This remediation limits the disruption only to the impacted user. Because we’ve identified the attacked services, we can follow up by reviewing the services’ access logs for other signs of suspicious activity. How is thatDot Different?</h2> Let’s compare this to watching for credential misuse through AWS CloudTrail Insights</a>, which many use to find anomalies in CloudTrail events. CloudTrail Insights is based on traditional anomaly detection techniques, which, in this case, watches for a simple variance in the number of API calls. This method does not highlight the attack, as their volume of events is too low. It did, however, highlight a number of other events that are clearly false positives. As an example, one of the CloudTrail Insights events shows that API calls to create a network interface on EC2 increased over a period of time. But this event is neither new nor novel. Every EC2 instance allows the creation of up to 15 network interfaces, and each of these can have one or more separate IP addresses. This is necessary for deploying services with multiple SSL certificates, and for many other purposes. These types of false positives create two problems for security analysts. First, the additional volume of data forces extra work onto the analysts to recategorize and eliminate the false positives. Secondly, real events that may be detected are buried within the stream of mixed information, creating the problem of alert fatigue and missed events that define overworked security analysts. Using thatDot Novelty Detector creates a limited and high confidence set of events, allowing analysts to identify, work, and resolve the most relevant events. To find what’s truly novel you will often need all of the context of an event. In this demonstration case, we’ve configured thatDot Novelty Detector to focus on each users’ use of different operations on all services at various times of day. This context is also crucial for skills-based routing—getting the incident information to the right team who can follow up with timely verification and remediation. Alert on What Matters</h2> Normally, you don’t watch Novelty Detector in real time. Your security monitoring system has integrations that alert you to the anomalies that you care about. In this example, we may set up a PagerDuty integration, with a score threshold to >0.99, so we were only notified for this urgent anomaly. For Slack, we set the threshold to >0.95, so we received six other observations of interest that we can tackle as our time permits. Summary</h2> These benefits mean that a security analyst will focus on high-value, high likelihood, alerts, leading to a 10x increase in productivity. Good security analysts are hard to find, and their limited time and burnout common in the industry make it crucial to use their time wisely. thatDot’s categorical analysis delivers the capability to generate these high value, high confidence alerts, so that analysts can quickly find true anomalies, judge them for maliciousness, and filter out the noise. thatDot finds anomalies in a richer set of data including categorical values like usernames, hostnames, file paths, URLs, process names and more; not just numbers. Better tools for examining more kinds of data finds more true anomalies and filters out results that are unsurprising numeric outliers. In CloudTrail, other anomaly detection solutions miss the fact that raul’s account was used to try to access several new services because the context, the link between operation requests and a single user, went unexamined. This contextual awareness is the categorical difference. If you’d like to see this in action, or learn more from the team that is pioneering categorical analysis at thatDot. What Is Categorical Data? Comparing it to Numerical Data for Analytics 2021-04-05T00:00:00+00:00 Two kinds of Data: Categorical and Numerical</h2> Data comes in two flavors: Numeric and Categorical. Numeric data is easy, it’s numbers. Categorical data is everything else. As the name suggests, categorical data is information that comes in categories—which means each instance of it is distinct from the others. Names are an example of categorical data, and my name is distinct from your name. On the unlikely chance that your name is the same as mine, I’m sure our government-issued ID numbers, phone numbers, and email addresses are distinct—which are also categorical data. Examples of Numeric and Categorical Data</h2> </th> </th></tr></thead> Numeric</td> Categorical</td></tr> Rate: 27 events/second</td> Name: Mary Shelley</td></tr> Score: 0.91237</td> IP Address: 192.168.1.100</td></tr> Clicks (counts): 2,743</td> File path: C:\Windows\System32\notepad.exe</td></tr> Money: $19.79</td> Sentiment: cautiously optimistic</td></tr> Temperature: 72° F</td> Address: 10 Downing Street</td></tr> Age: 27 years old</td> Zip code: 97214</td></tr> Weight: 165 lbs.</td> Email: info@thatDot.com</td></tr> Distance: 127 miles</td> Flavor: Umami</td></tr> Location: 45.5209, -122.6778</td> Location: 421 SW 6th Avenue, Portland, OR</td></tr> Color: #1f4c7c</td> Color: blue</td></tr> Angle: 91°</td> Angle: obtuse</td></tr> Weather: 60% chance of rain</td> Weather: Partly Cloudy</td></tr> Time: 1617212687 (Unix time)</td> Time: Wednesday at 10:44 am</td></tr> </tbody></table> As you can see from the examples at the bottom of this table, some kinds of data can be represented both as numeric data and categorical data. The type of information conveyed is different in each case, but this illustrates that there is often a reasonable relationship or translation between the two. In fact, many times when a person is trying to use numeric data, they implicitly convert it into categorical data—at least mentally. If I offered to put either $5,000,791 or $5,000,792 into your bank account, you probably wouldn’t spend much time arguing about which deposit should be made. The amounts are not categorically different. Your brain still says “$5 million.” Their difference doesn’t matter as much as the fact that they are just very big compared to the small $5 you might occasionally find on the sidewalk. Categorical data is often directly interpretable by humans—and often more of a challenge to interpret with computers. While numeric data is produced by measuring—and you can usually divide them (at least conceptually) into smaller parts as much as you want (remember the plan in Office Space</a> to collect fractions of a penny?)—categorical data is counted or referenced, not measured. It is often something you can point at or refer to linguistically. Each dot in a plot has a numeric position, but even if two dots have the same position the dots themselves are distinct because “this dot” is not “that dot.” So in short, we might just say that: categorical data is what numeric data is about. What can you do with categorical data?</h2> Ignore It</h3> Unfortunately, the most common strategy is just to ignore the categorical data. Log data often holds a wealth of information about categorical values, but because of its volume and lack of tooling, most of that data sits unused in log archives on the vague hope that, if a human is ever forced to look at this data by some future algorithm, the human will be able to read the categorical information and understand it directly. While the numerical data is processed by common analysis tools, the categorical data is ignored in the hope that numeric data happens to contain the answers that will be needed in the future. Count It</h3> If you don’t ignore categorical data, then by far the most common thing to do with it is to count the values. Entire tech stacks—and even entire companies—have been built around counting how many times each categorical value is seen. It is often very useful to know how frequent some values are. Rare values can be insightful. Common values can help you understand your data better. The word-count problem has become the de facto “hello, world” style example when getting started with stream processing tools like in this example from Apache Spark</a>. Turn it into Numbers</h3> If you try to do something more sophisticated than simple counting, then the next most common approach is to use one of a handful of techniques to try to represent the categorical data as numeric data. While counting is often the domain of data engineers (and much harder and more interesting than it looks</a>), data scientists usually try to reach further; one-hot encoding</a> is the most common technique. More complex approaches try to “embed” the categorical data into a high-dimensional vector space. If successfully trained, this process will put similar values close to each other, and dissimilar values farther away. Embedding techniques can accomplish almost miraculous results in some specialized contexts (Word2Vec</a> is still an astounding result, eight years later!), but these techniques require huge amounts of data, expertly trained, in a batch process ahead of time, so they cannot be used on data previously unseen. Connect It</h3> The hidden value of categorical data lies in its potential relationship to other values. Sophisticated embedding techniques can approximate these relationships, but a more natural approach is to represent the relationships directly. The last 10 years has seen emerging graph technologies that do exactly that. A graph is built of nodes and edges, you can picture this with circles for nodes and arrows for edges. The Node—Edge—Node pattern connects two categorical values (as nodes) by a relationship represented by the edge. This is a very natural way to represent data because that Node—Edge—Node pattern corresponds perfectly to the Subject—Predicate—Object pattern at the core of natural language. With categorical values represented as nodes in a graph, a wealth of information can be represented or discovered by analyzing the structure of that graph. Knowledge Graphs</a> can concisely represent the domain expertise of large groups, and can lead to new discoveries simply by connecting what we already know. A connected graph represents the ideal data representation for flexible/schema-less data structures which can also be computed on easily. Why is it hard to use categorical data?</h2> Bias Toward Numeric Analysis</h3> Analysis tools focus almost exclusively on numeric data. Relationships among categorical values can be profoundly useful, but they are difficult to quantify. Since they are hard to quantify, it’s hard to show that one analysis is obviously better than another. Graph Theory is a powerful discipline in mathematics focused on exactly that issue, but it is usually only taught at the graduate level and to very few practitioners overall. As a result, most data analysts and data scientists limit their work to the quantitative tools they studied in statistics. The industry as a whole is very limited in the tools available for working with categorical data. High Cardinality</h3> “Cardinality” refers to the number of possible values that might occur for a particular category. The cardinality of states in the U.S.A. is fifty. A value with high cardinality is one where an inconveniently large number of different values show up in the data—like all possible street addresses in the U.S.A. High cardinality becomes a challenge for some of the strategies mentioned above—like counting—because high cardinality requires maintaining a very large number of counters. When you are interested in the relationship between multiple values with high cardinality, that usually means maintaining separate counters for every possible combination of values. The size and complexity of this approach spirals out of control very quickly with high cardinality data! A related challenge for working with high-cardinality categorical data is when you don’t know all possible values ahead of time. You might call this a problem of “increasing cardinality.” Almost all of the tools for turning categorical values into numbers (like one-hot encoding and embedding techniques) require a fixed set of possible values, known in advance. These tools are not able to represent data they have never seen. How can we better use categorical data?</h2> Depending on the organization, the first step is to stop ignoring categorical data and start making use of it. More than half of the data collected by enterprise companies is never used! It’s collected, stored, and paid for… but never used. Most of that unused data is categorical and can contain critical information to solve otherwise intractable problems. The tendency to ignore categorical data and instead use numeric data simply because current tools are built for numbers leaves many problems unsolved. The industry behaves like the drunk person looking for their keys under a lamppost because that’s where the light is. Our Tools Need to Evolve</h3> The development of graph databases over the last 10 years has been a major step forward in making use of categorical data. Tools like Neo4j</a> have blazed this trail and proven the value of the graph model for getting to value in small categorical datasets. As the world moves inexorably toward high-volume and real-time stream processing, this powerful graph data model needs to be supported by the next generation of high-volume stream processing tools. Apache Kafka</a> is an incredibly powerful tool for delivering event streams, and tools like Quine</a> are being used to join those event streams together and process them through a streaming graph engine to produce a more intelligent real-time stream as output. Categorical data is also proving to be the long-elusive key to improving anomaly detection in challenging domains like cybersecurity. Our company has developed a streaming novelty detector for categorical data</a> which is able to produce real-time novelty scores, assessments, and explanations through behavioral fingerprinting. This system has been shown to accurately assess the novelty of categorical data in cybersecurity event streams and reduce false positives by 99%. The next generation of streaming data tools are making categorical data more accessible and usable. This long-neglected and underused class of data is already being collected by virtually all enterprise companies. It’s only now that the tools for using this data to solve challenging problems are becoming available. New to Quine's Novelty Detector: Visualizations and Enhancements 2021-02-02T00:00:00+00:00 Adding Capabilities to Novelty Detector</h2> Since the launch of thatDot’s real-time Novelty Detector for Categorical data</a> in November, we have received numerous feature requests for additional data exploration and data transformation capabilities. We are excited to announce the addition of these key functions in the latest release, available now from AWS Marketplace.</a> Data Exploration Capabilities:</h3> While the primary output of thatDot Novelty Detector is our Novelty Score API response payload, numerous users shared that they found value in the data exploration tools we use in our demos. This is especially useful when iterating on new use cases or digging into the details of specific anomaly events. To better support these requirements we have added the following to Novelty Detector. Data Distribution Plots</h4> Data Distribution Plots are a sampled view of API responses that provide visual insight into score distribution and rapid identification of your most anomalous observations. Plots combine Sequence, Novelty Scores, Uniqueness Scores, and score distribution and display different ranges of observations, including long term history, recent observations and high-scoring events. Plots feature significant interactivity, including continuous updates with new observations, drill-down to smaller data populations, and click-through to any single observation. Lastly, the entire Plots page can be rendered for each anomaly context you have configured in thatDot Novelty Detector. Example Data Distribution Plot</h4> Observation Detail Visualizations</h4> Observation Detail Visualizations are used to discover “why” an observation has been scored as it was, revealing the root cause of a score. They are accessed by clicking on any data point in your Data Distribution Plots, or by querying for the sequence number of a particular observation directly in the thatDot Discovery UI. Observation details show the relational context of each data element in the observation and a count of the number of times that value has been observed in the context of the data element preceding it. Clicking on any data element allows you to expand the tree to see the range of values observed in the data set. Example Observation Detail Visualization</h4> Data Transformation Functions:</h3> Quickly transforming or reordering data elements to experiment and refine your anomaly detection efforts was a top request by users. We are excited to share that users may now define data transformations using javascript, removing the need for external data preprocessing and allowing rapid iteration of new anomaly detection scenarios. Available to all users, the data transformation API supports a range of operations that can be applied against both batch and streaming data. Decomposing strings into components, which is particularly useful for decomposing directory paths or user agents</li> Concatenation of fields into a single aggregate value</li> Encoding numerical values as strings, such as converting metrics into good/poor/bad buckets or turning timestamps into day time descriptions such as morning, mid-day, evening and night</li> Data filtering to remove data not needed for your model</li> Data obfuscation including data hashing or masking</li> Data reordering to assess the impact of different data contexts</li> </ul> thatDot’s built-in Data Transformation Functions allow users to rapidly modify their observations, greatly increasing the pace of model testing and iteration. We at thatDot are excited to share these new updates with you and welcome additional feedback and feature requests. As noted above, the latest release of thatDot Novelty Detector for Categorical Data is available now from AWS Marketplace</a> or you can contact thatDot.com</a> for a demo. thatDot Novelty Detector</h4> thatDot Novelty Detector is the first general-use application designed for finding anomalies in real-time in data sets that include categorical data. Available as an application for deployment in any cloud or data center thatDot Novelty Detector exposes an API that scores submitted observations for their “novelty” enabling real-time anomaly detention with fewer false positives than traditional threshold based metric analysis. The World’s First Real-time Novelty Detector For Categorical Data 2020-12-02T00:00:00+00:00 An Industry First: Novelty Detection for Categorical Data in Real-Time</h2> thatDot is excited to share the general availability of the world’s first system for real-time categorical anomaly detection. Data Engineers, Developers, and Data Scientists can now generate a real-time score showing the novelty of any categorical data</a>—greatly extending the science of anomaly detection beyond the mainstay of time-series numerical analysis, and opening up zettabytes of data for new insights. Most of the big data collected globally is not numerical data; so traditional tools don’t apply. File names, email addresses, postal addresses, demographic groupings, IP addresses, given names, and other identifiers are all examples of categorical data that cannot be natively processed by existing anomaly detection technology. thatDot’s Novelty Detector is expanding the frontier of what data can be analyzed in real-time for anomalous signals. Turn High-Volume Data into High-Value Data</h2> “Big data comes with a curse: most of it is useless, but you can’t tell which data is valuable and which is mundane. We’re changing that. Early users of thatDot’s Novelty Detector have quickly made critical discoveries and found new value in both existing datasets and incoming streams of new data. Unlocking the insight in real-time data streams is helping our customers accelerate product development and operational issue analysis, benefiting both the revenue and costs sides of their business” said Ryan Wright, Founder and CEO of thatDot, Inc. Built upon Quine’s stateful streaming data engine, thatDot’s Novelty Detector easily scales the dynamic graphical models used to find true anomalies and explains why they stand out. The broad contextual information available to our graph processing engine dramatically reduces false positives so users don’t waste of time and resources with unneeded verification and issue resolution. Use Case: Real-time Access Log Fingerprinting</h2> Cloud services are powerful and ubiquitous, but each service is used differently, by different users in different places, and for different reasons. How can you tell if one of those user’s credentials are compromised? You’d need a system to “learn” what is normal for each service, for each user, in each location. Creating that training data would be nearly impossible. thatDot Novelty Detector trains itself (no training data required!) and immediately flags the compromised usage in real-time. By using the values from log data: [Service name, REST API endpoint, User ID, Country, City] and any other relevant information like time-of-day or specific service information, anomalous access patterns become immediately apparent. And with the context-aware explanations, thatDot Novelty detector will tell you what was so unusual about that anomaly. thatDot Categorical Anomaly Detection</h2> Categorical data provides insight into user behavior, system flows, and config changes all in real time. Simple to Use</h2> thatDot Novelty Detector is available as a container for rapid deployment in common container management platforms such as Kubernetes or AWS Elastic Container Service (ECS). Turn it on and thatDot Novelty Detector offers a simple REST API to ingest data—as a continual stream, or as a batch—and return a novelty score for each data observation. Together with the summary score, a valuable set of additional information explaining why that observation is anomalous is delivered with every data observation. No tuning, training, or setting hyper-parameters is needed with thatDot Novelty Detector. It is ready to run immediately, allowing rapid use all the way from research projects to large scale industrial applications. In support of data exploration and presentation, a graphical visualization of the observed data is provided via the included user interface. Free to Try</h2> thatDot Novelty Detector is available now on the AWS Marketplace</a> or you can contact thatDot.com for information about custom deployments. All users receive a free tier of usage on either platform and discounted annual commitment pricing is available for high volume use. Media interest, please contact Robert Malnati thatDot, Inc. info@thatdot.com</a> Draw Connections to Build Insights 2020-09-02T00:00:00+00:00 In the first post in this series</a>, we introduced the term “3D Data” as a mnemonic and a way to think about streaming data processing that incrementally builds toward human-level data questions with the power for deep contextual explanations (e.g. “Is my system running well?”). In this post, we dive deeper into the first “D”: Draw Connections, to further explore the benefits for data analysis and the groundbreaking result for streaming computation. Connections for Data Structure</h2> Data is related to other data, and there is value in drawing those connections explicitly; that’s the premise. In practice, this means using edges in a graph structure to encode some of the data[1]. A Node-Edge-Node pattern in a graph corresponds to an Subject-Predicate-Object pattern in natural language. So when we talk about creating edges between nodes, it’s best think of that edge as a predicate. Data used for edges usually comes from two sources: 1.) values given in the data itself, and 2.) tacit knowledge about what the data means. Data values as connections</h2> Let’s assume JSON objects are the input data format for a stream processing system. Every new object is a new piece of data to be considered by the system. If a NoSQL document store is used, then you can simply save the object as a new item in your document store. However if you do that, you will need to traverse the object to understand how it’s values are related to other objects in the store[2]. An alternative data representation would treat the JSON object as a node in a graph. The key/value pairs in the JSON object become properties on a node. But with a graph structure available, we have a new data modeling choice: an option to take a property from the JSON data and encode it as an edge to another node. The JSON key is used to create the edge label and the value is stored in the node on the other end of the edge[3]. The choice of which properties to pull out into separate nodes is a data modeling choice. So is the decision to use a single node for all occurrences of the value (i.e. from multiple different JSON objects) vs. a separate node for each occurrence of the same value. This method becomes especially interesting and powerful when applying it to nested JSON objects. Instead of each nested object treated opaquely as blob, the inner JSON object is the definition for a new collection of properties on a different node—with the edge labelled with the nested object’s key. This process can occur recursively as much as needed. Tacit knowledge as connections</h2> In addition to the obvious data being the source of connections between data elements, we have often found it useful to reify other assumed knowledge into specific nodes in the graph and connect them to other data. A trivial example might be to create nodes to represent the buckets of time that are relevant to the problem (days of the week, morning/afternoon/evening/night, every second, etc.), and then connect the associated data with those time buckets. Creating a node to represent an item of tacit knowledge provides a location to store other data for relevant conclusions. Drawing conclusions from that data will be discussed more in the next post in this 3D Data series. There are many kinds of tacit knowledge, and many kinds of data we often use but don’t literally represent. Choosing insightful examples is very application-specific, but considering the kinds of answers for which the data is being used is often illustrative. In our experience, there are often intermediate objects like “user” or “session” which are the subject of qualitative questions (e.g. “Did all users have a good experience?”) that are easily overlooked when considering data representations. These intermediate objects are easy to overlook because they aren’t the direct subject of any single piece of data but instead are the object or concept that is behind either the data or the questions. Reifying these objects makes them available for computation, as a place to store intermediate answers, and core components in meaningful explanations. Connections for Computation</h2> The data modeling techniques described above are often enlightening and useful. They can be applied across a broad collection of technologies. Relational databases or document stores can simulate graphs, with some extra computation. Joining two tables in a relational database is akin to traversing across the edges in a graph. This works well for batch operations; but we built thatDot Quine specifically to operate on streaming data. So instead of using extra computation to simulate a graph, we turn this on its head and use less computation over a graph to get iterative results we can produce in real-time. The result of iterative processing on a native graph means that we can make use of a technique called “semantic caching.” Semantic caching is a technique that uses the structure of the data to inform how computation should be performed. While this topic deserves its own separate discussion, we leave a mention here as a pointer to the deeper computational motivation for drawing connections in data. For those who can’t wait, we touch on this topic and other related concepts in the technology overview section, and our solutions team is always ready to discuss applications to your problem space. Both for data modeling and stream processing, the first step for realizing 3D Data is the same: Draw connections between data. You already have the data. Pulling out edges from that data and encoding other aspects you already know is a brilliant way to get started building powerful real-time answers with context to human-level questions. [1] We are assuming a property-graph model where nodes in the graph are distinguished with IDs and contain a set of key-value pairs called “properties”, and edges connect exactly two nodes and have an edge label with an optional direction. Variations on the property-graph model are in wide use in differing contexts. Sometimes edges themselves are allowed to have properties as well as nodes, but in our model, we do not assume that is always true. [2] Most object stores will index some values in these objects so they can be found more easily. This is an important step, but does not change the structure of the data stored. The need to pick carefully choose what to index becomes a critical consideration. [3] This is very similar to how the W3C RDF spec represents what are often stored as node properties. https://www.w3.org/TR/rdf-schema/#ch_properties</a> ‍ The Three D’s of Graph Data 2020-07-06T00:00:00+00:00 Putting Data In Perspective with Streaming Graph</h2> Africa is bigger than Greenland. A LOT bigger! You already knew that—but you also know that the world is round. When you try to show a globe on a flap map, you have to distort the original shape of some parts of the globe. Depending on where you start, different parts of the map will be distorted by different degrees. This is why a Mercator Projection of the Earth shows Africa and Greenland appearing about the same size. If you live in Europe or the United States, your region probably seems about right. If you only need to focus on a small area, flat maps are very useful! However when you need a global perspective, they are impossibly broken. ‍Mercator Projection, Courtesy of Daniel R. Strebe We face the same problem when using data to understand complex systems. Data comes from one source, one vendor, or one engineer’s guess about what needs to be seen and is convenient to collect. When it works well, it gives us something like a flat map that is helpful to understand a small area that is part of a complex system. If you try to use that perspective to understand the system as a whole, the end result is grossly distorted! This problem emerges on virtually all kinds of complex systems. CDN logs can help shed light on real-time video delivery across the internet, but dramatically obscure the quality of experience each of a million viewers is having. Monitoring file access logs can help understand security events in a live computer system, but does little to learn about injected shell code. Counting twitter followers might show you who the “influencers” are, but tells you little about the swelling revolution. The tendency of data engineers and analysts is to convert the complexity of these data sources into many “metrics” that show counts over time for dozens of possible measurements. This converts each digital signal into one compound analog chart, and the hope is that important events will result in a spike or a dip in one of those charts… and that the analyst will be able to explain what the spike means when it recurs. This approach is cumbersome, error prone, and requires constant vigilance by experts. But there is a better way. To understand a complex system from its data requires leaving the simplified world of flat data metrics and building a 3D model of the system’s data—use data to build a globe, instead of a map. This three dimensional model will allow you to see the system from every angle and correctly view the whole world. We have distilled this process down into three main principles for building 3D data. 1. Draw connections between data.</h3> Data should come from many different sources. Pull it into the same system so that you can connect elements in one set to the others. You will not have all possible data at the beginning, so you cannot know what the ideal data schema is. Accept that and use a system which will let you draw connections between data items over time and create new associations that weren’t obvious at the beginning. There are many ways to connect the dots. Allow for them all to be represented in the data—you won’t always know which are most valuable until the end. 2. Define new levels of data.</h3> Deliberately build new data elements. Build them out of other data elements. Decide what a pattern of events means, and create a new data element to represent that meaningful object. Be sure to connect the new objects to the data they were produced from as well as any other data that produces a possibly useful connection. Arrange any items which have a useful order; record every order that could be useful. Build beyond a single level; after considering what new object is made of many lower-level details, do it again and consider how many of those objects can be combined to make an even more meaningful object. Building up at each level will result in fewer data elements at that level, but each of them being more meaningful—and closer to the actual questions you want to answer. Building up from low-level data into increasingly abstract data levels is what produces 3D data needed to answer complex questions. 3. Drill down to answer questions.</h3> 3D data directly answers human-level questions at its highest points. Literally. There is a piece of data, built from other data, that says “yes” or “no” to a question like: “Is this video stream high quality?” If the answer is “no,” drill down and look at the data used to build this last level to start explaining why? If you need more detail, keep drilling down to lower levels of data. Answering new questions is a matter of defining new data combinations that build toward literally providing that answer. There isn’t a privileged perspective or one single “right” way to look at the data. Build every useful abstraction toward every question you want to answer. This view of 3D Data builds upon current practices and is also revolutionary in its power! We believe this so strongly that we’ve built a streaming data processing system to do 3D Data processing in real-time on a massive scale. At thatDot, our mission is to turn high-volume data into high-value data. This framework for 3D Data processing is how we do it. ‍ ‍ Defining Video Observability 2020-06-16T00:00:00+00:00 Streaming Graph Means Real-time Root Cause Analysis</h2> Imagine if the next time your video streaming operations dashboard-of-choice warns you that 100 users experienced video start failures in the last minute, it was only one click to see each of those sessions and that they are all related to a corrupted file for a single bit rate of an iOS-specific encoding of an asset on one of your CDNs. Better yet, your platform triggers a workflow to have this specific bit rate version of the asset re-encoded and uploaded to the CDN. Video observability can make this a reality. We enjoyed Mux’s recent blog, “What Is Video Observability</a>”, and thought to expand on Steve Lyons’s discussion. General observability solutions such as Splunk and DataDog ingest and aggregate CDN and origin logs as well as metrics from video specialists like Mux Data</a> to provide dashboards that help support video services. Operations teams overlay charts in these dashboards in their search for context, and then, based on intuition, dive into individual systems to explore granular event data to debug issues and answer cause-and-effect questions. Traditional Monitoring-Based Video Observability</h4> This aggregated monitoring scales well, but shortchanges our ability to understand cause and effect, complicates debugging and forces our understanding of the system into functional components instead of a human natural or logical view. What if, instead, we preserved all the granular data while also building abstracted views of our end-to-end platform to enable a new paradigm in actioning from our data? To accommodate this new model, we need to keep the raw event data but also transform it into sessions and then new composite metrics that span multiple elements of our platform for more holistic analysis and alerting. To achieve this definition of video observability we need to rethink our data pipeline as follows. Video Observability Solution Elements</h4> Video streams are what our audiences are experiencing and paying for, we should assess our platform’s performance in terms that relate to their experience. Once sessionized, our observability provides an inter-related view of the streaming platform as a single system, allowing us to see the cause and effect between packagers, CDNs, asset ID, players, etc. This holistic view allows us ask questions about the platform in more natural terms independent of the components. – “What is the root cause of this rebuffering event?” – “Is one CDN providing a better user experience than another?” – “What part of the streaming platform is causing latency in our video streams?” – “Are we delivering good quality video to users of the new Android OS?” Granular and Sessionized Video Observability</h4> Video observability enables us to see both the abstract system experience, as seen by our audience, while preserving the granular details we need for rapid discovery root causes to issues in our platform. When we have this high-confidence view of system element inter-relations we can streamline work processes and implement automation. Video observability brings opportunities to deliver both better customer service by reducing MTTR while also helping to free up our operations and engineering staff from time spent in “war rooms” inferring root causes. Adopting this broader video observability definition supports our entire business. Developers can leverage this same insight to understand the impact of changes to individual components on the entirety of the system. Architects can use granular data to perform historical analysis without having to rebuild models from logs. Product managers can directly see the QoE benefit of new investments or lower costs infrastructure substitutions. In all cases, understanding the relationships between the logs, events and metrics in terms that relate to our services delivered enable such insights so we can take well-informed action. Introducing thatDot 2020-05-26T00:00:00+00:00 That word… ”that.” It’s how you point with a word. What better way to express what you mean than by directly pointing? “What you mean” is the whole point. What you mean is where you spend your time—that’s why we’re here. We started thatDot after decades of building streaming data systems to extract value from data. Over and over again, the experience has taught us: there needs to be something better than building bespoke backend systems every time. —a product to make getting value from data at scale as easy as “that.” Data arrives a piece at a time. What we’re looking for is never so simple as just one item among many. The pieces we need are buried in a sea of data we don’t. We need to put the data together and see how it all relates to find the meaningful piece. The world is swimming in data, but to piece it together into that one most meaningful event… well that’s everything. thatDot’s mission is to produce meaning from data. Meaningful data is valuable data. —to turn high-volume data into high-value data. For the last six years, our team has been working to create a fundamentally new technology aimed at getting value from data easily at large scale. With major funding from DARPA, we have been able to approach these fundamental problems from a fresh perspective. The result is a new technology built from the ground up to solve the challenges of modern data processing in the enterprise. A new perspective has brought new capabilities which enable radical new results. Our technology unifies the storage and computational models for high-volume streaming enterprise data. As a result, each piece of data can perform its own computation—much like the neural network in the brain both stores information and relays signals used for complex reasoning. Combined with a powerful representation of time from end-to-end, thatDot users trigger new actions from the data itself. We’re working with some of the world’s most remarkable companies to bring insight and understanding to inhuman amounts of data. The time it takes to conceive and build a new data pipeline, to test a data hypothesis, or add a product feature is often measured in months. thatDot is turning this into hours or minutes. And the overnight batch processing to address today’s question tomorrow, can now be answered in less time than it takes to click the checkout button. The last few decades have seen nearly every company become a software company, and every software company has become a data company. Understanding that data well enough and fast enough to take action currently takes a small army of highly skilled software engineers—and the process is complicated and error prone, leaving many desired capabilities nothing but a distant dream. At thatDot, we believe a new product in this space can liberate engineering teams from crippling complexity, deliver unachievable capabilities, and bring enterprise companies back to building their businesses. Ryan Wright Founder & CEO thatDot, Inc. ‍ ‍ thatDot Raises Funding To End Microservices Complexity 2020-05-26T00:00:00+00:00 May 26, 2020 Portland, OR. thatDot, announces a $2M funding round led by Oregon Venture Fund (OVF), with participation by Hale Capital Partners and Galois, Inc. Leveraging years of DARPA funded R&D, thatDot is the creator of thatDot, an enterprise software solution for the real time discovery and navigation of data relationships in highly connected data, such as monitoring, security and commerce event streams. thatDot can ingest billions of events while building a rich relationship graph, identifying correlations, isolating anomalies, and triggering workflows, to unlock the value of big data in real time. thatDot powers use cases such as real-time root cause analysis, online video observability, streaming anomaly detection, data lineage, fraud detection, and application security tracing, reporting, and alerting. Solving for Microservice Complexity As a foundational element of an event driven software architecture, thatDot accelerates new service development and improves user experience for enterprise organizations by unlocking the value of their big data in real time. “Big data applications are predominately built on microservices architectures that require scarce technical talent and significant operational overhead. It’s been the only way to build highly scalable applications and services—until now. thatDot’s simplified event stream processing and analysis capabilities are a revolutionary step forward—it’s how big data applications will be built in the future,” said Nick Wade of Oregon Venture Fund. Revolutionary Technology “It is exciting to see this revolutionary technology, developed to satisfy the forward-looking requirements from DARPA, reach the market,” said Rob Wiltbank, CEO at Galois. “thatDot unlocks the value of streaming big data, combining high volume capabilities with automated intelligent data analysis, it will dramatically change back-end software development.” thatDot’s distributed stream-processing fabric ingests and stores event data, combined with a semantic graph layer to accelerate complex queries over large amounts of data, spread over broad time spans. This unique combination of technology enables several critical capabilities: Real time pattern recognition in streaming data</li> Real time anomaly detection in streaming categorical data</li> Complete data change tracking, for every update made to data and easy historical queries</li> Unified “Standing Queries” that instantly trigger custom actions on complex stream patterns</li> Low code usability</li> </ul> Repeat Entrepreneurs thatDot is led by repeat technology entrepreneurs from companies such as Citrix, Urban Airship, Motorola, Janrain and Cedexis. “Raising this foundational round of capital during Covid-19 is a strong endorsement of thatDot’s value proposition and market momentum,” said Ryan Wright Founder and CEO of thatDot. “The world is realizing that microservices are too complex and expensive to manage and orchestrate, and thatDot offers a compelling new way to program back-end operations directly from event data streams.” For more information, please visit www.thatdot.com</a> to explore our solutions or sign up for periodic updates. ‍

thatDot

Quine 2.0 Released!

The Secret Ingredient in the Alphabet Soup of Cybersecurity

Stream Processing World Meets Streaming Graph at Current 2024

Advantages of thatDot Streaming Graph</h2> Some of the advantages the attendees of Confluent told us they found most compelling about thatDot Streaming Graph included:</p>

Ryan's talk on "Streaming Entity Resolution for Kafka"</h2> </p>

Streaming Graph Get Started

Streaming Graph for Real-Time Risk Analysis at Data Connect in Columbus 2024

The Power of Real-Time Entity Resolution with Ryan Wright

Cypher all the things!

Learn more and Try Quine</h2> If you want to try Quine using your own data, here are some resources to help:</p>

thatDot CEO Explains Streaming Graph to Cybersecurity Thought Leader

Streaming Graph Processing on Categorical Data Enables Real-time Risk Calculation

Akka to Pekko Migration for thatDot and Quine

Microservice Hell: The State of the Art in Streaming Services

Maximizing Maintainability</h3> thatDot Streaming Graph has the ability to ingest data from multiple sources, simultaneously, removing the need to create multiple bespoke services. You can ingest data from:</p>

Novelty Demo

Novelty Tutorial</h2>

Novelty Score Results</h3> thatDot Novelty's Score Results response returns the observation score, along with additional useful information. Here is some of that data:</p>

Recent</strong> posts</h2> Want to read more news and other posts? Visit the resource center for all things thatDot.</p> View all Resources</a></p> Help Center</strong></h2> Streaming Graph Help</strong></p> View Docs</a></p> Novelty & Additional Help</strong></p> View Docs</a></p>

Release Announcement for thatDot Streaming Graph 1.6.1 with ClickHouse Persistor

Can Streaming Graphs Clean Up the Data Pipeline Mess?

Optimize Digital Twins to Real Time

Stop Querying Your Data

Streaming graph analytics: ThatDot’s open-source framework Quine is gaining interest

Understanding Batch VS Streaming Data

Understanding the Scale Limitations of Graph Databases

What is Categorical Data?

Quine Aims to Simplify Event Processing on Data in Motion

ThatDot accelerates streaming data analytics with open source Quine

thatDot launches Quine, a streaming graph engine

thatDot Launches Streaming Graph Platform

Authentication Fraud

Financial Fraud Detection

The Problem</h2> Financial fraud detection requires monitoring billions of transactions, devices and users in real-time for suspect behaviors without false positives that alienate customers when service is denied in the middle of a foreign vacation or late night business event.</p>

The Solution</h2> What is needed is a system that do four things:</p>

Graph AI

Log Analysis

Real-time Blockchain Fraud Detection

Key Value Take Away</h2> Sub 5ms Access to Complete trace history</li> Adapt to new blockchains rapidly with streaming ETL built in</li> Real-time materialization of wallet, block and transaction state</li> On premise software to deploy in your data center or cloud of choice</li>

Stateful Digital Twin

Streaming Graph ETL

Video Observability for Root Cause Analysis

Novelty Technology

Advanced Persistent Threat (APT) Detection

Real-Time IoB Threat Hunting

Real-time AWS CloudTrail Threat Detection

The Solution</h2>

Webinar: Approach Zero False Positive Cyber Alerts

4 Advantages to Streaming Analytics in Graph Form

The Future of Modern Threat Hunting is Streaming Graph

The Problems Quine Solves</h2> Quine solves some hard problems in this role. Let’s take a look at a few of the major points:</p>

Monitoring Quine Streaming Graph using Grafana + InfluxDB

Quine Metrics</h2> Quine reports three classes of metrics; counters, timers, and gauges.</p>

Recent</strong> posts</h2>
Want to read more news and other posts? Visit the resource center for all things thatDot.</p>
View all Resources</a></p>
Help Center</strong></h2>
Streaming Graph Help</strong></p>
View Docs</a></p>
Novelty & Additional Help</strong></p>
View Docs</a></p>

The Problem</h2>
Financial fraud detection requires monitoring billions of transactions, devices and users in real-time for suspect behaviors without false positives that alienate customers when service is denied in the middle of a foreign vacation or late night business event.</p>

The Problems Quine Solves</h2>
Quine solves some hard problems in this role. Let’s take a look at a few of the major points:</p>

Quine Metrics</h2>
Quine reports three classes of metrics; counters, timers, and gauges.</p>
TIP!</strong>
When queried, the metrics summary</a> API endpoint reports the same metrics as a metrics reporter.</p>