Stream processing systems and microservices are complicated! They’re complicated because they rely on many people to come up with complex solutions to a vast array of deep problems from several domains. Stream processing and microservices combine the hardest problems from the database world with the hardest problems in distributed systems. Each of these domains are vast with a long list of difficult challenges because of the interrelating of many fundamental or mathematical constraints. These fundamental constraints are unavoidable—they will surface in any system that stores data and has more than one processor.
Add to this even more complications from yet another dimension—time. Systems are built to compute the current answer, but it’s common to receive requests for past moments. Who changed this value, when, and what was it previously? Which timestamp should be canonical—the client’s or the server’s? —microservice A or microservice B? When it takes a few seconds or even minutes to process a query, can you really afford to give up visibility of the current moment? When is the right moment to take action after a complex data pattern has finally been built from my streaming data? Each of these time-related concerns further complicate the questions of data storage and stream processing.
But what if we could start again? —building a system from the ground up which confronts each of these fundamental problems honestly and deeply, but from a holistic perspective and with modern goals in mind. Plan for streaming data and real-time responses. Plan for distributed systems and granular data storage. Plan for complex questions on historical values and rich responses with real-time constraints. After six years of research and development, and with major funding from DARPA, this is exactly what the team at thatDot has done. We call it: thatDot Connect.
thatDot Connect bridges the worlds of streaming distributed systems and databases. On one hand, this system feels like a database but can actually handle being at the center of an enterprise-scale streaming workload. The system works as a massive distributed system computing real-time answers, but also stores historical data and makes it as easily accessed as a database. Let’s explain how…
Main design concepts
The master abstraction
The primary data model for thatDot is a graph. A graph consists of nodes connected to each other by edges. The edges have a direction and a label, and connect exactly two nodes. This graph organization of the data is very powerful—and a natural representation for how humans often think about data. The Node-Edge-Node pattern in a graph corresponds directly to the Subject-Predicate-Object pattern of the English language (and others). Each node in the graph can contain any number of properties. Properties are a set of Key/Value pairs—like a JSON object or a Python dictionary.
With this property graph model, all the same data that could be represented (with additional processing) in a relational database has a direct representation in a graph. A node represents a row in a table. Attributes or columns for that row correspond to properties on the node. Edges that connect nodes correspond to foreign keys pointing to rows in another table. Since there aren’t separate tables dividing the data, each node has a unique ID. This ID is like a primary key in a single table—it is always unique and provides fast access to that specific data.
Connected data: relationships are the key
The obvious value of some stored data is the literal content of that data. What number does it represent? What name is stored there? For decades, systems have focused on making use of this value in the data. As we’ve gained more data, the natural attempts to extract value from that data shift the focus to questions about how one piece of data is related to another. How data relates to other data is paradoxically difficult to understand in relational data stores. Relating data to other data requires expensive join operations across tables—often iterating through huge numbers of rows each time. A graph model puts the data relationships at the same level as the data values by encoding relationships as edges. Discovering relationships among data items is as simple as traversing across an edge. In a sense, joins are precomputed (when it’s very cheap) so that calculating data relationships becomes a trivial task.
In thatDot Connect, the data model is also the computational model. Computation is implemented as a native graph interpreter. A query, data, or an instruction is dropped into the graph and relayed through the network of nodes and edges to compute the appropriate answer or trigger the action desired. As explained below, this process is highly parallel, fully asynchronous, and profoundly powerful.
Unified data representation and computation: Actors
The graph interpreter at the heart of thatDot Connect is built on top of an old idea: the Actor Model, first described by Carl Hewitt in 1973. An actor is a lightweight, single-threaded process that encapsulates state and communicates with the outside world only through message passing. An actor has a mailbox where it receives messages. An actor’s computational role is only to process the messages that arrive in its mailbox. Actors are very inexpensive, and it is common to have anywhere from hundreds to millions of actors inside one actor system—on a single machine or clustered across any number of machines. The actor system handles the scheduling of actors on a CPU or thread pool. Actors can be spread across a cluster of computers transparently.
At a first approximation, each node in the dataset gets backed by one or more actors on demand. When computation needs to be done related to one particular node in the graph data set, the actor that represents that node is sent the appropriate messages. The system overall manages how many and which actors are live and ready to participate in computation vs. which actors save data on disk and have no current presence in-memory. Flexibility in this aspect allows the system to operate very quickly—taking full advantage of the semantic caching described below—or to operate on low-power machines with a very small resource footprint.
Since thatDot’s graph interpreter is implemented in an actor system, and all communication is done by message-passing into mailboxes, the entire interpreter operates in a fully asynchronous fashion. This allows for very efficient scheduling of computation on a work-stealing fork-join thread-pool, and no blocking during IO operations.
Actors are scheduled for computation concurrently with other actors. This makes the overall system computation highly parallel. Under heavy load, the scheduler can very naturally utilize all available threads or CPU cores by scheduling multiple separate actors to process their messages concurrently. Actor scheduling can be done on a highly-efficient “work-stealing” fork-join thread pool, taking advantage of all CPU resources available on the machine.
The choice of a graph representation for data and computation lets thatDot Connect very efficiently use the connections present in the data as meaningful information for determining which data to keep active in memory. The result is a remarkably high cache hit rate. With actors representing nodes in the graph, effecting the changes to a node and triggering updates to the rest of the system gets timed together so that continuing computation occurs at the optimal moment.
Treat time as a first-class citizen
The most relevant view of data is usually a view into the current state of the data. But it is becoming increasingly important to be able to answer historical queries about what the data used to be. thatDot Connect was created with time represented as a first-class citizen from the ground up. This makes querying historical data very easy, and also has significant advantages for managing complexity related to time in a massive distributed system.
Balance: schema-less flexibility with schema-full data structure
A graph structure of data and computation provides the ideal sweet spot between: a.) flexibility in the schema, so that you don’t always have to know the shape of the data ahead of time before the system is initially set up, and b.) ability to compute on the data reliably and add dynamic constraints to guarantee that some data changes are disallowed in the appropriate ways. This capability depends on the graph data model as well as the computation done on a node-level.
Entity resolution built in
Data often has duplicates. Sometimes data is slightly different, but is meant to refer to the same entity. thatDot Connect is able to build in this entity resolution out of the box. Custom ingest functions allow users to define which characteristics make an entity unique, and then guarantee that entity resolution will occur on future ingests.
High throughput connected data
Graph data systems have been around for at least a decade. These systems have chronically suffered from slow speeds related to the data storage and processing models imported from the prior half-century. Because of the fresh design of thatDot’s technology aiming to fill the void between databases and stream processing systems, we have been able to achieve remarkable streaming performance while achieving incredibly high throughput while building the graph model. We can now see data in its fully-connected form at modern enterprise scales.
Query the present (in real-time!)
A stream processing system must compute the desired answers in real-time as data streams by. Achieving this usually requires limiting the complexity of the questions that can be asked—because complex questions could require old data no longer available, or fetching it and computing the answer would slow the system down to an unacceptable degree. With thatDot Connect complex questions can be resolved in real-time alongside new data streaming in at high volume.
Query the past
Since thatDot Connect saves the full event log of all historical data, getting to historical answers is a trivial exercise. Replaying only the relevant nodes, thatDot Connect provides a fully-versioned graph so that absolutely any historical moment can be queried just as easily as the present. Just run your query with the desired timestamp.
Query the future
thatDot Connect has the unique ability to set a “standing query.” Analogous to a standing wave, this standing query lives inside the system, immediately propagating partial query results to the right places automatically while new data is streaming in. This capability allows a user to issue a standing query, then trigger any desired action every single time the query has a new result. This capability is a game-changer for backend data processing. Through setting several standing queries, new functionality in complex back-end systems can be achieved in mere minutes.
Out of order data-handling
To operate quickly, stream processing needs to occur on several machines with computation done in parallel. Parallel processing always guarantees that some ordering is lost. The order is given up to get the speed of parallel processing. But this can often affect the correctness of streaming systems which rely on the order of events to preserve a correct interpretation of the data. Getting data in the correct order is vital for finding patterns in many microservices systems.
thatDot Connect has the unique capability to issue queries that live inside the data and automatically propagate their results at the perfect moment. This makes concern about out-of-order data processing vanish. A query is issued and propagated through the graph to complete the correct pattern regardless of what order the data arrives in. Even if data arrives entirely backwards, thatDot will provide the correct answers immediately when the relevant data has arrived.
High-level interface: “Find that, do this.”
Business logic is the goal. Your business logic is your business—that’s why your software systems are created: to accomplish those business goals. It’s time to deprecate the old pattern of chopping up business logic into unnatural and unrecognizable pieces just so they can be implemented by microservices engineers.
At thatDot, we set out to create an interface that allows our customers to focus on their business logic on their own terms. Leave the mechanics for how to execute that logic to the graph interpreter. Instead, a user just defines 1.) what do you want to find: “find that”, and 2.) what do you want to do when you’ve found it: “do this.” The mechanics of highly efficient stream processing are generated automatically by our compilers and executed by our interpreter. “find that do this”
This pattern is easily to visualize and profoundly powerful—especially when a user starts combining several standing queries and doesn’t have to worry about out-of-order data processing. It is powerful enough to be a complete tool for programming back-end data processing systems, and simple enough to keep the business logic clearly in focus.
Customizable data storage
Data storage tools are a rather personal choice. We built thatDot Connect to give you that choice alongside all the other capabilities described here. Our data storage components are swappable, so you can use the existing data storage tools you prefer. Whether that’s relational databases, No-SQL stores, on-machine or off-machine or in the cloud, spinning-disk- or SSD-optimized, flat or structured—you choose. thatDot Connect integrates with several leading storage vendors and tools, with the possibility for additional customization to pull together several options, or integrate with other legacy systems.
Your company will never stop processing data. Your data isn’t going to shrink. There aren’t going to be fewer data sources in the future. Data will never stop. If your current strategy isn’t assuming infinite data, then you are committing to rebuild your data pipeline in 6, or 12, or 18 months yet again. It’s time to step off the treadmill.
thatDot Connect is a streaming data platform built literally for infinite data sets. Our storage systems can live off-machine and use managed cloud storage to grow with infinite data. Meanwhile, our streaming technology essentially uses the connections of new data as an index into older data. Partial query results are stored with archived data so that when a new piece of data is connected to that old data, the partial pattern from months or years ago is maintained and immediately pulled back in to continue the pattern from where it left off.
This is all done without forcing arbitrary windows of time or excluding some data because it is too old or wasn’t within a certain small time range. All data gets included if needed, regardless of how old it is. This is achieved by using edges—the connections from one piece of data to another—as the trigger for when to continue processing a pattern on older data. When new data completes a pattern that includes very old data, it is all pulled together in the newly completed result now available.
Trigger action in real-time
Finding complex patterns in real-time is a profound capability, but we’ve taken the next step as well: trigger action. When a standing query is defined, the user can choose from a menu of actions to take. These actions can be as simple as logging an event, or publishing it to a message queue. They can call back in to the system to update or annotate the data (which then helps when composing multiple standing queries together). Actions can call out to external systems, passing to them the relevant data from the pattern just detected. Actions can even be configured to execute custom code or implement complex algorithms. These actions help our users program and integrate their back-end systems.