What’s the difference between Categorical and Numerical Data?

thatDot avatar thatDot

According to a 2020 Microstrategy survey, 94% of enterprises report data and data analytics are crucial to their growth strategy. And yet, surprisingly, as much as 73% of the data that enterprises collect is never used, including a vast majority of what is termed “categorical data.”

Why would enterprises ignore an entire class of data? Especially when it is essential to high-priority use cases like personalization, customer 360, fraud detection and prevention, network performance monitoring, and supply chain management? The simple answer is that using categorical data with today’s tools is complex, and most data scientists aren’t trained to use it.

Figuring out how to use categorical data will help companies solve complex problems that have long evaded them. And they’ll be able to do so with data they already have. Here’s a look at categorical data, why it’s hard to wrangle, and how it could be useful.

Categorical Data 101

There are two main types of data: categorical and numerical. Numerical data, as the name implies, refers to numbers. Categorical data is everything else.

A generic dictionary.

Categorical data is non-numerical information that is divided into groups.

As its name suggests, categorical data describes categories or groups. Some examples of categorical data could be:

  • A list of most popular baby names;
  • Census data, such as citizenship, gender, and occupation;
  • ID numbers, phone numbers, and email addresses;
  • Brands (Audi, Mercedes-Benz, Kia, etc.).

In some instances, categorical data can be both categorical and numerical. For example, weather can be categorized as either “60% chance of rain,” or “partly cloudy.” Both mean the same thing to our brains, but the data takes a different form.

The Challenges of Categorical Data

The same thing that makes categorical data so powerful makes it challenging. While it is easy for you and me to tell the relative difference between a dog and a plane versus a dog and a cat, doing so computationally is not so straightforward. To express the difference between two pieces of categorical data, one must use graph-based analytical tools or have a background in graph theory. This is why “knowledge graphs” have been a recent hot topic. Since graph tools are not so widespread in today’s enterprise and academic landscape, data scientists instead fall back on the statistical techniques they know and for which there are ready tools.

Most machine learning algorithms can only handle numerical data. They can count instances of categorical data with real but limited utility. The other alternative is turning categorical data into numeric values using one of several encoding techniques. These techniques all tend to be slow and produce poor results – even making some goals impossible, like anomaly detection. Using categorical data comes with another challenge: high cardinality.

Cardinality refers to the number of possible values for a particular category. For example, the cardinality of a list of all models of iPhone ever made is a relatively manageable 34. On the other hand, a list of serial numbers for all 2.2 billion iPhones sold since production began represents a high-cardinality data set.

The size and complexity of traditional analytical approaches spiral quickly out of control with high-cardinality data. Additionally, almost all tools for turning categorical values into numbers (like one-hot encoding) require a fixed set of possible values known in advance.

As some high-cardinality data values are unknown, this poses a problem since those tools cannot represent data they have never seen. With all these challenges, you can begin to understand why enterprises end up ignoring categorical data altogether.

So, What Can You Do with Categorical Data?

The enormous and unrealized value of categorical data for enterprises resides in its ability to represent the relationships between values in a way humans can readily understand and express. These relationships can include all the properties associated with an object – I am tall, blonde, married, and have two children – or the relationship between two objects – I wrote this article, and you are reading this article.

You can use categorical data to efficiently group and connect classes of objects; for example, you can show all tall, blonde, married authors and the readers of their articles organized by geographic area and hobby. In doing so, you can uncover some unique insight and analysis. When you combine this “relationship thinking” with a computer’s ability to process enormous amounts of data, the astonishing power of categorical data becomes apparent.

The Strengths of Graph Technology

With the emergence of graph technology in recent years, enterprises can finally represent these relationships directly. A graph is built of nodes and edges; you can picture this with circles for nodes and arrows for edges that connect nodes. The node-edge-node pattern connects two categorical values (nodes) by a relationship represented by the edge.

This is a natural way to represent data because that node-edge-node pattern corresponds perfectly to the subject-predicate-object pattern at the core of a natural human language. So anything you can say in words can be represented naturally in a graph. Then we can analyze the relationships between the values by following the connections between categorical data in a graph.

A graph data structure that has the same outline as two hemispheres of a human brain.

Graph data structures connect information in a way that resembles the way we speak and think.

The challenge of using categorical data is like having a pantry of canned food and no can opener. There’s food there, but you have no tools to access it. Instead of looking at the same data with the same approach, the next generation of streaming graph data tools needs to make categorical data more accessible and usable.

We already see the success of categorical data as the key to improving anomaly detection in cybersecurity. But it’s only now that the tools for using this data to solve challenging problems are becoming available.

thatDot Software for Categorical Data Processing

thatDot streaming graph software is built specifically for categorical data. It combines a graph data structure (like Neo4J or TigerGraph) with the performance and scale of event processing systems like Flink and Spark Streaming. thatDot Novelty, built on thatDot Streaming Graph, is the first anomaly detection system to use categorical data, making it uniquely powerful. thatDot Streaming Graph is powered by Quine open source software. You can try it yourself either by downloading Quine or starting a Streaming Graph free trial. Learn how to ingest your own categorical data and build a streaming graph that can detect all sorts of attacks in real time.

  1. Try Streaming Graph free for yourself.
  2. Learn more about thatDot Streaming Graph.
  3. Join the Quine Discord Community and get help from thatDot engineers and community members.
  4. Check out the Ingest Data into Quine blog series covering everything from ingest from Kafka to ingesting .CSV data
  5. Download open source Quine – JAR file | Docker Image | Github

This article, in a slightly altered form, first appeared in Datanami on July 25th, 2022. Photo by JJ Ying on Unsplash