In my last post, we started digging into the re-emergence of complex event processing and event-driven architectures which has been enabled by the latest generation of stateful stream processors such as Spark Streaming, Samza, Kafka Streams, Apache Flink, and Google DataFlow. Today let’s start to develop a clear line of sight into the underlying conceptual space to help us make good event-related design decisions going forward.
If you take a hard look at how data, how information, is actually used by people, there is a moment, frequently just-in-time, when the data becomes knowledge – when some particular meaning is variously extracted from, filtered, assembled, aggregated – depending on how the data is structured – from the storage and transmission mechanism – the stream, the database, the file system.
Consider a traditional relation database – Aurora, or Oracle, or DB2. While there is some broad meaning captured in the database structure – more or less clearly, depending on how normalized it is – the knowledge it contains is described in the queries that are used against it.
Every shop has a host of such queries. Few, in my experience, formally catalog and reuse them. More’s the pity, because they represent the key to understanding.
What’s really going on? Time to climb the stairs to the top of the ivory tower to get a better view.
There is some domain of knowledge. That knowledge is about some set of related entities and objects. Its fountain is the instrumentation of the domain, the observation of the domain via that instrumentation, and the reporting of what was observed: events – the creation, change, and termination of entities and objects and their relationships. For entities and objects that exist before the instrumentation, observation, and reporting has begun, to the extent the transmission of knowledge about the domain requires that the compete state of a given entity or object, a snapshot of its attributes, a ‘state zero’, be explicitly acquired.
(I started to recap my ‘real things versus virtual things’ soapbox speech here, but didn’t feel like writing it again, preferring to roll on with the main thread here – I have a meeting in 12 minutes : ). For now, in a slightly different context, it is accessible in this post on the fundamental problem blockchain addresses.)
There are a host of ways of acquiring, persisting, and communicating knowledge about a domain. If we assume accuracy and completeness for now, and set aside all that we build around achieving them, the acquistion, persistence and communication of knowledge is largely a function of synthesizing and exposing it in an interface between agents in terms of the ontology of the domain – the conceptual framework in which the entities and objects and their relationships are defined, and the valid state transition rules are captured. If the format in which it is communicated and stored is not able to express the ontology, the knowledge loses coherence.
Let’s return to our relational database query example. Part of the ontology is captured in the tables’ structures – a Person has a name, a biological sex, a gender, a height, a weight, and so on. Part of the ontology is captured in the relationship among tables – a Person has another Person who is a biological male who is their biological father, and a different Person who is a biological female who is their biological mother. Part of it is captured in uniqueness constraints on columns or sets of columns, i.e. there can be only one Gibson guitar with a given serial number. Part of the ontology may be captured in database insert, update and delete triggers that only permit valid state changes according to the rules in the ontology.
A query made against the database is looking for some particular piece of knowledge, e.g. “what were all the Gibson guitars produced January 10th, 1978 in Gibson’s Kalamazoo factory?”
Today, the agent who writes the query has to know how the knowledge is ‘encoded’ in the domain, and explicitly craft the query in light of that knowledge. The explicit mapping from ontology to persistence structure is very rarely done, making the black art of knowing how to query a given database to find something out a valuable commodity in most organizations. (At Cambia there are a handful of people with such deep knowledge of the primary operational system for the health plans. As you might imagine, they are much in demand. And as data science and in turn AI have established a presence, those experts have tended to graviate to those teams.)
The person writing the guitar query would have to know that from 1977 until the Kalamazoo factory closed that Gibson serial numbers were 8 digits long, encoding the year, day and factory in the number according to this scheme: YDDDYRRR, where RRR is the factory ranking/plant designation number. And they would have to know that that number for Kalamazoo was between 001 and 499 inclusive. So their query might end up looking something like this:
SELECT * FROM GUITARS WHERE Make = ‘Gibson’ AND SerialNumber IS LIKE ‘70108???’ AND CAST(RIGHT(SerialNumber,3) AS INT) BETWEEN 0 AND 499;
The whole string parsing casting stuff is frequently a slightly different flavor from database to database, so they’d have to know that, too.
Of course, the database may have been structured more normally, so the manufacturing data and location were inferred from the serial number when the record was inserted. Or a custom function or two may have been created to extract the date and factory from the serial number. A ‘lookup’ table may have been created to capture the mapping from the factory number to a given factory, with 500 rows mapping from some number to ‘Kalamazoo’. Etc, etc.
The point is, the knowledge structure is captured in the ontology. That ontology exists whether we write it down or not. The mapping to it from a given database, or file format, or JSON schema, exists whether or not we capture it explicitly.
In most shops today, the ontology and the mappings from given implementations to it exists only in the minds of the users. To the extent their common dream is inconsistent, they do not have a shared understanding of the domain.
So, finally, back to events. A meaningful event is akin to a meaningful query. Both represent the extraction and communication of some particular knowledge of a domain. Just as a set of such queries exists in every shop, and should be formally captured and cataloged as one of a company’s most valuable data resources, a set of such event definitions should be captured and cataloged.
And both should be explicitly tied to an overarching ontology for the domain.
Next time we will look at how our events play in to our overall data and application architectures.
Stay tuned.