At Any Event, or at Every Event?

Still on the trail of conceptual clarity around events. In the last post in this – this is the third, so let’s say ‘series’ – we discussed the difference in knowledge and data. I used the familiar-to-most query against a relational database to illustrate the difference. Let’s return to that example for a moment.

The number of possible queries you can make against a particular database is not infinite, but given combinatorial explosion is such a huge number for most databases it may as well be.

Consider a SQL Server database with only two columns, A and B, each of the ‘bigint’ data type. Possible values in each column range from -2^63 to 2^63. Given a simple query of the pattern SELECT * WHERE A=x and B=y, we could make (feel free to comment on and correct any simple math error I am about to make : ) 2^15,876 queries using all possible discrete values of x and y. Per the Wikipedia entry for ‘Power of two’, 2^16,384 is 4,933 digits long. By comparison, a recent estimate of the number of particles in the universe – quarks, gluons, everything – is a number 81 digits long. So there are roughly 60 times as many queries we can make against our two-column database as there are particles in the entire universe.

The range of meaningful queries is, of course, much, much smaller. That strikes me as a function of the ontological model the database is physicalizing – the number of queries bounded by nonsense. (Which makes me think of ‘A Confederacy of Dunces’ for no reason obvious to me…)

And the range of meaningful queries we need to use in practice is but a small subset of all meaningful queries.

So why do we build elaborate distributed systems that move entire data sets around to expose any possible query to interfaces whose consumers may only use a dozen? That is like laying out an elaborate smorgasbord every day for a couple of customers who only eat the Swedish meatballs and maybe nibble a little gravlax with mustard-dill sauce: that’s a lot of wasted pickled herring, Jannson’s Temptation, julskinka and glogg.

The reason, of course, is caching for performance and availability. Our operational systems of record may not be as available as we want the interface to be. They may not scale easily or well under additional load. And our plumbing to move the query and reply around near-real-time may not be up to the task. We also frequently take the opportunity to scrub and buff the data ’till it’s shiny, and combine data sets from different sources on common dimensions.

Massive data virtualization appliances from dedicated companies such as Denodo and Actifio, do-everything vendors such as IBM, Oracle, and Informatica, and application and integration-ware vendors such as JBoss and Tibco (moving into the complementary data space) try to solve this is a big-play way.

But in the health care business – despite the been-strong economy – the low-to-mid eight figure hurdle for that big play is just too high.

And assembling enterprise solutions from massive Lincoln Logs like that runs counter to the evolutionary architecture approach we are realizing – on a path Neal Ford blazed – is our best sustainable long-term approach.

So a vendor says ‘data virtualization platform’ and we (the ‘architects’ we, not the ‘Cambia’ we) tend, for better or worse, to give them the bum’s rush.

What do we do instead of massive virt? We micro virt. We introduce services as a layer of abstraction, and do whatever kind of read caching is required for that service to meet the interface’s SLAs.

And as our operational systems are upgraded, or replaced with newer technology, they are becoming more scalable and available, making it more viable to tap them directly. So we sometimes find we are able to have our services layer plumbed as facades directly over the operational system (also useful since, as we all know, the Vendor King has no clothes.)

And, as our internal integration plumbing is replaced with new technologies pioneered in the world of Big Data, we are increasingly able to to-and-fro queries and even large query results.

What does this all mean for events? Do we want to use our event plumbing just for the near-real-time propagation of state, destined for some comprehensive cache of data local to some any-query interface? Or do we want to communicate knowledge?

The answer is ‘both’.

The stream of ‘normalized’ raw events – scrubbed and buffed and minty-fresh – is used as the fountain of operational truth, serving to continuously refresh microservices, forward caches, data lakes, data marts, data warehouses.

The result? Up-to-date, consistent data available across the enterprise.

That doesn’t mean up-to-date, consistent _knowledge_ available across the enterprise.

Communicating knowledge requires ingesting the ‘cloud’ of raw events produced by our discrete-but-relatable operational streams, and, using our ontological mapping, inferring meaningful events from that cloud.

Those business meaningful events – there is a new health plan member, a claim is now ready to pay, a member has aged into Medicare – can be used to directly trigger and inform business processes.

Making those meaningful events consistent with meaningful queries demands sourcing the logic for both from a common source: definitions of meaningful subsets of the overarching ontology.

Stay tuned.

Leave a comment