Your Knowledge Strategy: And Magic Filled the Air

In my last post, ‘You Don’t Need a Data Strategy, You Need A Knowledge Strategy,’ I painted a high level sketch of why we need knowledge strategies, and the key features they need to address. I skipped ahead pretty quickly, rambling on<the time is now to sing my song /Zep> to cover a breadth of ideas.

Today I’d like to circle back and tie up a few things more neatly both to buttress my arguments, and to help make the suggested approach more cohesive. Let’s dive a little deeper into provenance, sources of truth, aggregatability, and knowledge reuse.

Provenance Management

The oldest adage in software – even older than ‘Go To is Bad’, is ‘Garbage In, Garbage Out.’ There is a limit to how much the quality of data can be improved after its initial creation.

In a traditional staged data management system (raw-> scrubbed-> aggregated-> enhanced-> consumed), we scrub the data after we land it ‘raw’. In that stage we can use knowledge of well-known domains to validate and standardize their contents.

For example, for addresses we can map the address to the standard post office format, and kick it out if we can’t, we can go further and determine if an address is actually mailable, and kick it out or flag it if it’s not, we can go ever further and determine if a physical address actually exists, if there is a business or residence there, and flag it or kick it out if it’s not.

For personal identity information we can validate with LexisNexis or Experian or another vendor whether the persons exists, and whether the associated demographics are correct, and flag them or kick them out if not.

The problem with all that is well-known: the flag it or kick it out part. Either we suffer the dataset being incomplete, or we have to fix the errors.

And fixing errors is problematic – not only is it expensive, it is time-consuming. Generally it means going back to the system of record, making the fix there, and percolating the data back through and into your data management system.

A further complexity is that doing that not only changes the data for you, it changes it for any other consumers of the data before you get it. So they all have to be able to accommodate such retroactive changes. Even more challenging is that business logic, business decisions, may have been based on that incorrect data. The more time that passes before you catch the errors, and the further from the system of record, the greater the chance that the ripples from the fix rock your operational boat.

Obviously getting it right the first time is critical. Just as obvious is that, absent that, the earlier we get it right the better.

Provenance Management is the art of getting data right the first time.

For Provenance Management we should consider two kinds of data: data about things in the world, and data about virtual things. (As you may know, drawing this distinction is a common theme across my posts in this blog.)

For a real thing, our data is a ‘mini-me’ of the thing, a snapshot of some subset of its attributes at points in time. We find out the attributes’ values by first instrumenting the real thing: there must be a way of qualifying each attribute. That instrumentation must be used by an observer. That observer must then make a report about the attributes and communicate it over a channel. A receiver on that channel must then either pass it on to another channel, or, if they are the terminus, find out if there is an existing record about the thing-in-the-world, and, if there is, update it, else create a new record. If we are interested in the identity of the thing (say it is a Stradivarius violin versus a generic concrete block meeting some spec) we must establish its identity, and preserve a pointer to that identity.

If we are sophisticated, we make our record of it time-variant: before each update we preserve a record of the state of the thing before the update. If we are not, we overwrite the record of the thing.

Provenance is akin to the chain-of-custody of evidence in a police procedural:

  • What was the time of the observation?
  • Was the observation and reporting triggered by a change in the thing observed, or was it commissioned to be done once, or at some period?
  • If triggered by a change, did the instrumentation encompass the agent of change, their intent, and the tool they used, as well as the thing itself?
  • What was the quality of the instrumentation used at the time of use?
  • Who or what was the observer?
  • What were their qualifications at the time?
  • What was their state at the time of the observation?
  • How did they record the observation?
  • What system of categorization did they apply in recording it?
  • To the extent the categories are well-known, what validation did they do if any on the reported observation based on that knowledge?
  • When did they report it?
  • What communication tool did they use to report it?
  • What was its state at the time of reporting?
  • What channel was used for the communication?
  • What or who was the recipient?
  • How was the delivery of the report guaranteed?
  • How was the integrity of the report guaranteed?
  • Was the recipient the end of the communication chain, or did they pass it on in turn on another channel using another communication tool, in which case rinse and repeat previous questions?
  • If the end of the chain, what tool did they use to persist the report?
  • To the extent the attributes were from well-known categories, what validation did they do, if any?
  • If tracking identity of the object, how was the identity with an existing record resolved?

Explicitly managing provenance means using the meta-data from answers to some or all of these questions to make a judgement about the quality of the data.

Even though the overall conceptual model of Provenance is present, in a given circumstance not all of the questions can be asked and answered.

For example, say a member reports an address change to a customer service rep on the phone.
The entire provenance path might start when they signed the real estate contract purchasing the house, the instrumentation their eyes, recording it in their memory, using their cellphone to call the customer service rep and tell them the news.

The rep calls up their record, validates and confirms their identity, and enters the new address.

How confident are we at that point in the address? Maybe they got the street name mixed up, or the zip code. Maybe they won’t move in for two more weeks, or haven’t even closed, but forgot to tell us that part.

We can increase our confidence in the quality of the new information if that data entry tool will do the kind of address validation I described above – is it mailable, is there a house there versus a business, is there any other public record validating their association with the new address, and so on.

We can increase our confidence in the quality of the information if confirmation comes in from a different provenance chain – say, for example, we get a post office change of address form. When a customer does a change of address at the post office, USPS requires them to submit a credit card number. The change is free, but they put through, then roll back, a small transaction to validate the billing address on the card matches their current mailing address.

For virtual things, such as orders, there may be two kinds of records. Since the thing itself is virtual, its master form may be as an electronic record agreed upon by all parties. (Therein lies the power of distributed ledgers – see this post.). If there is not an agreed-upon-by-all-parties canonical location for the thing itself, provenance is obviously critically broken from the start.

The other kind of record of a virtual thing is just like a record of a real thing – it is a snapshot of the attributes of the virtual thing taken at points in time.

The canonical (or canonical-ish) record may serve both functions, which causes no end of confusion in the land of data. And leads us to the next topic to tie up, ‘Single Sources of Truth’.

Single Sources of Truth


The source of truth as we know it about a thing is first and foremost a function of where we write it, not where we read it.

As we have seen, for real things the truth is in the world, for virtual things in the canonical version.

The ‘truth’ as we deal with it in our records is not a binary thing. Of course our record of a thing either has fidelity to the actual thing, or it doesn’t. It is objectively true or false. But consider this. Our record is made up of attribute snapshots taken at points in time communicated along some provenance chain. The attributes of the thing itself may have changed since the last update. The provenance chain may have introduced error. The ‘truthiness’ </Colbert> – the accuracy and currency of the attributes – is function of the quality of our provenance chains and the frequency with which observations are reported. Once written a record never gets ‘more true’ unless it is updated, or it is validated by another provenance chain.

Consider Master Data Management. We put in an MDM system because we have more than one place we are writing the same kind of record – say a person record. In our MDM system we match records about the same individual (a huge topic in its own right) as well as we can, then (assuming we have a ‘co-existence’, also known as ‘hybrid’, MDM solution) we infer a ‘best version of the truth’, sometimes called ‘golden’, record from all the records we have matched as being about the same individual.

What does ‘best’ mean in ‘best version of the truth’? How do we infer it from matched record? We call it the application of ‘survivorship’ rules. But what is really happening? We are simply comparing the provenance chains for each of the attributes as best we can. Since we don’t typically manage provenance explicitly, we don’t have much provenance meta-data to use in such a comparison, so we fall back to ‘most recent’ wins in most cases. Some MDM systems such as Informatica have elaborate survivorship mechanisms, but they are arbitrary tools based on arbitrary assumptions, such as ‘record quality curves’ that fall off with age at some arbitrary rate.

In a fully operationalized MDM system we might have ‘feedback’ loops that update all of the source systems with those newly-minted best versions of the truth. Trying to automate that frequently runs afoul of business rules for each of those operational domains governing how updates can be made, and by whom, and the best we can do is inform the teams managing and using those systems when their records appear to be out-of-date or inaccurate.

So is the MDM system a ‘source of truth’? Not really. It’s golden records are sophisticated best guesses at the truth. But they are not operational truth. Contracts around data don’t encompass golden records. We can’t update them directly. If we send out marketing mailers to golden record addresses, and some come back ‘return to sender’ , where do we fix it?

We can use the golden records as axes for analytics, especially probabilistic analytics which may be tolerant of a certain ‘noise’ in the records.

We might use the linkage qualities of an MDM system to enable operational navigation across systems according to linked entities, but that navigation is difficult to make transparent in light of the probabilistic nature and accuracy of the MDM linkage as seen in the context of privacy laws and operational-domain specific contracts. If you are a large provider chain, and a patient logs into your portal, do you give them transparent access to their records all across the chain? The answer of the CommonWell kinds of interoperability solutions these days is a firm ‘No’, not without explicit confirmation by the linked system.

So where does confusion come in about sources of truth? Primarily because by ‘truth’ we mean ‘accurate, complete, and aggregatable.’ In our legacy designs, those three elements were frequently not present until somewhere downstream from the systems of record.

Which – discussing linkage was a nice lead in – brings us to the third topic to tie up a little today: aggregatability. (‘Aggregability’ means ‘tendency to aggregate’ – think platelets clumping into clots, ‘aggregatability’ ‘the ability to be aggregated’ – think Tinkertoys(r)).

Aggregatability

Aggregatability for data is the ability to join data records together based on attributes or sets of attributes they have in common.

There are a number of subtleties here. What you end up with can vary in its meaning and utility. That is because, as I have pointed out in previous posts here, some aspects of meaning may be captured in the structures of the data we are aggregating, and in the structure of the resulting aggregate.

For example relational databases contain more meaning in their structures than key-value stores do. So more aggregation has to happen ‘outside’ of the database for key-value stores.

Aggregation of database data can happen statically, in the structure, dynamically, in stored procedures or queries, or somewhere in between, in ‘materialized views’.

In the Kimball/Inmon world of star schema datamarts, data can be aggregated across schemas using dimensions they share in common, called ‘conformed dimensions.’ The resulting marts are sometimes called ‘constellations’. (‘Snowflakes’ are hybrids where the dimensions, as opposed to the fact tables, are themselves miniature relational models.)

Data aggregation in applications used to happen primarily in the database – the app would either get a connection and make a direct SQL call, or the developer would make and call a stored proc.

With the rise of object-oriented programming we got object-relational mapping tools such as TopLink and Hibernate that modularized and subordinated the aggregation. With the emergence of distributed apps, such as N-tier apps, the distributed app frameworks such as Java EE and .NET brought their own persistence abstractions such as JPA and ADO.NET.

Now, with the rise of service-oriented architectures, especially microservice architectures, while there is some aggregation happening ‘inside’ services, a lot has to be done ‘outside’. This demands the use of an aggregatable data model across related services, such as the consistent use of ‘location’ in the Google Maps API. This need to do aggregation dynamically in the client apps is obviously problematic at times, leading to the creation of aggregate services – close cousins of materialized views – and services such as those based on GraphQL that support dynamic queries at runtime. Getting large aggregate datasets out of APIs requires either chunking them up or streaming them through web sockets or gRPC or somesuch.

Another subtlety around aggregatability comes from the temporal and causal consistency of the data.

If we had a system where every create, update or delete action on a record took place in a distributed transaction encompassing every place in the enterprise we wanted to persist a copy of the record or a subset of it, we would be good to go.

I hear you laughing and groaning in equal parts. Yes, a ridiculous, impractical example, and yes, this is well-traveled territory.

Martin Kleppman does an outstanding job of laying out the issues around managing distributed data well in his Designing Data-Intensive Applications, which is, I think, already a classic reference despite only being a couple of years old.

Even without going deep into theory and fine-grained considerations such as unreliable clocks, it is obvious that if the semantics of the aggregated dataset need or assume its temporal or causal consistency we need to ensure that is true.

From a horseshoes and hand grenades perspective, sources that are maintained by periodic batch jobs, or worse daisy-chained periodic batch jobs, may not be naively aggregated with each other or with a source that is maintained near-real time without our paying serious attention to the relative timing – especially if the data we are seeking to aggregate is causally connected.

Winging a couple of examples:

Say we are working in the Medicare HMO space, where patient/members have to be attributed – assigned – to particular providers as the primary care-givers. We (not ‘Cambia we’ – ‘imaginary we’) have a daisy-chained-batch job to a system where we execute the complex rules around attribution, and make initial attribution decisions once a week, which we then have to send to the providers themselves for evaluation and confirmation. It generally takes another couple of weeks before the loop closes.

We also have a provider master – a provider management system where we maintain provider demographics, affiliations, network status, and so on. That system has an API in front of it.

Consider our consumer-facing phone app. We want to show our member HMO providers whose practices are in proximity to their home, workplace, and probable commute routes. Now we want to overlay their attribution status. If we use current provider information from the provider master and try to aggregate it with data from the attribution system, obviously there could be cracks wide enough to drop entire practices through.

Or let’s say we have a batch-driven data management solution that is maintaining a datamart used to inform a next-best-health action recommendation engine. The datamart is ‘fresh’ once a day.

A member goes to the doctor in the morning, where they are diagnosed with an allergy to nuts. They go to lunch, pull out the consumer-facing phone app, which makes a next-best action recommendation of eating more fruit and nuts. Clang.

I’m sure there are more telling examples – and, in healthcare, more serious ones. The point is, how do we do this with even a coarse grain of confidence?

The most accessible way to do this given today’s technology is to have a near-real time event stream from the systems of records broadcast across the enterprise making near-real-time updates to all of those distributed caches, databases, datamarts, data lakes, data warehouses.

That gets us in the ballpark.

Now, if we are channeling all the events that might be causally related through a common event pipe, such as a stateful streaming solution, we also give ourselves the opportunity to introduce more advanced approaches to temporal and causal coordination, such as version vectors.

Knowledge Reuse

The inference of knowledge from data is, from the perspective I’ve outlined here, a two-part function: aggregation to bring together a meaningful data set, and filtering to bring the knowledge contained in that data set into relief against the noise. That aggregation may be a deterministic function of the data itself, or may be a probabilistic inference. The filtering may be a simple ‘WHERE’ clause, or some complex, multi-stage evolution.

Regardless, the definitions of combined aggregation and filtering are always in terms of the ontology of the domain – the concepts, entities, their relationships, and invariant rules for their evolution.

We need to capture, persist, manage, and reuse those definitions across the enterprise.

Business decisions, and their automation in business rules systems, are procedural – if this scenario occurs then take this action among the alternatives.

Knowledge triggers decisions, informs them, and identified and qualifies actions and their consequences.

We need knowledge strategies first, then data strategies to enable them.

Stay tuned.

Leave a comment