You Don’t Need a Data Strategy, You Need a Knowledge Strategy

In my post “Throw Off Those 1960’s Data Strategy Shackles” I suggested that the days of the Inmon/Kimball model strategy for analytics are past, sketched out a couple of notes on making events the atoms in our data strategies, and promised a longer answer to what our future states should look like. Continuing on that road today, but this may turn into part 2 of N, which seems par for the course for me in this blog – it is becoming more a serial essay vehicle than a pithy posting one. Part 3 here.

In discussing key architecture factors in your enterprise data strategy, you have likely touched on, among other things:

  • the nature and meaning of ‘single source of truth’;
  • operational data versus analytical data;
  • batch versus streaming ingestion;
  • the need for a ‘semantic layer’ to facilitate aggregation across data sets;
  • and the management of entity identity.

Those are, to me, the trunk, ears, tail and tusk of an elephant in the room:  you don’t need a data strategy.  First you need a knowledge strategy supporting both operations and analytics.

Knowledge comes at that moment of human interaction with data when we learn something new that we able to fit into our conceptual framework of how things are, and how they work:   when you query a database and learn how many of your patients live in urban versus rural areas; or when you learn from a model you have trained there is a strong correlation between a customer’s per-visit time on your health tips website and improvements in their morbidity.

In business, we primarily use knowledge in business processes.   In a knowledge industry such as healthcare – we’re not making Chevys, we’re making healthy – business processes are decisions that must be made regularly, plumbed by communication.

Those decisions range from simple, deterministic ones we make the same way every time given the same circumstances, to difficult, complex ones we make in the face of incomplete information based on intuition and experience.

The power of the analytics and the AI revolution is twofold:  the number and types of decisions we can automate is increasing; and the intelligence of our human decision-makers is being augmented.

The nature of the business processes built from expert decisions is also evolving.  The long-running, slowly changing processes that are the legacy of Frederick Winslow Taylor, Henry Ford, and Peter Drucker are giving way in the post-industrial economy to something much different.  

Fixed sets of tasks are giving way to rapidly changing activities.  Complex decisions are distributed across the enterprise.  The line between operations and analytics is blurring. Processes span the gulf between businesses.  Massive multi-year projects are giving way to shorter ones with limited and specific objectives.   The agile approach is being adopted by business, not just by IT.

Business processes are not all at the same level:  there is a ‘Russian doll’ of related processes: at the center, actual executing processes, enclosed by process management processes, enclosed by process engineering processes, enclosed by process strategy processes.

Our decision-makers, and our automated decision processes, at each of those levels need knowledge of different kinds. 

To enable these ‘knowledge moments’ we need to make the information available to them accurate and timely, meaning it has fidelity to the things it is about, consistent, meaning the information available to all of our decision-makers is the same,  and aggregatable, meaning the decision-maker can combine sets of information as needed to create a larger palette of information from which to derive knowledge.

Historically, data automation happened in the back office first.  There were no computers at the front lines of business – the mainframes were off in temperature-controlled rooms by themselves surrounded by high priests of computing.   At the front lines, interacting with customers, making sales, paper was used.

So the process of getting data into the hands of the decision-makers involved data entry, where all of the information on all of that paper was punched into and typed into the computers, data mastering, where the inevitable errors in the hand-entered data were discovered and corrected,  and reporting, where the data was aggregated both across data sets and across time and knowledge moments happened.  That knowledge was then distributed slowly back to the front lines, back to the process managers, back to the process engineers, back to the process strategists.  

With the desktop computer revolution software moved to the front lines.   But data quality rules were practically difficult to enforce at the front lines, and were seen to cause customer and user friction.

So our front-line software systems permitted the entry of ‘dirty’ data – after all, we were scrubbing data in the back office anyway – or so the rationale went.

And those same systems were built with data models that were difficult to evolve, so companies took advantage of those poor quality controls to make them work, for example ‘overloading’ data fields to mean different things than originally intended.

With the emergence of commercial-off-the-shelf solutions we bought large canned software packages to serve different lines of business, or different business needs.   They came with their own models and databases for people, and businesses, and transaction types.  

These packages became the center of our operational business universe, their specific data models permeating our business environments, their vendors king.

Because of their tightly-coupled internal architectures we could not replace those models and databases with external ones.   So we ended up with data about consumers, for example, in a dozen different applications, in a dozen slightly different forms, each set maintained by discrete processes.

To deal with that confusion we bought or built a new kind of tool, Master Data Management systems, to find and link together data about the same entities, such as consumers, across all the systems where their data lived.   There are three levels of maturity of these systems:  ‘linkage’, where they simply create and maintain pointers to the common data across systems; ‘hybrid’, where in addition to the links they infer and maintain a ‘best version of the truth’ about the entities based on ‘survivorship’ rules evaluating which data is ‘best’; and ‘transactional’, the practically unachievable holy grail, where the MDM system becomes the operational system of record for that entity for the enterprise – back to where we would have started if we were building all of our software from scratch today.

And the paths for getting knowledge to decision-makers at each Russian doll process level – process execution, process management, process engineering, and process strategy – split.

We created massive data warehouses to hold our information.  We followed the approaches of two sages, Bill Inmon and Ralph Kimball, as different as Coke and Pepsi, and created elaborate mechanisms which were the networked-software equivalent of the old mainframe processes.   We ‘ingested’ the raw data from our various operational systems.  We ‘cleaned’ the data, making sure that each data element was of the appropriate type, and the data sets were internally consistent.  We linked the data together, in part with our MDM systems.  And we ‘denormalized’ the data, putting it into a physical form to facilitate large scale aggregation of it across time and common values for our reporting.

That was – is – not only a time-consuming process, but getting data ingested from various systems and temporally coordinating it forced us to the least-common denominator across the ingestion paths, so our warehouses were rarely even daily fresh, usually weekly or monthly at best.

Because of the time lags involved, no knowledge could be dynamically surfaced at the process execution level.   So we crafted one-off visibility into specific knowledge sets for critical processes.

As I implied in ‘Throw Off Those 1960’s Data Strategy Shackles‘, many organizations are still building classic Kimball/Inmon style solutions.   At first blush they may look different, since newer technologies are being used.  But they are functionally equivalent: ingest, scrub, aggregate, enhance, persist.  But that’s ok, they might say, because they intend to manage building them and maintaining it better than they did The Last Time, because they will be in the Cloud, and immutable, and trivially scalable, and handsome, and strong, and powerful, and wise.

That is like improving the village water supply by replacing the rusty metal bucket in the well, which you had previously put in to replace the wooden bucket before it, which rotted, with a new plastic bucket.  You are still going to be turning the same crank and pulling up water a bucket at a time.

What should you do instead?

You need to build something evolvable – that has emerged as perhaps the fundamental tenet of software architecture in the last decade. 

You need to build something that supports all Russian-doll levels of process, not just strategy, with accurate, consistent, aggregatable data.

You need to build something that can flex rapidly to support the accelerating speed of business today.   Supporting periodic analytics, no matter how sophisticated, is not enough of an accelerator.

You need to build something that supports the fine-grained explicit management of individual privacy through data tracking and controls.

You need to build something that will enable you to tie together not only discrete data sets, but the discrete and distributed systems themselves, including future distributed ledgers living ‘in between’ your business and others’.

You need to build something that supports consistency across the blurring line of operations and analytics, something where your phone app can call an operational API and an analytics API and combine the results with aplomb.

What would that look like? At a high level, it might look something like this:

  • Create and maintain an enterprise ontology (a ‘semantic dictionary’ describing entities, concepts, and their relationships and lifecycles) and ‘expressions’ of it in common implementation forms (Avro, Parquet, etc.);
  • Expand the ontology with a catalog of specific knowledge inferences – the equivalent of the common database queries of today that most data analysts have but few companies manage as a resource;
  • Aggressively manage data provenance in your operational systems to provide the most timely and accurate data possible to your operational API and Event layer;
  • Build APIs in front of all operational systems;
  • Instrument all operational systems to publish events;
  • Apply fundamental quality rules from your ontology in your API and Event layer while translating the data into an implementation format tagged to the ontology, making the data aggregatable across APIs and events on common attributes according the relationships defined in the ontology;
  • Create privacy policies, rules whose variables may be filled from your ontology, and mechanisms in your API and Event layer to implement them dynamically at the point of consumption;
  • Tag data in your API and Event layer with metadata about its source and ownership which is then carried through your systems as needed to meet future privacy-related obligations you can’t meet with our automated policies per se;
  • Update your data caches and data marts with subscribers to the operational events, making them consistent, accurate and up to date;
  • Create a stream of ‘business meaningful’ events by applying complex event patterns to the normalized event stream;
  • Trigger business processes with meaningful events;
  • Trigger real-time analytical pipelines with meaningful events, and publish the new knowledge that results;
  • Publish periodic analytical inferences sourced from the data marts as events;
  • Spin up new data marts ‘on demand’ from the Event layer, which can playback events back to any arbitrary past time up to the first events captured (‘event sourcing’);
  • Have applications consistently acquire operational data from your APIs and event subscriptions versus database queries;
  • Publish observations of data via APIs and event subscriptions as events in their own right.
  • Operationally integrate your MDM system to enable dynamic linking across operational systems and notifications of inconsistent operational data.

What would that get you?

  • Each level of process, from execution to management to engineering to strategy, would have consistent, timely data supporting knowledge for its decision-makers.
  • You would have a growing catalog of knowledge, not just data.
  • Data from process triggers, API calls, and analytical inferences would be cohesive and aggregatable according to the overarching ontology, creating broad information palettes for the inference of new knowledge.
  • New processes, new data marts, new analytics approaches that emerged could be readily supported with consistent, accurate, timely data.
  • The sophisticated, modern privacy management that would obtain would enable you to comply with existing and coming legislation and consumer expectation.

You obviously can’t get all the way there in one jump.   And this is just a sketch of a knowledge strategy.  You need to look at each category of data (e.g. party data, product data, agreement data, transaction data, analytical data, reference data) discretely to discover its unique factors.  And you need to do knowledge process decision analysis to surface the knowledge moments you need to support.  See my series – ok, one post so far, but I intend a series – on the nature of knowledge work https://healthitarchitect.com/2019/07/18/the-nature-of-knowledge-work/  for suggestions for that path.

Next up in this, the ‘knowledge strategy’ series : ), I will try to bust out a picture or two to illustrate.

Stay tuned.

Leave a comment