James Dixon's Blog

Union of the State – A Data Lake Use Case

Many business applications are essentially workflow applications or state machines. This includes CRM systems, ERP systems, asset tracking, case tracking, call center, and some financial systems. The real-world entities (employees, customers, devices, accounts, orders etc.) represented in these systems are stored as a collection of attributes that define their current state. Examples of these attributes include someone’s current address or number of dependents, an account’s current balance, who is in possession of laptop X, which documents for a loan approval have been provided, and the date of Fluffy’s last Feline Distemper vaccination.
State machines are very good at answering questions about the state of things. They are, after all, machines that handle state. But what about reporting on trends and changes over the short and long term? How do we do this? The answer for this is to track changes to the attributes in change logs. These change logs are database tables or text files that list the changes made over time. That way you can (although the data transformation is ugly) rewind the change log of a specific field across all objects in the system and then aggregate those changes to get a view over time. This is not easy to do and assumes that you have a change log. Typically, change logs only exist for the main fields in an application. There might only be change logs on 10-20% of the fields. So if you suddenly have an impulse so see how a lesser attribute has changed over time you are out of luck. It is impossible because that information is lost.
This situation is similar to the way that old school business intelligence and analytic applications were built. End users listed out the questions they want to ask of the data, the attributes necessary to answer those questions were skimmed from the data stream, and bulk loaded into a data mart. This method works fine until you have a new question to ask. The Data Lake approach solves this problem. You store all of the data in a Data Lake, populate data marts and your data warehouse to satisfy traditional needs, and enable ad-hoc query and reporting on the raw data in the Data Lake for new questions.
A Data Lake can also be used to solve the problems of history and trending for workflow applications and state machines. What if these applications write their initial state into the Data Lake and then also write the change of every attribute in there as well? While we are at it, let’s log all the application events coming from the user interface tier as well. From the application’s perspective this is a low-latency fire and forget scenario.
Now we have the initial state of the application’s data and the changes to of all of the attributes, not just the main/traditional fields. We can apply this approach to more than one application, each with its own Data Lake of state logs, storing every incremental change and event. So now we have the state of every field of (potentially) every business application in an enterprise across time. We have the “Union of the State”.
With this data we have the ability to rewind the Union of the State to any point in time. What are the potential use cases for the Union of the State?
Enterprise Time Machine
Suppose something happened a few weeks ago. Decisions were made. Things changed. But exactly what, when, and why? With an Enterprise Time Machine you can rewind the complete state of every major application to any point in time and then step forward event by event, click by click, change by change, at the millisecond level if things happened that quickly. For an e-commerce vendor this means being able to know for any specified millisecond in the past how many shopping carts where open, what was in them, which transactions were pending, which items were being boxed, or in transit, what was being returned, who was working, how many customer support calls were queued and how many were in progress. In different domains such as financial services or healthcare, the applications and attributes are different but the ability is the same.
In order to reconstruct the state at any point in time we need to load the initial snapshot into a repository and then update the attributes of each object as we process the logs, event by event, until we get to the point in time that we are interested in. A NoSQL store such as MongoDB , HBase, or Cassandra should work well as the repository. This process could be optimized by adding regular snapshots of the whole state into the Data Lake so that we don’t have to process from the very beginning every time. For a detailed analysis you could rebuild the state to a particular point in time and then process forwards in increments of any size. This way the situation of a device failure that led to a catastrophic cascade of events can be re-created and examined millisecond by millisecond.
Since we can re-create the state at any point in time we can do trending and historical analysis of any and every attribute over any time period, at any time granularity we want.
When user interface events are logged as well as the attribute changes you have the ability to know not only who changed what information, but also who looked at it. Who was aware of the situation? Why did Bob open a particular record every few hours and cancel out without making changes? This requires the History Machine described above.
One of the main tasks in a predictive exercise is to work out which attributes are predictive of your target variable and which ones are not. This can be impossible to do when you only have 10% of your attributes logged. Maybe the minor attributes are the predictive ones. Now you have all of them. This requires the trending facility described above.
Doug Moran, a co-founder of Pentaho and product manager for its Big Data products, sees many predictive applications for this kind of data. This includes the ability to derive a model from replays of previous events and use it to prescribe ways to influence the current situation to increase the likelihood of a desired outcome. For example, this could include replaying all previous shopping cart events for a user currently on an e-commerce site to derive a predictive model that prescribes a way to influence their current purchase in a positive way.
“Dixon’s Union of the State idea gives the Data Lake idea a positive mission besides storing more data for less money,”
said Dan Woods, an IT Consultant to buyers and vendors and CEO of Evolved Media, who has written about the Data Lake for several years.
“Providing the equivalent of a rewind, pause, forward remote control on the state of your business makes it affordable to answer many questions that are currently too expensive to tackle. Remember, you don’t have to implement this vision for all data for it to provide a new platform to answer difficult questions with minimal effort.”
How could this be done?
  • Let the application store it’s current state in a relational or No-SQL repository. Don’t affect the operation of the operational system.
  • Log all events and state changes that occur within the application. This is the tricky part unless it is an in-house application. It would be best if these events and state changes were logged in real time, but this is sometimes not ideal. Maybe SalesForces or SugarCRM will offer this level of logging as a feature. Dump this data into a Data Lake using a suitable storage and processing technology such as Hadoop.
  • Provide the ability to rewind the state of any and all attributes by parallel processing of the logs.
  • Provide the facilities listed above using technologies appropriate of each use case (using the rewind capability).

The plumbing and architecture for this is not simple and Dan Woods points out that there are databases like Datomic that provide capabilities for storing and querying state over time. But a solution based on a Data Lake has the same price, scalability, and architectural attributes as other big data systems.