James Dixon’s Blog

James Dixon’s thoughts on commercial open source and open source business intelligence

Pile-On: Dan Woods “Lessons From The First Wave Of Hadoop Adoption”

leave a comment »

Dan Woods put out a nice piece yesterday on his Forbes blog titled “Lessons From The First Wave Of Hadoop Adoption“.

I agree with him that the insights and advantages of Big Data solutions need to be described in ways other than technology. I’m going to add on to his insights.

1. It’s about more than big data. It’s a new platform.

Yes, it is a new platform.  That means it’s different than the old ones. The fact that you can do some things cheaper than you could before is not the main idea. A bigger story is that some things that were economically not possible before, now are. But the main idea is that this is a new platform, with new capabilities, that needs to fit into your existing data architecture.

2. Don’t get rid of your data warehouse

I completely agree. Big Data technology is a new tool with new characteristics. Using it to replace a Data Warehouse technology that is finely tuned for that use case is not a great idea. Don’t listen to the “Hadoop will replace every database within x years” crowd. No database has managed to replace every database. No database ever will because the variety of the use cases is too large.

3. Think about your data supply chain

Since a Big Data system needs to fit in with everything you currently have and operate, integration is a significant priority. Understand that with Big Data you can build a Big Silo, but a Big Silo is as bad as a small silo (just a lot bigger). You should not be required to pump all your data from every system into Hadoop to get value from it. Design you data architecture carefully, the implications and fallout of getting it right or wrong are significant.

4. It’s complicated

Yes it is. It’s also not cheap to do it well. Sure you can download a lot of open source software and prototype or prove your ideas without a lot of upfront outlay. But putting it into production is a production. Expect that.

Written by James

January 27, 2015 at 5:01 pm

Union of the State – A Data Lake Use Case

with 6 comments

Many business applications are essentially workflow applications or state machines. This includes CRM systems, ERP systems, asset tracking, case tracking, call center, and some financial systems. The real-world entities (employees, customers, devices, accounts, orders etc.) represented in these systems are stored as a collection of attributes that define their current state. Examples of these attributes include someone’s current address or number of dependents, an account’s current balance, who is in possession of laptop X, which documents for a loan approval have been provided, and the date of Fluffy’s last Feline Distemper vaccination.
State machines are very good at answering questions about the state of things. They are, after all, machines that handle state. But what about reporting on trends and changes over the short and long term? How do we do this? The answer for this is to track changes to the attributes in change logs. These change logs are database tables or text files that list the changes made over time. That way you can (although the data transformation is ugly) rewind the change log of a specific field across all objects in the system and then aggregate those changes to get a view over time. This is not easy to do and assumes that you have a change log. Typically, change logs only exist for the main fields in an application. There might only be change logs on 10-20% of the fields. So if you suddenly have an impulse so see how a lesser attribute has changed over time you are out of luck. It is impossible because that information is lost.
This situation is similar to the way that old school business intelligence and analytic applications were built. End users listed out the questions they want to ask of the data, the attributes necessary to answer those questions were skimmed from the data stream, and bulk loaded into a data mart. This method works fine until you have a new question to ask. The Data Lake approach solves this problem. You store all of the data in a Data Lake, populate data marts and your data warehouse to satisfy traditional needs, and enable ad-hoc query and reporting on the raw data in the Data Lake for new questions.
A Data Lake can also be used to solve the problems of history and trending for workflow applications and state machines. What if these applications write their initial state into the Data Lake and then also write the change of every attribute in there as well? While we are at it, let’s log all the application events coming from the user interface tier as well. From the application’s perspective this is a low-latency fire and forget scenario.
Now we have the initial state of the application’s data and the changes to of all of the attributes, not just the main/traditional fields. We can apply this approach to more than one application, each with its own Data Lake of state logs, storing every incremental change and event. So now we have the state of every field of (potentially) every business application in an enterprise across time. We have the “Union of the State”.
With this data we have the ability to rewind the Union of the State to any point in time. What are the potential use cases for the Union of the State?
Enterprise Time Machine
Suppose something happened a few weeks ago. Decisions were made. Things changed. But exactly what, when, and why? With an Enterprise Time Machine you can rewind the complete state of every major application to any point in time and then step forward event by event, click by click, change by change, at the millisecond level if things happened that quickly. For an e-commerce vendor this means being able to know for any specified millisecond in the past how many shopping carts where open, what was in them, which transactions were pending, which items were being boxed, or in transit, what was being returned, who was working, how many customer support calls were queued and how many were in progress. In different domains such as financial services or healthcare, the applications and attributes are different but the ability is the same.
In order to reconstruct the state at any point in time we need to load the initial snapshot into a repository and then update the attributes of each object as we process the logs, event by event, until we get to the point in time that we are interested in. A NoSQL store such as MongoDB , HBase, or Cassandra should work well as the repository. This process could be optimized by adding regular snapshots of the whole state into the Data Lake so that we don’t have to process from the very beginning every time. For a detailed analysis you could rebuild the state to a particular point in time and then process forwards in increments of any size. This way the situation of a device failure that led to a catastrophic cascade of events can be re-created and examined millisecond by millisecond.
Since we can re-create the state at any point in time we can do trending and historical analysis of any and every attribute over any time period, at any time granularity we want.
When user interface events are logged as well as the attribute changes you have the ability to know not only who changed what information, but also who looked at it. Who was aware of the situation? Why did Bob open a particular record every few hours and cancel out without making changes? This requires the History Machine described above.
One of the main tasks in a predictive exercise is to work out which attributes are predictive of your target variable and which ones are not. This can be impossible to do when you only have 10% of your attributes logged. Maybe the minor attributes are the predictive ones. Now you have all of them. This requires the trending facility described above.
Doug Moran, a co-founder of Pentaho and product manager for its Big Data products, sees many predictive applications for this kind of data. This includes the ability to derive a model from replays of previous events and use it to prescribe ways to influence the current situation to increase the likelihood of a desired outcome. For example, this could include replaying all previous shopping cart events for a user currently on an e-commerce site to derive a predictive model that prescribes a way to influence their current purchase in a positive way.
“Dixon’s Union of the State idea gives the Data Lake idea a positive mission besides storing more data for less money,”
said Dan Woods, an IT Consultant to buyers and vendors and CEO of Evolved Media, who has written about the Data Lake for several years.
“Providing the equivalent of a rewind, pause, forward remote control on the state of your business makes it affordable to answer many questions that are currently too expensive to tackle. Remember, you don’t have to implement this vision for all data for it to provide a new platform to answer difficult questions with minimal effort.”
How could this be done?
  • Let the application store it’s current state in a relational or No-SQL repository. Don’t affect the operation of the operational system.
  • Log all events and state changes that occur within the application. This is the tricky part unless it is an in-house application. It would be best if these events and state changes were logged in real time, but this is sometimes not ideal. Maybe SalesForces or SugarCRM will offer this level of logging as a feature. Dump this data into a Data Lake using a suitable storage and processing technology such as Hadoop.
  • Provide the ability to rewind the state of any and all attributes by parallel processing of the logs.
  • Provide the facilities listed above using technologies appropriate of each use case (using the rewind capability).


The plumbing and architecture for this is not simple and Dan Woods points out that there are databases like Datomic that provide capabilities for storing and querying state over time. But a solution based on a Data Lake has the same price, scalability, and architectural attributes as other big data systems.

Written by James

January 22, 2015 at 4:43 am

AWS and Your Crown Jewels

leave a comment »

These are my thoughts on a recent Dan Woods (Forbes) post titled “Will Companies Ever Move Their Crown Jewels to Amazon Web Services?“.

My short answer is yes, because otherwise Jeff Bezos (the Founder of Amazon) has failed. As of today Bezos is worth $28.8 bn and #16 on Forbes list of powerful people. I’m guessing he’s the kind of guy who doesn’t like to fail.

Jeff Bezos  explains his vision in this 10 year old TED talk: https://www.ted.com/talks/jeff_bezos_on_the_next_web_innovation

He spends the first 7 minutes comparing the Internet bubble to the California gold rush and then moves onto an analogy comparing the internet today with the electricity industry 100 years ago.

I admire the time and energy he spends on his analogies. He looked into different ones and compared them to find the best one. Good analogies are hard to find. The best ones sound obvious when you hear them but can hard to find. The Bee Keeker analogy for open source software took me months of iterations based on years of experience to come up with. It sounds fairly obvious when you hear it, but there was no analogy before it to help understand the model.

If an analogy is good enough will allow you to infer additional knowledge. If you follow Bezos’ electricity analogy, and look at history you can draw additional insights. Looking at the history of electricity adoption, we can draw inferences about the adoption of cloud computing (with some generalizations) :

  • Before the introduction of electricity supply as a commodity service, any large company needing electricity had its own electricity generators.
  • Before the introduction of cloud-based computing with utility pricing, any large company needing computing had its own data center.
  • Who were the first people to join the electricity grid? Small companies and residences without prior electrical supply.
  • Who were the first people to use cloud computing? Small companies and individuals without data centers.
  • Who were the last people to join the electricity grid? Large companies with their own power sources.
  • Who will be the last people to migrate to cloud computing? Large companies with their own data centers.

Looking in the press you can see that the majority of the anti-cloud talk comes from larger enterprises.

However, most companies that have started in the last five years are evolving cloud-based infrastructures. As a start-up you typically have desktop-based applications for accounting, HR, CRM etc. As you grow it makes sense to move to hosted solutions like Net Suite, SalesForce, SugarCRM etc. As you add more and more hosted solutions the cost and headache of installing and maintaining on-premise solutions looks less and less attractive.

So today’s generation of small comanies, which will become large companies in the future, have four classes of applications:

  • Domain-specific desktop applications
  • Generic applications with small scale usage (e.g. project planning)
  • Generic applications that will grow and become cloud based (payroll, CRM, or accounting in a small company)
  • Cloud-based applications

If companies, as Dan suggests, are using services other than Amazon for critical applications, then Amazon is failing in its mission due to operational issues. Jeff Bezos is not likely to let that continue for long.

Written by James

November 12, 2014 at 6:00 pm

Posted in Uncategorized

Extending Pentaho Analyzer and CDF

leave a comment »

Here is a sneak peek at some of the things I’ll be showing at Pentaho World sessions later this week (11am Oct 10th).

floorplan GPS trails and stationary indicators heatmap Screen Shot 2014-04-25 at 4.10.06 PM Screen Shot 2014-09-26 at 2.47.54 PM Screen Shot 2014-09-26 at 3.59.36 PM Screen Shot 2014-09-11 at 3.11.07 PM

Written by James

October 6, 2014 at 8:32 pm

Posted in Uncategorized

Data Lakes Revisited

with 8 comments

It seems it’s time to revisit the Data Lake after 4 years. Here’s my original post on it and a couple of video presentations.

There are lots of people using the term these days and some variety in their definitions and the stories they are telling:

I give credit to Dan Woods at Forbes for being the first to pick up on the idea http://www.forbes.com/sites/ciocentral/2011/07/21/big-data-requires-a-big-new-architecture

What I’d like to address today is (somewhat negative) commentary by Barry Devlin at TechTarget (http://searchbusinessanalytics.techtarget.com/feature/Data-lake-muddies-the-waters-on-big-data-management) and Andrew White and Nick Heudecker at Gartner (http://www.gartner.com/newsroom/id/2809117). In both these cases the statements they make are not wrong, yet not really right. Let’s take a little history tour, using some YouTube videos from 2010, to discover why. I call out the main points below.

Pentaho Hadoop Series Part 1: Big Data Architecture


  • 3:00. A data lake consists of a single source of data. Not distilled (pre-aggregated).
  • 3:25. Most companies only have one source of data that meet the criteria.
  • 4:30. You store all the data because you don’t know in advance all the questions that you will need to ask of it.
  • 6:00. The problem with data marts and data warehouses is that the pre-aggregation limits the questions that can be asked.
  • 6:45. By using a data lake, the institutional data marts and data warehouses can be populated with feeds of aggregations from the data lake, but ad-hoc questions can also be answered.
  • 8:00. A data lake does not replace a database, data mart, or data warehouse. At least not yet, and certainly not in 2010.

Summary: A single data lake houses data from one source. You can have multiple lakes, but that does not equal a data mart or data warehouse.

Pentaho Hadoop Series Part 5: Big Data and Data Warehouses


  • 0:15. Can you use a Big Data solution as a data warehouse? Yes.
  • 0:22. Should you? Probably not.
  • 0:30. A large amount of data from one or two systems, is not a data warehouse – it’s a big data mart at best.
  • 1:30. The difference between a data warehouse and a data lake.
  • 4:15. What if you really, really want to use a big data system for your data warehouse? Then you have a water-garden that is populated from data lakes.

Summary: A Data Lake is not a data warehouse housed in Hadoop. If you store data from many systems and join across them, you have a Water Garden, not a Data Lake.

I chose the term “Data Lake” carefully and paid attention to the analogy and the metaphor. But today some of the people using the term are not using as much care or attention.

Barry Devlin answers a self-imposed question “What is a data lake?” by stating:

“In the simplest summary, it is the idea that all enterprise data can and should be stored in Hadoop and accessed and used equally by all business applications.”

As is clear from the videos above, that was not the original definition of a Data Lake. Not at all. He’s talking about a Water Garden, which is significantly different. I agree with Devlin that the idea of putting all enterprise data into Hadoop (or any other data store) is not a viable option (at least right now). You should use the best tool for the job. Use a transactional database for transactional purposes. Use an analytic database for analytic purposes. Use Hadoop or MongoDB when they are the best tool for the situation. For the foreseeable future the IT environment is, and will be, a hybrid one with many different data stores.

Devlin objects to the term “Data Lake”. Whereas I object to his definition of it. It’s incorrect. The underlying issue is that people are using the term inappropriately and inaccurately. More on that later.

I also have some issues with Gartner’s take on Data Lakes (http://www.gartner.com/newsroom/id/2809117).

Their report makes statements like:

“By its definition, a data lake accepts any data, without oversight or governance.”

Who says there is no oversight? By its (original) definition it only accepts data from a single source so “any” is clearly wrong.

“The fundamental issue with the data lake is that it makes certain assumptions about the users of information”.

Who says it makes assumptions? How can a collection of data make assumptions? This makes no sense.

“And without metadata, every subsequent use of data means analysts start from scratch”.

Who says there is no metadata? Now who’s making assumptions? In all cases, Gartner is making these statements, only so that they can immediately refute them. Why not state that “Data lakes are pink with purple spots”, and then it follow up with the observation that color makes no sense in this context. Somewhere in all of this the main point has been lost. You store the raw data at its most granular level so that you can perform any ad-hoc aggregation at any time. The classic data warehouse and data mart approaches do not support this.

So, some people are miss-using the term and applying it to things that maybe make little sense to use as a production architecture. Oh well. The majority of people using the term “Data Warehouse” at Big Data conferences are miss-using that term too. They are referring to (at best) a large Data Mart, or (a worst) just a really large flat file, and in most cases not a real Data Warehouse. Confusing, yes. Annoying, yes. Worth spending time and energy on? No, not really.

Barry Devlin is welcome to fight a battle against the term “Data Lake”. Good luck to him. But if he doesn’t like it he should come up with a better idea.

If we’re going to fight pointless uphill battles against terminology, I wouldn’t pick this one, even though it’s my term getting miss-used. These are the top top 3 terms on my hit-list.

  1. “Big Data”. The median Hadoop deployment is 10 nodes with an amount of data that qualifies as “small” even by 1990’s Data Warehousing standards. Many people throw the NoSQL data stores in the Big Data buckets, and their median deployments are even smaller. I think “Scalable” technology is much more interesting than “Big” technology. Oracle’s Exadata box is “Big”, but not suitable for most solutions as it does not scale down. Some technologies don’t scale up. A technology that can cost effectively scale, and maintain constant performance, with data volumes varying from small to hugely massive is a great technology because you will never have to migrate. I agree that vast volumes and velocities bring interesting problems to light, but these are edge cases. Let’s focus on scalable technology, otherwise we are going to end up copying video standards that have gone from VGA to WHUXGA (Wide-Hex-Ultra-Extended-VGA). I’m not making that up, that’s a real video resolution. Unless we are careful, we are going to go from Big Data, to Super-Big Data, to Extended-Super-Big Data, to Quad-Super-Extended-Big Data, to Wide-Hex-Ultra-Extended-Big Data. Give me one thing that goes from Tiny-Bit-Mini-Reduced-Small Data to Wide-Hex-Ultra-Extended-Big Data and call it “Scalable Data” please.
  2. NoSQL. How come every major NoSQL data source has added (or is adding) some kind of structured query language? With JDBC/ODBC drivers on the way? Why? Because SQL was never the problem. A data store without query capabilities? Is the data there or not? Probably eventually. It’s “Schrödinger’s Commit” (a little geek humor for CAP theorists). If SQL was not the problem, what was? Schemas and performance were the real problems, and SQL was the scapegoat. But when we add SQL to NoSQL, do those negate each other? Are we left with a null? I propose we retroactively call these data stores “NoSchema” or “Technologies Now with SQL but Formerly Known as NoSQL”, or “Recently SQL’ed”, or maybe “A Little Late to the SQL Party But We Love Them Anyway”.
  3. Free as in Speech vs. Free as in Beer“. Richard Stallman has had some great ideas. Using this phrase to describe free software was not one of them. Where is all this free beer he speaks of? Beer takes money and time (and time=money) to produce. “Free speech” is something not uniformly available in all countries, and thus not easy to translate. However everyone understands freedom or liberty. Many people are confused by this phrase. So I (pointlessly) suggest we use this: free as in gratis vs free as in liberty.

Written by James

September 25, 2014 at 4:43 am

Posted in Uncategorized

Review of Prezi presentation tool

with 7 comments

I have been playing around with Prezi (http://prezi.com/), the online presentation tool. It’s a cool thing that lets you create presentations that are visually different from Powerpoint/Keynote. Like all these tools it will let you create bad presentations very quickly. If you want to create something compelling and appealing it will take planning and thought. Looking at the presentations on their site most people go way overboard on the zooming and rotation and the result is confusing and disorienting.


A cool tool but the design environment is very constraining and frustrating. Great for educational and light usage but not really suitable for large-scale or every day scenarios.


  • Zoom/Pan/Rotate: These give a new alternative to the standard Powerpoint feel.
  • Parallax Background: A nice effect. They call it 3D but it’s really just parallax of the background image.
  • Simple to Use: It’s easy to create simple presentations.
  • Cost: If you don’t mind all your presentations being public you can use the free version. The paid versions are not cheap in the long run.


  • Still Linear: They say is it non-linear and 3D but it is not. It is a zoom-able 2D canvas and can only create one path through the presentation with no branches or loops. It’s a linear flow through a 2D space. You can jump to different parts of the path if you can see them on the screen but constructing a truly non-linear path is clunky.
  • Basic Editing: No ability to directly set the size or rotation of objects using a properties editor. It’s hard to get objects exactly the same size and shape and rotation. It has a small color palette with no color gradients.
  • Z-Ordering: Prezi supports z-ordering but there is no way to control it. If you want to change to z order of an object you have to copy/delete/paste the object and any that overlap it until you have the right ordering.
  • Text Editing: Text controls are too rudimentary for a presentation tool. Text has a very small color palette with no way to set RGB values except at the theme level – you have to go to CSS editing to get better control.  No way to stretch text except proportionally. To get good control over text you need create text objects in Inkscape or Illustrator and import them into Prezi.
  • Designer: Selecting objects and basic navigating can be extremely frustrating. A toolbar for basic operations would be really helpful.
  • Animations/Build: Only one – build (appear). No build-outs. If you want to combine builds with overlapping frames some things are hard/impossible as you cannot control which frames “own” which objects.
  • Transitions: Only three – Slide, rotate, zoom. You cannot choose which, Prezi chooses for you based on the arrangement of your slides. No ability to set the speed of the transition you have to rely on the side effects of the slide arrangements. In auto-play mode you can only set the same timing for every transition in the deck, not for each individual transition.
  • Viewer controls: On a computer you use the left/right arrow keys to move, and once you get to the end you can right-click/control-click to rewind. On a tablet you cannot swipe, you click on the left/right edges which means you cannot put click-able objects close to the edges (to jump to other parts of the path). Also on tablets there is no rewind option, making it awkward for demo/booth usage.
  • No Save-As: Seriously? If you are about to embark on major modifications to a presentation you have no way to manage backups. You have to download a copy locally and then modify it and push it back to the server to overwrite your changes.
  • Vector Graphic Support: For a graphical tool that supports zooming and rotation, vector graphics are important. The only import vector formats Prezi supports are PDF and SWF – you cannot use SVG or AI or EPS. Using vector graphics that include transparency is really hard.
  • No Ability to Turn Off Transitions: Online meeting tools like Webex, Netmeeting etc will have major problems with the transition animation. There is no way to remove transitions and just jump between the frames.

Written by James

March 14, 2014 at 6:09 pm

Posted in Uncategorized

Tagged with

12 Days of Visualizations – Sun Burst

leave a comment »

Today we are launching our 12 Days of Visualizations program: http://events.pentaho.com/12days-of-Big-Data-Visualizations.html

We are going to release a few new visualizations every week over the holiday period. You can drop these visualizations into a Pentaho BA server and they will appear on the charting menu in Analyzer.

The first one that we are releasing is a Sun Burst chart. This chart is based on the Protovis sun burst chart – http://mbostock.github.com/protovis/ex/sunburst.html

The Sun Burst chart we created can be used in a couple of ways. Firstly it can be used as a multi-level pie chart. This sun burst shows how the sales in three territories breaks down into sales of product lines within those territories, and then how product line sales compare by year:

Screen Shot 2012-12-13 at 3.41.40 PM

This effect is achieved by using a color gradient for the outer ring that is based on the chart palette color of the inner rings, and by sorting the segments in each ring into descending order. When you compare the sun burst above with the pie chart below, you can see how much more information the sun burst provides.


You can choose to use a common color gradient on the outer ring so that it is easier to compare the items on that ring. In this example a blue gradient has been used for the outer ring. Regardless of which territory a city is in, the shade of blue it is colored in can be used to compare it with other cities.

Screen Shot 2012-12-13 at 3.43.10 PM

In this chart a red/yellow/green gradient has been used. Here the levels of the chart are year, quarter, and month so the data has not been sorted. The data for this chart is overtime costs so the gradient has been reversed to show larger overtime costs in red, and smaller ones in green.

Screen Shot 2012-12-13 at 3.46.11 PM

You can find out more about this chart here: http://wiki.pentaho.com/display/COM/Sunburst

Written by James

December 14, 2012 at 5:39 pm


Get every new post delivered to your Inbox.

Join 775 other followers