James Dixon’s Blog

James Dixon’s thoughts on commercial open source and open source business intelligence

Archive for October 2010

Pentaho, Hadoop, and Data Lakes

leave a comment »

Earlier this week, at Hadoop World in New York,  Pentaho announced availability of our first Hadoop release.

As part of the initial research into the Hadoop arena I talked to many companies that use Hadoop. Several common attributes and themes emerged from these meetings:

  • 80-90% of companies are dealing with structured or semi-structured data (not unstructured).
  • The source of the data is typically a single application or system.
  • The data is typically sub-transactional or non-transactional.
  • There are some known questions to ask of the data.
  • There are many unknown questions that will arise in the future.
  • There are multiple user communities that have questions of the data.
  • The data is of a scale or daily volume such that it won’t fit technically and/or economically into an RDBMS.

In the past the standard way to handle reporting and analysis of this data was to identify the most interesting attributes, and to aggregate these into a data mart. There are several problems with this approach:

  • Only a subset of the attributes are examined, so only pre-determined questions can be answered.
  • The data is aggregated so visibility into the lowest levels is lost

Based on the requirements above and the problems of the traditional solutions we have created a concept called the Data Lake to describe an optimal solution.

If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

For more information on this concept you can watch a presentation on it here: Pentaho’s Big Data Architecture

Written by James

October 14, 2010 at 4:06 pm

Follow

Get every new post delivered to your Inbox.

Join 692 other followers