James Dixon’s Blog

James Dixon’s thoughts on commercial open source and open source business intelligence

Pentaho, Hadoop, and Data Lakes

with 11 comments

Earlier this week, at Hadoop World in New York,  Pentaho announced availability of our first Hadoop release.

As part of the initial research into the Hadoop arena I talked to many companies that use Hadoop. Several common attributes and themes emerged from these meetings:

  • 80-90% of companies are dealing with structured or semi-structured data (not unstructured).
  • The source of the data is typically a single application or system.
  • The data is typically sub-transactional or non-transactional.
  • There are some known questions to ask of the data.
  • There are many unknown questions that will arise in the future.
  • There are multiple user communities that have questions of the data.
  • The data is of a scale or daily volume such that it won’t fit technically and/or economically into an RDBMS.

In the past the standard way to handle reporting and analysis of this data was to identify the most interesting attributes, and to aggregate these into a data mart. There are several problems with this approach:

  • Only a subset of the attributes are examined, so only pre-determined questions can be answered.
  • The data is aggregated so visibility into the lowest levels is lost

Based on the requirements above and the problems of the traditional solutions we have created a concept called the Data Lake to describe an optimal solution.

If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

For more information on this concept you can watch a presentation on it here: Pentaho’s Big Data Architecture

Written by James

October 14, 2010 at 4:06 pm

11 Responses

Subscribe to comments with RSS.

  1. […] fill the lake, and various users of the lake can come to examine, dive in, or take samples,” said Pentaho CTO James Dixon, creator of the term ‘Data […]

  2. […] for the term, ‘Data Lake.’ He first wrote about the Data Lake concept on his blog in 2010, Pentaho, Hadoop and Data Lakes. After the numerous interpretations and feedback, he revisited the concept and definition here: […]

  3. […] 2010, James Dixon introduced the concept of the Data Lake, and his idea has gained traction ever since. Dixon’s Data Lake is a style of data warehouse […]

  4. […] an enterprise, but I should mention that isn’t how it was originally intended. The term was coined by James Dixon in 2010, when he did that he intended a data lake to be used for a single data source, multiple data […]

  5. […] James Dixon, who identifies himself as the Chief Geek at Pentaho, coined the term data lake and describes it this way: […]

  6. […] Data Lake ble introdusert i 2010 av James Dixon, CTO i Pentaho, i forbindelse med lanseringen av deres første Big Data-løsning på […]

  7. […] five years since Pentaho’s CTO, James Dixon coined the now-ubiquitous term data lake in his blog. His metaphor contrasted bottled water which is cleansed and packaged for easy consumption with the […]

  8. […] entre le data warehouse et le Data Lake. Les termes Data Lake sont apparus pour la première en octobre 2010 dans le blog de James Dixon, CTO de Pentaho, spécialisée en Business Intelligence. Dans sa première approche du concept, […]

  9. […] term data lake werd in 2010 voor het eerst in een blog van de CTO van Business Intelligence specialist Pentaho, James Dixon genoemd. Hij omschreef het […]

  10. […] has been more than five years since James Dixon of Pentaho coined the term “data lake.” His original post suggests, “If you think of a data mart as a store of bottled water – cleansed and packaged and […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: