James Dixon’s Blog

James Dixon’s thoughts on commercial open source and open source business intelligence

Pentaho, Hadoop, and Data Lakes

with 15 comments

Earlier this week, at Hadoop World in New York,  Pentaho announced availability of our first Hadoop release.

As part of the initial research into the Hadoop arena I talked to many companies that use Hadoop. Several common attributes and themes emerged from these meetings:

  • 80-90% of companies are dealing with structured or semi-structured data (not unstructured).
  • The source of the data is typically a single application or system.
  • The data is typically sub-transactional or non-transactional.
  • There are some known questions to ask of the data.
  • There are many unknown questions that will arise in the future.
  • There are multiple user communities that have questions of the data.
  • The data is of a scale or daily volume such that it won’t fit technically and/or economically into an RDBMS.

In the past the standard way to handle reporting and analysis of this data was to identify the most interesting attributes, and to aggregate these into a data mart. There are several problems with this approach:

  • Only a subset of the attributes are examined, so only pre-determined questions can be answered.
  • The data is aggregated so visibility into the lowest levels is lost

Based on the requirements above and the problems of the traditional solutions we have created a concept called the Data Lake to describe an optimal solution.

If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

For more information on this concept you can watch a presentation on it here: Pentaho’s Big Data Architecture

Written by James

October 14, 2010 at 4:06 pm

15 Responses

Subscribe to comments with RSS.

  1. […] fill the lake, and various users of the lake can come to examine, dive in, or take samples,” said Pentaho CTO James Dixon, creator of the term ‘Data […]

  2. […] for the term, ‘Data Lake.’ He first wrote about the Data Lake concept on his blog in 2010, Pentaho, Hadoop and Data Lakes. After the numerous interpretations and feedback, he revisited the concept and definition here: […]

  3. […] 2010, James Dixon introduced the concept of the Data Lake, and his idea has gained traction ever since. Dixon’s Data Lake is a style of data warehouse […]

  4. […] an enterprise, but I should mention that isn’t how it was originally intended. The term was coined by James Dixon in 2010, when he did that he intended a data lake to be used for a single data source, multiple data […]

  5. […] James Dixon, who identifies himself as the Chief Geek at Pentaho, coined the term data lake and describes it this way: […]

  6. […] Data Lake ble introdusert i 2010 av James Dixon, CTO i Pentaho, i forbindelse med lanseringen av deres første Big Data-løsning på […]

  7. […] five years since Pentaho’s CTO, James Dixon coined the now-ubiquitous term data lake in his blog. His metaphor contrasted bottled water which is cleansed and packaged for easy consumption with the […]

  8. […] entre le data warehouse et le Data Lake. Les termes Data Lake sont apparus pour la première en octobre 2010 dans le blog de James Dixon, CTO de Pentaho, spécialisée en Business Intelligence. Dans sa première approche du concept, […]

  9. […] term data lake werd in 2010 voor het eerst in een blog van de CTO van Business Intelligence specialist Pentaho, James Dixon genoemd. Hij omschreef het […]

  10. […] has been more than five years since James Dixon of Pentaho coined the term “data lake.” His original post suggests, “If you think of a data mart as a store of bottled water – cleansed and packaged and […]

  11. […] // It has been more than five years since James Dixon of Pentaho coined the term “data lake.” His original post suggests, “If you think of a data mart as a store of bottled water – cleansed and packaged and […]

  12. […] seem simple: securely store all your data in a raw format and apply a schema on read. Indeed, the first description of a data lake compared it to a ‘large body of water in a more natural state’, whereas a data […]

  13. […] term “data lake” came from a blog post composed by James Dixon, CTO of Pentaho. Dixon wrote the post in 2010 when trying to distinguish […]

  14. […] (relatively) inexpensive computer hardware for storing ‘big data.’ ” The term was invented by James Dixon of Pentaho to describe the vast data repositories used in modern Big Data […]

    Peaxy

    February 28, 2017 at 8:43 pm


Leave a comment