Pentaho, Hadoop, and Data Lakes

Earlier this week, at Hadoop World in New York, Pentaho announced availability of our first Hadoop release.

As part of the initial research into the Hadoop arena I talked to many companies that use Hadoop. Several common attributes and themes emerged from these meetings:

80-90% of companies are dealing with structured or semi-structured data (not unstructured).
The source of the data is typically a single application or system.
The data is typically sub-transactional or non-transactional.
There are some known questions to ask of the data.
There are many unknown questions that will arise in the future.
There are multiple user communities that have questions of the data.
The data is of a scale or daily volume such that it won’t fit technically and/or economically into an RDBMS.

In the past the standard way to handle reporting and analysis of this data was to identify the most interesting attributes, and to aggregate these into a data mart. There are several problems with this approach:

Only a subset of the attributes are examined, so only pre-determined questions can be answered.
The data is aggregated so visibility into the lowest levels is lost

Based on the requirements above and the problems of the traditional solutions we have created a concept called the Data Lake to describe an optimal solution.

If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

For more information on this concept you can watch a presentation on it here: Pentaho’s Big Data Architecture

Written by James

October 14, 2010 at 4:06 pm

Posted in Business Intelligence, Datawarehousing, Hadoop, open source

15 Responses

Subscribe to comments with RSS.

[…] fill the lake, and various users of the lake can come to examine, dive in, or take samples,” said Pentaho CTO James Dixon, creator of the term ‘Data […]

Big Data and Data Lakes: A New Generation of the Data Warehouse | Formtek Blog

July 15, 2014 at 3:02 pm

Reply
[…] https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/ […]

Data Lakes Revisited | James Dixon's Blog

September 25, 2014 at 4:43 am

Reply
[…] for the term, ‘Data Lake.’ He first wrote about the Data Lake concept on his blog in 2010, Pentaho, Hadoop and Data Lakes. After the numerous interpretations and feedback, he revisited the concept and definition here: […]

Union of the State – A Data Lake Use Case | Pentaho Business Analytics Blog

January 22, 2015 at 6:41 pm

Reply
[…] 2010, James Dixon introduced the concept of the Data Lake, and his idea has gained traction ever since. Dixon’s Data Lake is a style of data warehouse […]

Data Lake vs. Data Warehouse: Which is Right for Healthcare?

February 17, 2015 at 7:33 pm

Reply
[…] an enterprise, but I should mention that isn’t how it was originally intended. The term was coined by James Dixon in 2010, when he did that he intended a data lake to be used for a single data source, multiple data […]

Bliki: DataLake | ..:: Frog in the box ::..

June 16, 2015 at 9:37 am

Reply
[…] James Dixon, who identifies himself as the Chief Geek at Pentaho, coined the term data lake and describes it this way: […]

The Data Lake: A More Balanced Perspective | The Cyberista Says

July 6, 2015 at 2:01 pm

Reply
[…] Data Lake ble introdusert i 2010 av James Dixon, CTO i Pentaho, i forbindelse med lanseringen av deres første Big Data-løsning på […]

Data Lake vs. Datavarehus | NextBridge Group

September 1, 2015 at 8:21 am

Reply
[…] five years since Pentaho’s CTO, James Dixon coined the now-ubiquitous term data lake in his blog. His metaphor contrasted bottled water which is cleansed and packaged for easy consumption with the […]

Turning Your Data Lake into a Streamlined Data Refinery | Ashnik

February 18, 2016 at 3:02 am

Reply
[…] entre le data warehouse et le Data Lake. Les termes Data Lake sont apparus pour la première en octobre 2010 dans le blog de James Dixon, CTO de Pentaho, spécialisée en Business Intelligence. Dans sa première approche du concept, […]

Qu’est-ce qu’un Data Lake à l’heure du Big Data ? | E-media, the Econocom blog

March 23, 2016 at 8:57 am

Reply
[…] term data lake werd in 2010 voor het eerst in een blog van de CTO van Business Intelligence specialist Pentaho, James Dixon genoemd. Hij omschreef het […]

Wat is een Data Lake in het Big Data landschap? | E-media, het Econocom blog

April 25, 2016 at 9:26 am

Reply
[…] has been more than five years since James Dixon of Pentaho coined the term “data lake.” His original post suggests, “If you think of a data mart as a store of bottled water – cleansed and packaged and […]

Data Lakes: Safe Way to Swim in Big Data? |

May 11, 2016 at 4:21 pm

Reply
[…] // It has been more than five years since James Dixon of Pentaho coined the term “data lake.” His original post suggests, “If you think of a data mart as a store of bottled water – cleansed and packaged and […]

Ventana Research: Data Lakes: Safe Way to Swim in Big Data? : MSR Communications

May 17, 2016 at 6:00 pm

Reply
[…] seem simple: securely store all your data in a raw format and apply a schema on read. Indeed, the first description of a data lake compared it to a ‘large body of water in a more natural state’, whereas a data […]

Introducing the Data Lake Solution on AWS – Cloud Data Architect

November 30, 2016 at 5:20 pm

Reply
[…] term “data lake” came from a blog post composed by James Dixon, CTO of Pentaho. Dixon wrote the post in 2010 when trying to distinguish […]

AWS | Cleaning Up Your Data Lake | Relus Cloud

February 8, 2017 at 9:13 pm

Reply
[…] (relatively) inexpensive computer hardware for storing ‘big data.’ ” The term was invented by James Dixon of Pentaho to describe the vast data repositories used in modern Big Data […]

Peaxy

February 28, 2017 at 8:43 pm

Reply

James Dixon’s Blog