Pentaho, Hadoop, and Data Lakes
Earlier this week, at Hadoop World in New York, Pentaho announced availability of our first Hadoop release.
As part of the initial research into the Hadoop arena I talked to many companies that use Hadoop. Several common attributes and themes emerged from these meetings:
- 80-90% of companies are dealing with structured or semi-structured data (not unstructured).
- The source of the data is typically a single application or system.
- The data is typically sub-transactional or non-transactional.
- There are some known questions to ask of the data.
- There are many unknown questions that will arise in the future.
- There are multiple user communities that have questions of the data.
- The data is of a scale or daily volume such that it won’t fit technically and/or economically into an RDBMS.
In the past the standard way to handle reporting and analysis of this data was to identify the most interesting attributes, and to aggregate these into a data mart. There are several problems with this approach:
- Only a subset of the attributes are examined, so only pre-determined questions can be answered.
- The data is aggregated so visibility into the lowest levels is lost
Based on the requirements above and the problems of the traditional solutions we have created a concept called the Data Lake to describe an optimal solution.
If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.
For more information on this concept you can watch a presentation on it here: Pentaho’s Big Data Architecture