James Dixon’s Blog

James Dixon’s thoughts on commercial open source and open source business intelligence

Data Lakes Revisited

with 9 comments

It seems it’s time to revisit the Data Lake after 4 years. Here’s my original post on it and a couple of video presentations.

There are lots of people using the term these days and some variety in their definitions and the stories they are telling:

I give credit to Dan Woods at Forbes for being the first to pick up on the idea http://www.forbes.com/sites/ciocentral/2011/07/21/big-data-requires-a-big-new-architecture

What I’d like to address today is (somewhat negative) commentary by Barry Devlin at TechTarget (http://searchbusinessanalytics.techtarget.com/feature/Data-lake-muddies-the-waters-on-big-data-management) and Andrew White and Nick Heudecker at Gartner (http://www.gartner.com/newsroom/id/2809117). In both these cases the statements they make are not wrong, yet not really right. Let’s take a little history tour, using some YouTube videos from 2010, to discover why. I call out the main points below.

Pentaho Hadoop Series Part 1: Big Data Architecture

https://www.youtube.com/watch?v=tR_yLsr87Uk

  • 3:00. A data lake consists of a single source of data. Not distilled (pre-aggregated).
  • 3:25. Most companies only have one source of data that meet the criteria.
  • 4:30. You store all the data because you don’t know in advance all the questions that you will need to ask of it.
  • 6:00. The problem with data marts and data warehouses is that the pre-aggregation limits the questions that can be asked.
  • 6:45. By using a data lake, the institutional data marts and data warehouses can be populated with feeds of aggregations from the data lake, but ad-hoc questions can also be answered.
  • 8:00. A data lake does not replace a database, data mart, or data warehouse. At least not yet, and certainly not in 2010.

Summary: A single data lake houses data from one source. You can have multiple lakes, but that does not equal a data mart or data warehouse.

Pentaho Hadoop Series Part 5: Big Data and Data Warehouses

https://www.youtube.com/watch?v=1CG01JmKp2Y

  • 0:15. Can you use a Big Data solution as a data warehouse? Yes.
  • 0:22. Should you? Probably not.
  • 0:30. A large amount of data from one or two systems, is not a data warehouse – it’s a big data mart at best.
  • 1:30. The difference between a data warehouse and a data lake.
  • 4:15. What if you really, really want to use a big data system for your data warehouse? Then you have a water-garden that is populated from data lakes.

Summary: A Data Lake is not a data warehouse housed in Hadoop. If you store data from many systems and join across them, you have a Water Garden, not a Data Lake.

I chose the term “Data Lake” carefully and paid attention to the analogy and the metaphor. But today some of the people using the term are not using as much care or attention.

Barry Devlin answers a self-imposed question “What is a data lake?” by stating:

“In the simplest summary, it is the idea that all enterprise data can and should be stored in Hadoop and accessed and used equally by all business applications.”

As is clear from the videos above, that was not the original definition of a Data Lake. Not at all. He’s talking about a Water Garden, which is significantly different. I agree with Devlin that the idea of putting all enterprise data into Hadoop (or any other data store) is not a viable option (at least right now). You should use the best tool for the job. Use a transactional database for transactional purposes. Use an analytic database for analytic purposes. Use Hadoop or MongoDB when they are the best tool for the situation. For the foreseeable future the IT environment is, and will be, a hybrid one with many different data stores.

Devlin objects to the term “Data Lake”. Whereas I object to his definition of it. It’s incorrect. The underlying issue is that people are using the term inappropriately and inaccurately. More on that later.

I also have some issues with Gartner’s take on Data Lakes (http://www.gartner.com/newsroom/id/2809117).

Their report makes statements like:

“By its definition, a data lake accepts any data, without oversight or governance.”

Who says there is no oversight? By its (original) definition it only accepts data from a single source so “any” is clearly wrong.

“The fundamental issue with the data lake is that it makes certain assumptions about the users of information”.

Who says it makes assumptions? How can a collection of data make assumptions? This makes no sense.

“And without metadata, every subsequent use of data means analysts start from scratch”.

Who says there is no metadata? Now who’s making assumptions? In all cases, Gartner is making these statements, only so that they can immediately refute them. Why not state that “Data lakes are pink with purple spots”, and then it follow up with the observation that color makes no sense in this context. Somewhere in all of this the main point has been lost. You store the raw data at its most granular level so that you can perform any ad-hoc aggregation at any time. The classic data warehouse and data mart approaches do not support this.

So, some people are miss-using the term and applying it to things that maybe make little sense to use as a production architecture. Oh well. The majority of people using the term “Data Warehouse” at Big Data conferences are miss-using that term too. They are referring to (at best) a large Data Mart, or (a worst) just a really large flat file, and in most cases not a real Data Warehouse. Confusing, yes. Annoying, yes. Worth spending time and energy on? No, not really.

Barry Devlin is welcome to fight a battle against the term “Data Lake”. Good luck to him. But if he doesn’t like it he should come up with a better idea.

If we’re going to fight pointless uphill battles against terminology, I wouldn’t pick this one, even though it’s my term getting miss-used. These are the top top 3 terms on my hit-list.

  1. “Big Data”. The median Hadoop deployment is 10 nodes with an amount of data that qualifies as “small” even by 1990’s Data Warehousing standards. Many people throw the NoSQL data stores in the Big Data buckets, and their median deployments are even smaller. I think “Scalable” technology is much more interesting than “Big” technology. Oracle’s Exadata box is “Big”, but not suitable for most solutions as it does not scale down. Some technologies don’t scale up. A technology that can cost effectively scale, and maintain constant performance, with data volumes varying from small to hugely massive is a great technology because you will never have to migrate. I agree that vast volumes and velocities bring interesting problems to light, but these are edge cases. Let’s focus on scalable technology, otherwise we are going to end up copying video standards that have gone from VGA to WHUXGA (Wide-Hex-Ultra-Extended-VGA). I’m not making that up, that’s a real video resolution. Unless we are careful, we are going to go from Big Data, to Super-Big Data, to Extended-Super-Big Data, to Quad-Super-Extended-Big Data, to Wide-Hex-Ultra-Extended-Big Data. Give me one thing that goes from Tiny-Bit-Mini-Reduced-Small Data to Wide-Hex-Ultra-Extended-Big Data and call it “Scalable Data” please.
  2. NoSQL. How come every major NoSQL data source has added (or is adding) some kind of structured query language? With JDBC/ODBC drivers on the way? Why? Because SQL was never the problem. A data store without query capabilities? Is the data there or not? Probably eventually. It’s “Schrödinger’s Commit” (a little geek humor for CAP theorists). If SQL was not the problem, what was? Schemas and performance were the real problems, and SQL was the scapegoat. But when we add SQL to NoSQL, do those negate each other? Are we left with a null? I propose we retroactively call these data stores “NoSchema” or “Technologies Now with SQL but Formerly Known as NoSQL”, or “Recently SQL’ed”, or maybe “A Little Late to the SQL Party But We Love Them Anyway”.
  3. Free as in Speech vs. Free as in Beer“. Richard Stallman has had some great ideas. Using this phrase to describe free software was not one of them. Where is all this free beer he speaks of? Beer takes money and time (and time=money) to produce. “Free speech” is something not uniformly available in all countries, and thus not easy to translate. However everyone understands freedom or liberty. Many people are confused by this phrase. So I (pointlessly) suggest we use this: free as in gratis vs free as in liberty.

Written by James

September 25, 2014 at 4:43 am

Posted in Uncategorized

9 Responses

Subscribe to comments with RSS.

  1. Hi James, very good post.

    Data Lakes can be understood as an Operational Data Stores (ODS) with added capabilities? The difference I see between ODSs and Data Lakes is that the first ones don’t manage unstructured/semi-structured data and don’t have all history. I saw this link http://www.splicemachine.com/applications/operational-data-lake/ which impressed me in how new terms are being created and used.
    Cheers,

    Carlos

    October 23, 2014 at 1:19 am

    • That is a good question. A Data Lake can be used to store device data and log data which is at a much lower level of granularity than an ODS. An ODS is more about the storage of current state. Device data goes deeper than that because it often contains information about events which, may or may not, result in state changes.

      An Operational Data Lake seems to be an improvement over a traditional ODS, however this is one of many use cases for a Data Lake architecture.

      James

      October 23, 2014 at 1:33 am

      • Hi James,

        I’m a BI architect & currently busy with Big Data POC & studies.
        In that context i started looking deeper in the “Data Lake” wave….

        I introduced myself in the subject via the white paper of Booz Allen Hamilton; which you’ve listed & tagged as kind of a “wrong” interpretation of your concept.

        Can you please tell us more on your opinion about their approach to the Data Lake?

        If i understood you correctly, one source = one data lake; in their approach they propose a single big data table to host all the data from any source with some metadata that can be added as we learn about those data….

        Thanks.

        Kaaoiass

        November 12, 2014 at 1:48 pm

  2. Hi Kaaoiass,

    The Booz Allen Hamilton paper is good. The fact that they are putting data from multiple data sources into what they cal a “Data Lake” is a minor change to the original definition. But it leads to confusion about the model because not all of the data is necessarily equal when you do that, and metadata becomes much more of an issue.

    In practice these conceptual differences won’t make much, if any, impact when it comes to the implementation. If you have two data sources your architecture, technology, and capabilities probably won’t differ much whether you consider it to be one data lake or two.

    James

    James

    November 12, 2014 at 5:33 pm

    • Thanks James but what worry me about this kind of implementation is the fact that absolutely all sources data reside in a single big table (Accumolo); can be a single point of failure, right?

      Reading there papers we get the idea that they have created the concept of Data Lake.

      how do you see the linkage & traceability; in their approach they have pointers to the orignal files …is that enough…i don’t think so; metadata should also document those aspects….

      kaaoiass

      November 13, 2014 at 8:14 am

      • I don’t recommend using Hadoop as the repository for all your operational systems. Just put the data you need to into Hadoop. If you have data from different sources, you won’t put them into a single table. They will be in different tables/files/directories. If you want to start joining the data or doing lineage, then yes, you will need metadata as well.

        James

        November 13, 2014 at 6:41 pm

  3. Agree with you; i don’t sense the Data Lake as the Enterprise “All-Data-In” space; however it’s challenging to know what data to put in….the whole idea is to provision the data lake first & then see what data can be used for what purpose.

    the Booz Hamilton approach suggests to use one big table for all data sources that’s why i asked your opinon on that…

    to me the issue is not really the Data Lake as such but other aspects related to Data Governance in general and Data Management in particular.

    Anyhow, i would like you to thank you for your very valuable inputs.

    kaaoiass

    November 17, 2014 at 11:10 am

  4. […] Pentaho co-founder and CTO, James Dixon is who we have to thank for the term, ‘Data Lake.’ He first wrote about the Data Lake concept on his blog in 2010, Pentaho, Hadoop and Data Lakes. After the numerous interpretations and feedback, he revisited the concept and definition here: Data Lakes Revisited. […]

  5. […] Dixon followed up on his 2010 post this fall with some interesting arguments about how we define “data lake” and certain other terms. Our view is that data lakes are an […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: