Data Lakes Revisited
It seems it’s time to revisit the Data Lake after 4 years. Here’s my original post on it and a couple of video presentations.
There are lots of people using the term these days and some variety in their definitions and the stories they are telling:
I give credit to Dan Woods at Forbes for being the first to pick up on the idea http://www.forbes.com/sites/ciocentral/2011/07/21/big-data-requires-a-big-new-architecture
What I’d like to address today is (somewhat negative) commentary by Barry Devlin at TechTarget (http://searchbusinessanalytics.techtarget.com/feature/Data-lake-muddies-the-waters-on-big-data-management) and Andrew White and Nick Heudecker at Gartner (http://www.gartner.com/newsroom/id/2809117). In both these cases the statements they make are not wrong, yet not really right. Let’s take a little history tour, using some YouTube videos from 2010, to discover why. I call out the main points below.
Pentaho Hadoop Series Part 1: Big Data Architecture
- 3:00. A data lake consists of a single source of data. Not distilled (pre-aggregated).
- 3:25. Most companies only have one source of data that meet the criteria.
- 4:30. You store all the data because you don’t know in advance all the questions that you will need to ask of it.
- 6:00. The problem with data marts and data warehouses is that the pre-aggregation limits the questions that can be asked.
- 6:45. By using a data lake, the institutional data marts and data warehouses can be populated with feeds of aggregations from the data lake, but ad-hoc questions can also be answered.
- 8:00. A data lake does not replace a database, data mart, or data warehouse. At least not yet, and certainly not in 2010.
Summary: A single data lake houses data from one source. You can have multiple lakes, but that does not equal a data mart or data warehouse.
Pentaho Hadoop Series Part 5: Big Data and Data Warehouses
- 0:15. Can you use a Big Data solution as a data warehouse? Yes.
- 0:22. Should you? Probably not.
- 0:30. A large amount of data from one or two systems, is not a data warehouse – it’s a big data mart at best.
- 1:30. The difference between a data warehouse and a data lake.
- 4:15. What if you really, really want to use a big data system for your data warehouse? Then you have a water-garden that is populated from data lakes.
Summary: A Data Lake is not a data warehouse housed in Hadoop. If you store data from many systems and join across them, you have a Water Garden, not a Data Lake.
I chose the term “Data Lake” carefully and paid attention to the analogy and the metaphor. But today some of the people using the term are not using as much care or attention.
Barry Devlin answers a self-imposed question “What is a data lake?” by stating:
“In the simplest summary, it is the idea that all enterprise data can and should be stored in Hadoop and accessed and used equally by all business applications.”
As is clear from the videos above, that was not the original definition of a Data Lake. Not at all. He’s talking about a Water Garden, which is significantly different. I agree with Devlin that the idea of putting all enterprise data into Hadoop (or any other data store) is not a viable option (at least right now). You should use the best tool for the job. Use a transactional database for transactional purposes. Use an analytic database for analytic purposes. Use Hadoop or MongoDB when they are the best tool for the situation. For the foreseeable future the IT environment is, and will be, a hybrid one with many different data stores.
Devlin objects to the term “Data Lake”. Whereas I object to his definition of it. It’s incorrect. The underlying issue is that people are using the term inappropriately and inaccurately. More on that later.
I also have some issues with Gartner’s take on Data Lakes (http://www.gartner.com/newsroom/id/2809117).
Their report makes statements like:
“By its definition, a data lake accepts any data, without oversight or governance.”
Who says there is no oversight? By its (original) definition it only accepts data from a single source so “any” is clearly wrong.
“The fundamental issue with the data lake is that it makes certain assumptions about the users of information”.
Who says it makes assumptions? How can a collection of data make assumptions? This makes no sense.
“And without metadata, every subsequent use of data means analysts start from scratch”.
Who says there is no metadata? Now who’s making assumptions? In all cases, Gartner is making these statements, only so that they can immediately refute them. Why not state that “Data lakes are pink with purple spots”, and then it follow up with the observation that color makes no sense in this context. Somewhere in all of this the main point has been lost. You store the raw data at its most granular level so that you can perform any ad-hoc aggregation at any time. The classic data warehouse and data mart approaches do not support this.
So, some people are miss-using the term and applying it to things that maybe make little sense to use as a production architecture. Oh well. The majority of people using the term “Data Warehouse” at Big Data conferences are miss-using that term too. They are referring to (at best) a large Data Mart, or (a worst) just a really large flat file, and in most cases not a real Data Warehouse. Confusing, yes. Annoying, yes. Worth spending time and energy on? No, not really.
Barry Devlin is welcome to fight a battle against the term “Data Lake”. Good luck to him. But if he doesn’t like it he should come up with a better idea.
If we’re going to fight pointless uphill battles against terminology, I wouldn’t pick this one, even though it’s my term getting miss-used. These are the top top 3 terms on my hit-list.
- “Big Data”. The median Hadoop deployment is 10 nodes with an amount of data that qualifies as “small” even by 1990’s Data Warehousing standards. Many people throw the NoSQL data stores in the Big Data buckets, and their median deployments are even smaller. I think “Scalable” technology is much more interesting than “Big” technology. Oracle’s Exadata box is “Big”, but not suitable for most solutions as it does not scale down. Some technologies don’t scale up. A technology that can cost effectively scale, and maintain constant performance, with data volumes varying from small to hugely massive is a great technology because you will never have to migrate. I agree that vast volumes and velocities bring interesting problems to light, but these are edge cases. Let’s focus on scalable technology, otherwise we are going to end up copying video standards that have gone from VGA to WHUXGA (Wide-Hex-Ultra-Extended-VGA). I’m not making that up, that’s a real video resolution. Unless we are careful, we are going to go from Big Data, to Super-Big Data, to Extended-Super-Big Data, to Quad-Super-Extended-Big Data, to Wide-Hex-Ultra-Extended-Big Data. Give me one thing that goes from Tiny-Bit-Mini-Reduced-Small Data to Wide-Hex-Ultra-Extended-Big Data and call it “Scalable Data” please.
- NoSQL. How come every major NoSQL data source has added (or is adding) some kind of structured query language? With JDBC/ODBC drivers on the way? Why? Because SQL was never the problem. A data store without query capabilities? Is the data there or not? Probably eventually. It’s “Schrödinger’s Commit” (a little geek humor for CAP theorists). If SQL was not the problem, what was? Schemas and performance were the real problems, and SQL was the scapegoat. But when we add SQL to NoSQL, do those negate each other? Are we left with a null? I propose we retroactively call these data stores “NoSchema” or “Technologies Now with SQL but Formerly Known as NoSQL”, or “Recently SQL’ed”, or maybe “A Little Late to the SQL Party But We Love Them Anyway”.
- “Free as in Speech vs. Free as in Beer“. Richard Stallman has had some great ideas. Using this phrase to describe free software was not one of them. Where is all this free beer he speaks of? Beer takes money and time (and time=money) to produce. “Free speech” is something not uniformly available in all countries, and thus not easy to translate. However everyone understands freedom or liberty. Many people are confused by this phrase. So I (pointlessly) suggest we use this: free as in gratis vs free as in liberty.