It seems it’s time to revisit the Data Lake after 4 years. Here’s my original post on it and a couple of video presentations.
There are lots of people using the term these days and some variety in their definitions and the stories they are telling:
I give credit to Dan Woods at Forbes for being the first to pick up on the idea http://www.forbes.com/sites/ciocentral/2011/07/21/big-data-requires-a-big-new-architecture
What I’d like to address today is (somewhat negative) commentary by Barry Devlin at TechTarget (http://searchbusinessanalytics.techtarget.com/feature/Data-lake-muddies-the-waters-on-big-data-management) and Andrew White and Nick Heudecker at Gartner (http://www.gartner.com/newsroom/id/2809117). In both these cases the statements they make are not wrong, yet not really right. Let’s take a little history tour, using some YouTube videos from 2010, to discover why. I call out the main points below.
Pentaho Hadoop Series Part 1: Big Data Architecture
- 3:00. A data lake consists of a single source of data. Not distilled (pre-aggregated).
- 3:25. Most companies only have one source of data that meet the criteria.
- 4:30. You store all the data because you don’t know in advance all the questions that you will need to ask of it.
- 6:00. The problem with data marts and data warehouses is that the pre-aggregation limits the questions that can be asked.
- 6:45. By using a data lake, the institutional data marts and data warehouses can be populated with feeds of aggregations from the data lake, but ad-hoc questions can also be answered.
- 8:00. A data lake does not replace a database, data mart, or data warehouse. At least not yet, and certainly not in 2010.
Summary: A single data lake houses data from one source. You can have multiple lakes, but that does not equal a data mart or data warehouse.
Pentaho Hadoop Series Part 5: Big Data and Data Warehouses
- 0:15. Can you use a Big Data solution as a data warehouse? Yes.
- 0:22. Should you? Probably not.
- 0:30. A large amount of data from one or two systems, is not a data warehouse – it’s a big data mart at best.
- 1:30. The difference between a data warehouse and a data lake.
- 4:15. What if you really, really want to use a big data system for your data warehouse? Then you have a water-garden that is populated from data lakes.
Summary: A Data Lake is not a data warehouse housed in Hadoop. If you store data from many systems and join across them, you have a Water Garden, not a Data Lake.
I chose the term “Data Lake” carefully and paid attention to the analogy and the metaphor. But today some of the people using the term are not using as much care or attention.
Barry Devlin answers a self-imposed question “What is a data lake?” by stating:
“In the simplest summary, it is the idea that all enterprise data can and should be stored in Hadoop and accessed and used equally by all business applications.”
As is clear from the videos above, that was not the original definition of a Data Lake. Not at all. He’s talking about a Water Garden, which is significantly different. I agree with Devlin that the idea of putting all enterprise data into Hadoop (or any other data store) is not a viable option (at least right now). You should use the best tool for the job. Use a transactional database for transactional purposes. Use an analytic database for analytic purposes. Use Hadoop or MongoDB when they are the best tool for the situation. For the foreseeable future the IT environment is, and will be, a hybrid one with many different data stores.
Devlin objects to the term “Data Lake”. Whereas I object to his definition of it. It’s incorrect. The underlying issue is that people are using the term inappropriately and inaccurately. More on that later.
I also have some issues with Gartner’s take on Data Lakes (http://www.gartner.com/newsroom/id/2809117).
Their report makes statements like:
“By its definition, a data lake accepts any data, without oversight or governance.”
Who says there is no oversight? By its (original) definition it only accepts data from a single source so “any” is clearly wrong.
“The fundamental issue with the data lake is that it makes certain assumptions about the users of information”.
Who says it makes assumptions? How can a collection of data make assumptions? This makes no sense.
“And without metadata, every subsequent use of data means analysts start from scratch”.
Who says there is no metadata? Now who’s making assumptions? In all cases, Gartner is making these statements, only so that they can immediately refute them. Why not state that “Data lakes are pink with purple spots”, and then it follow up with the observation that color makes no sense in this context. Somewhere in all of this the main point has been lost. You store the raw data at its most granular level so that you can perform any ad-hoc aggregation at any time. The classic data warehouse and data mart approaches do not support this.
So, some people are miss-using the term and applying it to things that maybe make little sense to use as a production architecture. Oh well. The majority of people using the term “Data Warehouse” at Big Data conferences are miss-using that term too. They are referring to (at best) a large Data Mart, or (a worst) just a really large flat file, and in most cases not a real Data Warehouse. Confusing, yes. Annoying, yes. Worth spending time and energy on? No, not really.
Barry Devlin is welcome to fight a battle against the term “Data Lake”. Good luck to him. But if he doesn’t like it he should come up with a better idea.
If we’re going to fight pointless uphill battles against terminology, I wouldn’t pick this one, even though it’s my term getting miss-used. These are the top top 3 terms on my hit-list.
- “Big Data”. The median Hadoop deployment is 10 nodes with an amount of data that qualifies as “small” even by 1990’s Data Warehousing standards. Many people throw the NoSQL data stores in the Big Data buckets, and their median deployments are even smaller. I think “Scalable” technology is much more interesting than “Big” technology. Oracle’s Exadata box is “Big”, but not suitable for most solutions as it does not scale down. Some technologies don’t scale up. A technology that can cost effectively scale, and maintain constant performance, with data volumes varying from small to hugely massive is a great technology because you will never have to migrate. I agree that vast volumes and velocities bring interesting problems to light, but these are edge cases. Let’s focus on scalable technology, otherwise we are going to end up copying video standards that have gone from VGA to WHUXGA (Wide-Hex-Ultra-Extended-VGA). I’m not making that up, that’s a real video resolution. Unless we are careful, we are going to go from Big Data, to Super-Big Data, to Extended-Super-Big Data, to Quad-Super-Extended-Big Data, to Wide-Hex-Ultra-Extended-Big Data. Give me one thing that goes from Tiny-Bit-Mini-Reduced-Small Data to Wide-Hex-Ultra-Extended-Big Data and call it “Scalable Data” please.
- NoSQL. How come every major NoSQL data source has added (or is adding) some kind of structured query language? With JDBC/ODBC drivers on the way? Why? Because SQL was never the problem. A data store without query capabilities? Is the data there or not? Probably eventually. It’s “Schrödinger’s Commit” (a little geek humor for CAP theorists). If SQL was not the problem, what was? Schemas and performance were the real problems, and SQL was the scapegoat. But when we add SQL to NoSQL, do those negate each other? Are we left with a null? I propose we retroactively call these data stores “NoSchema” or “Technologies Now with SQL but Formerly Known as NoSQL”, or “Recently SQL’ed”, or maybe “A Little Late to the SQL Party But We Love Them Anyway”.
- “Free as in Speech vs. Free as in Beer“. Richard Stallman has had some great ideas. Using this phrase to describe free software was not one of them. Where is all this free beer he speaks of? Beer takes money and time (and time=money) to produce. “Free speech” is something not uniformly available in all countries, and thus not easy to translate. However everyone understands freedom or liberty. Many people are confused by this phrase. So I (pointlessly) suggest we use this: free as in gratis vs free as in liberty.
I have been playing around with Prezi (http://prezi.com/), the online presentation tool. It’s a cool thing that lets you create presentations that are visually different from Powerpoint/Keynote. Like all these tools it will let you create bad presentations very quickly. If you want to create something compelling and appealing it will take planning and thought. Looking at the presentations on their site most people go way overboard on the zooming and rotation and the result is confusing and disorienting.
A cool tool but the design environment is very constraining and frustrating. Great for educational and light usage but not really suitable for large-scale or every day scenarios.
- Zoom/Pan/Rotate: These give a new alternative to the standard Powerpoint feel.
- Parallax Background: A nice effect. They call it 3D but it’s really just parallax of the background image.
- Simple to Use: It’s easy to create simple presentations.
- Cost: If you don’t mind all your presentations being public you can use the free version. The paid versions are not cheap in the long run.
- Still Linear: They say is it non-linear and 3D but it is not. It is a zoom-able 2D canvas and can only create one path through the presentation with no branches or loops. It’s a linear flow through a 2D space. You can jump to different parts of the path if you can see them on the screen but constructing a truly non-linear path is clunky.
- Basic Editing: No ability to directly set the size or rotation of objects using a properties editor. It’s hard to get objects exactly the same size and shape and rotation. It has a small color palette with no color gradients.
- Z-Ordering: Prezi supports z-ordering but there is no way to control it. If you want to change to z order of an object you have to copy/delete/paste the object and any that overlap it until you have the right ordering.
- Text Editing: Text controls are too rudimentary for a presentation tool. Text has a very small color palette with no way to set RGB values except at the theme level – you have to go to CSS editing to get better control. No way to stretch text except proportionally. To get good control over text you need create text objects in Inkscape or Illustrator and import them into Prezi.
- Designer: Selecting objects and basic navigating can be extremely frustrating. A toolbar for basic operations would be really helpful.
- Animations/Build: Only one – build (appear). No build-outs. If you want to combine builds with overlapping frames some things are hard/impossible as you cannot control which frames “own” which objects.
- Transitions: Only three – Slide, rotate, zoom. You cannot choose which, Prezi chooses for you based on the arrangement of your slides. No ability to set the speed of the transition you have to rely on the side effects of the slide arrangements. In auto-play mode you can only set the same timing for every transition in the deck, not for each individual transition.
- Viewer controls: On a computer you use the left/right arrow keys to move, and once you get to the end you can right-click/control-click to rewind. On a tablet you cannot swipe, you click on the left/right edges which means you cannot put click-able objects close to the edges (to jump to other parts of the path). Also on tablets there is no rewind option, making it awkward for demo/booth usage.
- No Save-As: Seriously? If you are about to embark on major modifications to a presentation you have no way to manage backups. You have to download a copy locally and then modify it and push it back to the server to overwrite your changes.
- Vector Graphic Support: For a graphical tool that supports zooming and rotation, vector graphics are important. The only import vector formats Prezi supports are PDF and SWF – you cannot use SVG or AI or EPS. Using vector graphics that include transparency is really hard.
- No Ability to Turn Off Transitions: Online meeting tools like Webex, Netmeeting etc will have major problems with the transition animation. There is no way to remove transitions and just jump between the frames.
Today we are launching our 12 Days of Visualizations program: http://events.pentaho.com/12days-of-Big-Data-Visualizations.html
We are going to release a few new visualizations every week over the holiday period. You can drop these visualizations into a Pentaho BA server and they will appear on the charting menu in Analyzer.
The first one that we are releasing is a Sun Burst chart. This chart is based on the Protovis sun burst chart – http://mbostock.github.com/protovis/ex/sunburst.html
The Sun Burst chart we created can be used in a couple of ways. Firstly it can be used as a multi-level pie chart. This sun burst shows how the sales in three territories breaks down into sales of product lines within those territories, and then how product line sales compare by year:
This effect is achieved by using a color gradient for the outer ring that is based on the chart palette color of the inner rings, and by sorting the segments in each ring into descending order. When you compare the sun burst above with the pie chart below, you can see how much more information the sun burst provides.
You can choose to use a common color gradient on the outer ring so that it is easier to compare the items on that ring. In this example a blue gradient has been used for the outer ring. Regardless of which territory a city is in, the shade of blue it is colored in can be used to compare it with other cities.
In this chart a red/yellow/green gradient has been used. Here the levels of the chart are year, quarter, and month so the data has not been sorted. The data for this chart is overtime costs so the gradient has been reversed to show larger overtime costs in red, and smaller ones in green.
You can find out more about this chart here: http://wiki.pentaho.com/display/COM/Sunburst
ZDNet reports on a Forrester survey that finds 5 out of 6 developers are using or deploying open source.
In the survey they found that 7% of developers are using open source software tools such as Pentaho.
The United States Department of Labor state that, in 2010, there were 913,100 software developers in the USA alone.
7% of 913,100 means about 64,000 developers using open source business intelligence software. Nice.
My thoughts on the whole Emily White/stealing music topic:
When she says she only bought 15 albums, I think she is talking about physical CDs. I think she did buy some of her music online. But she clearly states that she ripped music from the radio station and swapped mix CDs with her friends, and she makes it sound like she thinks this is not stealing.
Don’t Blame iTunes
Many people who complain about artist’s income people blame Apple and iTunes. Yes, iTunes propagated the old economic splits and percentages into the digital world. But Apple did not create those splits, they were agreed upon in contracts between the labels/producers and the artists. What iTunes did was to provide an alternative digital distribution medium to Napster. Apple saved artists from the prospect of getting no revenue at all. People who attack and boycott iTunes thinking that they are helping artists are deluded.
It’s Not Just Music
This whole debate also extends to movies, books, news commentary, and software – anything that can be digitally copied. In each of these arenas, the players and economic distribution is different, but the consequence of not paying is the same. If we all behaved this way, ultimately, there would be no books, or movies either. So how does this relate to proprietary software, open source software, and free software?
Just like companies that publish books, music, movies etc, proprietary software companies were the gatekeepers. They decided what software was created and made available. When the hardware and software becomes available at the consumer level, independent producers spring up. This happened with freeware software for PCs. The internet enables the distribution of the software, and methods of collecting payment. The costs of creating books, music, and movies have dropped dramatically because of the hardware and software now available. But, if no-one pays for the content created the proprietary software companies will go out of business.
Open source and free software are other ways for creating and distributing software, the difference being that these rely on software (source and binaries) being easy to copy. Don’t steal Microsoft’s BI software and use it without permission. Use our open source BI software – we want you to.
Free software requires that the software, and all software that is built upon it, be ‘free’. In this case ‘free’ means you can freely modify it, distribute it, and build upon it, and you give others those same rights. You can still charge for the software, but it makes no sense to (given the rights you give to your ‘customer’).
The ideals of Free Software Foundation (FSF) are based on the notion that when you think of something or invent something, it belongs to the world, you don’t own it. This is a wonderful idea, however most of the world, including many industries,and jobs, and professions, are based on the opposite principle – if you create it, you own it. To my mind I have fewer rights under the FSF view of the world, I don’t have the right to my own ideas.
Because of the freedoms that the Free Software Foundation believe in, they are against Digital Rights Management (DRM) software. DRM tries to protect the rights of artists, producers, and distributors of artistic content. In order to protect these rights, software is needed that is proprietary. If the DRM source code was open, it would make it easy for hackers to decode the content and remove the copy protection. So the Free Software Foundation is taking up the fight against DRM, calling it ‘Digital Restrictions Management’ (http://www.fsf.org/campaigns/drm.html). They call it this because, they say, DRM takes away your right to steal other people’s inventions. If you support of DRM-free software, you are choosing to fight against musicians, authors, actors.
The Open Source movement takes a pragmatic approach on this topic. When you have an idea, it is yours. You can choose to do whatever you want with your invention. If it is a software invention, and you choose to put it into open source, that’s great. If you choose not too, that’s fine too, because it is yours. Open Source allows hybrid models – where a producer can decide to put some of their software into open source but not all of it (open core or freemium model). This model enables a software producer to provide something of value to people who would not have paid for anything anyway (this includes geographies and economies where the producer would not sell anyway). These people are willing participants and contributors in other ways. The producer also gets to sell whatever software products it wants.
For some creative areas, if no-one pays for any content anymore, the creators will disappear eventually, and there will be no more content. But what happens if no-one pays for software anymore?
Proprietary software dies eventually, unless they switch to services models.
The majority of people contributing to open source/free software today are IT developers. There are two main types here: creating/extending/fixing software in the course of getting their project finished, or sponsored contributors. IT is where the majority of software developers are today, so IT/enterprise/business software is safe.
The software that would be at most risk would be software that is created by smaller software companies. Particularly software that has large up-front development costs. Games. The first, and maybe only, software segment to die would be the big-budget, realistic, immersive, loud video games. Who cares most about these games? The same demographic that is stealing all the music.
I say let Generation OMG copy and steal everything they want. All the really cool and fun careers will evaporate. Lots of the stuff they love (movies, music, games) will disappear. After they have spent a decade texting each other about how sucky everything is, they will grow up and have to re-create these industries. Hopefully with better economic structures than the current ones.
I’m at the MongoNYC conference in New York today, where Pentaho is a sponsor. 10gen have done a great job with this event, and they have 1,000 attendees at the event.
We just announced a strategic partnership between 10gen and Pentaho. From a technical perspective the integration between MongoDB and Pentaho means:
- No Big Silos. Data silos are bad. Big ones are no better. Our MongoDB ETL connectors for reading and writing data mean you can integrate your MongoDB data store with the rest of your data architecture (relational databases, hosted applications, custom applications, etc).
- Live reporting. We can provide desktop and web-based reports directly on MongoDB data
- Staging. We can provide trending and historical analysis by staging snapshots of MongoDB aggregations in a column store.
I’m looking forward to working with 10gen to integrate some of their new aggregation capabilities into Pentaho.