Dan Woods put out a nice piece yesterday on his Forbes blog titled “Lessons From The First Wave Of Hadoop Adoption“.
I agree with him that the insights and advantages of Big Data solutions need to be described in ways other than technology. I’m going to add on to his insights.
1. It’s about more than big data. It’s a new platform.
Yes, it is a new platform. That means it’s different than the old ones. The fact that you can do some things cheaper than you could before is not the main idea. A bigger story is that some things that were economically not possible before, now are. But the main idea is that this is a new platform, with new capabilities, that needs to fit into your existing data architecture.
2. Don’t get rid of your data warehouse
I completely agree. Big Data technology is a new tool with new characteristics. Using it to replace a Data Warehouse technology that is finely tuned for that use case is not a great idea. Don’t listen to the “Hadoop will replace every database within x years” crowd. No database has managed to replace every database. No database ever will because the variety of the use cases is too large.
3. Think about your data supply chain
Since a Big Data system needs to fit in with everything you currently have and operate, integration is a significant priority. Understand that with Big Data you can build a Big Silo, but a Big Silo is as bad as a small silo (just a lot bigger). You should not be required to pump all your data from every system into Hadoop to get value from it. Design you data architecture carefully, the implications and fallout of getting it right or wrong are significant.
4. It’s complicated
Yes it is. It’s also not cheap to do it well. Sure you can download a lot of open source software and prototype or prove your ideas without a lot of upfront outlay. But putting it into production is a production. Expect that.
“Dixon’s Union of the State idea gives the Data Lake idea a positive mission besides storing more data for less money,”
“Providing the equivalent of a rewind, pause, forward remote control on the state of your business makes it affordable to answer many questions that are currently too expensive to tackle. Remember, you don’t have to implement this vision for all data for it to provide a new platform to answer difficult questions with minimal effort.”
- Let the application store it’s current state in a relational or No-SQL repository. Don’t affect the operation of the operational system.
- Log all events and state changes that occur within the application. This is the tricky part unless it is an in-house application. It would be best if these events and state changes were logged in real time, but this is sometimes not ideal. Maybe SalesForces or SugarCRM will offer this level of logging as a feature. Dump this data into a Data Lake using a suitable storage and processing technology such as Hadoop.
- Provide the ability to rewind the state of any and all attributes by parallel processing of the logs.
- Provide the facilities listed above using technologies appropriate of each use case (using the rewind capability).
The plumbing and architecture for this is not simple and Dan Woods points out that there are databases like Datomic that provide capabilities for storing and querying state over time. But a solution based on a Data Lake has the same price, scalability, and architectural attributes as other big data systems.
These are my thoughts on a recent Dan Woods (Forbes) post titled “Will Companies Ever Move Their Crown Jewels to Amazon Web Services?“.
My short answer is yes, because otherwise Jeff Bezos (the Founder of Amazon) has failed. As of today Bezos is worth $28.8 bn and #16 on Forbes list of powerful people. I’m guessing he’s the kind of guy who doesn’t like to fail.
Jeff Bezos explains his vision in this 10 year old TED talk: https://www.ted.com/talks/jeff_bezos_on_the_next_web_innovation
He spends the first 7 minutes comparing the Internet bubble to the California gold rush and then moves onto an analogy comparing the internet today with the electricity industry 100 years ago.
I admire the time and energy he spends on his analogies. He looked into different ones and compared them to find the best one. Good analogies are hard to find. The best ones sound obvious when you hear them but can hard to find. The Bee Keeker analogy for open source software took me months of iterations based on years of experience to come up with. It sounds fairly obvious when you hear it, but there was no analogy before it to help understand the model.
If an analogy is good enough will allow you to infer additional knowledge. If you follow Bezos’ electricity analogy, and look at history you can draw additional insights. Looking at the history of electricity adoption, we can draw inferences about the adoption of cloud computing (with some generalizations) :
- Before the introduction of electricity supply as a commodity service, any large company needing electricity had its own electricity generators.
- Before the introduction of cloud-based computing with utility pricing, any large company needing computing had its own data center.
- Who were the first people to join the electricity grid? Small companies and residences without prior electrical supply.
- Who were the first people to use cloud computing? Small companies and individuals without data centers.
- Who were the last people to join the electricity grid? Large companies with their own power sources.
- Who will be the last people to migrate to cloud computing? Large companies with their own data centers.
Looking in the press you can see that the majority of the anti-cloud talk comes from larger enterprises.
However, most companies that have started in the last five years are evolving cloud-based infrastructures. As a start-up you typically have desktop-based applications for accounting, HR, CRM etc. As you grow it makes sense to move to hosted solutions like Net Suite, SalesForce, SugarCRM etc. As you add more and more hosted solutions the cost and headache of installing and maintaining on-premise solutions looks less and less attractive.
So today’s generation of small comanies, which will become large companies in the future, have four classes of applications:
- Domain-specific desktop applications
- Generic applications with small scale usage (e.g. project planning)
- Generic applications that will grow and become cloud based (payroll, CRM, or accounting in a small company)
- Cloud-based applications
If companies, as Dan suggests, are using services other than Amazon for critical applications, then Amazon is failing in its mission due to operational issues. Jeff Bezos is not likely to let that continue for long.
It seems it’s time to revisit the Data Lake after 4 years. Here’s my original post on it and a couple of video presentations.
There are lots of people using the term these days and some variety in their definitions and the stories they are telling:
I give credit to Dan Woods at Forbes for being the first to pick up on the idea http://www.forbes.com/sites/ciocentral/2011/07/21/big-data-requires-a-big-new-architecture
What I’d like to address today is (somewhat negative) commentary by Barry Devlin at TechTarget (http://searchbusinessanalytics.techtarget.com/feature/Data-lake-muddies-the-waters-on-big-data-management) and Andrew White and Nick Heudecker at Gartner (http://www.gartner.com/newsroom/id/2809117). In both these cases the statements they make are not wrong, yet not really right. Let’s take a little history tour, using some YouTube videos from 2010, to discover why. I call out the main points below.
Pentaho Hadoop Series Part 1: Big Data Architecture
- 3:00. A data lake consists of a single source of data. Not distilled (pre-aggregated).
- 3:25. Most companies only have one source of data that meet the criteria.
- 4:30. You store all the data because you don’t know in advance all the questions that you will need to ask of it.
- 6:00. The problem with data marts and data warehouses is that the pre-aggregation limits the questions that can be asked.
- 6:45. By using a data lake, the institutional data marts and data warehouses can be populated with feeds of aggregations from the data lake, but ad-hoc questions can also be answered.
- 8:00. A data lake does not replace a database, data mart, or data warehouse. At least not yet, and certainly not in 2010.
Summary: A single data lake houses data from one source. You can have multiple lakes, but that does not equal a data mart or data warehouse.
Pentaho Hadoop Series Part 5: Big Data and Data Warehouses
- 0:15. Can you use a Big Data solution as a data warehouse? Yes.
- 0:22. Should you? Probably not.
- 0:30. A large amount of data from one or two systems, is not a data warehouse – it’s a big data mart at best.
- 1:30. The difference between a data warehouse and a data lake.
- 4:15. What if you really, really want to use a big data system for your data warehouse? Then you have a water-garden that is populated from data lakes.
Summary: A Data Lake is not a data warehouse housed in Hadoop. If you store data from many systems and join across them, you have a Water Garden, not a Data Lake.
I chose the term “Data Lake” carefully and paid attention to the analogy and the metaphor. But today some of the people using the term are not using as much care or attention.
Barry Devlin answers a self-imposed question “What is a data lake?” by stating:
“In the simplest summary, it is the idea that all enterprise data can and should be stored in Hadoop and accessed and used equally by all business applications.”
As is clear from the videos above, that was not the original definition of a Data Lake. Not at all. He’s talking about a Water Garden, which is significantly different. I agree with Devlin that the idea of putting all enterprise data into Hadoop (or any other data store) is not a viable option (at least right now). You should use the best tool for the job. Use a transactional database for transactional purposes. Use an analytic database for analytic purposes. Use Hadoop or MongoDB when they are the best tool for the situation. For the foreseeable future the IT environment is, and will be, a hybrid one with many different data stores.
Devlin objects to the term “Data Lake”. Whereas I object to his definition of it. It’s incorrect. The underlying issue is that people are using the term inappropriately and inaccurately. More on that later.
I also have some issues with Gartner’s take on Data Lakes (http://www.gartner.com/newsroom/id/2809117).
Their report makes statements like:
“By its definition, a data lake accepts any data, without oversight or governance.”
Who says there is no oversight? By its (original) definition it only accepts data from a single source so “any” is clearly wrong.
“The fundamental issue with the data lake is that it makes certain assumptions about the users of information”.
Who says it makes assumptions? How can a collection of data make assumptions? This makes no sense.
“And without metadata, every subsequent use of data means analysts start from scratch”.
Who says there is no metadata? Now who’s making assumptions? In all cases, Gartner is making these statements, only so that they can immediately refute them. Why not state that “Data lakes are pink with purple spots”, and then it follow up with the observation that color makes no sense in this context. Somewhere in all of this the main point has been lost. You store the raw data at its most granular level so that you can perform any ad-hoc aggregation at any time. The classic data warehouse and data mart approaches do not support this.
So, some people are miss-using the term and applying it to things that maybe make little sense to use as a production architecture. Oh well. The majority of people using the term “Data Warehouse” at Big Data conferences are miss-using that term too. They are referring to (at best) a large Data Mart, or (a worst) just a really large flat file, and in most cases not a real Data Warehouse. Confusing, yes. Annoying, yes. Worth spending time and energy on? No, not really.
Barry Devlin is welcome to fight a battle against the term “Data Lake”. Good luck to him. But if he doesn’t like it he should come up with a better idea.
If we’re going to fight pointless uphill battles against terminology, I wouldn’t pick this one, even though it’s my term getting miss-used. These are the top top 3 terms on my hit-list.
- “Big Data”. The median Hadoop deployment is 10 nodes with an amount of data that qualifies as “small” even by 1990’s Data Warehousing standards. Many people throw the NoSQL data stores in the Big Data buckets, and their median deployments are even smaller. I think “Scalable” technology is much more interesting than “Big” technology. Oracle’s Exadata box is “Big”, but not suitable for most solutions as it does not scale down. Some technologies don’t scale up. A technology that can cost effectively scale, and maintain constant performance, with data volumes varying from small to hugely massive is a great technology because you will never have to migrate. I agree that vast volumes and velocities bring interesting problems to light, but these are edge cases. Let’s focus on scalable technology, otherwise we are going to end up copying video standards that have gone from VGA to WHUXGA (Wide-Hex-Ultra-Extended-VGA). I’m not making that up, that’s a real video resolution. Unless we are careful, we are going to go from Big Data, to Super-Big Data, to Extended-Super-Big Data, to Quad-Super-Extended-Big Data, to Wide-Hex-Ultra-Extended-Big Data. Give me one thing that goes from Tiny-Bit-Mini-Reduced-Small Data to Wide-Hex-Ultra-Extended-Big Data and call it “Scalable Data” please.
- NoSQL. How come every major NoSQL data source has added (or is adding) some kind of structured query language? With JDBC/ODBC drivers on the way? Why? Because SQL was never the problem. A data store without query capabilities? Is the data there or not? Probably eventually. It’s “Schrödinger’s Commit” (a little geek humor for CAP theorists). If SQL was not the problem, what was? Schemas and performance were the real problems, and SQL was the scapegoat. But when we add SQL to NoSQL, do those negate each other? Are we left with a null? I propose we retroactively call these data stores “NoSchema” or “Technologies Now with SQL but Formerly Known as NoSQL”, or “Recently SQL’ed”, or maybe “A Little Late to the SQL Party But We Love Them Anyway”.
- “Free as in Speech vs. Free as in Beer“. Richard Stallman has had some great ideas. Using this phrase to describe free software was not one of them. Where is all this free beer he speaks of? Beer takes money and time (and time=money) to produce. “Free speech” is something not uniformly available in all countries, and thus not easy to translate. However everyone understands freedom or liberty. Many people are confused by this phrase. So I (pointlessly) suggest we use this: free as in gratis vs free as in liberty.
I have been playing around with Prezi (http://prezi.com/), the online presentation tool. It’s a cool thing that lets you create presentations that are visually different from Powerpoint/Keynote. Like all these tools it will let you create bad presentations very quickly. If you want to create something compelling and appealing it will take planning and thought. Looking at the presentations on their site most people go way overboard on the zooming and rotation and the result is confusing and disorienting.
A cool tool but the design environment is very constraining and frustrating. Great for educational and light usage but not really suitable for large-scale or every day scenarios.
- Zoom/Pan/Rotate: These give a new alternative to the standard Powerpoint feel.
- Parallax Background: A nice effect. They call it 3D but it’s really just parallax of the background image.
- Simple to Use: It’s easy to create simple presentations.
- Cost: If you don’t mind all your presentations being public you can use the free version. The paid versions are not cheap in the long run.
- Still Linear: They say is it non-linear and 3D but it is not. It is a zoom-able 2D canvas and can only create one path through the presentation with no branches or loops. It’s a linear flow through a 2D space. You can jump to different parts of the path if you can see them on the screen but constructing a truly non-linear path is clunky.
- Basic Editing: No ability to directly set the size or rotation of objects using a properties editor. It’s hard to get objects exactly the same size and shape and rotation. It has a small color palette with no color gradients.
- Z-Ordering: Prezi supports z-ordering but there is no way to control it. If you want to change to z order of an object you have to copy/delete/paste the object and any that overlap it until you have the right ordering.
- Text Editing: Text controls are too rudimentary for a presentation tool. Text has a very small color palette with no way to set RGB values except at the theme level – you have to go to CSS editing to get better control. No way to stretch text except proportionally. To get good control over text you need create text objects in Inkscape or Illustrator and import them into Prezi.
- Designer: Selecting objects and basic navigating can be extremely frustrating. A toolbar for basic operations would be really helpful.
- Animations/Build: Only one – build (appear). No build-outs. If you want to combine builds with overlapping frames some things are hard/impossible as you cannot control which frames “own” which objects.
- Transitions: Only three – Slide, rotate, zoom. You cannot choose which, Prezi chooses for you based on the arrangement of your slides. No ability to set the speed of the transition you have to rely on the side effects of the slide arrangements. In auto-play mode you can only set the same timing for every transition in the deck, not for each individual transition.
- Viewer controls: On a computer you use the left/right arrow keys to move, and once you get to the end you can right-click/control-click to rewind. On a tablet you cannot swipe, you click on the left/right edges which means you cannot put click-able objects close to the edges (to jump to other parts of the path). Also on tablets there is no rewind option, making it awkward for demo/booth usage.
- No Save-As: Seriously? If you are about to embark on major modifications to a presentation you have no way to manage backups. You have to download a copy locally and then modify it and push it back to the server to overwrite your changes.
- Vector Graphic Support: For a graphical tool that supports zooming and rotation, vector graphics are important. The only import vector formats Prezi supports are PDF and SWF – you cannot use SVG or AI or EPS. Using vector graphics that include transparency is really hard.
- No Ability to Turn Off Transitions: Online meeting tools like Webex, Netmeeting etc will have major problems with the transition animation. There is no way to remove transitions and just jump between the frames.
Today we are launching our 12 Days of Visualizations program: http://events.pentaho.com/12days-of-Big-Data-Visualizations.html
We are going to release a few new visualizations every week over the holiday period. You can drop these visualizations into a Pentaho BA server and they will appear on the charting menu in Analyzer.
The first one that we are releasing is a Sun Burst chart. This chart is based on the Protovis sun burst chart – http://mbostock.github.com/protovis/ex/sunburst.html
The Sun Burst chart we created can be used in a couple of ways. Firstly it can be used as a multi-level pie chart. This sun burst shows how the sales in three territories breaks down into sales of product lines within those territories, and then how product line sales compare by year:
This effect is achieved by using a color gradient for the outer ring that is based on the chart palette color of the inner rings, and by sorting the segments in each ring into descending order. When you compare the sun burst above with the pie chart below, you can see how much more information the sun burst provides.
You can choose to use a common color gradient on the outer ring so that it is easier to compare the items on that ring. In this example a blue gradient has been used for the outer ring. Regardless of which territory a city is in, the shade of blue it is colored in can be used to compare it with other cities.
In this chart a red/yellow/green gradient has been used. Here the levels of the chart are year, quarter, and month so the data has not been sorted. The data for this chart is overtime costs so the gradient has been reversed to show larger overtime costs in red, and smaller ones in green.
You can find out more about this chart here: http://wiki.pentaho.com/display/COM/Sunburst