Hadoop | James Dixon's Blog

Archive for the ‘Hadoop’ Category

Next Big Thing: The Data Explosion

leave a comment »

These are the main points from The Data Explosion section of my Next Big Thing In Big Data presentation.

The data explosion started in the 1960s
Hard drive sizes double every 2 years
Data expands to fill the space available
Most data types have a natural maximum granularity
We are reaching the natural limit for color, images, sound and dates
Business data also has a natural limit
Organizations are jumping from transactional level data to the natural maximum in one implementation
As an organization implements a big data system, the volume of its stored data “pops”
Few companies are popping so far
The explosion of hype exceeds the data explosion
The server, memory, storage, and processor vendors do not show a data explosion happening

Summary

The underlying trend is an explosion itself
The explosion on the explosion is only minor so far
Popcorn Effect – it will continue to grow

Please comment or ask questions

Written by James

June 23, 2015 at 5:50 pm

Posted in Business Intelligence, commercial open source, Hadoop, Hive, open source, Uncategorized

Pile-On: Dan Woods “Lessons From The First Wave Of Hadoop Adoption”

leave a comment »

Dan Woods put out a nice piece yesterday on his Forbes blog titled “Lessons From The First Wave Of Hadoop Adoption“.

I agree with him that the insights and advantages of Big Data solutions need to be described in ways other than technology. I’m going to add on to his insights.

1. It’s about more than big data. It’s a new platform.

Yes, it is a new platform. That means it’s different than the old ones. The fact that you can do some things cheaper than you could before is not the main idea. A bigger story is that some things that were economically not possible before, now are. But the main idea is that this is a new platform, with new capabilities, that needs to fit into your existing data architecture.

2. Don’t get rid of your data warehouse

I completely agree. Big Data technology is a new tool with new characteristics. Using it to replace a Data Warehouse technology that is finely tuned for that use case is not a great idea. Don’t listen to the “Hadoop will replace every database within x years” crowd. No database has managed to replace every database. No database ever will because the variety of the use cases is too large.

3. Think about your data supply chain

Since a Big Data system needs to fit in with everything you currently have and operate, integration is a significant priority. Understand that with Big Data you can build a Big Silo, but a Big Silo is as bad as a small silo (just a lot bigger). You should not be required to pump all your data from every system into Hadoop to get value from it. Design you data architecture carefully, the implications and fallout of getting it right or wrong are significant.

4. It’s complicated

Yes it is. It’s also not cheap to do it well. Sure you can download a lot of open source software and prototype or prove your ideas without a lot of upfront outlay. But putting it into production is a production. Expect that.

Written by James

January 27, 2015 at 5:01 pm

Posted in Business Intelligence, Datawarehousing, Hadoop, open source

Union of the State – A Data Lake Use Case

with 6 comments

Many business applications are essentially workflow applications or state machines. This includes CRM systems, ERP systems, asset tracking, case tracking, call center, and some financial systems. The real-world entities (employees, customers, devices, accounts, orders etc.) represented in these systems are stored as a collection of attributes that define their current state. Examples of these attributes include someone’s current address or number of dependents, an account’s current balance, who is in possession of laptop X, which documents for a loan approval have been provided, and the date of Fluffy’s last Feline Distemper vaccination.

–

State machines are very good at answering questions about the state of things. They are, after all, machines that handle state. But what about reporting on trends and changes over the short and long term? How do we do this? The answer for this is to track changes to the attributes in change logs. These change logs are database tables or text files that list the changes made over time. That way you can (although the data transformation is ugly) rewind the change log of a specific field across all objects in the system and then aggregate those changes to get a view over time. This is not easy to do and assumes that you have a change log. Typically, change logs only exist for the main fields in an application. There might only be change logs on 10-20% of the fields. So if you suddenly have an impulse so see how a lesser attribute has changed over time you are out of luck. It is impossible because that information is lost.

–

This situation is similar to the way that old school business intelligence and analytic applications were built. End users listed out the questions they want to ask of the data, the attributes necessary to answer those questions were skimmed from the data stream, and bulk loaded into a data mart. This method works fine until you have a new question to ask. The Data Lake approach solves this problem. You store all of the data in a Data Lake, populate data marts and your data warehouse to satisfy traditional needs, and enable ad-hoc query and reporting on the raw data in the Data Lake for new questions.

–

A Data Lake can also be used to solve the problems of history and trending for workflow applications and state machines. What if these applications write their initial state into the Data Lake and then also write the change of every attribute in there as well? While we are at it, let’s log all the application events coming from the user interface tier as well. From the application’s perspective this is a low-latency fire and forget scenario.

–

Now we have the initial state of the application’s data and the changes to of all of the attributes, not just the main/traditional fields. We can apply this approach to more than one application, each with its own Data Lake of state logs, storing every incremental change and event. So now we have the state of every field of (potentially) every business application in an enterprise across time. We have the “Union of the State”.

–

With this data we have the ability to rewind the Union of the State to any point in time. What are the potential use cases for the Union of the State?

–

Enterprise Time Machine

Suppose something happened a few weeks ago. Decisions were made. Things changed. But exactly what, when, and why? With an Enterprise Time Machine you can rewind the complete state of every major application to any point in time and then step forward event by event, click by click, change by change, at the millisecond level if things happened that quickly. For an e-commerce vendor this means being able to know for any specified millisecond in the past how many shopping carts where open, what was in them, which transactions were pending, which items were being boxed, or in transit, what was being returned, who was working, how many customer support calls were queued and how many were in progress. In different domains such as financial services or healthcare, the applications and attributes are different but the ability is the same.

–

In order to reconstruct the state at any point in time we need to load the initial snapshot into a repository and then update the attributes of each object as we process the logs, event by event, until we get to the point in time that we are interested in. A NoSQL store such as MongoDB , HBase, or Cassandra should work well as the repository. This process could be optimized by adding regular snapshots of the whole state into the Data Lake so that we don’t have to process from the very beginning every time. For a detailed analysis you could rebuild the state to a particular point in time and then process forwards in increments of any size. This way the situation of a device failure that led to a catastrophic cascade of events can be re-created and examined millisecond by millisecond.

–

Trending

Since we can re-create the state at any point in time we can do trending and historical analysis of any and every attribute over any time period, at any time granularity we want.

–

Compliance

When user interface events are logged as well as the attribute changes you have the ability to know not only who changed what information, but also who looked at it. Who was aware of the situation? Why did Bob open a particular record every few hours and cancel out without making changes? This requires the History Machine described above.

–

Predictive

One of the main tasks in a predictive exercise is to work out which attributes are predictive of your target variable and which ones are not. This can be impossible to do when you only have 10% of your attributes logged. Maybe the minor attributes are the predictive ones. Now you have all of them. This requires the trending facility described above.

–

Doug Moran, a co-founder of Pentaho and product manager for its Big Data products, sees many predictive applications for this kind of data. This includes the ability to derive a model from replays of previous events and use it to prescribe ways to influence the current situation to increase the likelihood of a desired outcome. For example, this could include replaying all previous shopping cart events for a user currently on an e-commerce site to derive a predictive model that prescribes a way to influence their current purchase in a positive way.

–

“Dixon’s Union of the State idea gives the Data Lake idea a positive mission besides storing more data for less money,”

said Dan Woods, an IT Consultant to buyers and vendors and CEO of Evolved Media, who has written about the Data Lake for several years.

“Providing the equivalent of a rewind, pause, forward remote control on the state of your business makes it affordable to answer many questions that are currently too expensive to tackle. Remember, you don’t have to implement this vision for all data for it to provide a new platform to answer difficult questions with minimal effort.”

–

Architecture

How could this be done?

Let the application store it’s current state in a relational or No-SQL repository. Don’t affect the operation of the operational system.
Log all events and state changes that occur within the application. This is the tricky part unless it is an in-house application. It would be best if these events and state changes were logged in real time, but this is sometimes not ideal. Maybe SalesForces or SugarCRM will offer this level of logging as a feature. Dump this data into a Data Lake using a suitable storage and processing technology such as Hadoop.
Provide the ability to rewind the state of any and all attributes by parallel processing of the logs.
Provide the facilities listed above using technologies appropriate of each use case (using the rewind capability).

–

The plumbing and architecture for this is not simple and Dan Woods points out that there are databases like Datomic that provide capabilities for storing and querying state over time. But a solution based on a Data Lake has the same price, scalability, and architectural attributes as other big data systems.

Written by James

January 22, 2015 at 4:43 am

Posted in Business Intelligence, Datawarehousing, Hadoop, open source, Uncategorized

Pentaho’s Big Data Release

leave a comment »

This week at Pentaho we announced a major Big Data release, including:

Open sourcing of our of big data code
Moving Pentaho Data Integration to the Apache license
Support for Hbase, Cassandra, MongoDB, Hadapt
And numerous functionality and performance improvements

What does this mean for the Big Data market, for Pentaho, and for everyone else?

We believe you should use the best tool for each job. For example you should use Hadoop or a NoSQL database where those technologies suit your purposes, and use a high performance columnar database for the use cases they are suited to. Your organization probably has applications that use traditional databases, and likely has a hosted application or two as well. Like it or not, if you have a single employee that has a spreadsheet on their laptop, you have a data architecture that includes flat files. So every data architecture is a hybrid environment to some extent. To solve the requirements of your business, your IT group probably has to move/merge/transform data between these data stores. You may have an application or two that has no external inputs or outputs, and no integration points with other applications. There is a word for these applications – silos. Silos are bad. Big data is no different. A big data store that is not integrated with your data architecture is a Big Silo. Big Silos are just as bad as regular silos, only bigger.

So when you add a big data technology to your organization, you don’t want it to be a silo. The big data capabilities of Pentaho Data Integration enable you to integrate your big data store into the rest of your data architecture. If you are using any of the big data technologies we support you can move data into, and out of these data stores using a graphical environment. Our data integration capabilities also extend to traditional databases, columnar databases, flat files, web services, hosted applications and more. So you can easily integrate your big data application into the rest of your data architecture. This means your big data store is not a silo.

For Pentaho, the big data arena is a strategic one. These are new technologies and architectures so all the players in this space are starting from the same place. It is a great space for us because people using these technologies need tools and capabilities that are easy for us to deliver. Hadoop is especially cool because all of our tools and technologies are pure Java and are embeddable, so we can execute our engines within the data nodes and scale linearly as your data grows.

For everyone else our tools continue to provide great bang for the buck for ETL, reporting, OLAP, predictive analytics etc. Now we also lower the cost, time, and skills sets required to investigate big data solutions. For any one application you can divide the data architecture into two main segments: client data and server data. Client data includes things like flat files, mobile app data, cookie data etc. Server data includes transactional/traditional databases and big data stores. I don’t see the server-side as all or nothing. It could be all RDBMS, all big data store, 50/50, or any mix of the two. It’s like milk and coffee. You can have a glass of milk, a cup of coffee, or variations in between with different amounts of milk or coffee. So you can consider an application that only uses a traditional database today to be an application that currently utilizes 0% of its potential big data component. So every data architecture exists on this continuum, and we have great tools to help you if you want to step into the big data world.

If you want to find out more:

Visit http://community.pentaho.com/BigData which has downloads, how-tos, and other resouces
Connect with the community on irc.freenode.net ##pentaho;
Join the Pentaho Big Data technical developer mailing list to be notified about future big data product updates and related events.
Attend the techcast on Thursday February 9th to learn more about Pentaho Kettle for Big Data, watch a live demo and hear how you can get involved. Register now at http://www.pentaho.com/resources/events/20120209-pentaho-kettle-webinar/
Hands-on training FREE for attendees at the 2012 Strata Conference in Santa Clara, California. Sign-up for our how-to training session (http://strataconf.com/strata2012) on February 28th during the ‘Tuesday Tutorials.’ Register with Pentaho’s 20 percent discount code: str12sd20 <https://en.oreilly.com/strata2012/public/register> .

Written by James

February 2, 2012 at 9:56 pm

Posted in Business Intelligence, Datawarehousing, Hadoop, Hive, open source

Tagged with Cassandra, Hadoop, HBase, Hive, MongoDB, Pentaho

More Hadoop in New York City

leave a comment »

Yesterday was fun. First I met with a potential customer looking to try Hadoop for a big data project.

Then I had a lengthy and interesting chat with Dan Woods. Amongst other things Dan runs http://www.citoresearch.com/ and also blogs for Forbes. We talked about Pentaho’s history and our experiences so far with the commercial open source model. We also talked about Hadoop and big data and about the vision and roadmap of our Agile BI offering.

Next I met with Steve Lohr who is a technology reporter for the New York Times. We talked about many topics including the enterprise software markets and how open source is affecting them. We also talked about Hadoop, of course.

Next was a co-meet-up of the New York Predictive Analytics and No-SQL groups where I presented decks about Weka and Hadoop, separately and together. There were lots of interesting questions and side discussions earlier. By the time we finished all these topics a blizzard was going on out side. Cabs were nowhere to be seen so Matt Gershoff of Conductrics was kind enough to lead me via the subway to the vicinity of my hotel.

Written by James

January 27, 2011 at 4:01 pm

Posted in Agile BI, Business Intelligence, commercial open source, Hadoop, Hive

Big Data in New York City

leave a comment »

I’m having an interesting time in NYC this week. I had to retrieve my snowboarding jacket out of the attic for this trip. It’s snowing right now, which is better than the sleet forecast for later. So far I’ve met with a few Big Data customers and prospects and presented at the New York Hadoop User Group. Our hybrid database/Hadoop data lake architecture always gets a good reception and our ability to run our data integration engine within the Hadoop data nodes impresses people.

Being the first Business Intelligence vendor to bring reporting and ETL to the Hadoop space sets us apart from all the other vendors. We have so much recognition in this space that I’ve spoken to a few people in the last month who thought we were ‘THE’ visualization and data transformation provider for Hadoop and didn’t connect to other data sources.

This afternoon I’m meeting with reporters and columnists from a couple of different publications to chat about Big Data / Hadoop stuff. Tonight I’m presenting at the New York Predictive Analytics Meetup to talk about Hadoop from an analysis perspective.

Written by James

January 26, 2011 at 5:20 pm

Posted in Agile BI, Business Intelligence, commercial open source, Hadoop, open source

Meetups and Pentaho Summit(s) coming up in January

with one comment

It’s going to be a busy month.

January 19th and 20th is our Global 2011 Summit in San Francisco. I have three sessions Pentaho for Hadoop, Extending Pentaho’s Capabilities, and an Architecture Overview. So I’m creating and digging up some new sample plug-ins and extensions. I’m also going to take part in an Q&A session with the Penaho architects since Julian Hyde (Mondrian), Matt Casters (PDI/Kettle), Thomas Morgner (Pentaho Reporting) will all be there. Who should attend?

CTOs, architects, product managers, business executives and partner-facing staff from System Integrators and Resellers, as well as Software Providers with a need to embed business intelligence or data integration software into your products.

We usually have customers and prospects attending our summits as well.

We are also having an architect’s summit that same week to work on our 2011 technology road-map. That should be a lot of fun.

The week after that I’ll be in New York presenting at the NYC Hadoop User Group on Tuesday, January 25 and the NYC Predictive Analytics Meetup on Wednesday January 26th.

Written by James

January 5, 2011 at 9:20 pm

Posted in Agile BI, Business Intelligence, commercial open source, Hadoop, Hive, open source

Pentaho, Hadoop, and Data Lakes

with 15 comments

Earlier this week, at Hadoop World in New York, Pentaho announced availability of our first Hadoop release.

As part of the initial research into the Hadoop arena I talked to many companies that use Hadoop. Several common attributes and themes emerged from these meetings:

80-90% of companies are dealing with structured or semi-structured data (not unstructured).
The source of the data is typically a single application or system.
The data is typically sub-transactional or non-transactional.
There are some known questions to ask of the data.
There are many unknown questions that will arise in the future.
There are multiple user communities that have questions of the data.
The data is of a scale or daily volume such that it won’t fit technically and/or economically into an RDBMS.

In the past the standard way to handle reporting and analysis of this data was to identify the most interesting attributes, and to aggregate these into a data mart. There are several problems with this approach:

Only a subset of the attributes are examined, so only pre-determined questions can be answered.
The data is aggregated so visibility into the lowest levels is lost

Based on the requirements above and the problems of the traditional solutions we have created a concept called the Data Lake to describe an optimal solution.

If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

For more information on this concept you can watch a presentation on it here: Pentaho’s Big Data Architecture

Written by James

October 14, 2010 at 4:06 pm

Posted in Business Intelligence, Datawarehousing, Hadoop, open source

Pentaho & Hadoop Webinar Tomorrow (June 9th)

with one comment

I’ll be presenting an informal webinar tomorrow to the Pentaho community about Pentaho’s Hadoop initiative.

I’m planning to cover:

Use cases for Hadoop and what needs Pentaho can solve
The risks and costs of Hadoop implementations
Architecture of Hadoop/Hive with and without Pentaho
Pentaho’s Hadoop roadmap
Demo of selected integration points
Cover some FAQs

When:

June 9, 2010 10:00 am, Eastern Daylight Time (New York, GMT-04:00)
June 9, 2010 2:00 pm, (Reykjavik, GMT)
June 9, 2010 4:00 pm, Europe Summer Time (Paris, GMT+02:00)
June 10, 2010 12:00 am, Australia Eastern Standard Time (Sydney, GMT+10:00)

Where:

http://wiki.pentaho.com/display/COM/Community+Technical+WebEx

Written by James

June 8, 2010 at 6:33 pm

Posted in Business Intelligence, Hadoop, Hive, open source

Pentaho and IBM Hadoop Announcements

with 5 comments

Last week, on the same day, both Pentaho and IBM made announcements about Hadoop support. There are several interesting things about this:

IBM’s announcement is a validation of Hadoop’s functionality, scalability and maturity. Good news.
Hadoop, being Java, will run on AIX, and on IBM hardware. In fact, Hadoop hurts the big iron vendors. Hadoop also, to some extent competes with IBM’s existing database offerings. But their announcement was made by their professional services group, not by their hardware or AIX groups. For IBM this is a services play.
IBM announced their own distro of Hadoop. This requires a significant development, packaging, testing, and support investment for IBM. They are going ‘all in’, to use a poker term. The exact motivation behind this has yet to be revealed. They are offering their own tools and extensions to Hadoop, which is fair enough, but this is possible without providing their own full distro. Only time will show how they are maintaining their internal fork or branch of Hadoop and whether any generic code contributions make it out of Big Blue into the Hadoop projects.
IBM is making a play for Big Data, which, in conjunction with their cloud/grid initiatives, makes perfect sense. When it comes to cloud computing, the cost of renting hardware is gradually converging with the price of electricity. But with the rise of the cloud, an existing problem is compounded. Web-based applications generate a wealth of event-based data. This data is hard enough to analyze when you have it on-premise, and it quickly eclipses the size of the transactional data. When this data is generated in a cloud environment, the problem is worse: you don’t even have the data locally, and moving it will cost you. IBM is attempting a land-grab: cloud + Hadoop + IBM services (with or without IBM hardware, OS, and databases). They are recognizing the fact that running apps in the cloud and storing data in the cloud are easy: but analyzing that data is harder and therefore more valuable.

Pentaho’s announcement, was similar in some ways, different in others:

Like IBM, we recognize the needs and opportunities.
Technology-wise, Pentaho has a suite of tools, engines and products that are a much better suited for Hadoop integration, being pure Java and designed to be embedded
Pentaho has no plans to release our own distro of Hadoop. Any changes we make to Hadoop, Hive etc will be contributed to Apache
And lastly, but no less importantly, Pentaho announced first. 😉

When it comes to other players:

Microsoft is apparently making Hadoop ready for Azure, but is Hadoop currently is not recommended for production use on Windows. It will be interesting to see how these facts resolve themselves.
Oracle/Sun has the ability to read from the Hadoop file system and has a proprietary Map/Reduce capability, but no compelling Hadoop support yet. In direct conflict with the scale-out mentality of Hadoop, in a recent Wired interview Larry Ellison talked about Oracle’s new hardware

The machine costs more than $1 million, stands over 6 feet tall, is two feet wide and weighs a full ton. It is capable of storing vast quantities of data, allowing businesses to analyze information at lightening fast speeds or instantly process commercial transactions.

HP, Dell etc are probably picking up some business providing the commodity hardware for Hadoop installations, but don’t yet have a discernible vision.

Interesting times…

Written by James

May 27, 2010 at 3:33 am

Posted in Business Intelligence, commercial open source, Datawarehousing, Hadoop, Hive, open source

James Dixon’s Blog

Archive for the ‘Hadoop’ Category

Next Big Thing: The Data Explosion

Pile-On: Dan Woods “Lessons From The First Wave Of Hadoop Adoption”

1. It’s about more than big data. It’s a new platform.

2. Don’t get rid of your data warehouse

3. Think about your data supply chain

4. It’s complicated

Union of the State – A Data Lake Use Case

Pentaho’s Big Data Release

More Hadoop in New York City

Big Data in New York City

Meetups and Pentaho Summit(s) coming up in January

Pentaho, Hadoop, and Data Lakes

Pentaho & Hadoop Webinar Tomorrow (June 9th)

Pentaho and IBM Hadoop Announcements

Twitters

My Other Stuff

Pages

Archives

Meta

Categories