James Dixon’s Blog

James Dixon’s thoughts on commercial open source and open source business intelligence

Archive for May 27th, 2010

Pentaho and IBM Hadoop Announcements

with 5 comments

Last week, on the same day, both Pentaho and IBM made announcements about Hadoop support. There are several interesting things about this:

  • IBM’s announcement is a validation of Hadoop’s functionality, scalability and maturity. Good news.
  • Hadoop, being Java, will run on AIX, and on IBM hardware. In fact, Hadoop hurts the big iron vendors. Hadoop also, to some extent competes with IBM’s existing database offerings. But their announcement was made by their professional services group, not by their hardware or AIX groups. For IBM this is a services play.
  • IBM announced their own distro of Hadoop. This requires a significant development, packaging, testing, and support investment for IBM. They are going ‘all in’, to use a poker term. The exact motivation behind this has yet to be revealed. They are offering their own tools and extensions to Hadoop, which is fair enough, but this is possible without providing their own full distro. Only time will show how they are maintaining their internal fork or branch of Hadoop and whether any generic code contributions make it out of Big Blue into the Hadoop projects.
  • IBM is making a play for Big Data, which, in conjunction with their cloud/grid initiatives, makes perfect sense. When it comes to cloud computing, the cost of renting hardware is gradually converging with the price of electricity. But with the rise of the cloud, an existing problem is compounded. Web-based applications generate a wealth of event-based data. This data is hard enough to analyze when you have it on-premise, and it quickly eclipses the size of the transactional data. When this data is generated in a cloud environment, the problem is worse: you don’t even have the data locally, and moving it will cost you. IBM is attempting a land-grab: cloud + Hadoop + IBM services (with or without IBM hardware, OS, and databases). They are recognizing the fact that running apps in the cloud and storing data in the cloud are easy: but analyzing that data is harder and therefore more valuable.

Pentaho’s announcement, was similar in some ways, different in others:

  • Like IBM, we recognize the needs and opportunities.
  • Technology-wise, Pentaho has a suite of tools, engines and products that are a much better suited for Hadoop integration, being pure Java and designed to be embedded
  • Pentaho has no plans to release our own distro of Hadoop. Any changes we make to Hadoop, Hive etc will be contributed to Apache
  • And lastly, but no less importantly, Pentaho announced first. 😉

When it comes to other players:

  • Microsoft is apparently making Hadoop ready for Azure, but is Hadoop currently is not recommended for production use on Windows. It will be interesting to see how these facts resolve themselves.
  • Oracle/Sun has the ability to read from the Hadoop file system and has a proprietary Map/Reduce capability, but no compelling Hadoop support yet. In direct conflict with the scale-out mentality of Hadoop, in a recent Wired interview Larry Ellison talked about Oracle’s new hardware

The machine costs more than $1 million, stands over 6 feet tall, is two feet wide and weighs a full ton. It is capable of storing vast quantities of data, allowing businesses to analyze information at lightening fast speeds or instantly process commercial transactions.

  • HP, Dell etc are probably picking up some business providing the commodity hardware for Hadoop installations, but don’t yet have a discernible vision.

Interesting times…

Written by James

May 27, 2010 at 3:33 am