James Dixon’s Blog

James Dixon’s thoughts on commercial open source and open source business intelligence

Archive for May 19th, 2010

What Agile BI is not…

leave a comment »

I have a late entry for Pentaho’s ‘What is Agile BI?’ competition.

Agile BI is everything that is wrong with this: ‘Baker Hughes Deploys SAP BusinessObjects Solutions in Less Than One Year’

This is the title of a session that was run yesterday at the SAP/SapphireNow conference in Orlando. That’s old-school, right there.

Thanks to @markmadsen and @wherescape on Twitter for giving me the twit-up on this.

Written by James

May 19, 2010 at 5:28 pm

Pentaho and Hadoop: Big Data + Big ETL + Big BI = Big Deal

with 2 comments

Earlier today Pentaho announced support for Hadoop – read about it here.

There are many reasons we are doing this:

  • Hadoop lacks graphical design tools – Pentaho provides plug-able design tools.
  • Hadoop is Java –  Pentaho’s technologies are Java.
  • Hadoop needs embedded ETL – Pentaho Data Integration is easy to embed.
  • Pentaho’s open source model enables us to provide technology with great price/performance.
  • Hadoop lacks visualization tools – Pentaho has those
  • Pentaho provides a full suite of ETL, Reporting, Dashboards, Slice ‘n’ Dice Analysis, and Predictive Analytics/Machine Learning

The thing is, taking all of these in combination, Pentaho is the only technology that satisfies all of these points.

You can see a few of the upcoming integration points in the demo video. The ones shown in the video are only a few of the many integration points we are going to deliver.

Most recently I’ve been working on integrating the Pentaho suite with the Hive database. This enables desktop and web-based reporting, integration with the Pentaho BI platform components, and integration with Pentaho Data Integration. Between these use cases, hundreds of different components and transformation steps can be combined in thousands of different ways with Hive data. I had to make some modifications to the Hive JDBC driver and we’ll be working with the Hive community to get these changes contributed. These changes are the minimal changes required to get some of the Pentaho technologies working with Hive. Currently the changes are in a local branch of the Hive codebase. More specifically they are a ‘SHort-term Rapid-Iteration Minimal Patch’ fork – a SHRIMP Fork.

Technically, I think the most interesting Hive-related feature so far is the ability to call an ETL process within a SQL statement (as a Hive UDF). This enables all kinds of complex processing and data manipulation within a Hive SQL statement.

There are many more Hadoop-related ETL and BI features and tools to come from Pentaho.  It’s gonna be a big summer.

Written by James

May 19, 2010 at 7:49 am