James Dixon’s Blog

James Dixon’s thoughts on commercial open source and open source business intelligence

Archive for the ‘Hive’ Category

Next Big Thing: The Data Explosion

leave a comment »

These are the main points from The Data Explosion section of my Next Big Thing In Big Data presentation.

  • The data explosion started in the 1960s
  • Hard drive sizes double every 2 years
  • Data expands to fill the space available
  • Most data types have a natural maximum granularity
  • We are reaching the natural limit for color, images, sound and dates
  • Business data also has a natural limit
  • Organizations are jumping from transactional level data to the natural maximum in one implementation
  • As an organization implements a big data system, the volume of its stored data “pops”
  • Few companies are popping so far
  • The explosion of hype exceeds the data explosion
  • The server, memory, storage, and processor vendors do not show a data explosion happening


  • The underlying trend is an explosion itself
  • The explosion on the explosion is only minor so far
  • Popcorn Effect – it will continue to grow

Please comment or ask questions


Written by James

June 23, 2015 at 5:50 pm

Pentaho’s Big Data Release

leave a comment »

This week at Pentaho we announced a major Big Data release, including:

  • Open sourcing of our of big data code
  • Moving Pentaho Data Integration to the Apache license
  • Support for Hbase, Cassandra, MongoDB, Hadapt
  • And numerous functionality and performance improvements

What does this mean for the Big Data market, for Pentaho, and for everyone else?

We believe you should use the best tool for each job. For example you should use Hadoop or a NoSQL database where those technologies suit your purposes, and use a high performance columnar database for the use cases they are suited to. Your organization probably has applications that use traditional databases, and likely has a hosted application or two as well. Like it or not, if you have a single employee that has a spreadsheet on their laptop, you have a data architecture that includes flat files. So every data architecture is a hybrid environment to some extent. To solve the requirements of your business, your IT group probably has to move/merge/transform data between these data stores. You may have an application or two that has no external inputs or outputs, and no integration points with other applications. There is a word for these applications – silos. Silos are bad. Big data is no different. A big data store that is not integrated with your data architecture is a Big Silo. Big Silos are just as bad as regular silos, only bigger.

So when you add a big data technology to your organization, you don’t want it to be a silo. The big data capabilities of Pentaho Data Integration enable you to integrate your big data store into the rest of your data architecture. If you are using any of the big data technologies we support you can move data into, and out of these data stores using a graphical environment. Our data integration capabilities also extend to traditional databases, columnar databases, flat files, web services, hosted applications and more. So you can easily integrate your big data application into the rest  of your data architecture. This means your big data store is not a silo.

For Pentaho, the big data arena is a strategic one. These are new technologies and architectures so all the players in this space are starting from the same place. It is a great space for us because people using these technologies need tools and capabilities that are easy for us to deliver. Hadoop is especially cool because all of our tools and technologies are pure Java and are embeddable, so we can execute our engines within the data nodes and scale linearly as your data grows.

For everyone else our tools continue to provide great bang for the buck for ETL, reporting, OLAP, predictive analytics etc. Now we also lower the cost, time, and skills sets required to investigate big data solutions. For any one application you can divide the data architecture into two main segments: client data and server data. Client data includes things like flat files, mobile app data, cookie data etc. Server data includes transactional/traditional databases and big data stores. I don’t see the server-side as all or nothing. It could be all RDBMS, all big data store, 50/50, or any mix of the two. It’s like milk and coffee. You can have a glass of milk, a cup of coffee, or variations in between with different amounts of milk or coffee. So you can consider an application that only uses a traditional database today to be an application that currently utilizes 0% of its potential big data component. So every data architecture exists on this continuum, and we have great tools to help you if you want to step into the big data world.

If you want to find out more:



Written by James

February 2, 2012 at 9:56 pm

More Hadoop in New York City

leave a comment »

Yesterday was fun. First I met with a potential customer looking to try Hadoop for a big data project.

Then I had a lengthy and interesting chat with Dan Woods. Amongst other things Dan runs http://www.citoresearch.com/ and also blogs for Forbes. We talked about Pentaho’s history and our experiences so far with the commercial open source model.   We also talked about Hadoop and big data and about the vision and roadmap of our Agile BI offering.

Next I met with Steve Lohr who is a technology reporter for the New York Times. We talked about many topics including the enterprise software markets and how open source is affecting them. We also talked about Hadoop, of course.

Next was a co-meet-up of the New York Predictive Analytics and No-SQL groups where I presented decks about Weka and Hadoop, separately and together. There were lots of interesting questions and side discussions earlier. By the time we finished all these topics a blizzard was going on out side. Cabs were nowhere to be seen so Matt Gershoff of Conductrics was kind enough to lead me via the subway to the vicinity of my hotel.

Written by James

January 27, 2011 at 4:01 pm

Meetups and Pentaho Summit(s) coming up in January

with one comment

It’s going to be a busy month.

January 19th and 20th is our Global 2011 Summit in San Francisco. I have three sessions Pentaho for Hadoop, Extending Pentaho’s Capabilities, and an Architecture Overview. So I’m creating and digging up some new sample plug-ins and extensions. I’m also going to take part in an Q&A session with the Penaho architects since Julian Hyde (Mondrian), Matt Casters (PDI/Kettle), Thomas Morgner (Pentaho Reporting) will all be there. Who should attend?

CTOs, architects, product managers, business executives and partner-facing staff from System Integrators and Resellers, as well as Software Providers with a need to embed business intelligence or data integration software into your products.

We usually have customers and prospects attending our summits as well.

We are also having an architect’s summit that same week to work on our 2011 technology road-map. That should be a lot of fun.

The week after that I’ll be in New York presenting at the NYC Hadoop User Group on Tuesday, January 25 and the NYC Predictive Analytics Meetup on Wednesday January 26th.

Written by James

January 5, 2011 at 9:20 pm

Pentaho & Hadoop Webinar Tomorrow (June 9th)

with one comment

I’ll be presenting an informal webinar tomorrow to the Pentaho community about Pentaho’s Hadoop initiative.

I’m planning to cover:

  • Use cases for Hadoop and what needs Pentaho can solve
  • The risks and costs of Hadoop implementations
  • Architecture of Hadoop/Hive with and without Pentaho
  • Pentaho’s Hadoop roadmap
  • Demo of selected integration points
  • Cover some FAQs


June 9, 2010 10:00 am, Eastern Daylight Time (New York, GMT-04:00)
June 9, 2010 2:00 pm, (Reykjavik, GMT)
June 9, 2010 4:00 pm, Europe Summer Time (Paris, GMT+02:00)
June 10, 2010 12:00 am, Australia Eastern Standard Time (Sydney, GMT+10:00)



Written by James

June 8, 2010 at 6:33 pm

Pentaho and IBM Hadoop Announcements

with 5 comments

Last week, on the same day, both Pentaho and IBM made announcements about Hadoop support. There are several interesting things about this:

  • IBM’s announcement is a validation of Hadoop’s functionality, scalability and maturity. Good news.
  • Hadoop, being Java, will run on AIX, and on IBM hardware. In fact, Hadoop hurts the big iron vendors. Hadoop also, to some extent competes with IBM’s existing database offerings. But their announcement was made by their professional services group, not by their hardware or AIX groups. For IBM this is a services play.
  • IBM announced their own distro of Hadoop. This requires a significant development, packaging, testing, and support investment for IBM. They are going ‘all in’, to use a poker term. The exact motivation behind this has yet to be revealed. They are offering their own tools and extensions to Hadoop, which is fair enough, but this is possible without providing their own full distro. Only time will show how they are maintaining their internal fork or branch of Hadoop and whether any generic code contributions make it out of Big Blue into the Hadoop projects.
  • IBM is making a play for Big Data, which, in conjunction with their cloud/grid initiatives, makes perfect sense. When it comes to cloud computing, the cost of renting hardware is gradually converging with the price of electricity. But with the rise of the cloud, an existing problem is compounded. Web-based applications generate a wealth of event-based data. This data is hard enough to analyze when you have it on-premise, and it quickly eclipses the size of the transactional data. When this data is generated in a cloud environment, the problem is worse: you don’t even have the data locally, and moving it will cost you. IBM is attempting a land-grab: cloud + Hadoop + IBM services (with or without IBM hardware, OS, and databases). They are recognizing the fact that running apps in the cloud and storing data in the cloud are easy: but analyzing that data is harder and therefore more valuable.

Pentaho’s announcement, was similar in some ways, different in others:

  • Like IBM, we recognize the needs and opportunities.
  • Technology-wise, Pentaho has a suite of tools, engines and products that are a much better suited for Hadoop integration, being pure Java and designed to be embedded
  • Pentaho has no plans to release our own distro of Hadoop. Any changes we make to Hadoop, Hive etc will be contributed to Apache
  • And lastly, but no less importantly, Pentaho announced first. 😉

When it comes to other players:

  • Microsoft is apparently making Hadoop ready for Azure, but is Hadoop currently is not recommended for production use on Windows. It will be interesting to see how these facts resolve themselves.
  • Oracle/Sun has the ability to read from the Hadoop file system and has a proprietary Map/Reduce capability, but no compelling Hadoop support yet. In direct conflict with the scale-out mentality of Hadoop, in a recent Wired interview Larry Ellison talked about Oracle’s new hardware

The machine costs more than $1 million, stands over 6 feet tall, is two feet wide and weighs a full ton. It is capable of storing vast quantities of data, allowing businesses to analyze information at lightening fast speeds or instantly process commercial transactions.

  • HP, Dell etc are probably picking up some business providing the commodity hardware for Hadoop installations, but don’t yet have a discernible vision.

Interesting times…

Written by James

May 27, 2010 at 3:33 am

Pentaho and Hadoop: Big Data + Big ETL + Big BI = Big Deal

with 2 comments

Earlier today Pentaho announced support for Hadoop – read about it here.

There are many reasons we are doing this:

  • Hadoop lacks graphical design tools – Pentaho provides plug-able design tools.
  • Hadoop is Java –  Pentaho’s technologies are Java.
  • Hadoop needs embedded ETL – Pentaho Data Integration is easy to embed.
  • Pentaho’s open source model enables us to provide technology with great price/performance.
  • Hadoop lacks visualization tools – Pentaho has those
  • Pentaho provides a full suite of ETL, Reporting, Dashboards, Slice ‘n’ Dice Analysis, and Predictive Analytics/Machine Learning

The thing is, taking all of these in combination, Pentaho is the only technology that satisfies all of these points.

You can see a few of the upcoming integration points in the demo video. The ones shown in the video are only a few of the many integration points we are going to deliver.

Most recently I’ve been working on integrating the Pentaho suite with the Hive database. This enables desktop and web-based reporting, integration with the Pentaho BI platform components, and integration with Pentaho Data Integration. Between these use cases, hundreds of different components and transformation steps can be combined in thousands of different ways with Hive data. I had to make some modifications to the Hive JDBC driver and we’ll be working with the Hive community to get these changes contributed. These changes are the minimal changes required to get some of the Pentaho technologies working with Hive. Currently the changes are in a local branch of the Hive codebase. More specifically they are a ‘SHort-term Rapid-Iteration Minimal Patch’ fork – a SHRIMP Fork.

Technically, I think the most interesting Hive-related feature so far is the ability to call an ETL process within a SQL statement (as a Hive UDF). This enables all kinds of complex processing and data manipulation within a Hive SQL statement.

There are many more Hadoop-related ETL and BI features and tools to come from Pentaho.  It’s gonna be a big summer.

Written by James

May 19, 2010 at 7:49 am