150,000 installations year-to-date for Pentaho

Our most recent figures show 156,000 copies of Pentaho software were installed so far this year. These numbers are not download numbers, but installed software that has been used. This includes Pentaho servers and some Pentaho client tools. These numbers do not represent only long-term installations, but also do not represent all Pentaho’s software distributions or installations. Since these numbers are not absolutel

An analysis by country of these numbers shows interesting results.

The Long Tail

This chart shows the number of new installations year-to-date for each country. Our data shows new Pentaho installations in 176 countries so far this year. That’s out of a total of 229 countries.

This is clearly a classic long tail. In fact after the first 20 or 30 countries it is difficult to read values from the chart. This second chart uses a log scale. The line on this chart is almost perfectly linear, showing that the distribution by country is pretty much logarithmic.

Over the same time period Pentaho has customers in 46 countries. This is a larger geographic spread than most of the proprietary BI companies.

Since we are dealing with country-based data, here is the analysis I did using Google Geo Map, Pentaho Data Integration, and Pentaho BI Server.

New Pentaho Installs Jan-Oct 2010

This shows the geographic spread of the installations.

It is fairly obvious from the map above that the highest number of installations were in the USA, China, and Brazil, followed by India and parts of Europe. But this simplistic graphic does take into account the economics or demographics of the countries. How does the number of installations relate to the size or economic power of each country?

New Installations Per $Billion GDP

If we look at the number of new installations of Pentaho software per billion dollars of GDP we see a different picture. The GDP data is from the CIA World Factbook

I capped ‘Installs Per Bn GDP’ at 10 to prevent outliers from skewing the color gradient.

Compared with the first map the prominence of the USA and Chine is reduced, and the areas of high activity are shown to be South America followed by Europe and parts of Asia. But analysis using GDP alone does not take into account things like exchange rates and the cost-of-living within a country – as a result there is probably a bias towards countries like the South American ones. So I went to find metrics that should remove bias of economic factors.

New Installations Per 100k Labor Force

If we look at the number of new Pentaho installs compared with the labor force of each country we  get a slightly different picture. The labor force data is from the CIA World Factbook

I capped ‘Installs Per 100k Labor Force’ at 50 to prevent outliers from skewing the color gradient.

Compared with the first two maps, this one shows the South America, Europe, and North America countries roughly equal to each other. Australia and New Zealand are also comparable. Asia, Africa, and the Middle East are shown to be generally behind. What is odd about this graphic is that countries like India, generally considered to be significant open source consumers, are not shown to be within the leading countries. This is because, I’m assuming, that a large percentage of the labor force is agricultural, and as such, less likely to be doing much BI.

New Installations Per 100k Internet Users

So instead of labor force, let’s look at new installations of Pentaho for every 100k internet users within a country.

I capped ‘Installs Per 100k Internet Users’ at 50 to prevent outliers from skewing the color gradient.

Here we see that South America is still prominent, along with southern Europe. The rest of Europe and North America come second along with India, other parts of Asia and Australia. South Africa also makes a showing for the first time. China however does not show strongly.

This metric – Installations per 100k Internet Users – seems like a reasonable way to compare the adoption of software between countries. ‘Internet Users’, by definition, have access to a computer (needed to run FOSS) and to the internet (needed to get FOSS). This metric is not skewed by the percentage of the population that are not internet users, and is not skewed by cost-of-living or exchange rates.

Here are the top 40 countries for new Pentaho installations per 100k internet users ( for countries with over 1 million internet users).

There is a bias still. Countries with a lower percentage of internet users in the total population will be rated higher than those with very high percentage. This is because in the first case, the individuals with internet access will tend to be those in business, i.e. those with a higher than normal need for BI tools. Whereas in the second case the internet users include relatively more families and individuals – those with a lower need for BI tools. This bias would not affect the installation figures of software such as Firefox, but would affect the ratings in Pentaho’s case.

If we group the countries into regions we see some other interesting things. This scatter chart shows number internet users on the X axis and new installations of Pentaho software on the Y axis.

Interestingly the USA, South America, and Asia come out with around the same total number of installations (approx 30,000), but the chart shows a large difference (100m up to 550m) in the number of internet users within those regions. Europe, as a region, has the highest number of new installations, with a 50% margin over the second place region.

So which metric do you think is most valuable? And for what purpose?

Also interesting to note is that the 2010 installations numbers represent, for each country, 40-50% of the all-time (2007-2010) installation figures. This means that the number of new installations so far in 2010 is about the same as the number of installations in the previous 3 years combined.

About This Analysis

And yes, I used Pentaho software to do this analysis – I used Pentaho’s Agile BI process.

  • Iteration 1: I first loaded the ‘new installations’ data into a table to do the histograms and the first map. After seeing the map, it occurred to me that just looking at the installation figures was not very interesting, and that comparing installations to GDP might be better.
  • Iteration 2: So I went to find GDP data and added it to the table using Pentaho Data Integration. After seeing the ‘Pentaho Installs Per $dn GDP’  map it occurred to me that other metrics might show different, and better, results – so I went to find other data sets, not knowing what I might find.
  • Iteration 3: At the CIA World Factbook I found Labor Force and Internet Users. I added these to the table and looked at the maps. At this point I decided that comparing installation counts with the total number of internet users in a country was a good metric.

It took three iterations of finding data, merge/calculate/load, and visualization before I settled on an analysis I thought was optimal. The important point here is that until I saw the data visualized, the next question did not occur to me, so a one-pass ‘requirements’ -> ‘design’ -> ‘implement’ -> ‘visualize’ process would not have worked.

