James Dixon’s Blog

James Dixon’s thoughts on commercial open source and open source business intelligence

Archive for the ‘Business Intelligence’ Category

Next Big Thing: The Data Explosion

leave a comment »

These are the main points from The Data Explosion section of my Next Big Thing In Big Data presentation.

  • The data explosion started in the 1960s
  • Hard drive sizes double every 2 years
  • Data expands to fill the space available
  • Most data types have a natural maximum granularity
  • We are reaching the natural limit for color, images, sound and dates
  • Business data also has a natural limit
  • Organizations are jumping from transactional level data to the natural maximum in one implementation
  • As an organization implements a big data system, the volume of its stored data “pops”
  • Few companies are popping so far
  • The explosion of hype exceeds the data explosion
  • The server, memory, storage, and processor vendors do not show a data explosion happening

Summary

  • The underlying trend is an explosion itself
  • The explosion on the explosion is only minor so far
  • Popcorn Effect – it will continue to grow

Please comment or ask questions

Advertisements

Written by James

June 23, 2015 at 5:50 pm

Pile-On: Dan Woods “Lessons From The First Wave Of Hadoop Adoption”

leave a comment »

Dan Woods put out a nice piece yesterday on his Forbes blog titled “Lessons From The First Wave Of Hadoop Adoption“.

I agree with him that the insights and advantages of Big Data solutions need to be described in ways other than technology. I’m going to add on to his insights.

1. It’s about more than big data. It’s a new platform.

Yes, it is a new platform.  That means it’s different than the old ones. The fact that you can do some things cheaper than you could before is not the main idea. A bigger story is that some things that were economically not possible before, now are. But the main idea is that this is a new platform, with new capabilities, that needs to fit into your existing data architecture.

2. Don’t get rid of your data warehouse

I completely agree. Big Data technology is a new tool with new characteristics. Using it to replace a Data Warehouse technology that is finely tuned for that use case is not a great idea. Don’t listen to the “Hadoop will replace every database within x years” crowd. No database has managed to replace every database. No database ever will because the variety of the use cases is too large.

3. Think about your data supply chain

Since a Big Data system needs to fit in with everything you currently have and operate, integration is a significant priority. Understand that with Big Data you can build a Big Silo, but a Big Silo is as bad as a small silo (just a lot bigger). You should not be required to pump all your data from every system into Hadoop to get value from it. Design you data architecture carefully, the implications and fallout of getting it right or wrong are significant.

4. It’s complicated

Yes it is. It’s also not cheap to do it well. Sure you can download a lot of open source software and prototype or prove your ideas without a lot of upfront outlay. But putting it into production is a production. Expect that.

Written by James

January 27, 2015 at 5:01 pm

Union of the State – A Data Lake Use Case

with 6 comments

Many business applications are essentially workflow applications or state machines. This includes CRM systems, ERP systems, asset tracking, case tracking, call center, and some financial systems. The real-world entities (employees, customers, devices, accounts, orders etc.) represented in these systems are stored as a collection of attributes that define their current state. Examples of these attributes include someone’s current address or number of dependents, an account’s current balance, who is in possession of laptop X, which documents for a loan approval have been provided, and the date of Fluffy’s last Feline Distemper vaccination.
State machines are very good at answering questions about the state of things. They are, after all, machines that handle state. But what about reporting on trends and changes over the short and long term? How do we do this? The answer for this is to track changes to the attributes in change logs. These change logs are database tables or text files that list the changes made over time. That way you can (although the data transformation is ugly) rewind the change log of a specific field across all objects in the system and then aggregate those changes to get a view over time. This is not easy to do and assumes that you have a change log. Typically, change logs only exist for the main fields in an application. There might only be change logs on 10-20% of the fields. So if you suddenly have an impulse so see how a lesser attribute has changed over time you are out of luck. It is impossible because that information is lost.
This situation is similar to the way that old school business intelligence and analytic applications were built. End users listed out the questions they want to ask of the data, the attributes necessary to answer those questions were skimmed from the data stream, and bulk loaded into a data mart. This method works fine until you have a new question to ask. The Data Lake approach solves this problem. You store all of the data in a Data Lake, populate data marts and your data warehouse to satisfy traditional needs, and enable ad-hoc query and reporting on the raw data in the Data Lake for new questions.
A Data Lake can also be used to solve the problems of history and trending for workflow applications and state machines. What if these applications write their initial state into the Data Lake and then also write the change of every attribute in there as well? While we are at it, let’s log all the application events coming from the user interface tier as well. From the application’s perspective this is a low-latency fire and forget scenario.
Now we have the initial state of the application’s data and the changes to of all of the attributes, not just the main/traditional fields. We can apply this approach to more than one application, each with its own Data Lake of state logs, storing every incremental change and event. So now we have the state of every field of (potentially) every business application in an enterprise across time. We have the “Union of the State”.
With this data we have the ability to rewind the Union of the State to any point in time. What are the potential use cases for the Union of the State?
Enterprise Time Machine
Suppose something happened a few weeks ago. Decisions were made. Things changed. But exactly what, when, and why? With an Enterprise Time Machine you can rewind the complete state of every major application to any point in time and then step forward event by event, click by click, change by change, at the millisecond level if things happened that quickly. For an e-commerce vendor this means being able to know for any specified millisecond in the past how many shopping carts where open, what was in them, which transactions were pending, which items were being boxed, or in transit, what was being returned, who was working, how many customer support calls were queued and how many were in progress. In different domains such as financial services or healthcare, the applications and attributes are different but the ability is the same.
In order to reconstruct the state at any point in time we need to load the initial snapshot into a repository and then update the attributes of each object as we process the logs, event by event, until we get to the point in time that we are interested in. A NoSQL store such as MongoDB , HBase, or Cassandra should work well as the repository. This process could be optimized by adding regular snapshots of the whole state into the Data Lake so that we don’t have to process from the very beginning every time. For a detailed analysis you could rebuild the state to a particular point in time and then process forwards in increments of any size. This way the situation of a device failure that led to a catastrophic cascade of events can be re-created and examined millisecond by millisecond.
Trending
Since we can re-create the state at any point in time we can do trending and historical analysis of any and every attribute over any time period, at any time granularity we want.
Compliance
When user interface events are logged as well as the attribute changes you have the ability to know not only who changed what information, but also who looked at it. Who was aware of the situation? Why did Bob open a particular record every few hours and cancel out without making changes? This requires the History Machine described above.
Predictive
One of the main tasks in a predictive exercise is to work out which attributes are predictive of your target variable and which ones are not. This can be impossible to do when you only have 10% of your attributes logged. Maybe the minor attributes are the predictive ones. Now you have all of them. This requires the trending facility described above.
Doug Moran, a co-founder of Pentaho and product manager for its Big Data products, sees many predictive applications for this kind of data. This includes the ability to derive a model from replays of previous events and use it to prescribe ways to influence the current situation to increase the likelihood of a desired outcome. For example, this could include replaying all previous shopping cart events for a user currently on an e-commerce site to derive a predictive model that prescribes a way to influence their current purchase in a positive way.
“Dixon’s Union of the State idea gives the Data Lake idea a positive mission besides storing more data for less money,”
said Dan Woods, an IT Consultant to buyers and vendors and CEO of Evolved Media, who has written about the Data Lake for several years.
“Providing the equivalent of a rewind, pause, forward remote control on the state of your business makes it affordable to answer many questions that are currently too expensive to tackle. Remember, you don’t have to implement this vision for all data for it to provide a new platform to answer difficult questions with minimal effort.”
Architecture
How could this be done?
  • Let the application store it’s current state in a relational or No-SQL repository. Don’t affect the operation of the operational system.
  • Log all events and state changes that occur within the application. This is the tricky part unless it is an in-house application. It would be best if these events and state changes were logged in real time, but this is sometimes not ideal. Maybe SalesForces or SugarCRM will offer this level of logging as a feature. Dump this data into a Data Lake using a suitable storage and processing technology such as Hadoop.
  • Provide the ability to rewind the state of any and all attributes by parallel processing of the logs.
  • Provide the facilities listed above using technologies appropriate of each use case (using the rewind capability).

The plumbing and architecture for this is not simple and Dan Woods points out that there are databases like Datomic that provide capabilities for storing and querying state over time. But a solution based on a Data Lake has the same price, scalability, and architectural attributes as other big data systems.

Written by James

January 22, 2015 at 4:43 am

12 Days of Visualizations – Sun Burst

leave a comment »

Today we are launching our 12 Days of Visualizations program: http://events.pentaho.com/12days-of-Big-Data-Visualizations.html

We are going to release a few new visualizations every week over the holiday period. You can drop these visualizations into a Pentaho BA server and they will appear on the charting menu in Analyzer.

The first one that we are releasing is a Sun Burst chart. This chart is based on the Protovis sun burst chart – http://mbostock.github.com/protovis/ex/sunburst.html

The Sun Burst chart we created can be used in a couple of ways. Firstly it can be used as a multi-level pie chart. This sun burst shows how the sales in three territories breaks down into sales of product lines within those territories, and then how product line sales compare by year:

Screen Shot 2012-12-13 at 3.41.40 PM

This effect is achieved by using a color gradient for the outer ring that is based on the chart palette color of the inner rings, and by sorting the segments in each ring into descending order. When you compare the sun burst above with the pie chart below, you can see how much more information the sun burst provides.

pie

You can choose to use a common color gradient on the outer ring so that it is easier to compare the items on that ring. In this example a blue gradient has been used for the outer ring. Regardless of which territory a city is in, the shade of blue it is colored in can be used to compare it with other cities.

Screen Shot 2012-12-13 at 3.43.10 PM

In this chart a red/yellow/green gradient has been used. Here the levels of the chart are year, quarter, and month so the data has not been sorted. The data for this chart is overtime costs so the gradient has been reversed to show larger overtime costs in red, and smaller ones in green.

Screen Shot 2012-12-13 at 3.46.11 PM

You can find out more about this chart here: http://wiki.pentaho.com/display/COM/Sunburst

Written by James

December 14, 2012 at 5:39 pm

5 Out Of 6 Developers Are Using Open Source

leave a comment »

ZDNet  reports on a Forrester survey that finds 5 out of 6 developers are using or deploying open source.

http://www.zdnet.com/five-out-of-six-developers-now-using-or-deploying-open-source-7000008499/?s_cid=rSINGLE

In the survey they found that 7% of developers are using open source software tools such as Pentaho.

The United States Department of Labor state that, in 2010, there were 913,100 software developers in the USA alone.

http://www.bls.gov/ooh/Computer-and-Information-Technology/Software-developers.htm

7% of 913,100 means about 64,000 developers using open source business intelligence software. Nice.

Written by James

December 10, 2012 at 2:25 pm

I Never Owned Any Software To Begin With

leave a comment »

My thoughts on the whole Emily White/stealing music topic:

http://www.npr.org/blogs/allsongs/2012/06/16/154863819/i-never-owned-any-music-to-begin-with

http://thetrichordist.wordpress.com/2012/06/18/letter-to-emily-white-at-npr-all-songs-considered/

When she says she only bought 15 albums, I think she is talking about physical CDs. I think she did buy some of her music online. But she clearly states that she ripped music from the radio station and swapped mix CDs with her friends, and she makes it sound like she thinks this is not stealing.

Don’t Blame iTunes

Many people who complain about artist’s income people blame Apple and iTunes. Yes, iTunes propagated the old economic splits and percentages into the digital world. But Apple did not create those splits, they were agreed upon in contracts between the labels/producers and the artists. What iTunes did was to provide an alternative digital distribution medium to Napster. Apple saved artists from the prospect of getting no revenue at all. People who attack and boycott iTunes thinking that they are helping artists are deluded.

It’s Not Just Music

This whole debate also extends to movies, books, news commentary, and software – anything that can be digitally copied. In each of these arenas, the players and economic distribution is different, but the consequence of not paying is the same. If we all behaved this way, ultimately, there would be no books, or movies either. So how does this relate to proprietary software, open source software, and free software?

Proprietary Software

Just like companies that publish books, music, movies etc, proprietary software companies were the gatekeepers. They decided what software was created and made available. When the hardware and software becomes available at the consumer level, independent producers spring up. This happened with freeware software for PCs. The internet enables the distribution of the software, and methods of collecting payment. The costs of creating books, music, and movies have dropped dramatically because of the hardware and software now available. But, if no-one pays for the content created the proprietary software companies will go out of business.

Non-Proprietary

Open source and free software are other ways for creating and distributing software, the difference being that these rely on software (source and binaries) being easy to copy. Don’t steal Microsoft’s BI software and use it without permission. Use our open source BI software – we want you to.

Free Software

Free software requires that the software, and all software that is built upon it, be ‘free’. In this case ‘free’ means you can freely modify it, distribute it, and build upon it, and you give others those same rights. You can still charge for the software, but it makes no sense to (given the rights you give to your ‘customer’).

The ideals of Free Software Foundation (FSF) are based on the notion that when you think of something or invent something, it belongs to the world, you don’t own it. This is a wonderful idea, however most of the world, including many industries,and jobs, and professions, are based on the opposite principle – if you create it, you own it. To my mind I have fewer rights under the FSF view of the world, I don’t have the right to my own ideas.

Because of the freedoms that the Free Software Foundation believe in, they are against Digital Rights Management (DRM) software. DRM tries to protect the rights of artists, producers, and distributors of artistic content. In order to protect these rights, software is needed that is proprietary. If the DRM source code was open, it would make it easy for hackers to decode the content and remove the copy protection. So the Free Software Foundation is taking up the fight against DRM, calling it ‘Digital Restrictions Management’ (http://www.fsf.org/campaigns/drm.html). They call it this because, they say, DRM takes away your right to steal other people’s inventions. If you support of DRM-free software, you are choosing to fight against musicians, authors, actors.

Open Source

The Open Source movement takes a pragmatic approach on this topic. When you have an idea, it is yours. You can choose to do whatever you want with your invention. If it is a software invention, and you choose to put it into open source, that’s great. If you choose not too, that’s fine too, because it is yours. Open Source allows hybrid models – where a producer can decide to put some of their software into open source but not all of it (open core or freemium model). This model enables a software producer to provide something of value to people who would not have paid for anything anyway (this includes geographies and economies where the producer would not sell anyway). These people are willing participants and contributors in other ways. The producer also gets to sell whatever software products it wants.

Doomsday?

For some creative areas, if no-one pays for any content anymore, the creators will disappear eventually, and there will be no more content. But what happens if no-one pays for software anymore?

Proprietary software dies eventually, unless they switch to services models.

The majority of people contributing to open source/free software today are IT developers. There are two main types here: creating/extending/fixing software in the course of getting their project finished, or sponsored contributors. IT is where the majority of software developers are today, so IT/enterprise/business software is safe.

The software that would be at most risk would be software that is created by smaller software companies. Particularly software that has large up-front development costs. Games. The first, and maybe only, software segment to die would be the big-budget, realistic, immersive, loud video games. Who cares most about these games? The same demographic that is stealing all the music.

I say let Generation OMG copy and steal everything they want. All the really cool and fun careers will evaporate. Lots of the stuff they love (movies, music, games) will disappear. After they have spent a decade texting each other about how sucky everything is, they will grow up and have to re-create these industries. Hopefully with better economic structures than the current ones.

leave a comment »

I’m at the MongoNYC conference in New York today, where Pentaho is a sponsor. 10gen have done a great job with this event, and they have 1,000 attendees at the event.

We just announced a strategic partnership between 10gen and Pentaho. From a technical perspective the integration between MongoDB and Pentaho means:

  • No Big Silos. Data silos are bad. Big ones are no better. Our MongoDB ETL connectors for reading and writing data mean you can integrate your MongoDB data store with the rest of your data architecture (relational databases, hosted applications, custom applications, etc).
  • Live reporting. We can provide desktop and web-based reports directly on MongoDB data
  • Staging. We can provide trending and historical analysis by staging snapshots of MongoDB aggregations in a column store.

I’m looking forward to working with 10gen to integrate some of their new aggregation capabilities into Pentaho.

Written by James

May 23, 2012 at 2:22 pm