Archive for May 2015
Today at Pentaho we announced our first official support for Spark . The integration we are launching enables Spark jobs to be orchestrated using Pentaho Data Integration so that Spark can be coordinated with the rest of your data architecture. The original prototypes for number of different Spark integrations were done in Pentaho Labs, where we view Spark as an evolving technology. The fact that Spark is an evolving technology is important.
Let’s look at a different evolving technology that might be more familiar – Hadoop. Yahoo created Hadoop for a single purpose indexing the Internet. In order to accomplish this goal, it needed a distributed file system (HDFS) and a parallel execution engine to create the index (MapReduce). The concept of MapReduce was taken from Google’s white paper titled MapReduce: Simplified Data Processing on Large Clusters. In that paper, Dean and Ghemawat describe examples of problems that can be easily expressed as MapReduce computations. If you look at the examples, they are all very similar in nature. Examples include finding lines of text that contain a word, counting the numbers of URLs in documents, listing the names of documents that contain URLs to other documents and sorting words found in documents.
When Yahoo released Hadoop into open source, it became very popular because the idea of an open source platform that used a scale-out model with a built-in processing engine was very appealing. People implemented all kinds of innovative solutions using MapReduce, including tasks such as facial recognition in streaming video. They did this using MapReduce despite the fact that it was not designed for this task, and that forcing tasks into the MapReduce format was clunky and inefficient. The fact that people attempted these implementations is a testament to the attractiveness of the Hadoop platform as a whole in spite of the limited flexibility of MapReduce.
What has happened to Hadoop since those days? A number of important things, like security have been added, showing that Hadoop has evolved to meet the needs of large enterprises. A SQL engine, Hive, was added so that Hadoop could act as a database platform. A No-SQL engine, HBase, was added. In Hadoop 2, Yarn was added. Yarn allows other processing engines to execute on the same footing as MapReduce. Yarn is a reaction to the sentiment that “Hadoop overall is great, but MapReduce was not designed for my use case.” Each of these new additions (security, Hive, HBase, Yarn, etc.) is at a different level of maturity and has gone through its own evolution.
As we can see, Hadoop has come a long way since it was created for a specific purpose. Spark is evolving in a similar way. Spark was created as a scalable in-memory solution for a data scientist. Note, a single data scientist. One. Since then Spark has acquired the ability to answer SQL queries, added some support for multi-user/concurrency, and the ability to run computations against streaming data using micro-batches. So Spark is evolving in a similar way to Hadoop’s history over the last 10 years. The worlds of Hadoop and Spark also overlap in other ways. Spark itself has no storage layer. It makes sense to be able to run Spark inside of Yarn so that HDFS can be used to store the data, and Spark can be used as the processing engine on the data nodes using Yarn. This is an option that has been available since late 2012.
In Pentaho Labs, we continually evaluate both new technologies and evolving technologies to assess their suitability for enterprise-grade data transformation and analytics. We have created prototypes demonstrating our analysis capabilities using Spark SQL as a data source and running Pentaho Data Integration transformations inside the Spark engine in streaming and non-streaming modes. Our announcement today is the productization of one Spark use case, and we will continue to evaluate and release more support as Spark continues to grow and evolve.
Pentaho Data Integration for Apache Spark will be GA in June 2015. You can learn more about Spark innovation in Pentaho Labs here: www.pentaho.com/labs.
I will also lead a webinar on Apache Spark on Tuesday June 2, 2015 at 10am/pt (1pm/ET). You can register at http://events.pentaho.com/pentaho-labs-apache-spark-registration.html.
Open Source: In praise of the profiteering enterprise, the greedy freeloader, and the selfish developer
Recently, Matt Asay talked about a number of different issues causing conflict in the free and open source world in a piece titled “The new struggles facing open source” and comes to the conclusion that currently the biggest problem is the role of enterprises controlling projects (he says “controlling the community” which is impossible, if you’ve every tried it). He makes a lot of sense but I don’t entirely agree with his points about the role of businesses in open source being detrimental. Hadoop, Spark, Storm, Kafka, Hive, HBase etc all came from enterprises that still employ the majority of the core contributors in most cases. Why did these companies create these technologies? Not for philanthropy. Not for the greater good. For better profit via better infrastructure. Having created those technologies they decided to open source them. For the greater good? No. For lower maintenance, and better profits, with side benefits of better mindshare and easier recruiting. Did these companies open source their domain-specific intellectual property that is the basis of their business? No, and they never will. They only open sourced internally developed infrastructure that is tangential to their business. Do these companies believe that all ideas inherently belong to the people of the world? No. They put into open source what was in their best interest to do so. Self-interest all round. Score 1 point for greed.
In another piece titled “Enterprises still miss the real point of open source” Matt argues that enterprises, while they are using a lot of open source, still don’t get it. He finishes with:
Again, merely using open source isn’t enough. Contributions are required.
But let’s look at “The rise and rise of open source” by Simon Phipps. This is a review of Black Duck’s most recent “The Future Of Open Source” survey. The net result is that across all the important metrics usage of open source for running businesses and creating products is now over 50% for the first time. Some of the merits are still rising rapidly. 78% of respondents report they are running their business with open source software. Indicating that an approach based on, and using, open source is now the mainstream, and that purely proprietary approaches are now the minority. As a result InfoWorld is stopping their open source special interest channel, because it is now the mainstream. Yay for open source.
But who are these companies that make up these statistics, that represent the majority of businesses. Are they all contributing to open source? The survey indicates that while 78% of businesses are running on open source, only 64% of those say they are contributing to open source. What do we call the greedy who use open source but do not contribute to it? They are the Freeloaders. Matt Asay says they need to contribute. I say they already have. If the freeloaders weren’t using open source, only 49.92% (78% * 64%) of companies would be running their business on open source. In other words the only reason we can claim today that open source is the mainstream is by the actions of the (apparently) non-contributing the freeloaders. But isn’t tipping the balance of the overall market from proprietary to open source a contribution in itself? Of course it is. The act of merely using open source software displaces a proprietary alternative, and is a contribution in its own right. No matter how little you contribute, even the greedy who contribute nothing, still make a contribution. Score another point for greed.
So now lets look at the people to do contribute. The vast majority of these are paid contributors employed by enterprises and IT/software developers trying to get their job done and to rollout a product or feature. These activities include creating features, fixing bugs, translating, testing etc (the list is long). Enterprises fund these activities for several reasons including:
- Getting a product to market
- Lowering development costs
- Lowering license fees
- Improving time to market
- Employee retention
- Increasing mindshare and thought leadership
Developers fix bugs and contribute them to a project because they don’t want to re-apply the bug in every future release. This is self-serving behavior. Do I care who controls or directs the project? Nope. The core contributors of the project accept the bug to increase quality, which helps adoption, which grows the project. This is self-serving behavior. One of the greatest and most powerful things about open source is that everyone can act out of self-interest, and everyone gains from everyone else acting selfishly. This makes the model very strong. Score another point for greed.
Final score: Philanthropy 0, Greed 3
If open source is ultimately driven by greed and self-interest, how is it any better than proprietary software development? Because it is an inherently better way to develop software, and in so many ways that the fight isn’t even close. Is it philosophically better? Yes, I believe the fundamental principles of open source are better than proprietary development. But is it morally better? No. The underlying power of open source over proprietary development is that greed is naturally converted into useful contribution, whereas with proprietary development greed translates into channel conflict, price fixing, monopolies, class action suits, vendor lock-in, and inefficient, low-quality, bloated software.
Open source rules the day. But philanthropy and the believe that all ideas belong to the world did not get it there.