James Dixon’s Blog

James Dixon’s thoughts on commercial open source and open source business intelligence

Pentaho Labs Apache Spark Integration

with one comment

Today at Pentaho we announced our first official support for Spark . The integration we are launching enables Spark jobs to be orchestrated using Pentaho Data Integration so that Spark can be coordinated with the rest of your data architecture. The original prototypes for number of different Spark integrations were done in Pentaho Labs, where we view Spark as an evolving technology. The fact that Spark is an evolving technology is important.

Let’s look at a different evolving technology that might be more familiar – Hadoop. Yahoo created Hadoop for a single purpose ­ indexing the Internet. In order to accomplish this goal, it needed a distributed file system (HDFS) and a parallel execution engine to create the index (MapReduce). The concept of MapReduce was taken from Google’s white paper titled MapReduce: Simplified Data Processing on Large Clusters. In that paper, Dean and Ghemawat describe examples of problems that can be easily expressed as MapReduce computations. If you look at the examples, they are all very similar in nature. Examples include finding lines of text that contain a word, counting the numbers of URLs in documents, listing the names of documents that contain URLs to other documents and sorting words found in documents.

When Yahoo released Hadoop into open source, it became very popular because the idea of an open source platform that used a scale-out model with a built-in processing engine was very appealing. People implemented all kinds of innovative solutions using MapReduce, including tasks such as facial recognition in streaming video. They did this using MapReduce despite the fact that it was not designed for this task, and that forcing tasks into the MapReduce format was clunky and inefficient. The fact that people attempted these implementations is a testament to the attractiveness of the Hadoop platform as a whole in spite of the limited flexibility of MapReduce.

What has happened to Hadoop since those days? A number of important things, like security have been added, showing that Hadoop has evolved to meet the needs of large enterprises. A SQL engine, Hive, was added so that Hadoop could act as a database platform. A No-SQL engine, HBase, was added. In Hadoop 2, Yarn was added. Yarn allows other processing engines to execute on the same footing as MapReduce. Yarn is a reaction to the sentiment that “Hadoop overall is great, but MapReduce was not designed for my use case.” Each of these new additions (security, Hive, HBase, Yarn, etc.) is at a different level of maturity and has gone through its own evolution.

As we can see, Hadoop has come a long way since it was created for a specific purpose. Spark is evolving in a similar way. Spark was created as a scalable in-memory solution for a data scientist. Note, a single data scientist. One. Since then Spark has acquired the ability to answer SQL queries, added some support for multi-user/concurrency, and the ability to run computations against streaming data using micro-batches. So Spark is evolving in a similar way to Hadoop’s history over the last 10 years. The worlds of Hadoop and Spark also overlap in other ways. Spark itself has no storage layer. It makes sense to be able to run Spark inside of Yarn so that HDFS can be used to store the data, and Spark can be used as the processing engine on the data nodes using Yarn. This is an option that has been available since late 2012.

In Pentaho Labs, we continually evaluate both new technologies and evolving technologies to assess their suitability for enterprise-grade data transformation and analytics. We have created prototypes demonstrating our analysis capabilities using Spark SQL as a data source and running Pentaho Data Integration transformations inside the Spark engine in streaming and non-streaming modes. Our announcement today is the productization of one Spark use case, and we will continue to evaluate and release more support as Spark continues to grow and evolve.

Pentaho Data Integration for Apache Spark will be GA in June 2015. You can learn more about Spark innovation in Pentaho Labs here: www.pentaho.com/labs.

I will also lead a webinar on Apache Spark on Tuesday June 2, 2015 at 10am/pt (1pm/ET). You can register at http://events.pentaho.com/pentaho-labs-apache-spark-registration.html.

Written by James

May 12, 2015 at 1:41 pm

Posted in Uncategorized

One Response

Subscribe to comments with RSS.

  1. Reblogged this on Pentaho Business Analytics Blog and commented:
    Pentaho Labs has been busy in the lab and we are excited to reveal their latest innovation: Pentaho integration for Apache Spark.

    James Dixon, Pentaho CTO and Lab Scientist shares his thought on Apache Spark on his blog: Pentaho Labs Apache Spark Integration. Overall he thinks Spark has promise, but it needs some work. Read more about:
    • What he/Pentaho thinks Sparks solves
    • Some promising use cases
    • What Pentaho supports today

    rebeccapentaho

    May 14, 2015 at 4:43 pm


Leave a comment