Archive for the ‘Datawarehousing’ Category
We announced a strategic partnership with DataStax today: http://www.pentaho.com/press-room/releases/datastax-and-pentaho-jointly-deliver-complete-analytics-solution-for-apache-cassandra/
DataStax provides products and services for the popular Apache No-SQL database Cassandra. We are releasing our first round of Cassandra integration in our next major release and you can download it today (see below).
Our Cassandra integration includes open source data integration steps to read from, and write to Cassandra. So you can integrate Cassandra into your data architecture using Pentaho Data Integration/Kettle and avoid creating a Big Silo – all with a nice drag/drop graphical UI. Since our tools are integrated, you can create desktop and web-based reports directly on top of Cassandra. You can also use our tools to extract and aggregate data into a datamart for interactive exploration and analysis. We are demoing these capabilities at the Strata conference in Santa Clara this week.
- Product downloads, how-to videos and documents are available at http://www.pentaho.com/cassandra and http://www.datastax.com/pentaho
- Attend the webinar on March 15th to learn more and about using Cassandra’s integration with Pentaho Kettle http://www.pentaho.com/datastax-webinar
- Download, access how-to documents and videos at http://community.pentaho.com/BigData
This week at Pentaho we announced a major Big Data release, including:
- Open sourcing of our of big data code
- Moving Pentaho Data Integration to the Apache license
- Support for Hbase, Cassandra, MongoDB, Hadapt
- And numerous functionality and performance improvements
What does this mean for the Big Data market, for Pentaho, and for everyone else?
We believe you should use the best tool for each job. For example you should use Hadoop or a NoSQL database where those technologies suit your purposes, and use a high performance columnar database for the use cases they are suited to. Your organization probably has applications that use traditional databases, and likely has a hosted application or two as well. Like it or not, if you have a single employee that has a spreadsheet on their laptop, you have a data architecture that includes flat files. So every data architecture is a hybrid environment to some extent. To solve the requirements of your business, your IT group probably has to move/merge/transform data between these data stores. You may have an application or two that has no external inputs or outputs, and no integration points with other applications. There is a word for these applications – silos. Silos are bad. Big data is no different. A big data store that is not integrated with your data architecture is a Big Silo. Big Silos are just as bad as regular silos, only bigger.
So when you add a big data technology to your organization, you don’t want it to be a silo. The big data capabilities of Pentaho Data Integration enable you to integrate your big data store into the rest of your data architecture. If you are using any of the big data technologies we support you can move data into, and out of these data stores using a graphical environment. Our data integration capabilities also extend to traditional databases, columnar databases, flat files, web services, hosted applications and more. So you can easily integrate your big data application into the rest of your data architecture. This means your big data store is not a silo.
For Pentaho, the big data arena is a strategic one. These are new technologies and architectures so all the players in this space are starting from the same place. It is a great space for us because people using these technologies need tools and capabilities that are easy for us to deliver. Hadoop is especially cool because all of our tools and technologies are pure Java and are embeddable, so we can execute our engines within the data nodes and scale linearly as your data grows.
For everyone else our tools continue to provide great bang for the buck for ETL, reporting, OLAP, predictive analytics etc. Now we also lower the cost, time, and skills sets required to investigate big data solutions. For any one application you can divide the data architecture into two main segments: client data and server data. Client data includes things like flat files, mobile app data, cookie data etc. Server data includes transactional/traditional databases and big data stores. I don’t see the server-side as all or nothing. It could be all RDBMS, all big data store, 50/50, or any mix of the two. It’s like milk and coffee. You can have a glass of milk, a cup of coffee, or variations in between with different amounts of milk or coffee. So you can consider an application that only uses a traditional database today to be an application that currently utilizes 0% of its potential big data component. So every data architecture exists on this continuum, and we have great tools to help you if you want to step into the big data world.
If you want to find out more:
- Visit http://community.pentaho.com/BigData which has downloads, how-tos, and other resouces
- Connect with the community on irc.freenode.net ##pentaho;
- Join the Pentaho Big Data technical developer mailing list to be notified about future big data product updates and related events.
- Attend the techcast on Thursday February 9th to learn more about Pentaho Kettle for Big Data, watch a live demo and hear how you can get involved. Register now at http://www.pentaho.com/resources/events/20120209-pentaho-kettle-webinar/
- Hands-on training FREE for attendees at the 2012 Strata Conference in Santa Clara, California. Sign-up for our how-to training session (http://strataconf.com/strata2012) on February 28th during the ‘Tuesday Tutorials.’ Register with Pentaho’s 20 percent discount code: str12sd20 <https://en.oreilly.com/strata2012/public/register> .
Earlier this week, at Hadoop World in New York, Pentaho announced availability of our first Hadoop release.
As part of the initial research into the Hadoop arena I talked to many companies that use Hadoop. Several common attributes and themes emerged from these meetings:
- 80-90% of companies are dealing with structured or semi-structured data (not unstructured).
- The source of the data is typically a single application or system.
- The data is typically sub-transactional or non-transactional.
- There are some known questions to ask of the data.
- There are many unknown questions that will arise in the future.
- There are multiple user communities that have questions of the data.
- The data is of a scale or daily volume such that it won’t fit technically and/or economically into an RDBMS.
In the past the standard way to handle reporting and analysis of this data was to identify the most interesting attributes, and to aggregate these into a data mart. There are several problems with this approach:
- Only a subset of the attributes are examined, so only pre-determined questions can be answered.
- The data is aggregated so visibility into the lowest levels is lost
Based on the requirements above and the problems of the traditional solutions we have created a concept called the Data Lake to describe an optimal solution.
If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.
For more information on this concept you can watch a presentation on it here: Pentaho’s Big Data Architecture
Last week, on the same day, both Pentaho and IBM made announcements about Hadoop support. There are several interesting things about this:
- IBM’s announcement is a validation of Hadoop’s functionality, scalability and maturity. Good news.
- Hadoop, being Java, will run on AIX, and on IBM hardware. In fact, Hadoop hurts the big iron vendors. Hadoop also, to some extent competes with IBM’s existing database offerings. But their announcement was made by their professional services group, not by their hardware or AIX groups. For IBM this is a services play.
- IBM announced their own distro of Hadoop. This requires a significant development, packaging, testing, and support investment for IBM. They are going ‘all in’, to use a poker term. The exact motivation behind this has yet to be revealed. They are offering their own tools and extensions to Hadoop, which is fair enough, but this is possible without providing their own full distro. Only time will show how they are maintaining their internal fork or branch of Hadoop and whether any generic code contributions make it out of Big Blue into the Hadoop projects.
- IBM is making a play for Big Data, which, in conjunction with their cloud/grid initiatives, makes perfect sense. When it comes to cloud computing, the cost of renting hardware is gradually converging with the price of electricity. But with the rise of the cloud, an existing problem is compounded. Web-based applications generate a wealth of event-based data. This data is hard enough to analyze when you have it on-premise, and it quickly eclipses the size of the transactional data. When this data is generated in a cloud environment, the problem is worse: you don’t even have the data locally, and moving it will cost you. IBM is attempting a land-grab: cloud + Hadoop + IBM services (with or without IBM hardware, OS, and databases). They are recognizing the fact that running apps in the cloud and storing data in the cloud are easy: but analyzing that data is harder and therefore more valuable.
Pentaho’s announcement, was similar in some ways, different in others:
- Like IBM, we recognize the needs and opportunities.
- Technology-wise, Pentaho has a suite of tools, engines and products that are a much better suited for Hadoop integration, being pure Java and designed to be embedded
- Pentaho has no plans to release our own distro of Hadoop. Any changes we make to Hadoop, Hive etc will be contributed to Apache
- And lastly, but no less importantly, Pentaho announced first.
When it comes to other players:
- Microsoft is apparently making Hadoop ready for Azure, but is Hadoop currently is not recommended for production use on Windows. It will be interesting to see how these facts resolve themselves.
- Oracle/Sun has the ability to read from the Hadoop file system and has a proprietary Map/Reduce capability, but no compelling Hadoop support yet. In direct conflict with the scale-out mentality of Hadoop, in a recent Wired interview Larry Ellison talked about Oracle’s new hardware
The machine costs more than $1 million, stands over 6 feet tall, is two feet wide and weighs a full ton. It is capable of storing vast quantities of data, allowing businesses to analyze information at lightening fast speeds or instantly process commercial transactions.
- HP, Dell etc are probably picking up some business providing the commodity hardware for Hadoop installations, but don’t yet have a discernible vision.
Looks like the Oracle acquisition of Sun is helping MySQL – according to Zack Urlocker (MySQL Marketing VP) via Twitter
Set a new record in lead gen last week. More than 15x what we were doing 3 months ago. Quantity and quality are both improving.
Maybe the threats of forks and rebellions are premature, particularly if the leads increase sales, and sales increases engineering resources. Whether the recent increase in transparency and openness continues at MySQL might be the bigger question.
In response to Mark Madsen’s post:
Doesn’t DW-in-the-cloud suffer from the same fundamental problem as DW-as-a-Service in that you have to pump all of your proprietary, strategic, highly sensitive data outside of the firewall onto someone else’s hardware?