In 2011, the first edition of “Pentaho Data Integration Cookbook” was published. In that moment in time, the book was interesting enough for a PDI (Pentaho Data Integration) developer as it provided relevant answers for many of the common tasks that have to be carried out for data warehousing processes.
After two years, the data market has greatly evolved. Among other trends, Big Data is a major trend and nowadays PDI included numerous new features to connect and use Hadoop and NoSQL databases.
The idea behind the second version is to include some of the brand new tasks required to tame Big Data using PDI and update the content of the previous edition. Alex Meadows, from Red Hat, has joined the previous authors (María Carina Roldan and Adrián Sergio Pulvirenti) in this second version. Maria is author of four books about Pentaho Data Integration.
What is Pentaho Data Integration?
I’m sure that many of you already know it. For those who doesn’t. PDI is an open source swiss army knife of tools to extract, move, transform and load data.
What is this book?
To put it simply. It includes practical handy recipes for many of the everyday situations for a PDI developer. All recipes follow the same schema:
- State the problem
- Create a transformation or job to solve the problem
- Explain in detail and provide potential pitfalls
What is new?
One thing that a potential reader can question himself is: If I already have the previous one, is it worth to read this additional version? If you are a Pentaho Data Integration developer, the easy answer is yes. Mainly, because the book includes new chapters and sections for Big Data and Business Analytics, technologies that are becoming crucial core corporate capabilities in the information age.
So, in my humble opinion, the most interesting chapters are:
- Chapter 3: where the reader will have the chance to learn how to load / get data into Hadoop, hbase and MongoDB.
- Chapter 12: where the reader will be given the opportunity to read data from a SAS data file, to create statistics from a data stream and to build a random data sample for Weka.
What I’m missing or could be improved?
More screenshots, some readers probably could think the same. Being honest, while I’m happy about the chapter 3 and 12, it will be interesting to have more content related to these topics. So, let’s put it this way. I am counting down the days for the following edition.
In summary, an interesting book for PDI and data warehousing practitioners that give some information about how to use PDI for Big Data and Analytics. If you are interested you can find it here.