Book Review – “Instant Pentaho Data Integration Kitchen”

A new year, a new book about Pentaho. And many times a book about Pentaho Data Integration (PDI). While this could seem a topic, it’s true at least for the last years. Some of you could think that this is a bad thing, but it’s not. If there is a great piece of software that it’s worth to spend some time to learn and master in the whole Pentaho stack is PDI, a.k.a Kettle, for old friends. Don’t doubt about it. Not for a second. If you want to be a Data Scientist is a good skill to add to your toolbox.

So, when Packt Publishing offer me the chance to review the new book from Sergio Ramazzina, “Instant Pentaho Data Integration Kitchen” published by Packt Publishing, it was a big yes. No brain.

For those who still don’t know what is Pentaho Data Integration, the simplest answer is: it’s an open source ETL tool created 10 years ago by Matt Caster.

About the book

As many other books about PDI, the book starts explaining what is PDI adding a brief summary of its story. As many of you already known, PDI is a quite powerful tool but mastering all the features requires time and commitment before you are able to design enterprise-level ETLs.

This book can help with that goal. Starting with how to create a simple transformation and a simple job (the two types of ETL processes according PDI), the book provides valuable information, tips and insights on how to master the use of the command line, the repository, the execution log or scheduling jobs and transformations. Let’s put it clear it helps you to master some of the most important features when using PDI in a project as your main ETL tools. It is particularly interesting and useful the chapter “Scheduling PDI jobs and transformations”.

With a straight narrative, this short book is easy to read and in my humble opinion it could be an interesting complement to your PDI library if you are looking for a quick guide.

However it should be said that if you are looking for a book describing the data warehousing process and how to use PDI for that process, this is not your book.

Evolución de los procesos ETL

Hace unos años cuando hablabamos de ETL sólo nos referíamos a lo siguiente:

  • Procesos de extracción de datos.
  • Procesos de transformación de datos.
  • Procesos de carga de datos.
  • Gestión de metadatos.
  • Servicios de administración y operacionales.

Actualmente es necesario hablar de integración de datos (Data Integration) como evolución de los procesos ETL y bajo este paraguas tenemos:

  • Servicios de acceso a datos.
  • Data profiling.
  • Data Quality.
  • Procesado de datos operacionales.
  • Servicios de transformación: CDC, SCD, Validación, Agregación.
  • Acceso en tiempo real.
  • ETL
  • EII.
  • EAI.
  • Transporte de datos.
  • Gestión de metadatos.
  • Servicios de entrega.

En posteriores posts hablaremos de algunos de los aspectos que conformas la visión actual de la integración de datos. Y ahora en vuestras organizaciones: ¿cocéis o enriqueceis los datos?