Dataflow, todo es posible si tienes imaginación

Como adaptar el framework a tu problema

Por Alberto Fernández Valiente

Sobre mí

Ingeniero Técnico en Informática de Sistemas por la Universidad de Sevilla
Más de 15 años de carrera profesional
PSF Contributing member
DSF Individual member

¿Qué es Dataflow?

Google Cloud

“Dataflow is a managed service for executing a wide variety of data processing patterns.

The Apache Beam SDK is an open source programming model that enables you to develop both batch and streaming pipelines. You create your pipelines with an Apache Beam program and then run them on the Dataflow service.”

Supported runtimes

Apache Beam releases	Supported Python versions
2.39.0 – 2.42.0	3.7, 3.8, 3.9
2.37.0 – 2.38.0	3.6, 3.7, 3.8, 3.9

Hablemos de Apache Beam

Apache Beam is a unified model for defining both batch and streaming data-parallel processing pipelines.

Pipeline A pipeline is a user-constructed graph of transformations that defines the desired data processing operations.
PCollection A PCollection is a data set or data stream. The data that a pipeline processes is part of a PCollection.
PTransform A PTransform (or transform) represents a data processing operation, or a step, in your pipeline. A transform is applied to zero or more PCollection objects, and produces zero or more PCollection objects.

Ejemplo

PCollection

The PCollection abstraction represents a potentially distributed, multi-element data set. You can think of a PCollection as “pipeline” data; Beam transforms use PCollection objects as inputs and outputs. As such, if you want to work with data in your pipeline, it must be in the form of a PCollection.

After you’ve created your Pipeline, you’ll need to begin by creating at least one PCollection in some form. The PCollection you create serves as the input for the first operation in your pipeline.

PTransform

A PTransform represents a data processing operation, or a step, in your pipeline. Every PTransform takes one or more PCollection objects as input, performs a processing function that you provide on the elements of that PCollection, and produces zero or more output PCollection objects.

Flex templates

Creación de pipelines a demanda

La plantilla se define con un fichero JSON dentro de un Google Cloud Storage.
Se puede lanzar un job de Dataflow desde línea de comando o mediante código.
Todos los jobs tienen un nombre, que puede estar duplicado.

Launcher y Workers

Dataflow lanza inicialmente una máquina n1-standard-1 que se encarga de crear el pipeline de Apache Beam que luego será ejecutado.

Una vez se ha creado un pipeline correcto se lanzará la/s máquina/s del tipo que hayamos definido para que se ejecute.

Internamente se instancia un Kubernetes dentro de las máquinas para gestionar la ejecución de las tareas dentro de los sdk-harness, uno por cada núcleo de la máquina.

Los “work attempts” encapsulan las unidades de trabajo a nivel PTransform y se paraleliza asignando uno a cada sdk-harness. Se puede limitar el número de ejecuciones paralelas por cada worker.

Show me the code

Conclusiones

Pros

Permite ejecutar tareas de duración ilimitada a petición dentro de GCloud
Solo se necesita generar una imagen Docker
Puedes movilizar muchas máquinas potentes con autoescalado

Cons

Máximo 100 jobs en paralelo
Tarda 7-10 minutos en arrancar
Hay que pensar como Apache Beam
Complejo de depurar y optimizar
Peor rendimiento