Por Alberto Fernández Valiente
“Dataflow is a managed service for executing a wide variety of data processing patterns.
The Apache Beam SDK is an open source programming model that enables you to develop both batch and streaming pipelines. You create your pipelines with an Apache Beam program and then run them on the Dataflow service.”
Apache Beam releases | Supported Python versions |
---|---|
2.39.0 – 2.42.0 | 3.7, 3.8, 3.9 |
2.37.0 – 2.38.0 | 3.6, 3.7, 3.8, 3.9 |
The PCollection abstraction represents a potentially distributed, multi-element data set. You can think of a PCollection as “pipeline” data; Beam transforms use PCollection objects as inputs and outputs. As such, if you want to work with data in your pipeline, it must be in the form of a PCollection.
After you’ve created your Pipeline, you’ll need to begin by creating at least one PCollection in some form. The PCollection you create serves as the input for the first operation in your pipeline.
A PTransform represents a data processing operation, or a step, in your pipeline. Every PTransform takes one or more PCollection objects as input, performs a processing function that you provide on the elements of that PCollection, and produces zero or more output PCollection objects.
Dataflow lanza inicialmente una máquina n1-standard-1 que se encarga de crear el pipeline de Apache Beam que luego será ejecutado.
Una vez se ha creado un pipeline correcto se lanzará la/s máquina/s del tipo que hayamos definido para que se ejecute.
Internamente se instancia un Kubernetes dentro de las máquinas para gestionar la ejecución de las tareas dentro de los sdk-harness, uno por cada núcleo de la máquina.
Los “work attempts” encapsulan las unidades de trabajo a nivel PTransform y se paraleliza asignando uno a cada sdk-harness. Se puede limitar el número de ejecuciones paralelas por cada worker.