By example, if we want to improve performance of a Pentaho PDI ETL process in a MongoDB Input step by using parallelism, we could specify the "Number of Copies to Start" (right click on MongoDB step) to 10:
Only with this configuration we will have 10 steps consuming the same dataset, we need something to partition the dataset, so we could use the "mod" function over a field matching with the step instance copy at execution time, we´ll do this in the "Query" tab of the step:
So, at execution time, this javascript criteria will allow to retrieve subsets of the original dataset based on a mod partition over the parallelism specified and the step instance copy (parameter Internal.Step.CopyNr) .
Whe need a field with numbers with high cardinality to make this technique effective.
Cool post, buddy.That was exactly that I needed.
ResponderEliminarThank you so much