Skip navigation links

Package com.hazelcast.jet

The Pipeline API is Jet's high-level API to build and execute distributed computation jobs.

See: Description

Package com.hazelcast.jet Description

The Pipeline API is Jet's high-level API to build and execute distributed computation jobs. It models the computation using an analogy with a system of interconnected water pipes. The data flows from the pipeline's sources to its sinks. Pipes can bifurcate and merge, but there can't be any closed loops (cycles).

The basic element is a pipeline stage which can be attached to one or more other stages, both in the upstream and the downstream direction. A stage accepts the data coming from its upstream stages, transforms it, and directs the resulting data to its downstream stages.

Kinds of transformation performed by pipeline stages

Basic

Basic transformations have a single upstream stage and statelessly transform individual items in it. Examples are map, filter, and flatMap.

Grouping and aggregation

The groupBy transformation groups items by key and performs an aggregate operation on each group. It outputs the results of the aggregate operation, one for each observed distinct key.

The coGroup transformation groups items by key in several streams at once and performs an aggregate operation on all groups that share the same key, separately for each key. It outputs the results of the aggregate operation, one for each observed distinct key.

Hash-join

Hash-join is a special kind of joining transform, specifically tailored to the use case of data enrichment. It is an asymmetrical join that distinguishes the primary upstream stage from the enriching stages. The source for an enriching stage is most typically a key-value store (such as a Hazelcast IMap). Its data stream must be finite and each item must have a distinct join key. The primary stage, on the other hand, may be infinite and contain duplicate keys.

For each of the enriching stages there is a separate pair of functions to extract the joining key on both sides. For example, a Trade can be joined with both a Broker on trade.getBrokerId() == broker.getId() and a Product on trade.getProductId() == product.getId(), and all this can happen in a single hash-join transform.

Implementationally, the hash-join transform is optimized for throughput so that each computing member has a local copy of all the enriching data, stored in hashtables (hence the name). The enriching streams are consumed in full before ingesting any data from the primary stream.

Skip navigation links

Copyright © 2017 Hazelcast, Inc.. All Rights Reserved.