com.hazelcast.jet.pipeline (Hazelcast Root 6.0.0-SNAPSHOT API)

package com.hazelcast.jet.pipeline

The Pipeline API is Jet's high-level API to build and execute distributed computation jobs. It models the computation using an analogy with a system of interconnected water pipes. The data flows from the pipeline's sources to its sinks. Pipes can bifurcate and merge, but there can't be any closed loops (cycles).

The basic element is a pipeline stage which can be attached to one or more other stages, both in the upstream and the downstream direction. A pipeline accepts the data coming from its upstream stages, transforms it, and directs the resulting data to its downstream stages.

Kinds of transformation performed by pipeline stages

Basic

Basic transformations have a single upstream pipeline and statelessly transform individual items in it. Examples are map,


 filter

, and flatMap.

Grouping and aggregation

The aggregate*() transformations perform an aggregate operation on a set of items. You can call stage.groupingKey() to group the items by a key and then Jet will aggregate each group separately. For stream stages you must specify a stage.window() which will transform the infinite stream into a series of finite windows. If you specify more than one input stage for the aggregation (using


 stage.aggregate2()

, stage.aggregate3() or


 stage.aggregateBuilder()

, the data from all streams will be combined into the aggregation result. The AggregateOperation you supply must define a separate accumulate primitive for each contributing stream. Refer to its Javadoc for further details.

Hash-Join

The hash-join is a joining transform designed for the use case of data enrichment with static data. It is an asymmetric join that joins the enriching stage(s) to the primary stage. The enriching stages must be batch stages — they must represent finite datasets. The primary stage may be either a batch or a stream stage.

You must provide a separate pair of functions for each of the enriching stages: one to extract the key from the primary item and one to extract it from the enriching item. For example, you can join a Trade with a Broker on trade.getBrokerId() == broker.getId() and a Product on trade.getProductId() == product.getId(), and all this can happen in a single hash-join transform.

The hash-join transform is optimized for throughput — each cluster member materializes a local copy of all the enriching data, stored in hashtables (hence the name). It consumes the enriching streams in full before ingesting any data from the primary stream.

The output of hashJoin is just like an SQL left outer join: for each primary item there are N output items, one for each matching item in the enriching set. If an enriching set doesn't have a matching item, the output will have a null instead of the enriching item.

If you need SQL inner join, then you can use the specialised innerHashJoin function, in which for each primary item with at least one match, there are N output items, one for each matching item in the enriching set. If an enriching set doesn't have a matching item, there will be no records with the given primary item. In this case the output function's arguments are always non-null.

The join also allows duplicate keys on both enriching and primary inputs: the output is a cartesian product of all the matching entries.

Example:

 +------------------------+-----------------+---------------------------+
 |     Primary input      | Enriching input |          Output           |
 +------------------------+-----------------+---------------------------+
 | Trade{ticker=AA,amt=1} | Ticker{id=AA}   | Tuple2{                   |
 | Trade{ticker=BB,amt=2} | Ticker{id=BB}   |   Trade{ticker=AA,amt=1}, |
 | Trade{ticker=AA,amt=3} |                 |   Ticker{id=AA}           |
 |                        |                 | }                         |
 |                        |                 | Tuple2{                   |
 |                        |                 |   Trade{ticker=BB,amt=2}, |
 |                        |                 |   Ticker{id=BB}           |
 |                        |                 | }                         |
 |                        |                 | Tuple2{                   |
 |                        |                 |   Trade{ticker=AA,amt=3}, |
 |                        |                 |   Ticker{id=AA}           |
 |                        |                 | }                         |
 +------------------------+-----------------+---------------------------+

Since:: Jet 3.0

Related Packages

Package

Description

com.hazelcast.jet

Hazelcast Jet is a distributed computation engine running on top of Hazelcast IMDG technology.

com.hazelcast.jet.pipeline.file

This package offers the FileSourceBuilder which allows you to construct various kinds of Pipeline sources that read from local or distributed files.

com.hazelcast.jet.pipeline.test

This package contains various mock sources to help with pipeline testing and development.
Class

Description

AggregateBuilder<R0>

Offers a step-by-step API to build a pipeline stage that co-aggregates the data from several input stages.

AggregateBuilder1<T0>

Offers a step-by-step API to build a pipeline stage that co-aggregates the data from several input stages.

BatchSource<T>

A finite source of data for a Jet pipeline.

BatchStage<T>

A stage in a distributed computation pipeline that will observe a finite amount of data (a batch).

BatchStageWithKey<T,K>

An intermediate step while constructing a group-and-aggregate batch pipeline stage.

DataConnectionRef

Represents a reference to the data connection, used with Sources.jdbc(DataConnectionRef, ToResultSetFunction, FunctionEx).

FileSinkBuilder<T>

See Sinks.filesBuilder(java.lang.String).

FileSourceBuilder

Builder for a file source which reads lines from files in a directory (but not its subdirectories) and emits output object created by mapOutputFn

GeneralHashJoinBuilder<T0>

Offers a step-by-step fluent API to build a hash-join pipeline stage.

GeneralStage<T>

The common aspect of batch and stream pipeline stages, defining those operations that apply to both.

GeneralStageWithKey<T,K>

An intermediate step when constructing a group-and-aggregate pipeline stage.

GroupAggregateBuilder<K,R0>

Offers a step-by-step API to build a pipeline stage that co-groups and aggregates the data from several input stages.

GroupAggregateBuilder1<T0,K>

Offers a step-by-step API to build a pipeline stage that co-groups and aggregates the data from several input stages.

HashJoinBuilder<T0>

Offers a step-by-step fluent API to build a hash-join pipeline stage.

JdbcPropertyKeys

This class defines property keys that can be passed to JDBC connector.

JdbcSinkBuilder<T>

See Sinks.jdbcBuilder().

JmsSinkBuilder<T>

See Sinks.jmsQueueBuilder(com.hazelcast.function.SupplierEx<jakarta.jms.ConnectionFactory>) or Sinks.jmsTopicBuilder(com.hazelcast.function.SupplierEx<jakarta.jms.ConnectionFactory>).

JmsSourceBuilder

See Sources.jmsQueueBuilder(com.hazelcast.function.SupplierEx<? extends jakarta.jms.ConnectionFactory>) or Sources.jmsTopicBuilder(com.hazelcast.function.SupplierEx<? extends jakarta.jms.ConnectionFactory>).

JoinClause<K,T0,T1,T1_OUT>

Specifies how to join an enriching stream to the primary stream in a hash-join operation.

JournalInitialPosition

When passed to an IMap/ICache Event Journal source, specifies which event to start from.

MapSinkBuilder<T,K,V>

Builder for a map that is used as sink.

MapSinkEntryProcessorBuilder<E,K,V,R>

Parameters for using a map as a sink with an EntryProcessor

Pipeline

Models a distributed computation job using an analogy with a system of interconnected water pipes.

RemoteMapSourceBuilder<K,V,T>

Builder providing a fluent API to build a remote map source.

ServiceFactories

Utility class with methods that create several useful service factories.

ServiceFactory<C,S>

A holder of functions needed to create and destroy a service object used in pipeline transforms such as stage.mapUsingService().

SessionWindowDefinition

Represents the definition of a session window.

Sink<T>

A data sink in a Jet pipeline.

SinkBuilder<C,T>

See SinkBuilder.sinkBuilder(String, FunctionEx).

Sinks

Contains factory methods for various types of pipeline sinks.

SinkStage

A pipeline stage that doesn't allow any downstream stages to be attached to it.

SlidingWindowDefinition

Represents the definition of a sliding window.

SourceBuilder<C>

Top-level class for Jet custom source builders.

SourceBuilder.SourceBuffer<T>

The buffer object that the fillBufferFn gets on each call.

SourceBuilder.TimestampedSourceBuffer<T>

The buffer object that the fillBufferFn gets on each call.

Sources

Contains factory methods for various types of pipeline sources.

Stage

The basic element of a Jet pipeline, represents a computation step.

StageWithKeyAndWindow<T,K>

Represents an intermediate step in the construction of a pipeline stage that performs a windowed group-and-aggregate operation.

StageWithWindow<T>

Represents an intermediate step in the construction of a pipeline stage that performs a windowed aggregate operation.

StreamHashJoinBuilder<T0>

Offers a step-by-step fluent API to build a hash-join pipeline stage.

StreamSource<T>

An infinite source of data for a Jet pipeline.

StreamSourceStage<T>

A source stage in a distributed computation pipeline that will observe an unbounded amount of data (i.e., an event stream).

StreamStage<T>

A stage in a distributed computation pipeline that will observe an unbounded amount of data (i.e., an event stream).

StreamStageWithKey<T,K>

An intermediate step while constructing a pipeline transform that involves a grouping key, such as windowed group-and-aggregate.

WindowAggregateBuilder<R0>

Offers a step-by-step fluent API to build a pipeline stage that performs a windowed co-aggregation of the data from several input stages.

WindowAggregateBuilder1<T0>

Offers a step-by-step fluent API to build a pipeline stage that performs a windowed co-aggregation of the data from several input stages.

WindowDefinition

The definition of the window for a windowed aggregation operation.

WindowGroupAggregateBuilder<K,R0>

Offers a step-by-step API to build a pipeline stage that performs a windowed co-grouping and aggregation of the data from several input stages.

WindowGroupAggregateBuilder1<T0,K>

Offers a step-by-step API to build a pipeline stage that performs a windowed co-grouping and aggregation of the data from several input stages.

Package com.hazelcast.jet.pipeline

Kinds of transformation performed by pipeline stages

Basic

Grouping and aggregation

Hash-Join