Package com.hazelcast.jet.pipeline
The basic element is a pipeline stage which can be attached to one or more other stages, both in the upstream and the downstream direction. A pipeline accepts the data coming from its upstream stages, transforms it, and directs the resulting data to its downstream stages.
Kinds of transformation performed by pipeline stages
Basic
Basic transformations have a single upstream pipeline and statelessly transform individual items in it. Examples aremap
,
filter
, and flatMap
.
Grouping and aggregation
Theaggregate*()
transformations perform an aggregate operation
on a set of items. You can call stage.groupingKey()
to group the
items by a key and then Jet will aggregate each group separately. For
stream stages you must specify a stage.window()
which will
transform the infinite stream into a series of finite windows. If you
specify more than one input stage for the aggregation (using
stage.aggregate2()
, stage.aggregate3()
or
stage.aggregateBuilder()
, the data from all streams will be combined
into the aggregation result. The AggregateOperation
you
supply must define a separate accumulate
primitive for each contributing stream. Refer to its Javadoc for further
details.
Hash-Join
The hash-join is a joining transform designed for the use case of data enrichment with static data. It is an asymmetric join that joins the enriching stage(s) to the primary stage. The enriching stages must be batch stages — they must represent finite datasets. The primary stage may be either a batch or a stream stage.
You must provide a separate pair of functions for each of the enriching
stages: one to extract the key from the primary item and one to extract
it from the enriching item. For example, you can join a Trade
with a Broker
on trade.getBrokerId() == broker.getId()
and a Product
on trade.getProductId() == product.getId()
,
and all this can happen in a single hash-join transform.
The hash-join transform is optimized for throughput — each cluster member materializes a local copy of all the enriching data, stored in hashtables (hence the name). It consumes the enriching streams in full before ingesting any data from the primary stream.
The output of hashJoin
is just like an SQL left outer join:
for each primary item there are N output items, one for each matching
item in the enriching set. If an enriching set doesn't have a matching
item, the output will have a null
instead of the enriching item.
If you need SQL inner join, then you can use the specialised
innerHashJoin
function, in which for each primary item with
at least one match, there are N output items, one for each matching
item in the enriching set. If an enriching set doesn't have a matching
item, there will be no records with the given primary item. In this case
the output function's arguments are always non-null.
The join also allows duplicate keys on both enriching and primary inputs: the output is a cartesian product of all the matching entries.
Example:
+------------------------+-----------------+---------------------------+ | Primary input | Enriching input | Output | +------------------------+-----------------+---------------------------+ | Trade{ticker=AA,amt=1} | Ticker{id=AA} | Tuple2{ | | Trade{ticker=BB,amt=2} | Ticker{id=BB} | Trade{ticker=AA,amt=1}, | | Trade{ticker=AA,amt=3} | | Ticker{id=AA} | | | | } | | | | Tuple2{ | | | | Trade{ticker=BB,amt=2}, | | | | Ticker{id=BB} | | | | } | | | | Tuple2{ | | | | Trade{ticker=AA,amt=3}, | | | | Ticker{id=AA} | | | | } | +------------------------+-----------------+---------------------------+
- Since:
- Jet 3.0
-
ClassDescriptionAggregateBuilder<R0>Offers a step-by-step API to build a pipeline stage that co-aggregates the data from several input stages.Offers a step-by-step API to build a pipeline stage that co-aggregates the data from several input stages.BatchSource<T>A finite source of data for a Jet pipeline.BatchStage<T>A stage in a distributed computation
pipeline
that will observe a finite amount of data (a batch).BatchStageWithKey<T,K> An intermediate step while constructing a group-and-aggregate batch pipeline stage.Represents a reference to the data connection, used withSources.jdbc(DataConnectionRef, ToResultSetFunction, FunctionEx)
.Builder for a file source which reads lines from files in a directory (but not its subdirectories) and emits output object created bymapOutputFn
Offers a step-by-step fluent API to build a hash-join pipeline stage.GeneralStage<T>GeneralStageWithKey<T,K> An intermediate step when constructing a group-and-aggregate pipeline stage.GroupAggregateBuilder<K,R0> Offers a step-by-step API to build a pipeline stage that co-groups and aggregates the data from several input stages.GroupAggregateBuilder1<T0,K> Offers a step-by-step API to build a pipeline stage that co-groups and aggregates the data from several input stages.HashJoinBuilder<T0>Offers a step-by-step fluent API to build a hash-join pipeline stage.This class defines property keys that can be passed to JDBC connector.SeeSinks.jdbcBuilder()
.JoinClause<K,T0, T1, T1_OUT> Specifies how to join an enriching stream to the primary stream in ahash-join
operation.When passed to an IMap/ICache Event Journal source, specifies which event to start from.MapSinkBuilder<T,K, V> Builder for a map that is used as sink.MapSinkEntryProcessorBuilder<E,K, V, R> Parameters for using a map as a sink with an EntryProcessor: TODO review if this can be merged with MapSinkEntryProcessorBuilder, add full javadoc if notModels a distributed computation job using an analogy with a system of interconnected water pipes.RemoteMapSourceBuilder<K,V, T> Builder providing a fluent API to build a remote map source.Utility class with methods that create several usefulservice factories
.ServiceFactory<C,S> A holder of functions needed to create and destroy a service object used in pipeline transforms such asstage.mapUsingService()
.Represents the definition of a session window.Sink<T>A data sink in a Jet pipeline.SinkBuilder<C,T> Contains factory methods for various types of pipeline sinks.A pipeline stage that doesn't allow any downstream stages to be attached to it.Represents the definition of a sliding window.Top-level class for Jet custom source builders.The buffer object that thefillBufferFn
gets on each call.The buffer object that thefillBufferFn
gets on each call.Contains factory methods for various types of pipeline sources.The basic element of a Jetpipeline
, represents a computation step.Represents an intermediate step in the construction of a pipeline stage that performs a windowed group-and-aggregate operation.Represents an intermediate step in the construction of a pipeline stage that performs a windowed aggregate operation.Offers a step-by-step fluent API to build a hash-join pipeline stage.StreamSource<T>An infinite source of data for a Jet pipeline.A source stage in a distributed computationpipeline
that will observe an unbounded amount of data (i.e., an event stream).StreamStage<T>A stage in a distributed computationpipeline
that will observe an unbounded amount of data (i.e., an event stream).StreamStageWithKey<T,K> An intermediate step while constructing a pipeline transform that involves a grouping key, such as windowed group-and-aggregate.Offers a step-by-step fluent API to build a pipeline stage that performs a windowed co-aggregation of the data from several input stages.Offers a step-by-step fluent API to build a pipeline stage that performs a windowed co-aggregation of the data from several input stages.The definition of the window for a windowed aggregation operation.Offers a step-by-step API to build a pipeline stage that performs a windowed co-grouping and aggregation of the data from several input stages.Offers a step-by-step API to build a pipeline stage that performs a windowed co-grouping and aggregation of the data from several input stages.