Like with the functional interfaces, Jet also includes the distributed
versions of the classes found in
can be reached via
However, keep in mind that collectors such as
toArray() create a local data
structure and load all the results into it. This works fine with the
regular JDK streams, where everything is local, but usually fails badly
in the context of a distributed computing job.
For, example the following innocent-looking code can easily cause out-of-memory errors because the whole distributed map will need to be held in memory at a single place:
// get a distributed map with 5GB per node on a 10 node cluster IStreamMap<String, String> map = jet.getMap("large_map"); // now try to build a HashMap of 50GB Map<String, String> result = map.stream() .map(e -> e.getKey() + e.getValue()) .collect(toMap(v -> v, v -> v));
This is why Jet distinguishes between the standard
collectors and the Jet-specific
Reducer puts the data
into a distributed data structure and knows how to leverage its native
partitioning scheme to optimize the access pattern across the cluster.
These are some of the
Reducer implementations provided in Jet:
toIMap(): writes the data to a new Hazelcast
groupingByToIMap(): performs a grouping operation and then writes the results to a Hazelcast
IMap. This uses a more efficient implementation than the standard
groupingBy()collector and can make use of partitioning.
toIList(): writes the data to a new Hazelcast
A distributed data structure is cluster-managed, therefore you can't
just create one and forget about it; it will live on until you
explicitly destroy it. That means it's inappropriate to use as a part of
a data item inside a larger collection, a further consequence being that
Reducer is inappropriate as a downstream collector; that's where
the JDK-standard collectors make sense.