Class KinesisSources

java.lang.Object
com.hazelcast.jet.kinesis.KinesisSources

public final class KinesisSources extends Object
Contains factory methods for creating Amazon Kinesis Data Streams (KDS) sources.
Since:
Jet 4.4
  • Field Details

    • MILLIS_BEHIND_LATEST_METRIC

      public static final String MILLIS_BEHIND_LATEST_METRIC
      Name of the metric exposed by the source, used to monitor if reading a specific shard is delayed or not. The value represents the number of milliseconds the last read record is behind the tip of the stream.
      See Also:
  • Method Details

    • kinesis

      @Nonnull public static KinesisSources.Builder<Map.Entry<String,byte[]>> kinesis(@Nonnull String stream)
      Initiates the building of a streaming source that consumes a Kinesis data stream and emits Map.Entry<String, byte[]> items.

      Each emitted item represents a Kinesis record, the basic unit of data stored in KDS. A record is composed of a sequence number, partition key, and data blob. This source does not expose the sequence numbers, uses them only internally, for replay purposes.

      The partition key is used to group related records and route them inside Kinesis. Partition keys are Unicode strings with a maximum length of 256 characters. They are specified by the data producer while adding data to KDS.

      The data blob of the record is provided in the form of a byte array, in essence, serialized data (Kinesis doesn't handle serialization internally). The maximum size of the data blob is 1 MB.

      The source is distributed. Each instance is consuming data from zero, one or more Kinesis shards, the base throughput units of KDS. One shard provides a capacity of 2MB/sec data output. Items with the same partition key always come from the same shard; one shard contains multiple partition keys. Items coming from the same shard are ordered, and the source preserves this order (except on resharding). The local parallelism of the source is not defined, so it will depend on the number of cores available.

      If a processing guarantee is specified for the job, Jet will periodically save the current shard offsets internally and then replay from the saved offsets when the job is restarted. If no processing guarantee is enabled, the source will start reading from the oldest available data, determined by the KDS retention period (defaults to 24 hours, can be as long as 365 days). This replay capability makes the source suitable for pipelines with both at-least-once and exactly-once processing guarantees.

      The source can provide native timestamps, in the sense that they are read from KDS, but be aware that they are Kinesis ingestion times, not event times in the strictest sense.

      As stated before, the source preserves the ordering inside shards. However, Kinesis supports resharding, which lets you adjust the number of shards in your stream to adapt to changes in data flow rate through the stream. When resharding happens, the source cannot preserve the order among the last items of a shard being destroyed and the first items of new shards being created. Watermarks might also experience unexpected behaviour during resharding. Best to do resharding when the data flow is stopped, if possible.

      NOTE. In Kinesis terms, this source is a "shared throughput consumer." This means that all the limitations on data read, imposed by Kinesis (at most 5 read transaction per second, at most 2MiB of data per second) apply not on a per-source basis but for all sources at once. If you start only one pipeline with a source for a Kinesis stream, then the source implementation will assure that no limit is tripped. But if you run multiple pipelines with sources for the same Kinesis stream, these sources cannot coordinate the limits among themselves. It will still work, but you will see occasional warning messages in the logs about rates and limitations being tripped.

      The source provides a metric called "millisBehindLatest", which specifies, for each shard, the source is actively reading data from if there is any delay in reading the data.

      Parameters:
      stream - name of the Kinesis stream being consumed by the source
      Returns:
      fluent builder that can be used to set properties and also to construct the source once configuration is done