public final class HadoopSources extends Object
|Modifier and Type||Field and Description|
With the new HDFS API, some of the
|Modifier and Type||Method and Description|
Returns a source that reads records from Apache Hadoop HDFS and emits the results of transforming each record (a key-value pair) with the supplied projection function.
public static final String COPY_ON_READ
RecordReaders return the same key/value instances for each record, for example
LineRecordReader. If this property is set to
true, the source makes a copy of each object after applying the
projectionFn. For readers which create a new instance for each record, the source can be configured to not copy the objects for performance.
Also if you are using a projection function which doesn't refer to any
mutable state from the key or value, then it makes sense to set this
false to avoid unnecessary copying.
The source copies the objects by serializing and de-serializing them. The
objects should be either
Writable or serializable in a way which
Jet can serialize/de-serialize.
Here is how you can configure the source. Default and always safe value is
Configuration conf = new Configuration(); conf.set(HdfsSources.COPY_ON_READ, "false"); BatchSource<Entry<K, V>> source = HadoopSources.inputFormat(conf);
@Nonnull public static <K,V,E> BatchSource<E> inputFormat(@Nonnull org.apache.hadoop.conf.Configuration configuration, @Nonnull com.hazelcast.function.BiFunctionEx<K,V,E> projectionFn)
This source splits and balances the input data among Jet processors, doing its best to achieve data locality. To this end the Jet cluster topology should be aligned with Hadoop's — on each Hadoop member there should be a Jet member.
The processor will use either the new or the old MapReduce API based on
the key which stores the
InputFormat configuration. If it's
stored under , the new API
will be used. Otherwise, the old API will be used. If you get the
JobContextImpl.getConfiguration(), the new API will be
used. Please see
COPY_ON_READ if you are using the new API.
The default local parallelism for this processor is 2 (or less if less CPUs are available).
This source does not save any state to snapshot. If the job is restarted, all entries will be emitted again.
K- key type of the records
V- value type of the records
E- the type of the emitted value
configuration- JobConf for reading files with the appropriate input format and path
projectionFn- function to create output objects from key and value. If the projection returns a
nullfor an item, that item will be filtered out
Copyright © 2020 Hazelcast, Inc.. All rights reserved.