public final class HadoopSources extends Object
Modifier and Type | Field and Description |
---|---|
static String |
COPY_ON_READ
With the new HDFS API, some of the
RecordReader s return the same
key/value instances for each record, for example LineRecordReader . |
static String |
IGNORE_FILE_NOT_FOUND |
static String |
SHARED_LOCAL_FS
When reading files from local file system using Hadoop, each processor
reads files from its own local file system.
|
Modifier and Type | Method and Description |
---|---|
static <K,V> BatchSource<Map.Entry<K,V>> |
inputFormat(org.apache.hadoop.conf.Configuration jobConf)
Convenience for
inputFormat(Configuration, BiFunctionEx)
with Map.Entry as its output type. |
static <K,V,E> BatchSource<E> |
inputFormat(org.apache.hadoop.conf.Configuration configuration,
BiFunctionEx<K,V,E> projectionFn)
Returns a source that reads records from Apache Hadoop HDFS and emits
the results of transforming each record (a key-value pair) with the
supplied projection function.
|
static <K,V,E> BatchSource<E> |
inputFormat(ConsumerEx<org.apache.hadoop.conf.Configuration> configureFn,
BiFunctionEx<K,V,E> projectionFn)
Returns a source that reads records from Apache Hadoop HDFS and emits
the results of transforming each record (a key-value pair) with the
supplied projection function.
|
public static final String COPY_ON_READ
RecordReader
s return the same
key/value instances for each record, for example LineRecordReader
.
If this property is set to true
, the source makes a copy of each
object after applying the projectionFn
. For readers which create
a new instance for each record, the source can be configured to not copy
the objects for performance.
Also if you are using a projection function which doesn't refer to any
mutable state from the key or value, then it makes sense to set this
property to false
to avoid unnecessary copying.
The source copies the objects by serializing and de-serializing them. The
objects should be either Writable
or serializable in a way which
Jet can serialize/de-serialize.
Here is how you can configure the source. Default and always safe value is
true
:
Configuration conf = new Configuration();
conf.setBoolean(HadoopSources.COPY_ON_READ, false);
BatchSource<Entry<K, V>> source = HadoopSources.inputFormat(conf);
public static final String SHARED_LOCAL_FS
true
.
Here is how you can configure the source. Default value is false
:
Configuration conf = new Configuration();
conf.setBoolean(HadoopSources.SHARED_LOCAL_FS, true);
BatchSource<Entry<K, V>> source = HadoopSources.inputFormat(conf);
public static final String IGNORE_FILE_NOT_FOUND
@Nonnull public static <K,V,E> BatchSource<E> inputFormat(@Nonnull org.apache.hadoop.conf.Configuration configuration, @Nonnull BiFunctionEx<K,V,E> projectionFn)
This source splits and balances the input data among Jet processors, doing its best to achieve data locality. To this end the Jet cluster topology should be aligned with Hadoop's — on each Hadoop member there should be a Jet member.
The processor will use either the new or the old MapReduce API based on
the key which stores the InputFormat
configuration. If it's
stored under , the new API
will be used. Otherwise, the old API will be used. If you get the
configuration from JobContextImpl.getConfiguration()
, the new API will be
used. Please see COPY_ON_READ
if you are using the new API.
The default local parallelism for this processor is 2 (or less if less CPUs are available).
This source does not save any state to snapshot. If the job is restarted, all entries will be emitted again.
K
- key type of the recordsV
- value type of the recordsE
- the type of the emitted valueconfiguration
- JobConf for reading files with the appropriate
input format and pathprojectionFn
- function to create output objects from key and value.
If the projection returns a null
for an item, that item
will be filtered out@Nonnull public static <K,V,E> BatchSource<E> inputFormat(@Nonnull ConsumerEx<org.apache.hadoop.conf.Configuration> configureFn, @Nonnull BiFunctionEx<K,V,E> projectionFn)
This source splits and balances the input data among Jet processors, doing its best to achieve data locality. To this end the Jet cluster topology should be aligned with Hadoop's — on each Hadoop member there should be a Jet member.
The configureFn
is used to configure the MR Job. The function is
run on the coordinator node of the Jet Job, avoiding contacting the server
from the machine where the job is submitted.
The new MapReduce API will be used.
The default local parallelism for this processor is 2 (or less if less CPUs are available).
This source does not save any state to snapshot. If the job is restarted, all entries will be emitted again.
K
- key type of the recordsV
- value type of the recordsE
- the type of the emitted valueconfigureFn
- function to configure the MR jobprojectionFn
- function to create output objects from key and value.
If the projection returns a null
for an item, that item
will be filtered out@Nonnull public static <K,V> BatchSource<Map.Entry<K,V>> inputFormat(@Nonnull org.apache.hadoop.conf.Configuration jobConf)
inputFormat(Configuration, BiFunctionEx)
with Map.Entry
as its output type.Copyright © 2022 Hazelcast, Inc.. All rights reserved.