Class HadoopSources
- Since:
- Jet 3.0
-
Field Summary
Modifier and TypeFieldDescriptionstatic final String
With the new HDFS API, some of theRecordReader
s return the same key/value instances for each record, for exampleLineRecordReader
.static final String
static final String
When reading files from local file system using Hadoop, each processor reads files from its own local file system. -
Method Summary
Modifier and TypeMethodDescriptionstatic <K,
V, E> BatchSource<E> inputFormat
(ConsumerEx<org.apache.hadoop.conf.Configuration> configureFn, BiFunctionEx<K, V, E> projectionFn) Returns a source that reads records from Apache Hadoop HDFS and emits the results of transforming each record (a key-value pair) with the supplied projection function.static <K,
V> BatchSource<Map.Entry<K, V>> inputFormat
(org.apache.hadoop.conf.Configuration jobConf) Convenience forinputFormat(Configuration, BiFunctionEx)
withMap.Entry
as its output type.static <K,
V, E> BatchSource<E> inputFormat
(org.apache.hadoop.conf.Configuration configuration, BiFunctionEx<K, V, E> projectionFn) Returns a source that reads records from Apache Hadoop HDFS and emits the results of transforming each record (a key-value pair) with the supplied projection function.
-
Field Details
-
COPY_ON_READ
With the new HDFS API, some of theRecordReader
s return the same key/value instances for each record, for exampleLineRecordReader
. If this property is set totrue
, the source makes a copy of each object after applying theprojectionFn
. For readers which create a new instance for each record, the source can be configured to not copy the objects for performance.Also if you are using a projection function which doesn't refer to any mutable state from the key or value, then it makes sense to set this property to
false
to avoid unnecessary copying.The source copies the objects by serializing and de-serializing them. The objects should be either
Writable
or serializable in a way which Jet can serialize/de-serialize.Here is how you can configure the source. Default and always safe value is
true
:Configuration conf = new Configuration(); conf.setBoolean(HadoopSources.COPY_ON_READ, false); BatchSource<Entry<K, V>> source = HadoopSources.inputFormat(conf);
- See Also:
-
SHARED_LOCAL_FS
When reading files from local file system using Hadoop, each processor reads files from its own local file system. If the local file system is shared between members, e.g NFS mounted filesystem, you should configure this property astrue
.Here is how you can configure the source. Default value is
false
:Configuration conf = new Configuration(); conf.setBoolean(HadoopSources.SHARED_LOCAL_FS, true); BatchSource<Entry<K, V>> source = HadoopSources.inputFormat(conf);
- Since:
- Jet 4.4
- See Also:
-
IGNORE_FILE_NOT_FOUND
- Since:
- Jet 4.4
- See Also:
-
-
Method Details
-
inputFormat
@Nonnull public static <K,V, BatchSource<E> inputFormatE> (@Nonnull org.apache.hadoop.conf.Configuration configuration, @Nonnull BiFunctionEx<K, V, E> projectionFn) Returns a source that reads records from Apache Hadoop HDFS and emits the results of transforming each record (a key-value pair) with the supplied projection function.This source splits and balances the input data among Jet processors, doing its best to achieve data locality. To this end the Jet cluster topology should be aligned with Hadoop's — on each Hadoop member there should be a Jet member.
The processor will use either the new or the old MapReduce API based on the key which stores the
InputFormat
configuration. If it's stored under "mapreduce.job.inputformat.class", the new API will be used. Otherwise, the old API will be used. If you get the configuration fromJobContextImpl.getConfiguration()
, the new API will be used. Please seeCOPY_ON_READ
if you are using the new API.The default local parallelism for this processor is 2 (or less if less CPUs are available).
This source does not save any state to snapshot. If the job is restarted, all entries will be emitted again.
- Type Parameters:
K
- key type of the recordsV
- value type of the recordsE
- the type of the emitted value- Parameters:
configuration
- JobConf for reading files with the appropriate input format and pathprojectionFn
- function to create output objects from key and value. If the projection returns anull
for an item, that item will be filtered out
-
inputFormat
@Nonnull public static <K,V, BatchSource<E> inputFormatE> (@Nonnull ConsumerEx<org.apache.hadoop.conf.Configuration> configureFn, @Nonnull BiFunctionEx<K, V, E> projectionFn) Returns a source that reads records from Apache Hadoop HDFS and emits the results of transforming each record (a key-value pair) with the supplied projection function.This source splits and balances the input data among Jet processors, doing its best to achieve data locality. To this end the Jet cluster topology should be aligned with Hadoop's — on each Hadoop member there should be a Jet member.
The
configureFn
is used to configure the MR Job. The function is run on the coordinator node of the Jet Job, avoiding contacting the server from the machine where the job is submitted.The new MapReduce API will be used.
The default local parallelism for this processor is 2 (or less if less CPUs are available).
This source does not save any state to snapshot. If the job is restarted, all entries will be emitted again.
- Type Parameters:
K
- key type of the recordsV
- value type of the recordsE
- the type of the emitted value- Parameters:
configureFn
- function to configure the MR jobprojectionFn
- function to create output objects from key and value. If the projection returns anull
for an item, that item will be filtered out
-
inputFormat
@Nonnull public static <K,V> BatchSource<Map.Entry<K,V>> inputFormat(@Nonnull org.apache.hadoop.conf.Configuration jobConf) Convenience forinputFormat(Configuration, BiFunctionEx)
withMap.Entry
as its output type.
-