ReadHdfsP
is used to read items from one or more HDFS files. The input
is split according to the given InputFormat
and read in parallel
across all processor instances.
ReadHdfsP
by default emits items of type Map.Entry<K,V>
, where K
and V
are the parameters for the given InputFormat
. It is possible
to transform the records using an optional mapper parameter. For
example, in the above example the output record type of TextInputFormat
is Map.Entry<LongWritable, Text>
. LongWritable
represents the line
number and Text
the contents of the line. If you do not care about
line numbers, and want your output as a plain String
, you can do as
follows:
Vertex source = dag.newVertex("source", readHdfs(jobConf, (k, v) -> v.toString()));
With this change, ReadHdfsP
will emit items of type String
instead.
Cluster Co-location
The Jet cluster should be run on the same nodes as the HDFS nodes for best read performance. If this is the case, each processor instance will try to read as much local data as possible. A heuristic algorithm is used to assign replicated blocks across the cluster to ensure a well-balanced work distribution between processor instances for maximum performance.