Class FileSourceBuilder<T>

  • Type Parameters:
    T - the type of items a source using this file format will emit

    public class FileSourceBuilder<T>
    extends java.lang.Object
    A unified builder object for various kinds of file sources.

    To create an instance, use FileSources.files(String).

    Jet 4.4
    • Method Detail

      • glob

        public FileSourceBuilder<T> glob​(@Nonnull
                                         java.lang.String glob)
        Sets a glob pattern to filter the files in the specified directory. The default value is '*', matching all files in the directory.
        glob - glob pattern,
      • format

        public <T_NEW> FileSourceBuilder<T_NEW> format​(@Nonnull
                                                       FileFormat<T_NEW> fileFormat)
        Set the file format for the source. See FileFormat for available formats and factory methods.

        It's not possible to implement a custom format.

      • useHadoopForLocalFiles

        public FileSourceBuilder<T> useHadoopForLocalFiles​(boolean useHadoop)
        Specifies that Jet should use Apache Hadoop for files from the local filesystem. Otherwise, local files are read by Jet directly. One advantage of Hadoop is that it can provide better parallelization when the number of files is smaller than the total parallelism of the pipeline source.

        Default value is false.

        useHadoop - if Hadoop should be use for reading local filesystem
      • sharedFileSystem

        public FileSourceBuilder<T> sharedFileSystem​(boolean sharedFileSystem)
        If sharedFileSystem is true, Jet will assume all members see the same files. They will split the work so that each member will read a part of the files. If sharedFileSystem is false, each member will read all files in the directory, assuming that other members see different files.

        This option applies only for the local filesystem when Hadoop is not used and when the directory doesn't contain a prefix for a remote file system. Distributed filesystems are always assumed to be shared.

        If you start all the members on a single machine (such as for development), set this property to true. If you have multiple machines with multiple members each and the directory is not a shared storage, it's not possible to configure the file reader correctly - use only one member per machine.

        Default value is false.

      • ignoreFileNotFound

        public FileSourceBuilder<T> ignoreFileNotFound​(boolean ignoreFileNotFound)
        Set to true to ignore no matching files in the directory specified by path.

        When there is no file matching the glob specified by glob(String) (or the default glob) Jet throws an exception by default. This might be problematic in some cases, where the directory is empty. To override this behaviour set this to true.

        If set to true and there are no files in the directory the source will produce 0 items.

        Default value is false.

        ignoreFileNotFound - true if no files in the specified directory should be accepted
      • option

        public FileSourceBuilder<T> option​(java.lang.String key,
                                           java.lang.String value)
        Specifies an arbitrary option for the underlying source. If you are looking for a missing option, check out the FileFormat class you're using, it offers parsing-related options.
      • buildMetaSupplier

        public ProcessorMetaSupplier buildMetaSupplier()
        Builds a ProcessorMetaSupplier based on the current state of the builder. Use for integration with the Core API.

        This method is a part of Core API and has lower backward-compatibility guarantees (we can change it in minor version).

      • hasHadoopPrefix

        public static boolean hasHadoopPrefix​(java.lang.String path)
        Checks if the given path starts with one of the defined Hadoop prefixes: "s3a://", // Amazon S3 "hdfs://", // HDFS "wasbs://", // Azure Cloud Storage "adl://", // Azure Data Lake Gen 1 "abfs://", // Azure Data Lake Gen 2 "gs://" // Google Cloud Storage see HADOOP_PREFIXES