Interface FileFormat<T>

Type Parameters:
T - the type of items a source using this file format will emit
All Superinterfaces:
Serializable
All Known Implementing Classes:
AvroFileFormat, CsvFileFormat, JsonFileFormat, LinesTextFileFormat, ParquetFileFormat, RawBytesFileFormat, TextFileFormat

public interface FileFormat<T> extends Serializable
Describes the data format of a file to be used as a Jet data source. This is a data object that holds the configuration; actual implementation code is looked up elsewhere, by using this object as a key.
Since:
Jet 4.4
  • Method Summary

    Modifier and Type
    Method
    Description
    static <T> AvroFileFormat<T>
    Returns a file format for Avro files.
    static <T> AvroFileFormat<T>
    avro(Class<T> clazz)
    Returns a file format for Avro files that specifies to use reflection to deserialize the data into instances of the provided Java class.
    Returns a file format for binary files.
    static <T> CsvFileFormat<T>
    csv(Class<T> clazz)
    Returns a file format for CSV files which specifies to deserialize each line into an instance of the given class.
    csv(List<String> fieldNames)
    Returns a file format for CSV files which specifies to deserialize each line into String[].
    Returns the name of the file format.
    static <T> JsonFileFormat<T>
    Returns a file format for JSON Lines files.
    static <T> JsonFileFormat<T>
    json(Class<T> clazz)
    Returns a file format for JSON Lines files, where each line of text is one JSON object.
    Returns a file format for text files where each line is a String data item.
    lines(Charset charset)
    Returns a file format for text files where each line is a String data item.
    static <T> ParquetFileFormat<T>
    Returns a file format for Parquet files.
    Returns a file format for text files where the whole file is a single string item.
    text(Charset charset)
    Returns a file format for text files where the whole file is a single string item.
  • Method Details

    • format

      @Nonnull String format()
      Returns the name of the file format. The convention is to use the well-known filename suffix or, if there is none, a short-form name of the format.
    • avro

      @Nonnull static <T> AvroFileFormat<T> avro()
      Returns a file format for Avro files.
    • avro

      @Nonnull static <T> AvroFileFormat<T> avro(@Nullable Class<T> clazz)
      Returns a file format for Avro files that specifies to use reflection to deserialize the data into instances of the provided Java class. Jet will use the ReflectDatumReader to read Avro data. The parameter may be null, disabling the option to deserialize using reflection, but for that case you may prefer the no-argument avro() call.
    • csv

      @Nonnull static CsvFileFormat<String[]> csv(@Nullable List<String> fieldNames)
      Returns a file format for CSV files which specifies to deserialize each line into String[]. It assumes the CSV has a header line and specifies to use it as the column names that map to the object's fields.

      fieldNames specify which column should be at which index in the resulting string array. It is useful if the files have different field order or don't have the same set of columns.

      For example, if the argument is [surname, name], then the format will always return items of type String[2] where at index 0 is the surname column and at index 1 is the name column, regardless of the actual columns found in a particular file. If some file doesn't have some field, the value at its index will always be 0.

      If the given list is null, the length and order of the string array will match the order found in each file. It can be different for each file. If it's an empty array, a zero-length array will be returned.

    • csv

      @Nonnull static <T> CsvFileFormat<T> csv(@Nonnull Class<T> clazz)
      Returns a file format for CSV files which specifies to deserialize each line into an instance of the given class. It assumes the CSV has a header line and specifies to use it as the column names that map to the object's fields.
    • json

      @Nonnull static <T> JsonFileFormat<T> json()
      Returns a file format for JSON Lines files.
    • json

      @Nonnull static <T> JsonFileFormat<T> json(@Nullable Class<T> clazz)
      Returns a file format for JSON Lines files, where each line of text is one JSON object. It specifies to deserialize the JSON data into instances of the provided class. It uses Jackson jr, which supports the basic data types such as strings, numbers, lists and maps, objects with JavaBeans-style getters/setters, as well as public fields. If parameter is null, data is deserialized into Map<String, Object> but for that case you may prefer the no-argument json() call.
    • lines

      @Nonnull static LinesTextFileFormat lines()
      Returns a file format for text files where each line is a String data item. It uses the UTF-8 character encoding.
    • lines

      @Nonnull static LinesTextFileFormat lines(@Nonnull Charset charset)
      Returns a file format for text files where each line is a String data item. This variant allows you to choose the character encoding. Note that the Hadoop-based file connector only accepts UTF-8.
      Parameters:
      charset - character encoding of the file
    • parquet

      @Nonnull static <T> ParquetFileFormat<T> parquet()
      Returns a file format for Parquet files.

      NOTE: this format is supported only through the Hadoop connector.

    • bytes

      @Nonnull static RawBytesFileFormat bytes()
      Returns a file format for binary files.
    • text

      @Nonnull static TextFileFormat text()
      Returns a file format for text files where the whole file is a single string item. It uses the UTF-8 character encoding.
    • text

      @Nonnull static TextFileFormat text(@Nonnull Charset charset)
      Returns a file format for text files where the whole file is a single string item. This variant allows you to choose the character encoding.

      NOTE: the Hadoop connector only supports UTF-8. This option is supported for local files only.

      Parameters:
      charset - character encoding of the file