Class S3Sources

java.lang.Object
com.hazelcast.jet.s3.S3Sources

public final class S3Sources extends Object
Contains factory methods for creating AWS S3 sources.
  • Method Details

    • s3

      @Nonnull public static BatchSource<String> s3(@Nonnull List<String> bucketNames, @Nullable String prefix, @Nonnull SupplierEx<? extends software.amazon.awssdk.services.s3.S3Client> clientSupplier)
      Convenience for s3(List, String, Charset, SupplierEx, BiFunctionEx). Emits lines to downstream without any transformation and uses StandardCharsets.UTF_8.
    • s3

      @Nonnull public static <T> BatchSource<T> s3(@Nonnull List<String> bucketNames, @Nullable String prefix, @Nonnull Charset charset, @Nonnull SupplierEx<? extends software.amazon.awssdk.services.s3.S3Client> clientSupplier, @Nonnull BiFunctionEx<String,String,? extends T> mapFn)
      Creates an AWS S3 BatchSource which lists all the objects in the bucket-list using given prefix, reads them line by line, transforms each line to the desired output object using given mapFn and emits them to downstream.

      The source does not save any state to snapshot. If the job is restarted, it will re-emit all entries.

      The default local parallelism for this processor is 2.

      Here is an example which reads the objects from a single bucket with applying the given prefix.

      
       Pipeline p = Pipeline.create();
       BatchStage<String> srcStage = p.readFrom(S3Sources.s3(
            Arrays.asList("bucket1", "bucket2"),
            "prefix",
            StandardCharsets.UTF_8,
            () -> S3Client.create(),
            (filename, line) -> line
       ));
       
      Type Parameters:
      T - the type of the items the source emits
      Parameters:
      bucketNames - list of bucket-names
      prefix - the prefix to filter the objects. Optional, passing null will list all objects.
      clientSupplier - function which returns the s3 client to use one client per processor instance is used
      mapFn - the function which creates output object from each line. Gets the object name and line as parameters
    • s3

      @Nonnull public static <I, T> BatchSource<T> s3(@Nonnull List<String> bucketNames, @Nullable String prefix, @Nonnull SupplierEx<? extends software.amazon.awssdk.services.s3.S3Client> clientSupplier, @Nonnull FunctionEx<? super InputStream,? extends Stream<I>> readFileFn, @Nonnull BiFunctionEx<String,? super I,? extends T> mapFn)
      Creates an AWS S3 BatchSource which lists all the objects in the bucket-list using given prefix, reads them using provided readFileFn, transforms each read item to the desired output object using given mapFn and emits them to downstream.

      The source does not save any state to snapshot. If the job is restarted, it will re-emit all entries.

      The default local parallelism for this processor is 2.

      Here is an example which reads the objects from a single bucket with applying the given prefix.

      
       Pipeline p = Pipeline.create();
       BatchStage<String> srcStage = p.readFrom(S3Sources.s3(
            Arrays.asList("bucket1", "bucket2"),
            "prefix",
            () -> S3Client.create(),
            (inputStream) -> new LineIterator(new InputStreamReader(inputStream)),
            (filename, line) -> line
       ));
       
      Type Parameters:
      T - the type of the items the source emits
      Parameters:
      bucketNames - list of bucket-names
      prefix - the prefix to filter the objects. Optional, passing null will list all objects.
      clientSupplier - function which returns the s3 client to use one client per processor instance is used
      readFileFn - the function which creates iterator, which reads the file in lazy way
      mapFn - the function which creates output object from each line. Gets the object name and line as parameters
    • s3

      @Nonnull public static <I, T> BatchSource<T> s3(@Nonnull List<String> bucketNames, @Nullable String prefix, @Nonnull SupplierEx<? extends software.amazon.awssdk.services.s3.S3Client> clientSupplier, @Nonnull TriFunction<? super InputStream,String,String,? extends Stream<I>> readFileFn, @Nonnull BiFunctionEx<String,? super I,? extends T> mapFn)
      Creates an AWS S3 BatchSource which lists all the objects in the bucket-list using given prefix, reads them using provided readFileFn, transforms each read item to the desired output object using given mapFn and emits them to downstream.

      The source does not save any state to snapshot. If the job is restarted, it will re-emit all entries.

      The default local parallelism for this processor is 2.

      Here is an example which reads the objects from a single bucket with applying the given prefix.

      
       Pipeline p = Pipeline.create();
       BatchStage<String> srcStage = p.readFrom(S3Sources.s3(
            Arrays.asList("bucket1", "bucket2"),
            "prefix",
            () -> S3Client.create(),
            (inputStream, key, bucketName) -> new LineIterator(new InputStreamReader(inputStream)),
            (filename, line) -> line
       ));
       
      Type Parameters:
      T - the type of the items the source emits
      Parameters:
      bucketNames - list of bucket-names
      prefix - the prefix to filter the objects. Optional, passing null will list all objects.
      clientSupplier - function which returns the s3 client to use one client per processor instance is used
      readFileFn - the function which creates iterator, which reads the file in lazy way
      mapFn - the function which creates output object from each line. Gets the object name and line as parameters
      Since:
      Jet 4.3