Class S3Sources


  • public final class S3Sources
    extends java.lang.Object
    Contains factory methods for creating AWS S3 sources.
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static BatchSource<java.lang.String> s3​(java.util.List<java.lang.String> bucketNames, java.lang.String prefix, SupplierEx<? extends software.amazon.awssdk.services.s3.S3Client> clientSupplier)
      static <I,​T>
      BatchSource<T>
      s3​(java.util.List<java.lang.String> bucketNames, java.lang.String prefix, SupplierEx<? extends software.amazon.awssdk.services.s3.S3Client> clientSupplier, FunctionEx<? super java.io.InputStream,​? extends java.util.stream.Stream<I>> readFileFn, BiFunctionEx<java.lang.String,​? super I,​? extends T> mapFn)
      Creates an AWS S3 BatchSource which lists all the objects in the bucket-list using given prefix, reads them using provided readFileFn, transforms each read item to the desired output object using given mapFn and emits them to downstream.
      static <I,​T>
      BatchSource<T>
      s3​(java.util.List<java.lang.String> bucketNames, java.lang.String prefix, SupplierEx<? extends software.amazon.awssdk.services.s3.S3Client> clientSupplier, TriFunction<? super java.io.InputStream,​java.lang.String,​java.lang.String,​? extends java.util.stream.Stream<I>> readFileFn, BiFunctionEx<java.lang.String,​? super I,​? extends T> mapFn)
      Creates an AWS S3 BatchSource which lists all the objects in the bucket-list using given prefix, reads them using provided readFileFn, transforms each read item to the desired output object using given mapFn and emits them to downstream.
      static <T> BatchSource<T> s3​(java.util.List<java.lang.String> bucketNames, java.lang.String prefix, java.nio.charset.Charset charset, SupplierEx<? extends software.amazon.awssdk.services.s3.S3Client> clientSupplier, BiFunctionEx<java.lang.String,​java.lang.String,​? extends T> mapFn)
      Creates an AWS S3 BatchSource which lists all the objects in the bucket-list using given prefix, reads them line by line, transforms each line to the desired output object using given mapFn and emits them to downstream.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • s3

        @Nonnull
        public static BatchSource<java.lang.String> s3​(@Nonnull
                                                       java.util.List<java.lang.String> bucketNames,
                                                       @Nullable
                                                       java.lang.String prefix,
                                                       @Nonnull
                                                       SupplierEx<? extends software.amazon.awssdk.services.s3.S3Client> clientSupplier)
        Convenience for s3(List, String, Charset, SupplierEx, BiFunctionEx). Emits lines to downstream without any transformation and uses StandardCharsets.UTF_8.
      • s3

        @Nonnull
        public static <T> BatchSource<T> s3​(@Nonnull
                                            java.util.List<java.lang.String> bucketNames,
                                            @Nullable
                                            java.lang.String prefix,
                                            @Nonnull
                                            java.nio.charset.Charset charset,
                                            @Nonnull
                                            SupplierEx<? extends software.amazon.awssdk.services.s3.S3Client> clientSupplier,
                                            @Nonnull
                                            BiFunctionEx<java.lang.String,​java.lang.String,​? extends T> mapFn)
        Creates an AWS S3 BatchSource which lists all the objects in the bucket-list using given prefix, reads them line by line, transforms each line to the desired output object using given mapFn and emits them to downstream.

        The source does not save any state to snapshot. If the job is restarted, it will re-emit all entries.

        The default local parallelism for this processor is 2.

        Here is an example which reads the objects from a single bucket with applying the given prefix.

        
         Pipeline p = Pipeline.create();
         BatchStage<String> srcStage = p.readFrom(S3Sources.s3(
              Arrays.asList("bucket1", "bucket2"),
              "prefix",
              StandardCharsets.UTF_8,
              () -> S3Client.create(),
              (filename, line) -> line
         ));
         
        Type Parameters:
        T - the type of the items the source emits
        Parameters:
        bucketNames - list of bucket-names
        prefix - the prefix to filter the objects. Optional, passing null will list all objects.
        clientSupplier - function which returns the s3 client to use one client per processor instance is used
        mapFn - the function which creates output object from each line. Gets the object name and line as parameters
      • s3

        @Nonnull
        public static <I,​T> BatchSource<T> s3​(@Nonnull
                                                    java.util.List<java.lang.String> bucketNames,
                                                    @Nullable
                                                    java.lang.String prefix,
                                                    @Nonnull
                                                    SupplierEx<? extends software.amazon.awssdk.services.s3.S3Client> clientSupplier,
                                                    @Nonnull
                                                    FunctionEx<? super java.io.InputStream,​? extends java.util.stream.Stream<I>> readFileFn,
                                                    @Nonnull
                                                    BiFunctionEx<java.lang.String,​? super I,​? extends T> mapFn)
        Creates an AWS S3 BatchSource which lists all the objects in the bucket-list using given prefix, reads them using provided readFileFn, transforms each read item to the desired output object using given mapFn and emits them to downstream.

        The source does not save any state to snapshot. If the job is restarted, it will re-emit all entries.

        The default local parallelism for this processor is 2.

        Here is an example which reads the objects from a single bucket with applying the given prefix.

        
         Pipeline p = Pipeline.create();
         BatchStage<String> srcStage = p.readFrom(S3Sources.s3(
              Arrays.asList("bucket1", "bucket2"),
              "prefix",
              () -> S3Client.create(),
              (inputStream) -> new LineIterator(new InputStreamReader(inputStream)),
              (filename, line) -> line
         ));
         
        Type Parameters:
        T - the type of the items the source emits
        Parameters:
        bucketNames - list of bucket-names
        prefix - the prefix to filter the objects. Optional, passing null will list all objects.
        clientSupplier - function which returns the s3 client to use one client per processor instance is used
        readFileFn - the function which creates iterator, which reads the file in lazy way
        mapFn - the function which creates output object from each line. Gets the object name and line as parameters
      • s3

        @Nonnull
        public static <I,​T> BatchSource<T> s3​(@Nonnull
                                                    java.util.List<java.lang.String> bucketNames,
                                                    @Nullable
                                                    java.lang.String prefix,
                                                    @Nonnull
                                                    SupplierEx<? extends software.amazon.awssdk.services.s3.S3Client> clientSupplier,
                                                    @Nonnull
                                                    TriFunction<? super java.io.InputStream,​java.lang.String,​java.lang.String,​? extends java.util.stream.Stream<I>> readFileFn,
                                                    @Nonnull
                                                    BiFunctionEx<java.lang.String,​? super I,​? extends T> mapFn)
        Creates an AWS S3 BatchSource which lists all the objects in the bucket-list using given prefix, reads them using provided readFileFn, transforms each read item to the desired output object using given mapFn and emits them to downstream.

        The source does not save any state to snapshot. If the job is restarted, it will re-emit all entries.

        The default local parallelism for this processor is 2.

        Here is an example which reads the objects from a single bucket with applying the given prefix.

        
         Pipeline p = Pipeline.create();
         BatchStage<String> srcStage = p.readFrom(S3Sources.s3(
              Arrays.asList("bucket1", "bucket2"),
              "prefix",
              () -> S3Client.create(),
              (inputStream, key, bucketName) -> new LineIterator(new InputStreamReader(inputStream)),
              (filename, line) -> line
         ));
         
        Type Parameters:
        T - the type of the items the source emits
        Parameters:
        bucketNames - list of bucket-names
        prefix - the prefix to filter the objects. Optional, passing null will list all objects.
        clientSupplier - function which returns the s3 client to use one client per processor instance is used
        readFileFn - the function which creates iterator, which reads the file in lazy way
        mapFn - the function which creates output object from each line. Gets the object name and line as parameters
        Since:
        Jet 4.3