KafkaSource (Apache Crunch 0.15.0 API)

java.lang.Object
- org.apache.crunch.kafka.KafkaSource

All Implemented Interfaces:

ReadableSource<Pair<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>>, Source<Pair<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>>, TableSource<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>
```
public class KafkaSource
extends Object
implements TableSource<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>, ReadableSource<Pair<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>>
```
A Crunch Source that will retrieve events from Kafka given start and end offsets. The source is not designed to process unbounded data but instead to retrieve data between a specified range.
The values retrieved from Kafka are returned as raw bytes inside of a BytesWritable. If callers need specific parsing logic based on the topic then consumers are encouraged to use multiple Kafka Sources for each topic and use special DoFn to parse the payload.

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class KafkaSource.BytesDeserializer
Basic Deserializer which simply wraps the payload as a BytesWritable.

Nested Classes
Modifier and Type	Class and Description
`static class`	`KafkaSource.BytesDeserializer` Basic `Deserializer` which simply wraps the payload as a `BytesWritable`.

Field Summary

Fields
Modifier and Type	Field and Description
`static long`	`CONSUMER_POLL_TIMEOUT_DEFAULT` Default timeout value for `CONSUMER_POLL_TIMEOUT_KEY` of 1 second.
`static String`	`CONSUMER_POLL_TIMEOUT_KEY` Constant to indicate how long the reader waits before timing out when retrieving data from Kafka.

Constructor Summary

Constructors
Constructor and Description
`KafkaSource(Properties kafkaConnectionProperties, Map<org.apache.kafka.common.TopicPartition,Pair<Long,Long>> offsets)` Constructs a Kafka source that will read data from the Kafka cluster identified by the `kafkaConnectionProperties` and from the specific topics and partitions identified in the `offsets`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`ReadableData<Pair<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>>`	`asReadable()`
`void`	`configureSource(org.apache.hadoop.mapreduce.Job job, int inputId)` Configure the given job to use this source as an input.
`Converter<?,?,?,?>`	`getConverter()` Returns the `Converter` used for mapping the inputs from this instance into `PCollection` or `PTable` values.
`long`	`getLastModifiedAt(org.apache.hadoop.conf.Configuration configuration)` Returns the time (in milliseconds) that this `Source` was most recently modified (e.g., because an input file was edited or new files were added to a directory.)
`long`	`getSize(org.apache.hadoop.conf.Configuration configuration)` Returns the number of bytes in this `Source`.
`PTableType<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>`	`getTableType()`
`PType<Pair<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>>`	`getType()` Returns the `PType` for this source.
`Source<Pair<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>>`	`inputConf(String key, String value)` Adds the given key-value pair to the `Configuration` instance that is used to read this `Source<T></T>`.
`Iterable<Pair<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>>`	`read(org.apache.hadoop.conf.Configuration conf)` Returns an `Iterable` that contains the contents of this source.
`String`	`toString()`

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait

- Field Detail
  - CONSUMER_POLL_TIMEOUT_KEY
```
public static final String CONSUMER_POLL_TIMEOUT_KEY
```
    Constant to indicate how long the reader waits before timing out when retrieving data from Kafka.
    
    See Also:
    
    Constant Field Values
  - CONSUMER_POLL_TIMEOUT_DEFAULT
```
public static final long CONSUMER_POLL_TIMEOUT_DEFAULT
```
    Default timeout value for CONSUMER_POLL_TIMEOUT_KEY of 1 second.
    
    See Also:
    
    Constant Field Values
- Constructor Detail
  - KafkaSource
```
public KafkaSource(Properties kafkaConnectionProperties,
                   Map<org.apache.kafka.common.TopicPartition,Pair<Long,Long>> offsets)
```
    Constructs a Kafka source that will read data from the Kafka cluster identified by the kafkaConnectionProperties and from the specific topics and partitions identified in the offsets
    
    Parameters:
    
    kafkaConnectionProperties - The connection properties for reading from Kafka. These properties will be honored with the exception of the ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG and ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG
    
    offsets - A map of TopicPartition to a pair of start and end offsets respectively. The start and end offsets are evaluated at [start, end) where the ending offset is excluded. Each TopicPartition must have a non-null pair describing its offsets. The start offset should be less than the end offset. If the values are equal or start is greater than the end then that partition will be skipped.
- Method Detail
  - inputConf
```
public Source<Pair<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>> inputConf(String key,
                                                                                                     String value)
```
    Description copied from interface: Source
    
    Adds the given key-value pair to the Configuration instance that is used to read this Source<T></T>. Allows for multiple inputs to re-use the same config keys with different values when necessary.
    
    Specified by:
    
    inputConf in interface Source<Pair<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>>
  - getType
```
public PType<Pair<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>> getType()
```
    Description copied from interface: Source
    
    Returns the PType for this source.
    
    Specified by:
    
    getType in interface Source<Pair<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>>
  - getConverter
```
public Converter<?,?,?,?> getConverter()
```
    Description copied from interface: Source
    
    Returns the Converter used for mapping the inputs from this instance into PCollection or PTable values.
    
    Specified by:
    
    getConverter in interface Source<Pair<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>>
  - getTableType
```
public PTableType<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable> getTableType()
```
    Specified by:
    
    getTableType in interface TableSource<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>
  - getSize
```
public long getSize(org.apache.hadoop.conf.Configuration configuration)
```
    Description copied from interface: Source
    
    Returns the number of bytes in this Source.
    
    Specified by:
    
    getSize in interface Source<Pair<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>>
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class Object
  - getLastModifiedAt
```
public long getLastModifiedAt(org.apache.hadoop.conf.Configuration configuration)
```
    Description copied from interface: Source
    
    Returns the time (in milliseconds) that this Source was most recently modified (e.g., because an input file was edited or new files were added to a directory.)
    
    Specified by:
    
    getLastModifiedAt in interface Source<Pair<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>>
  - read
```
public Iterable<Pair<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>> read(org.apache.hadoop.conf.Configuration conf)
                                                                                           throws IOException
```
    Description copied from interface: ReadableSource
    
    Returns an Iterable that contains the contents of this source.
    
    Specified by:
    
    read in interface ReadableSource<Pair<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>>
    
    Parameters:
    
    conf - The current Configuration instance
    
    Returns:
    
    the contents of this Source as an Iterable instance
    
    Throws:
    
    IOException
  - configureSource
```
public void configureSource(org.apache.hadoop.mapreduce.Job job,
                            int inputId)
                     throws IOException
```
    Description copied from interface: Source
    
    Configure the given job to use this source as an input.
    
    Specified by:
    
    configureSource in interface Source<Pair<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>>
    
    Parameters:
    
    job - The job to configure
    
    inputId - For a multi-input job, an identifier for this input to the job
    
    Throws:
    
    IOException
  - asReadable
```
public ReadableData<Pair<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>> asReadable()
```
    Specified by:
    
    asReadable in interface ReadableSource<Pair<org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable>>
    
    Returns:
    
    a ReadableData instance containing the data referenced by this ReadableSource.

Class KafkaSource

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

CONSUMER_POLL_TIMEOUT_KEY

CONSUMER_POLL_TIMEOUT_DEFAULT

Constructor Detail

KafkaSource

Method Detail

inputConf

getType

getConverter

getTableType

getSize

toString

getLastModifiedAt

read

configureSource

asReadable