DistributedPipeline (Apache Crunch 0.11.0 API)

This project has retired. For details please refer to its Attic page.

Overview

Package

Class

Use

Tree

Deprecated

Index

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.crunch.impl.dist
Class DistributedPipeline

java.lang.Object
  org.apache.crunch.impl.dist.DistributedPipeline

All Implemented Interfaces:: Pipeline

Direct Known Subclasses:: MRPipeline, SparkPipeline

public abstract class DistributedPipeline
extends Object
implements Pipeline
extends Object
implements Pipeline

Constructor Summary

DistributedPipeline(String name, org.apache.hadoop.conf.Configuration conf, PCollectionFactory factory)
          Instantiate with a custom name and configuration.

Method Summary

void cleanup(boolean force)
          Cleans up any artifacts created as a result of running the pipeline.

<T> SourceTarget<T> createIntermediateOutput(PType<T> ptype)


org.apache.hadoop.fs.Path createTempPath()


PipelineResult done()
          Run any remaining jobs required to generate outputs and then clean up any intermediate data files that were created in this run or previous calls to run.

<S> PCollection<S> emptyPCollection(PType<S> ptype)
          Creates an empty PCollection of the given PType.

<K,V> PTable<K,V> emptyPTable(PTableType<K,V> ptype)
          Creates an empty PTable of the given PTable Type.

void enableDebug()
          Turn on debug logging for jobs that are run from this pipeline.

org.apache.hadoop.conf.Configuration getConfiguration()
          Returns the Configuration instance associated with this pipeline.

PCollectionFactory getFactory()


<T> ReadableSource<T> getMaterializeSourceTarget(PCollection<T> pcollection)
          Retrieve a ReadableSourceTarget that provides access to the contents of a PCollection.

String getName()
          Returns the name of this pipeline.

int getNextAnonymousStageId()


<S> PCollection<S> read(Source<S> source)
          Converts the given Source into a PCollection that is available to jobs run using this Pipeline instance.

<K,V> PTable<K,V> read(TableSource<K,V> source)
          A version of the read method for TableSource instances that map to PTables.

PCollection<String> readTextFile(String pathName)
          A convenience method for reading a text file.

<Output> Output sequentialDo(PipelineCallable<Output> pipelineCallable)
          Executes the given PipelineCallable on the client after the Targets that the PipelineCallable depends on (if any) have been created by other pipeline processing steps.

void setConfiguration(org.apache.hadoop.conf.Configuration conf)
          Set the Configuration to use with this pipeline.

void write(PCollection<?> pcollection, Target target)
          Write the given collection to the given target on the next pipeline run.

void write(PCollection<?> pcollection, Target target, Target.WriteMode writeMode)
          Write the contents of the PCollection to the given Target, using the storage format specified by the target and the given WriteMode for cases where the referenced Target already exists.

<T> void writeTextFile(PCollection<T> pcollection, String pathName)
          A convenience method for writing a text file.

Methods inherited from class java.lang.Object

equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.crunch.Pipeline

cache, materialize, run, runAsync

Constructor Summary
`DistributedPipeline(String name, org.apache.hadoop.conf.Configuration conf, PCollectionFactory factory)` Instantiate with a custom name and configuration.

Methods inherited from class java.lang.Object
`equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Methods inherited from interface org.apache.crunch.Pipeline
`cache, materialize, run, runAsync`

Constructor Detail

DistributedPipeline

public DistributedPipeline(String name,
                           org.apache.hadoop.conf.Configuration conf,
                           PCollectionFactory factory)

Instantiate with a custom name and configuration.

Parameters:: name - Display name of the pipeline; conf - Configuration to be used within all MapReduce jobs run in the pipeline

Method Detail

getFactory

public PCollectionFactory getFactory()

getConfiguration

public org.apache.hadoop.conf.Configuration getConfiguration()

Description copied from interface: Pipeline

Returns the Configuration instance associated with this pipeline.

Specified by:: getConfiguration in interface Pipeline

setConfiguration

public void setConfiguration(org.apache.hadoop.conf.Configuration conf)

Description copied from interface: Pipeline

Set the Configuration to use with this pipeline.

Specified by:: setConfiguration in interface Pipeline

done

public PipelineResult done()

Description copied from interface: Pipeline

Run any remaining jobs required to generate outputs and then clean up any intermediate data files that were created in this run or previous calls to run.

Specified by:: done in interface Pipeline

sequentialDo

public <Output> Output sequentialDo(PipelineCallable<Output> pipelineCallable)

Description copied from interface: Pipeline

Executes the given PipelineCallable on the client after the Targets that the PipelineCallable depends on (if any) have been created by other pipeline processing steps.

Specified by:: sequentialDo in interface Pipeline

Type Parameters:: Output - The return type of the PipelineCallable
Parameters:: pipelineCallable - The sequential logic to execute
Returns:: The result of executing the PipelineCallable

read

public <S> PCollection<S> read(Source<S> source)

Description copied from interface: Pipeline

Converts the given Source into a PCollection that is available to jobs run using this Pipeline instance.

Specified by:: read in interface Pipeline

Parameters:: source - The source of data
Returns:: A PCollection that references the given source

read

public <K,V> PTable<K,V> read(TableSource<K,V> source)

Description copied from interface: Pipeline

A version of the read method for TableSource instances that map to PTables.

Specified by:: read in interface Pipeline

Parameters:: source - The source of the data
Returns:: A PTable that references the given source

readTextFile

public PCollection<String> readTextFile(String pathName)

Description copied from interface: Pipeline

A convenience method for reading a text file.

Specified by:: readTextFile in interface Pipeline

write

public void write(PCollection<?> pcollection,
                  Target target)

Description copied from interface: Pipeline

Write the given collection to the given target on the next pipeline run. The system will check to see if the target's location already exists using the WriteMode.DEFAULT rule for the given Target.

Specified by:: write in interface Pipeline

Parameters:: pcollection - The collection; target - The output target

write

public void write(PCollection<?> pcollection,
                  Target target,
                  Target.WriteMode writeMode)

Description copied from interface: Pipeline

Write the contents of the PCollection to the given Target, using the storage format specified by the target and the given WriteMode for cases where the referenced Target already exists.

Specified by:: write in interface Pipeline

Parameters:: pcollection - The collection; target - The target to write to; writeMode - The strategy to use for handling existing outputs

emptyPCollection

public <S> PCollection<S> emptyPCollection(PType<S> ptype)

Description copied from interface: Pipeline

Creates an empty PCollection of the given PType.

Specified by:: emptyPCollection in interface Pipeline

Parameters:: ptype - The PType of the empty PCollection
Returns:: A valid PCollection with no contents

emptyPTable

public <K,V> PTable<K,V> emptyPTable(PTableType<K,V> ptype)

Description copied from interface: Pipeline

Creates an empty PTable of the given PTable Type.

Specified by:: emptyPTable in interface Pipeline

Parameters:: ptype - The PTableType of the empty PTable
Returns:: A valid PTable with no contents

getMaterializeSourceTarget

public <T> ReadableSource<T> getMaterializeSourceTarget(PCollection<T> pcollection)

Retrieve a ReadableSourceTarget that provides access to the contents of a PCollection. This is primarily intended as a helper method to Pipeline.materialize(PCollection). The underlying data of the ReadableSourceTarget may not be actually present until the pipeline is run.

Parameters:: pcollection - The collection for which the ReadableSourceTarget is to be retrieved
Returns:: The ReadableSourceTarget
Throws:: IllegalArgumentException - If no ReadableSourceTarget can be retrieved for the given PCollection

createIntermediateOutput

public <T> SourceTarget<T> createIntermediateOutput(PType<T> ptype)

createTempPath

public org.apache.hadoop.fs.Path createTempPath()

writeTextFile

public <T> void writeTextFile(PCollection<T> pcollection,
                              String pathName)

Description copied from interface: Pipeline

A convenience method for writing a text file.

Specified by:: writeTextFile in interface Pipeline

cleanup

public void cleanup(boolean force)

Description copied from interface: Pipeline

Cleans up any artifacts created as a result of running the pipeline.

Specified by:: cleanup in interface Pipeline

Parameters:: force - forces the cleanup even if all targets of the pipeline have not been completed.

getNextAnonymousStageId

public int getNextAnonymousStageId()

enableDebug

public void enableDebug()

Description copied from interface: Pipeline

Turn on debug logging for jobs that are run from this pipeline.

Specified by:: enableDebug in interface Pipeline

getName

public String getName()

Description copied from interface: Pipeline

Returns the name of this pipeline.

Specified by:: getName in interface Pipeline

Returns:: Name of the pipeline

Overview

Package

Class

Use

Tree

Deprecated

Index

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.crunch.impl.dist Class DistributedPipeline

DistributedPipeline

getFactory

getConfiguration

setConfiguration

done

sequentialDo

read

read

readTextFile

write

write

emptyPCollection

emptyPTable

getMaterializeSourceTarget

createIntermediateOutput

createTempPath

writeTextFile

cleanup

getNextAnonymousStageId

enableDebug

getName

org.apache.crunch.impl.dist
Class DistributedPipeline