Download

Documentation

Community

Development

tasks/datasources



The design for this api has not been finalized. While much of the API has been implemented it is subject to major changes until released 0.6


Datasources

Pydra does not provide a distributed datasource. Instead Pydra provides a datasource API. Datasources are a generic interface to storage mechanisms that can be read from or written to. They encapsulate the connection and translation of the data. Tasks may use them as both input and output.

Datasources are required for working with ParallelTask and MapReduceTask? because the workflow expects workunits to be generated through this common interface. Tasks can use the ListDataSource? to generate a datasource from a list of data.

Benefits of datasources

Limiting Serialization/Deserialization

Any data that is sent through the cluster as an argument or return value must be serialized before it is sent, and deserialized when it is received. In some cases this happens several times. This is a very costly operation.

Data used by or resulting from a task, should almost always avoid serialization. This is especially true when dealing with large volumes of data. This is the primary reason for using a datasource instead of passing data as arguments or return values.

Managing Datasources


Scheduled as ticket #128.

Pydra can optionally manages a collection of datasources associated with a task. Managed datasources will have their connections maintained in between workunits to reduce connection overhead.

Backends

A backend is a wrapper around a specific type of storage system. By itself a backend does nothing other than connect and disconnect from a distributed datastorage medium. Backends are separate from datasources so that a user can choose to interact with this datasource in a custom way, while allowing Pydra to manage connections.

Functions

  • Connect
  • Disconnect

Available Backends

Targeted Storage Systems

  • CouchDB
  • Django ORM
  • Network Share - extension of filesystem that mounts a network share
  • Memcached
  • Memory - read from arguments passed into task, output to dictionary returned by task.

Slicer

A slicer is a class that takes raw data from a backend and translates it into workunits. Slicers are used to transform your unique data-structures into a list of work that can understand Pydra.

Slicers are both iterable and pickable. Workunits are generated by iterating over the list, but only a key identifying the workunit is sent with the work request. When the cluster#Worker receives the work request it must retrieve the workunit data from its local instance of the datasource.

Types of Slicers

Backend Specific Slicers

These slicers only works with certain backends. This is a because many backends behave radically different. For instance a SQL backend expects a SQL query, while a FileSystemBackend? expects a directory path. These slicers

Subslicers

These slicers use another Slicer, usually a backend specific slicer, as an input. They are used to further slice data.

slicer_a = SQLSlicer('select blob from my_table')
slicer_b = LineSlicer(slicer_a)

Keys returned from a subslicer are a composite of the two slicers because the key must identifier the position at both levels.

Slicer Performance

Slicers should be designed to quickly retrieve a value given a key. For example the FileSlicer? keys include an offset in the file where the workunit is located. This allows it to seek directly to the workunit data, rather than iterating through lines in a file.