The design for this api has not been finalized. While much of the API has been implemented it is subject to major changes until released 0.6
Datasources
Pydra does not provide a distributed datasource. Instead Pydra provides a datasource API. Datasources are a generic interface to storage mechanisms that can be read from or written to. They encapsulate the connection and translation of the data. Tasks may use them as both input and output.
Datasources are required for working with ParallelTask and MapReduceTask? because the workflow expects workunits to be generated through this common interface. Tasks can use the ListDataSource? to generate a datasource from a list of data.
Benefits of datasources
Limiting Serialization/Deserialization
Any data that is sent through the cluster as an argument or return value must be serialized before it is sent, and deserialized when it is received. In some cases this happens several times. This is a very costly operation.
Data used by or resulting from a task, should almost always avoid serialization. This is especially true when dealing with large volumes of data. This is the primary reason for using a datasource instead of passing data as arguments or return values.
Managing Datasources
Scheduled as ticket #128.
Pydra can optionally manages a collection of datasources associated with a task. Managed datasources will have their connections maintained in between workunits to reduce connection overhead.
Backends
A backend is a wrapper around a specific type of storage system. By itself a backend does nothing other than connect and disconnect from a distributed datastorage medium. Backends are separate from datasources so that a user can choose to interact with this datasource in a custom way, while allowing Pydra to manage connections.
Functions
- Connect
- Disconnect
Available Backends
- SQL
- FileSystem?
Targeted Storage Systems
- CouchDB
- Django ORM
- Network Share - extension of filesystem that mounts a network share
- Memcached
- Memory - read from arguments passed into task, output to dictionary returned by task.
Slicer
A slicer is a class that takes raw data from a backend and translates it into workunits. Slicers are used to transform your unique data-structures into a list of work that can understand Pydra.
Slicers are both iterable and pickable. Workunits are generated by iterating over the list, but only a key identifying the workunit is sent with the work request. When the cluster#Worker receives the work request it must retrieve the workunit data from its local instance of the datasource.
Types of Slicers
Backend Specific Slicers
These slicers only works with certain backends. This is a because many backends behave radically different. For instance a SQL backend expects a SQL query, while a FileSystemBackend? expects a directory path. These slicers
Subslicers
These slicers use another Slicer, usually a backend specific slicer, as an input. They are used to further slice data.
slicer_a = SQLSlicer('select blob from my_table')
slicer_b = LineSlicer(slicer_a)
Keys returned from a subslicer are a composite of the two slicers because the key must identifier the position at both levels.
Slicer Performance
Slicers should be designed to quickly retrieve a value given a key. For example the FileSlicer? keys include an offset in the file where the workunit is located. This allows it to seek directly to the workunit data, rather than iterating through lines in a file.
