-
Notifications
You must be signed in to change notification settings - Fork 156
RegistryObject
A proposal for a splitting of the duties of GangaObject
into two classes:
- one which is a standard in-memory object with a
_data
-backed store (GangaObject
) - one which is stored in a registry and stores all its data in the registry (
RegistryObject
)
In all of Ganga, there are only a very small number of objects which are ever stored in registries (ignoring the box registry for now): Job
, Task
, and ShareRef
and of those only Job
is ever lazily-loaded. Nonetheless, every GangaObject
needs to worry about both lazy-loading (_index_cache
etc.) as well as registry membership. We propose simplifying this by creating a subclass of GangaObject
which would know how to deal with these things so GangaObject
(particularly its descriptor getters and setters) could be simplified to standard _dict
access.
GangaObjects
would never be lazy-loaded and would always have all their data stored in _data
as is the case for most objects at the moment. A RegistryObject
would be a very thin wrapper object which would only store an id
and _registry
attribute. Any call to, for example, j.status
would be redirected to something like self._registry.get_attribute(self.id, 'status')
. It would then be the responsibility of the registry to decide how to implement get_attribute()
. For example JobRegistry
would try to get the information from the cache in preference to loading the object fully but PrepRegistry
would always get the session lock and load the XML from disk.
The registry will now be responsible for storing all the data about its items directly. So instead of having Registry._objects
be a list of GangaObject
s, it will instead be a table of data with three columns, object id, object cache and full object data. On initial creation of the registry it will request from the repository the cache for each of the objects it knows about and enter them into the table. When data is requested that cannot be retrieved from the cache column, the registry will will request the full object from the repository. All the repository needs to return is the data dictionary for the object (the equivalent of the current _data
attribute) which the registry can then place in the correct cell in the table.
We then overload RegistryObject
's Descriptor
(via some simple metaclass magic) so that instead of __get__
and __set__
accessing _data
, they instead call Registry.get_attribute()
instead. get_attribute()
will query the data table and return the appropriate information. The registry can at this point decide whether to return info from the cache or fully load the object from disk and give that data instead.
The repository's interface can be conceptually simplified down to three methods:
- return the cache for an object
- return the full data for an object
- given an object and its cache data, write it to disk
with potentially an additional method to return all caches for all objects for efficiency on first load.
This isn't dissimilar to how it works today but it this should clarify the boundaries between the classes and allow some simplification of the code.
There are two main reasons to perform lazy-loading:
- display a summary of pertinent information in a table (i.e. typing
jobs
) - provide a small subset of information which is provided through the standard object API (i.e. looping over jobs and doing
j.status
or fully loading a job without fully loading all its subjobs)
Currently the information for these is stored in the index cache and is sometimes confusingly mixed together and differentiated by mangling the names. The index cache is sometimes treated as a fall-through surrogate for the _data
dictionary, even in those cases where the types do not match. This proposal suggests that the cache is always a fundamental Python object whose interpretation is solely down to the registry in question and is not implicitly treated as a simple subset of the _data
dictionary.
The JobRegistry
for example only needs to store information for the display function and could therefore have a cache something like:
{
'status': 'running',
'name': '',
'subjobs': 10,
'application': 'Executable',
'backend': 'Dirac',
'backend.actualCE': 'LCG.RAL-LCG2.uk',
'comment': '',
}
There is no implicit mapping between the keys in this dictionary and schema attributes on the associated RegistryObject
, it is simply information used for displaying jobs
. Given a call of jobs.get_attribute(j.id, 'backend')
it would not have to use the value of 'backend'
. However, since this is a generic cache of data, JobRegistry
would be within its rights to use 'status'
to give the return value of jobs.get_attribute(j.id, 'status')
. It is up to the registry how it uses this data.
Other registries may use this cache in other ways as they see fit so we propose referring to this cache generically as an "object metadata cache". Its purpose of to (depending on the registry in question) get "some" information about an object without a potentially expensive full load of it.
- The box registry: Since we've reduced the types that can be stored in registries the box will need to be rewritten somewhat. It needs some thought but should be allowed to prevent progress in other, more important areas.
- Subjobs: To first-order this will continue to work as it does now but it opens up possibilities in the future for harmonising the job registry and ``SubJobXMLList`