Attention
This is no longer maintained and has been superseded by datajoint-company/datajoint-docs. Please file new issues there (or help contribute!). We are currently migrating and generating new content until December 2022 after which we’ll be decomissioning https://docs.datajoint.org and https://tutorials.datajoint.org in favor of https://datajoint.com/docs/.
External Store¶
DataJoint organizes most of its data in a relational database.
Relational databases excel at representing relationships between entities and storing structured data.
However, relational databases are not particularly well-suited for storing large continuous chunks of data such as images, signals, and movies.
An attribute of type longblob
can contain an object up to 4 GiB in size (after compression) but storing many such large objects may hamper the performance of queries on the entire table.
A good rule of thumb is that objects over 10 MiB in size should not be put in the relational database.
In addition, storing data in cloud-hosted relational databases (e.g. AWS RDS) may be more expensive than in cloud-hosted simple storage systems (e.g. AWS S3).
DataJoint introduces a new datatype, external
to store large data objects within its relational framework.
Defining an attribute of type external
is done using the same definition syntax and works the same way as a longblob
attribute from the user’s perspective.
However, its data are stored in an external storage system rather than in the relational database.
Various systems can play the role of external storage, including a shared file system accessible to all team members with access to these objects or a cloud storage solutions such as the AWS S3.
For example, the following table stores motion-aligned two-photon movies.
# Motion aligned movies
-> twophoton.Scan
---
aligned_movie : external # motion-aligned movie
All insert and fetch operations work identically for external
attributes as they do for blob attributes, with the same serialization protocol.
Similar to blobs, external attributes cannot be used in restriction conditions.
Multiple external storage configurations may be used simultaneously. In this case, the specific external storage name is specified:
# Motion aligned movies
-> twophoton.Scan
---
aligned_movie : external-raw # motion-aligned movie
Principles of operation¶
External storage is organized to emulate individual attribute values in the relational database. DataJoint organizes external storage to preserve the same data integrity principles as in relational storage.
The external storage locations are specified in the DataJoint connection configuration, with one specification for each store.
Note
External storage is not yet implemented in MATLAB. The feature will be added in an upcoming release: https://github.com/datajoint/datajoint-matlab/issues/143
Each schema corresponds to a dedicated folder at the storage location with the same name as the database schema.
Stored objects are identified by the SHA-256 hashes (in web-safe base-64 ASCII) of their serialized contents. This scheme allows for the same object used multiple times in the same schema to be stored only once.
In the external storage, the objects are saved as files with the hash as the filename.
Each database schema has an auxiliary table named
~external
for representing externally stored objects.It is automatically created the first time external storage is used. The primary key of
~external
is the external storage name and the hash. Other attributes are thecount
of references by tables in the schema, thesize
of the object in bytes, and the timestamp of the last event (creation, update, or deletion).Below are sample entries in
~external
.~external¶ STORAGE
HASH
count
size
timestamp
raw
1GEqtEU6JYE OLS4sZHeHDx WQ3JJfLlHVZ io1ga25vd2
3
1039536788
2017-06-07 23:14:01
wqsKbNB1LKS X7aLEV+ACKW Gr-XcB6+h6x 91Wrfh9uf7
0
168849430
2017-06-07 22:47:58
Attributes of type
external
are declared as renamed foreign keys referencing the~external
table (but are not shown as such to the user).The insert operation first saves all the external objects in the external storage, then inserts the corresponding entities in
~external
for new data or increments thecount
for duplicates. Only then are the specified entities inserted.The delete operation first deletes the specified entities, then decrements the
count
of the item in~external
. Only then is the entire transaction committed, but the object is not actually deleted at this time.The fetch operation uses the hash values to find the data. In order to prevent excessive network overhead, a special external store named
cache
can be configured. If thecache
is enabled, thefetch
operation need not access~external
directly. Insteadfetch
will retrieve the cached object without downloading directly from the ‘real’ external store.Cleanup is performed regularly when the database is in light use or off-line. Shallow cleanup removes all objects from external storage with
count=0
in~external
. Deep cleanup removes all objects from external storage with no entry in the~external
table.DataJoint never removes objects from the local cache folder. The cache folder may just be periodically emptied entirely or based on file access date. If dedicated cache folders are maintained for each schema, then a special procedure will be provided to remove all objects that are no longer listed in
~/external
.
Data removal from external storage is separated from the delete operations to ensure that data are not lost in race conditions between inserts and deletes of the same objects, especially in cases of transactional processing or in processes that are likely to get terminated. The cleanup steps are performed in a separate process when the risks of race conditions are minimal. The process performing the cleanups must be isolated to prevent interruptions resulting in loss of data integrity.
Configuration¶
The following steps must be performed to enable external storage:
Assign external location settings for each storage as shown in the Step 1 example above.
Use
dj.set
for configuration.
location
specifies the root path to the external data for all schemas as well as the protocol in the prefix such asfile://
ors3://
.
account
andtoken
specify the credentials for accessing the external location.
Optionally, for each schema specify the cache folder for local fetch cache.
Note
The cache folder is not yet implemented in MATLAB. The feature will be added in an upcoming release: https://github.com/datajoint/datajoint-matlab/issues/143
Cleanup¶
Deletion of records containing externally stored blobs is a ‘soft delete’ which only removes the database-side records from the database. To remove the actual blob data, a separate cleanup process is run as described here.
Remove tracking entries for unused external blob items.
this will remove the tracking entry from the external storage table for any external blobs not referred to by any record.
Note
External storage is not yet implemented in MATLAB. The feature will be added in an upcoming release: https://github.com/datajoint/datajoint-matlab/issues/143
Remove actual blob files from the desired external storage location.
Important
this action should only be performed if no modifications are being done to the tables using this external.
Note
External storage is not yet implemented in MATLAB. The feature will be added in an upcoming release: https://github.com/datajoint/datajoint-matlab/issues/143
This will remove the actual unused files kept in the external storage ‘external-name’.