hash-based data management system. MATLAB
and python
offerings available.
- You write and share code. But code operates on data, and you want a way to work on the same code independently of where the code is, or where the data is.
- Data shouldn't change. But it does, because people make mistakes. Hard drives fail. You don't want your entire analysis pipeline to operate on data that isn't what you think it is. GIGO.
- hash-based data integrity checks
- your code is agnostic to the location of the data, but cares only about what the data is. Think of this as URIs vs. URL.
This is inspired by how magnet links work.
MATLAB | Python | |
---|---|---|
Status | feature-complete | WIP |
Location | hash table stored in hash_table.mat |
hash table stored in hash_table.pkl |
Tech | uses MATLAB arrays | uses python dictionaries |
Hashing | uses system md5 or dataHash.m |
uses hashlib |
Performance (182 MB file) | ~19 ms | ~0.37 ms |
The recommended way to install this is to use my package manager:
urlwrite('http://srinivas.gs/install.m','install.m');
install sg-s/data-manager
install sg-s/srinivas.gs_mtools
Alternatively, you can clone this using git
:
git clone https://github.com/sg-s/data-manager
git clone https://github.com/sg-s/srinivas.gs_mtools
Don't forget to fix your MATLAB paths so that it points to the correct folders.
First, generate a dataManager object:
dm = dataManager;
Scan the current folder for all data files and determine their hashes:
dm.rehash;
Scan a specific folder and add all the data there to the hash table:
dm.rehash('/path/to/folder/with/data/')
View all hashes and paths stored in the hash table, sorted by when they were last accessed:
dm.view
View all hashes and paths stored in the hash table, sorted by path name:
dm.view('','name')
View only hashes corresponding to paths that contain a specific string, and sorted by when they were last accessed:
dm.view('bicameral-mind','la')
Retrieve the path corresponding to a particular hash:
path_name = dm.getPath(hash);
Clean up entries in the hash table that no longer resolve to files.
dm.cleanup
View all methods of dataManager
:
methods(dataManager)
View interactive help:
dataManager
dataManager also uses a file called dmignore.m
which lists file name patterns that it should ignore. You can add to dmignore.m
to suit your needs.
First, import the module:
from DataManager import DataManager
dm = DataManager()
Scan the current folder for all data files and determine their hashes:
dm.rehash()
Scan a specific folder and add all the data there to the hash table:
dm.rehash('/path/to/folder/with/data/')
Retrieve the path corresponding to a particular hash:
path_name = dm[hash]
View all hashes and paths stored in the hash table, sorted by path name:
dm.view()
data-manager is free software. GPL v3