Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPT: make pandas (and may be scipy, ... anything else?) dependency optional #1282

Draft
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

yarikoptic
Copy link
Contributor

Motivation

Disclaimer: I love pandas, matplotlib, scipy!

But AFAIK none of them must be needed for regular IO on .nwb files, unless I do have some pandas DataFrame I am working with and thus would have pandas imported already.

ATM 9 importing of pynwb causes import of over 800 modules from about 100 packages into sys.modules:

$> time python -c 'import sys; s1=set(sys.modules); import pynwb; s2=set(sys.modules).difference(s1); um=set(m.split(".")[0] for m in s2 if not m.startswith("_")); print(len(s2), len(um), sorted(um))' 
860 104 ['argparse', 'array', 'ast', 'asyncio', 'atexit', 'base64', 'binascii', 'bisect', 'bz2', 'calendar', 'collections', 'concurrent', 'contextvars', 'copy', 'copyreg', 'csv', 'ctypes', 'cycler', 'cython_runtime', 'datetime', 'dateutil', 'decimal', 'difflib', 'dis', 'distutils', 'email', 'enum', 'errno', 'fnmatch', 'fractions', 'gc', 'gettext', 'glob', 'grp', 'gzip', 'h5py', 'hashlib', 'hdmf', 'hmac', 'http', 'inspect', 'json', 'kiwisolver', 'linecache', 'locale', 'logging', 'lzma', 'math', 'matplotlib', 'mmap', 'multiprocessing', 'ntpath', 'numbers', 'numpy', 'opcode', 'pandas', 'pathlib', 'pickle', 'pkg_resources', 'pkgutil', 'platform', 'plistlib', 'pprint', 'pwd', 'pyexpat', 'pynwb', 'pyparsing', 'pytz', 'quopri', 'random', 're', 'ruamel', 'scipy', 'secrets', 'select', 'selectors', 'shlex', 'shutil', 'signal', 'six', 'socket', 'sre_compile', 'sre_constants', 'sre_parse', 'ssl', 'string', 'struct', 'subprocess', 'sysconfig', 'tempfile', 'textwrap', 'token', 'tokenize', 'traceback', 'typing', 'unicodedata', 'unittest', 'urllib', 'uu', 'uuid', 'weakref', 'xml', 'zipfile', 'zlib']
python -c   1.32s user 0.10s system 99% cpu 1.416 total

so -- 1.4 sec just to import top level pynwb module, possibly just to read or save a .nwb file. That is seems what "contributes" to our attempts to parallelize some functionality in dandi-cli: see dandi/dandi-cli#191

one of the main "abusers" is pandas. it is the one causing matplotlib imports etc. With this draft PR + following quick and dirty patch to hdmf (would need to be send there but I ran out of juice and wanted to first "palpate interest"):

$> diff -Naur ../hdmf-upstream/src/hdmf/common/table.py /home/yoh/proj/dandi/dandi-cli/venvs/dev3/lib/python3.8/site-packages/hdmf/common/table.py
--- ../hdmf-upstream/src/hdmf/common/table.py	2020-07-29 12:22:33.922776064 -0400
+++ /home/yoh/proj/dandi/dandi-cli/venvs/dev3/lib/python3.8/site-packages/hdmf/common/table.py	2020-08-13 19:24:19.174817618 -0400
@@ -5,7 +5,6 @@
 
 from h5py import Dataset
 import numpy as np
-import pandas as pd
 from collections import OrderedDict
 from warnings import warn
 
@@ -753,6 +752,7 @@
                 raise KeyError("Key type not supported by DynamicTable %s" % str(type(arg)))
 
             if df:
+                import pandas as pd
                 # reformat objects to fit into a pandas DataFrame
                 id_index = ret.pop('id')
                 if np.isscalar(id_index):
@@ -824,7 +824,7 @@
 
     @classmethod
     @docval(
-        {'name': 'df', 'type': pd.DataFrame, 'doc': 'source DataFrame'},
+        {'name': 'df', 'type': "TODO pd.DataFrame", 'doc': 'source DataFrame'},
         {'name': 'name', 'type': str, 'doc': 'the name of this table'},
         {
             'name': 'index_column',

import of pynwb drops almost that half a second which takes to import pandas:

$> time python -c 'import sys; s1=set(sys.modules); import pynwb; s2=set(sys.modules).difference(s1); um=set(m.split(".")[0] for m in s2 if not m.startswith("_")); print(len(s2), len(um), sorted(um))'
573 89 ['argparse', 'array', 'ast', 'asyncio', 'atexit', 'base64', 'binascii', 'bisect', 'bz2', 'calendar', 'collections', 'concurrent', 'contextvars', 'copy', 'copyreg', 'ctypes', 'cython_runtime', 'datetime', 'dateutil', 'decimal', 'difflib', 'dis', 'email', 'enum', 'errno', 'fnmatch', 'gc', 'gettext', 'glob', 'grp', 'h5py', 'hashlib', 'hdmf', 'hmac', 'inspect', 'json', 'linecache', 'locale', 'logging', 'lzma', 'math', 'multiprocessing', 'ntpath', 'numbers', 'numpy', 'opcode', 'pathlib', 'pickle', 'pkg_resources', 'pkgutil', 'platform', 'plistlib', 'pprint', 'pwd', 'pyexpat', 'pynwb', 'quopri', 'random', 're', 'ruamel', 'scipy', 'secrets', 'select', 'selectors', 'shutil', 'signal', 'six', 'socket', 'sre_compile', 'sre_constants', 'sre_parse', 'ssl', 'string', 'struct', 'subprocess', 'sysconfig', 'tempfile', 'textwrap', 'token', 'tokenize', 'traceback', 'typing', 'unittest', 'urllib', 'uuid', 'weakref', 'xml', 'zipfile', 'zlib']
python -c   1.17s user 0.06s system 99% cpu 1.225 total

Next candidate could be scipy which seems to be needed only at the hdmf level and only for sparse matrices...

here is the q&d patch
$> diff -Naur ../hdmf-upstream/src/hdmf/common/sparse.py /home/yoh/proj/dandi/dandi-cli/venvs/dev3/lib/python3.8/site-packages/hdmf/common/sparse.py 
--- ../hdmf-upstream/src/hdmf/common/sparse.py	2020-01-07 12:54:32.995303465 -0500
+++ /home/yoh/proj/dandi/dandi-cli/venvs/dev3/lib/python3.8/site-packages/hdmf/common/sparse.py	2020-08-13 19:31:46.831436217 -0400
@@ -1,4 +1,3 @@
-import scipy.sparse as sps
 import numpy as np
 import h5py
 
@@ -11,7 +10,7 @@
 @register_class('CSRMatrix')
 class CSRMatrix(Container):
 
-    @docval({'name': 'data', 'type': (sps.csr_matrix, np.ndarray, h5py.Dataset),
+    @docval({'name': 'data', 'type': ("sps.csr_matrix", np.ndarray, h5py.Dataset),
              'doc': 'the data to use for this CSRMatrix or CSR data array.'
                     'If passing CSR data array, *indices*, *indptr*, and *shape* must also be provided'},
             {'name': 'indices', 'type': (np.ndarray, h5py.Dataset), 'doc': 'CSR index array', 'default': None},
@@ -19,6 +18,7 @@
             {'name': 'shape', 'type': (list, tuple, np.ndarray), 'doc': 'the shape of the matrix', 'default': None},
             {'name': 'name', 'type': str, 'doc': 'the name to use for this when storing', 'default': 'csr_matrix'})
     def __init__(self, **kwargs):
+        import scipy.sparse as sps
         call_docval_func(super().__init__, kwargs)
         data = getargs('data', kwargs)
         if isinstance(data, (np.ndarray, h5py.Dataset)):

which seems to drop additional 100ms from pynwb import time...

The main problem to address is the @docval types description. I have not looked into either it is used only for docs (thus could be just replaced with a string), or needs actual type for type checking may be???

How to test the behavior?

See all the snippets above

Checklist

  • Have you checked our Contributing document?
  • Have you ensured the PR description clearly describes problem and the solution?
  • Is your contribution compliant with our coding style? This can be checked running flake8 from the source directory.
  • Have you checked to ensure that there aren't other open Pull Requests for the same change?
  • Have you included the relevant issue number using #XXX notation where XXX is the issue number?

@rly
Copy link
Contributor

rly commented Aug 14, 2020

Thanks for the PR @yarikoptic .

It is possible to adapt docval to take "pd.DataFrame" as an argument for type checking and then import pandas only at that point. I tried implementing this locally without any errors.

The question then is whether that is the right solution for the problem and whether this solution is appropriate for the majority of users / use cases. While delaying import of pandas and scipy would speed up the initial import, it creates the inconvenience that anyone who wants to use the commonly used DynamicTable.__getitem__, DynamicTable.get, or DynamicTable.to_dataframe needs to have pandas installed in their environment first. This adds another step for installation of PyNWB that complicates it (pip install pynwb pandas>0.23) or creates a barrier / sticking point for researchers who are not so tech and Python savvy ("ugh why won't this just work out of the box?"), which I would say is most of the audience of PyNWB. I don't think this inconvenience is worth the efficiency, but @oruebel @ajtritt, please feel free to chime in.

The root problem seems to be that you want to read basic metadata information from an NWB file and need a fast, lightweight tool to do so that uses pandas or scipy only on demand. We could solve that by creating forks of pynwb and hdmf (pynwb-lite and hdmf-lite?) with not only delayed imports but also a pickled TypeMap of some sort (see #1050 ) and other efficiencies for this particular use case. This is more work for us, the maintainers, however... but if we only need to update it after every minor release, then it might not be so bad...

@@ -1,5 +1,4 @@
h5py==2.10.0
hdmf==2.1.0
numpy==1.18.5
pandas==0.25.3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas is still a requirement though, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that having pandas as a requirement for install should not be an issue. If we "just" want to disable certain functionality to speed-up import, then an alternative way would be to use environment variables such that, e.g., if PYNWB_NOPANDAS is set then you would not import Pandas, i.e., the user would have to explicitly set the variable before importing pynwb. Having a special "pynwb-lite" may be the "cleaner" variant but I'm afraid that this will ultimately add a lot of work and I don't think we have sufficient resources right now to support such an effort.

@yarikoptic
Copy link
Contributor Author

yarikoptic commented Aug 14, 2020

Thank you @ajtritt and @oruebel for your rapid feedback, very much appreciated!

The question then is whether that is the right solution for the problem and whether this solution is appropriate for the majority of users / use cases.

I totally agree and appreciate the difficulty of this issue. Unfortunately I do not know an ideal solution ATM, and some compromise should be made. A similar battle we are fighting in DataLad, to which end we also use delayed imports, extensions (so some functionality is not available until people install domain/purpose specific extensions such as datalad-container, datalad-neuroimaging, later possibly datalad-nwb etc), and extra_requires (see below).

Overall I see it beneficial to separate two aspects: runtime and installation. Although runtime changes would depend on how installed, I think runtime should be made to cater available installation schemes.

code / runtime

docstrings

It is possible to adapt docval to take "pd.DataFrame" as an argument for type checking and then import pandas only at that point. I tried implementing this locally without any errors.

One of the major aspects to appreciate is that no pandas data structures would reach code for checking for isinstance if no pandas already imported. That is what I have tried to code up in e.g. this point: https://github.com/NeurodataWithoutBorders/pynwb/pull/1282/files#diff-edc3e1a40efc7c55019dfdfbc9d9cc9cL686

So even for type checking no need to import pandas if pandas is not among loaded modules already.

It is only to mint docstrings etc where I thought it would be desired to be able to list pandas.DataFrame and for that no import is needed either.

delayed imports

they do add up some coding burden and easy to "slip away". To that end it is valuable to add a unittest along the lines of my snippet which checks on effect of the import of pynwb... I thought we had such one in datalad as well but failed to find it now (I guess we just refactored all heavy dependencies away anyways).

delayed imports sometimes (although I see those less and less frequently) even help to just avoid some tricky "import bug" of some heavy dependency, so with delayed imports it emerges only whenever related code path is hit.

re @oruebel env vars idea which is an alternative to delayed imports as far as I see it: I think it would be adding substantially more of user and downstream developers burden.

installation: user vs use as a library dependency

anyone who wants to use the commonly used DynamicTable.getitem, DynamicTable.get, or DynamicTable.to_dataframe needs to have pandas installed in their environment first

Yes. To that end as a compromise @needs_pandas decorator could quickly check if pandas is importable and if not -- blow up with some user friendly For use of this functionality please install pandas >= XXX, or just "pip install pynwb[full]" if you used pip to install pynwb.
We do something similar e.g. in datalad whenever user tries to use some command which we know comes from an extension they do not have installed, e.g.:

$> datalad containers-run
datalad: Unknown command 'containers-run'.  See 'datalad --help'.

Hint: Command containers-run is provided by (not installed) extension datalad-container.
  • I wish there was a way to make pip/setup.* to declare some pynwb[core], dependency on which then downstreams like dandi-cli could declare, which would install only that subset of dependencies avoiding all listed in install_requires thus making it possible to make a "leaner than default" installation of pynwb -- but AFAIK there is no way for that. may be @jwodder knows if that is possible
  • may be there is an easy way to "mint" different flavors of packages to be uploaded to pypi, so there could be pynwb-core (or pynwb-lite as @oruebel suggested) which would be coming from the same codebase etc, just with leaner dependencies... it is different from a "code fork" since it is only for dependencies declaration. Also I do not consider it a "good" solution but it might be a possibility better than a full blown "code fork".

@jwodder
Copy link

jwodder commented Aug 14, 2020

I wish there was a way to make pip/setup.* to declare some pynwb[core], dependency on which then downstreams like dandi-cli could declare, which would install only that subset of dependencies avoiding all listed in install_requires thus making it possible to make a "leaner than default" installation of pynwb -- but AFAIK there is no way for that. may be @jwodder knows if that is possible

Installing an extra always also installs everything listed in install_requires. You would have to do this the other way around, with install_requires made as minimal as possible and extras added for also installing pandas etc. (which would mean that installing just pynwb wouldn't give you pandas, which might not be the UX you're going for).

@yarikoptic
Copy link
Contributor Author

Installing an extra always also installs everything listed in install_requires. You would have to do this the other way around ...

yeap, that is what I suggested in preceding bullet point. But such a feature request ("make specific extra_requires to not demand installation of install_requires") seems like something to suggest to pypi etc people -- we keep running into such use cases more and more often. AFAIK it would provide the best path forward for python libraries which are expected to be (naive) user-facing and used as dependencies.

@jwodder
Copy link

jwodder commented Aug 14, 2020

@yarikoptic There's a discussion of a similar feature request going on now in the Python Packaging forum.

@yarikoptic
Copy link
Contributor Author

yarikoptic commented Aug 14, 2020

Great find @jwodder - exactly what is needed. I will comment after/if digest it in full - didn't find mentioning of setup.cfg which might get odd if it to be None or an empty string

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants