-
Notifications
You must be signed in to change notification settings - Fork 150
Consider adding a Categorical datatype to bcolz #66
Comments
That's sort of what the factorization does btw (though there are no special functions for categorical access yet) |
Yes, the categorical is the result of a factorization. IIRC in R a Categorical type is actually called factor. |
I would be quite interested in this as well. Rationale: pandas & bcolz string columns is currently extremely slow, as each entry in the string column is converted to a python string object. It also sounds like a tractable solution to #174. It does not appear that a fixed length pandas dtype is likely to be available any time soon since the work required seems to be non-negligible: pandas-dev/pandas#5261 @mrocklin is working around this deficit in his dask.dataframe project by using the pandas 'category' dtype. The downside to this solution is that when reading a bcolz string column, the column has to be factorized for every read. It would be nice if the factorization could be stored with the data. The visualfabriq/bquery project has the "machinery" required for a categorical dtype in place. As mentioned the api for categorical access is still missing. How should the access functions behave? My gut says:
Does this sound like a sensible idea? Is anybody else interested in this issue as well? |
There is a trade off here. I think that categoricals are fantastic. However I also appreciate that bcolz has a simple data model that exactly mirrors NumPy. I like the |
I think the categorical type can be implemented as an additional type in an external library, given the Cython-API it would be possible. Some notes on the implementation: The performance of a categorical type depends very much on the "load-factor" i.e. the ratio of unique types to total number of elements. This influences how to store it in-memory and out-of-core. Depending on the number of categories, it may be wise to deactivate the shuffle filter. For example if you "only" have < 256 categories, it may or may not make any difference. Anything beyond that depends on both the entropy and the order of the data. For example, if you have 257 categories but the element that maps to integer 256 appears only once in your >> 257 elements dataset, using the shuffle filter should give some nice compression ratios. If on the other hand your coresponding values are sampled uniformly from the closed interval 0-65536 (short/int16) then the shuffle filter is unlikely to give you much advantage. |
I would like to have a categorical type too; of course it's up to @FrancescAlted and @esc , but I think in light of previous discussions, we can always add it to bquery (to keep bcolz focused + it's exactly the kind of thing that we're trying to add with bquery). For now it would mean:
Some reads that might be interesting for @ARF1 and @mrocklin to understand why we made bquery in the first place. The original discussion document based on which we started bquery: http://www.slideshare.net/cvaartjes/bcolz-groupby-discussion-document |
This issue would probably benefit from being considered together with the possibility of introducing a pandas A pandas @mrocklin I appreciate the rationale for staying with the numpy data model but the possible performance improvement with a pandas @esc Going the bquery route with the "categoricals / pandas out_flavor"-combination would be an option but I think it would require reimplementation of large sections of ctable. Integrating into bcolz look like much less effort since the "pandas DataFrames" and "numpy structured array" data access models are virtually identical. Only small sections that instantiate the output structure appropriately would be needed (at a first glance). @CarstVaartjes We seem to have the same use case: fincancial data analysis. Is your analysis thus also predominantly along rather than across columns? How do you deal with the inefficiencies resulting from the the row-major ordering of the structured array? Cheap computing power? Or are your filling your pandas DataFrames directly from carray and avoid ctable altogether for data access? |
Hi all, Thanks for the detailed discussions. I am not able to dig into this a lot now, but I am planning to put more time on this in the near future. Just to be short: I like categoricals too, but that introduces complexity and compression alleviates situations where your cardinality is low, so that needs more discussion. I like the idea of using pandas as an additional flavor to output queries from bcolz, but we should explore if a dictionary of numpy arrays can make the thing too. If pandas can ingest dictionaries of numpy arrays cheaply, then that could be the way to go. I know that you are discussing other things too and I appreciate that, but just to be clear, I would like bcolz to not become the monster that for example PyTables has gone. Falling on the side of complexity sometimes might seem appealing, but we should fight hard against that temptation and try to keep things simple. As I said, expect me becoming more active here in the short future. Thanks. |
@FrancescAlted Thanks for taking the time to respond. I appreciate the desire to keep things simple and maintainable. Categorical dtype issue:
In short: to me categoricals as relating to bcolz are only really an issue with strings. Compression will take case of the other dtypes in my use cases. Maybe limiting categoricals to strings would change the complexity vs. utility analysis. I currently do not have a clear enough idea on how categoricals would be implemented to asses this. Row-major ordering issue: Currently projection to a subset of the columns is identical with numpy and ctable:
While a pandas out_flavour would preserve the api:
If that break in the api is not an issue, a dict of arrays would be great for pandas: the built-in functions can ingest them fairly cheaply. "Fairly" cheaply because:
|
Hi @ARF1, about your questions: we also use columns pre-dominantly (we do more fmcg & retail stuff then financial, but for example we have sets with 2 billion records of retail sales). However we do not really run into major inefficiencies in bquery itself, as that really does per-column operations (see the slideshare presentation I mentioned). Line 738 in ac52154
|
In PR #187 I propose a new abstraction layer for the generation of the "results array" in This would allow everybody to provide their own Overhead is fairly low: 42.3µs vs. 38.5µs for returning a single-row result from my test data. I would love to know what you think. |
To be inspired by:
https://pandas-docs.github.io/pandas-docs-travis/categorical.html
The text was updated successfully, but these errors were encountered: