Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when calculating variance of series of ndarrays #25542

Closed
MakGre opened this issue Mar 5, 2019 · 3 comments
Closed

Error when calculating variance of series of ndarrays #25542

MakGre opened this issue Mar 5, 2019 · 3 comments
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). Reduction Operations sum, mean, min, max, etc.

Comments

@MakGre
Copy link

MakGre commented Mar 5, 2019

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
# np.__version__ is 1.14.5
# pd.__version__ is 0.24.1

# create list of dicts with ndarrays
ll = []
for ii in range(2):
    dd = {0: np.ones(2)}
    ll += [dd]

# create data frame
df = pd.DataFrame(ll)

# print(df) looks as expected
#             0
# 0  [1.0, 1.0]
# 1  [1.0, 1.0]

m = df[0].mean() # works as expected
# equivalent to df[0].values.mean(axis=0)
print(m) # array([1., 1.])

v = df[0].values.var(axis=0) # yields expected result array([0., 0.])

v = df[0].var() # raises TypeError: setting an array element with a sequence.

Problem description

A Pandas series od dtype object can contain numpy.ndarrays. This ist useful to store high-dimensional data in DataFrames.
Calculating the mean of such a series works as expected. Calculating the variance however, yields an error. The calculation is easily performed by inserting .values between the series and the var call, so it is no fundamental problem.

This is the error Traceback of df[0].var()

Traceback (most recent call last):

  File "<ipython-input-23-be47a51ab53b>", line 1, in <module>
    df[0].var()

  File "C:\Users\Maksim\WinPython\python-3.6.3.amd64\lib\site-packages\pandas\core\generic.py", line 10976, in stat_func
    skipna=skipna, ddof=ddof)

  File "C:\Users\Maksim\WinPython\python-3.6.3.amd64\lib\site-packages\pandas\core\series.py", line 3626, in _reduce
    return op(delegate, skipna=skipna, **kwds)

  File "C:\Users\Maksim\WinPython\python-3.6.3.amd64\lib\site-packages\pandas\core\nanops.py", line 76, in _f
    return f(*args, **kwargs)

  File "C:\Users\Maksim\WinPython\python-3.6.3.amd64\lib\site-packages\pandas\core\nanops.py", line 138, in f
    raise TypeError(e)

TypeError: setting an array element with a sequence.

Expected Output

I expect
df[0].var()
to yield the same as
df[0].values.var(axis=0)

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.24.1
pytest: 3.2.3
pip: 18.1
setuptools: 39.2.0
Cython: 0.27.2
numpy: 1.14.5
scipy: 1.1.0
pyarrow: 0.7.1
xarray: 0.9.6
IPython: 6.2.1
sphinx: 1.6.5
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: 1.5.1
bottleneck: 1.2.1
tables: None
numexpr: None
feather: 0.4.0
matplotlib: 2.1.0
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: 1.0.2
lxml.etree: 4.1.0
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.14
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@WillAyd
Copy link
Member

WillAyd commented Mar 6, 2019

Operations like these are supposed to reduce to a scalar so I think it's purely happenstance that the mean works and not something we generally make guarantees about.

With that said the error message isn't very helpful. Investigation into what's going on and PRs to make this more useful would certainly be welcome

@MakGre
Copy link
Author

MakGre commented Mar 11, 2019

Thank you for your reply.

So the usage of mean an var is not really supported for object type columns, I guess?

For me it is not much trouble to perform the operations on the values instead. I just wanted to let you guys know.

If this is by design, then the issue can be closed as far as I am concerned.

@mroeschke mroeschke added Error Reporting Incorrect or improved errors from pandas Numeric Operations Arithmetic, Comparison, and Logical operations labels Nov 2, 2019
@jbrockmendel jbrockmendel added the Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). label Sep 22, 2020
@mroeschke mroeschke added Enhancement Reduction Operations sum, mean, min, max, etc. and removed Numeric Operations Arithmetic, Comparison, and Logical operations labels Jun 27, 2021
@mroeschke
Copy link
Member

Thanks for the issue, but it appears this hasn't gotten traction in a while so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

No branches or pull requests

4 participants