bug: Integer column not subscriptable #6222

ml31415 · 2023-04-25T17:19:40Z

ml31415
Apr 25, 2023

What happened?

Given the following code:

import pandas as pd
import ibis

conn = ibis.duckdb.connect()
df = pd.DataFrame({"a": [1,2,3,4], "b":[4,5,6,7]})
tab = conn.sql("select * from df")

>>> tab
┏━━━━━━━┳━━━━━━━┓
┃ a     ┃ b     ┃
┡━━━━━━━╇━━━━━━━┩
│ int64 │ int64 │
├───────┼───────┤
│     1 │     4 │
│     2 │     5 │
│     3 │     6 │
│     4 │     7 │
└───────┴───────┘

With this little table, selecting columns and simple slicing works as expected, the restriction that the step is limited to one is not nice, but understandable, given the limitations of SQL LIMIT.

>>> tab[1:3]
┏━━━━━━━┳━━━━━━━┓
┃ a     ┃ b     ┃
┡━━━━━━━╇━━━━━━━┩
│ int64 │ int64 │
├───────┼───────┤
│     2 │     5 │
│     3 │     6 │
└───────┴───────┘

tab["a", "b"][1:3]
┏━━━━━━━┳━━━━━━━┓
┃ a     ┃ b     ┃
┡━━━━━━━╇━━━━━━━┩
│ int64 │ int64 │
├───────┼───────┤
│     2 │     5 │
│     3 │     6 │
└───────┴───────┘

>>> tab[1:3:-1]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [180], in <cell line: 1>()
----> 1 tab[1:3:-1]

File ~/.local/lib/python3.10/site-packages/ibis/expr/types/relations.py:474, in Table.__getitem__(self, what)
    472 step = what.step
    473 if step is not None and step != 1:
--> 474     raise ValueError('Slice step can only be 1')
    475 start = what.start or 0
    476 stop = what.stop

ValueError: Slice step can only be 1

Now, what doesn't work, is to only pick a single column of it.

>>> tab["a"][1:3]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [161], in <cell line: 1>()
----> 1 tab["a"][1:3]

TypeError: 'IntegerColumn' object is not subscriptable

While selecting the other way round is fine:

tab[1:3]["a"]
┏━━━━━━━┓
┃ a     ┃
┡━━━━━━━┩
│ int64 │
├───────┤
│     2 │
│     3 │
└───────┘

I've seen #1985 talking about the same error message, though it talks about subscription with a boolean array, which is a different story.

Being forced to to a certain order of selecting rows or columns first, is imho at least confusing. Why should subscription suddenly be impossible, if you select only one instead of two columns? This is not understandable from a user point of view. It feels like a SQL select suddenly failing, just because you select less columns.

It's the same with the polars backend, except that it only accepts a single upper or lower limit, but not both, and returns None otherwise. Which might be a bug on its own.

connpl = ibis.polars.connect()
tabpl = connpl.read_pandas(df, "tab")

>>> tabpl[1:3]

>>> tabpl[:2]
┏━━━━━━━┳━━━━━━━┓
┃ a     ┃ b     ┃
┡━━━━━━━╇━━━━━━━┩
│ int64 │ int64 │
├───────┼───────┤
│     1 │     4 │
│     2 │     5 │
└───────┴───────┘

tabpl["a"][:2]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [173], in <cell line: 1>()
----> 1 tabpl["a"][:2]

TypeError: 'IntegerColumn' object is not subscriptable

Memtable fails as well, so I guess it's rather a generic issue and not backend related:

>>> tabmem = ibis.memtable(df, name="tab")
>>> tabmem["a"][1:3]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [223], in <cell line: 1>()
----> 1 tabmem["a"][1:3]

TypeError: 'IntegerColumn' object is not subscriptable

What version of ibis are you using?

5.1.0

What backend(s) are you using, if any?

duckDB

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

cpcloud · 2023-04-26T13:03:49Z

cpcloud
Apr 26, 2023
Maintainer

Hi @ml31415 👋🏻!

I hear you. These little API differences can be annoying.

It may help to understand why things are the way they are.

First, as you point out, there seems to be an arbitrary limitation on doing things with one columns versus more than one column.

This limitation is intentional and is a result of a particular design choice in modeling the "tabular data" problem. We model the dimension of the various parts of a computation with different objects: we have Table for 2 dimensions, Column for one, and Scalar for zero.

Under this model some operations have ambiguous meaning when composed with other operations.

One thing that may help clarify is to ask what it would mean to compose another operation with t["a"][1:3].

Some operations make sense: any computation that is directly built off of t["a"][1:3] would work; things like sum, mean, log, etc.

However as soon as you need to "combine" such an expression with something else the operation becomes ambiguous. So it is with a hypothetical t["a"].distinct() operation as well (we used to have this, but removed it for the same reason).

With all that in mind, let's talk about some of your specific points.

Being forced to to a certain order of selecting rows or columns first, is imho at least confusing. Why should subscription suddenly be impossible, if you select only one instead of two columns?

This is because tables and columns are different objects, which itself is the way the problem is modeled in ibis.

This is not understandable from a user point of view.

Hopefully the above helps clarify! Let us know if there's something else that would help.

It feels like a SQL select suddenly failing, just because you select less columns.

Yeah, that'd be really weird!

If we modeled everything as a table, e.g., a column is a one-column table, a scalar is one-column one-row table then we might be able to make limit and distinct work the way you're describing, but it would have other consequences for the API that would require a redesign of many things.

It's the same with the polars backend, except that it only accepts a single upper or lower limit, but not both, and returns None otherwise. Which might be a bug on its own.

This is less a bug and more of a missing feature for the Polars backend. Ibis is a community project, and we depend on the good will of folks like yourselves to help us understand what features to prioritize and what bugs to fix. We don't always implement support for something when we add a new backend (for various reasons) so some backends are missing functionality that others have. If you're interested in more complete slicing support for polars, I'd be curious to know if you plan to use polars with ibis or if you just happened to try it for comparison!

Either way, all of your feedback on ibis is extremely valuable.

0 replies

ml31415 · 2023-04-27T08:05:48Z

ml31415
Apr 27, 2023
Author

Hi Phillip! Thanks for your detailed reply and for your work on ibis. I think it's a great and highly needed project. So please take my criticism with all due respect! If it's not a bug, I guess the issue here is less the technical background, but more a general question of API design. And something that rather deserves a solution, less so explanation and understanding from the user.

Imagine someone of the numpy devs: "Yeah, you can slice 2D arrays totally fine, but not 1D arrays, because they're a special case, and we didn't want to take care of that". Or back to the SQL example, Postgres devs telling you: "Yeah, you can use limit on whatever you select, but not if you select only a single column, because we had technical issues with the implementation". It all sounds like an awfully artificial limitation. And also a violation of the API design principle of trying to reduce surprises as good as possible.

This limitation is intentional and is a result of a particular design choice in modeling the "tabular data" problem. We model the dimension of the various parts of a computation with different objects: we have Table for 2 dimensions, Column for one, and Scalar for zero.

Well, yes, tables split up in columns, but there is nothing forbidding the column and the table, to behave in the very same way, when you try to slice them. Usually you might set up a common protocol for the slicing behaviour and make both classes implement that. If there is an issue with a certain operation, of course this would not be part of the common protocol. I mean, Ibis is not the first software dealing with tabular data, obviously, and all of them are facing the same problem of having result sets with one or more columns. And all of them, including the Ibis supported backends, had to solve how to deal with that at some point. I didn't do any deep dive into the code, but if there actually was such a design choice, that would forbid a good implementation, I guess that design choice should be the center of a lengthy discussion.

About polars, I just recently had evaluated it for another use case, so it came handy as another backend example to try.

I really love to see that fixed. Otherwise Ibis so far made a really good impression to me while experimenting with it!

0 replies

cpcloud · 2023-05-01T16:51:56Z

cpcloud
May 1, 2023
Maintainer

@ml31415 I appreciate the response!

Let's see if we can come to a mutual understanding here. I feel like there's still tiny bit we're not aligned on.

I'll try to address your specific points. Let me know if I failed to clarify anything!

Imagine someone of the numpy devs: "Yeah, you can slice 2D arrays totally fine, but not 1D arrays, because they're a special case, and we didn't want to take care of that".

The thing I think failed to communicate effectively in my previous comment is that in ibis's case, a single column is not a special case of a table, it's an entirely separate object, with different semantics. In the numpy case as you note, one dimensional arrays are a special case of N-dimensional arrays. It's also true that everything in numpy is an ndarray. An important consequence is that one dimensional arrays don't need to be handled specially: they can be handled with generic ND-array code.

I think it's incorrect to say that our approach is analogous to this hypothetical scenario because the premise (columns are a special case of tables) isn't true.

Or back to the SQL example, Postgres devs telling you: "Yeah, you can use limit on whatever you select, but not if you select only a single column, because we had technical issues with the implementation".

The same approach to reasoning about numpy can be taken in this case. There's a single object that a user can interact with in postgres, around which everything is modeled: tables. A SELECT statement always returns a table, there are no single column objects or scalar objects that you can return from a query.

Similarly, the analogy doesn't really work because the premise that columns are a special case of tables isn't true in the case of postgres. There are no columns, just tables.

but there is nothing forbidding the column and the table, to behave in the very same way, when you try to slice them.

Unfortunately there is.

Consider the following computations:

t = ibis.table({"a": "int"}, name="t")

expr = t.a + t.a[:5]

What is the meaning of expr? Should this do an automatic join somehow? Should it error?

Usually you might set up a common protocol for the slicing behaviour and make both classes implement that.

Indeed that might work if ibis had complete control over execution. By design, ibis is at the mercy of the executing backend. Some backends might be able to make sense of t.a + t.a[:5] some might not. Many of our SQL backends would balk at this operation because ultimately the limit applied to get t.a[:5] turns the underlying table into a distinct relation from t which means t and limited_t now need to be related somehow. Pandas does this automatically using indexes, but we don't have that in ibis.

I really love to see that fixed.

It's not necessarily off the table, but it needs some unambiguous composition semantics and well-supported behavior across most of our backends.

0 replies

ml31415 · 2023-05-02T14:57:06Z

ml31415
May 2, 2023
Author

Hi Philip, thanks for coming back to this! Would you mind clarifying your last example? Unfortunately I fail to understand what exactly you try to point out with it.

t.a is an integer column of length n. The slice is of length m <= 5. So in the general case, the columns are of different length, which would showcase an attempted element-wise addition with mismatched length, which is a clear error condition. It also can't be a skalar addition. So yes, if you'd ask me, that should throw an error right away. There is no point in guessing, what the user might have meant.

Is this "addition" supposed to be something like an append or extend operation? In that case, again, where is the difference, if it's a column, a table with one, or with n columns? I'd expect the same behaviour. t.extend(t[:5]) should be a valid statement as well as t.a.extend(t.a[:5]).

A SELECT statement always returns a table, there are no single column objects or scalar objects that you can return from a query.

Fair enough to opt for more implicit behaviour, avoiding explicit convert functions like .one() for a single row or like in our example to_column(). That leaves the second option like with pandas / numpy (tables and columns have a compatible interface, whenever possible).

So far you mentioned, that it would require a redesign in some places. Sure, it's work to change something. Maybe even a lot of work. Nevertheless, the first question is, if it's a desired change at all. How much work it would be, is a subsequent consideration, while somebody else might even show up and fix it for you.

it needs some unambiguous composition semantics and well-supported behaviour across most of our backends.

Is there even any backend, that has issues with LIMIT on a single column?

0 replies

cpcloud · 2023-05-03T18:51:58Z

cpcloud
May 3, 2023
Maintainer

t.a is an integer column of length n. The slice is of length m <= 5. So in the general case, the columns are of different length, which would showcase an attempted element-wise addition with mismatched length, which is a clear error condition. It also can't be a skalar addition. So yes, if you'd ask me, that should throw an error right away. There is no point in guessing, what the user might have meant.

I don't think you're considering that t.a[:5] + t.b[:5] is also not well-defined without a unique ORDER BY on t, nor are further compositions of operations that are not functions of the limited column.

That leaves the second option like with pandas / numpy (tables and columns have a compatible interface, whenever possible).

Pandas and numpy are very different here, I don't think it's correct to equate these.

Again, NumPy does not special case 1D arrays in its implementation. Just because 1D arrays are logically a special case does not mean they are implemented that way. The same code that operates on ND arrays operates on 1D arrays. Maybe it will help to consider that there are no special np.one_d_array objects: there's only np.ndarray.

Pandas on the other hand has a data model similar to ibis: a table thing (DataFrame) and a column thing (Series).

Yes, they share a bunch of implementation details, and you're right that they are similar.

And yet they are still distinct objects. You cannot swap either of them out for the other arbitrarily.

Is there even any backend, that has issues with LIMIT on a single column?

Definitely not. The issue is not whether the simple case of LIMIT on a single column can be executed on any of ibis's backends, it's that none of the following are currently true:

We know what people want to do with slicing beyond looking at the first N values of a column
Given 1, we know that what people want to do is well-defined
Given 2, it seems like it doesn't require a fundamental rewrite of the library.
A person interested in doing the work

95% of this hypothetical effort is numbers 1 and 2.

@ml31415 It seems like you've got ideas about how this should work. What do you think about pushing this effort through? We're happy to help guide in whatever way you may need.

0 replies

ml31415 · 2023-05-04T12:06:10Z

ml31415
May 4, 2023
Author

Alright, thanks for the patient answers. Let's leave it as is. I'll have a closer look on how a implementation might look like, if we actually decide to go with ibis for future projects. I agree, it's not an as straight forward change, as it looked like initially. The workaround, to slice the table first, will have to do for now.

0 replies

cpcloud · 2023-05-04T12:08:50Z

cpcloud
May 4, 2023
Maintainer

@ml31415 Thanks for bringing the issue up and engaging. I think it's conversations like these that help drive the ecosystem forward.

Looking forward to hearing about/seeing what you do with ibis!

As always please continue to create issues and start discussions for anything that comes up.

0 replies

cpcloud · 2023-05-04T12:45:14Z

cpcloud
May 4, 2023
Maintainer

@ml31415 In the meantime, do you mind if I move this to a GitHub discussion?

0 replies

NickCrews · 2023-05-08T17:58:22Z

NickCrews
May 8, 2023
Maintainer

@cpcloud I think this deserves a FAQ page in the documentation somewhere. The important and unexpected thing is that SQL/ibis make no guarantees about row order, which is a key assumption that numpy/pandas users are used to. So two columns can't be deterministically "lined up" unless they are actually part of the same Table. I think that might be the root cause of the misunderstanding above. This misunderstanding was the reason I filed a bunch of bugs several months ago that weren't actually bugs, where I was asking Ibis to do something outside of the model of SQL.

0 replies

cpcloud · 2023-05-08T18:00:28Z

cpcloud
May 8, 2023
Maintainer

@NickCrews Definitely. I'll make an issue for it.

0 replies

cpcloud · 2023-09-11T19:16:55Z

cpcloud
Sep 11, 2023
Maintainer

Closing this discussion out, thanks for the pleasant interactions all!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Integer column not subscriptable #6222

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 11 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

bug: Integer column not subscriptable #6222

ml31415 Apr 25, 2023

What happened?

What version of ibis are you using?

What backend(s) are you using, if any?

Relevant log output

Code of Conduct

Replies: 11 comments

cpcloud Apr 26, 2023 Maintainer

ml31415 Apr 27, 2023 Author

cpcloud May 1, 2023 Maintainer

ml31415 May 2, 2023 Author

cpcloud May 3, 2023 Maintainer

ml31415 May 4, 2023 Author

cpcloud May 4, 2023 Maintainer

cpcloud May 4, 2023 Maintainer

NickCrews May 8, 2023 Maintainer

cpcloud May 8, 2023 Maintainer

cpcloud Sep 11, 2023 Maintainer

ml31415
Apr 25, 2023

cpcloud
Apr 26, 2023
Maintainer

ml31415
Apr 27, 2023
Author

cpcloud
May 1, 2023
Maintainer

ml31415
May 2, 2023
Author

cpcloud
May 3, 2023
Maintainer

ml31415
May 4, 2023
Author

cpcloud
May 4, 2023
Maintainer

cpcloud
May 4, 2023
Maintainer

NickCrews
May 8, 2023
Maintainer

cpcloud
May 8, 2023
Maintainer

cpcloud
Sep 11, 2023
Maintainer