Replies: 11 comments
-
Hi @ml31415 👋🏻! I hear you. These little API differences can be annoying. It may help to understand why things are the way they are. First, as you point out, there seems to be an arbitrary limitation on doing things with one columns versus more than one column. This limitation is intentional and is a result of a particular design choice in modeling the "tabular data" problem. We model the dimension of the various parts of a computation with different objects: we have Under this model some operations have ambiguous meaning when composed with other operations. One thing that may help clarify is to ask what it would mean to compose another operation with Some operations make sense: any computation that is directly built off of However as soon as you need to "combine" such an expression with something else the operation becomes ambiguous. So it is with a hypothetical With all that in mind, let's talk about some of your specific points.
This is because tables and columns are different objects, which itself is the way the problem is modeled in ibis.
Hopefully the above helps clarify! Let us know if there's something else that would help.
Yeah, that'd be really weird! If we modeled everything as a table, e.g., a column is a one-column table, a scalar is one-column one-row table then we might be able to make
This is less a bug and more of a missing feature for the Polars backend. Ibis is a community project, and we depend on the good will of folks like yourselves to help us understand what features to prioritize and what bugs to fix. We don't always implement support for something when we add a new backend (for various reasons) so some backends are missing functionality that others have. If you're interested in more complete slicing support for polars, I'd be curious to know if you plan to use polars with ibis or if you just happened to try it for comparison! Either way, all of your feedback on ibis is extremely valuable. |
Beta Was this translation helpful? Give feedback.
-
Hi Phillip! Thanks for your detailed reply and for your work on ibis. I think it's a great and highly needed project. So please take my criticism with all due respect! If it's not a bug, I guess the issue here is less the technical background, but more a general question of API design. And something that rather deserves a solution, less so explanation and understanding from the user. Imagine someone of the numpy devs: "Yeah, you can slice 2D arrays totally fine, but not 1D arrays, because they're a special case, and we didn't want to take care of that". Or back to the SQL example, Postgres devs telling you: "Yeah, you can use limit on whatever you select, but not if you select only a single column, because we had technical issues with the implementation". It all sounds like an awfully artificial limitation. And also a violation of the API design principle of trying to reduce surprises as good as possible.
Well, yes, tables split up in columns, but there is nothing forbidding the column and the table, to behave in the very same way, when you try to slice them. Usually you might set up a common protocol for the slicing behaviour and make both classes implement that. If there is an issue with a certain operation, of course this would not be part of the common protocol. I mean, Ibis is not the first software dealing with tabular data, obviously, and all of them are facing the same problem of having result sets with one or more columns. And all of them, including the Ibis supported backends, had to solve how to deal with that at some point. I didn't do any deep dive into the code, but if there actually was such a design choice, that would forbid a good implementation, I guess that design choice should be the center of a lengthy discussion. About polars, I just recently had evaluated it for another use case, so it came handy as another backend example to try. I really love to see that fixed. Otherwise Ibis so far made a really good impression to me while experimenting with it! |
Beta Was this translation helpful? Give feedback.
-
@ml31415 I appreciate the response! Let's see if we can come to a mutual understanding here. I feel like there's still tiny bit we're not aligned on. I'll try to address your specific points. Let me know if I failed to clarify anything!
The thing I think failed to communicate effectively in my previous comment is that in ibis's case, a single column is not a special case of a table, it's an entirely separate object, with different semantics. In the numpy case as you note, one dimensional arrays are a special case of N-dimensional arrays. It's also true that everything in numpy is an ndarray. An important consequence is that one dimensional arrays don't need to be handled specially: they can be handled with generic ND-array code. I think it's incorrect to say that our approach is analogous to this hypothetical scenario because the premise (columns are a special case of tables) isn't true.
The same approach to reasoning about numpy can be taken in this case. There's a single object that a user can interact with in postgres, around which everything is modeled: tables. A Similarly, the analogy doesn't really work because the premise that columns are a special case of tables isn't true in the case of postgres. There are no columns, just tables.
Unfortunately there is. Consider the following computations:
What is the meaning of
Indeed that might work if ibis had complete control over execution. By design, ibis is at the mercy of the executing backend. Some backends might be able to make sense of
It's not necessarily off the table, but it needs some unambiguous composition semantics and well-supported behavior across most of our backends. |
Beta Was this translation helpful? Give feedback.
-
Hi Philip, thanks for coming back to this! Would you mind clarifying your last example? Unfortunately I fail to understand what exactly you try to point out with it.
Is this "addition" supposed to be something like an append or extend operation? In that case, again, where is the difference, if it's a column, a table with one, or with n columns? I'd expect the same behaviour.
Fair enough to opt for more implicit behaviour, avoiding explicit convert functions like So far you mentioned, that it would require a redesign in some places. Sure, it's work to change something. Maybe even a lot of work. Nevertheless, the first question is, if it's a desired change at all. How much work it would be, is a subsequent consideration, while somebody else might even show up and fix it for you.
Is there even any backend, that has issues with |
Beta Was this translation helpful? Give feedback.
-
I don't think you're considering that
Pandas and numpy are very different here, I don't think it's correct to equate these. Again, NumPy does not special case 1D arrays in its implementation. Just because 1D arrays are logically a special case does not mean they are implemented that way. The same code that operates on ND arrays operates on 1D arrays. Maybe it will help to consider that there are no special Pandas on the other hand has a data model similar to ibis: a table thing ( Yes, they share a bunch of implementation details, and you're right that they are similar. And yet they are still distinct objects. You cannot swap either of them out for the other arbitrarily.
Definitely not. The issue is not whether the simple case of
95% of this hypothetical effort is numbers 1 and 2. @ml31415 It seems like you've got ideas about how this should work. What do you think about pushing this effort through? We're happy to help guide in whatever way you may need. |
Beta Was this translation helpful? Give feedback.
-
Alright, thanks for the patient answers. Let's leave it as is. I'll have a closer look on how a implementation might look like, if we actually decide to go with ibis for future projects. I agree, it's not an as straight forward change, as it looked like initially. The workaround, to slice the table first, will have to do for now. |
Beta Was this translation helpful? Give feedback.
-
@ml31415 Thanks for bringing the issue up and engaging. I think it's conversations like these that help drive the ecosystem forward. Looking forward to hearing about/seeing what you do with ibis! As always please continue to create issues and start discussions for anything that comes up. |
Beta Was this translation helpful? Give feedback.
-
@ml31415 In the meantime, do you mind if I move this to a GitHub discussion? |
Beta Was this translation helpful? Give feedback.
-
@cpcloud I think this deserves a FAQ page in the documentation somewhere. The important and unexpected thing is that SQL/ibis make no guarantees about row order, which is a key assumption that numpy/pandas users are used to. So two columns can't be deterministically "lined up" unless they are actually part of the same Table. I think that might be the root cause of the misunderstanding above. This misunderstanding was the reason I filed a bunch of bugs several months ago that weren't actually bugs, where I was asking Ibis to do something outside of the model of SQL. |
Beta Was this translation helpful? Give feedback.
-
@NickCrews Definitely. I'll make an issue for it. |
Beta Was this translation helpful? Give feedback.
-
Closing this discussion out, thanks for the pleasant interactions all! |
Beta Was this translation helpful? Give feedback.
-
What happened?
Given the following code:
With this little table, selecting columns and simple slicing works as expected, the restriction that the step is limited to one is not nice, but understandable, given the limitations of SQL
LIMIT
.Now, what doesn't work, is to only pick a single column of it.
While selecting the other way round is fine:
I've seen #1985 talking about the same error message, though it talks about subscription with a boolean array, which is a different story.
Being forced to to a certain order of selecting rows or columns first, is imho at least confusing. Why should subscription suddenly be impossible, if you select only one instead of two columns? This is not understandable from a user point of view. It feels like a SQL select suddenly failing, just because you select less columns.
It's the same with the polars backend, except that it only accepts a single upper or lower limit, but not both, and returns
None
otherwise. Which might be a bug on its own.Memtable fails as well, so I guess it's rather a generic issue and not backend related:
What version of ibis are you using?
5.1.0
What backend(s) are you using, if any?
duckDB
Relevant log output
No response
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions