Which column identifier should we use for DDL operation #1081

silentninja · 2022-02-19T06:13:38Z

silentninja
Feb 19, 2022
Collaborator

We are currently using various identifiers to access a column

columns array index - This was the preferred way to access a column earlier but we ran into problems due to it being mutable, as in any column name change or dropping intermediate columns would change the index position of the column.
attnum - Using column_array_index was deprecated due to the reasons stated above and was replaced by attnum based column reference at the service layer in this PR. It is not used directly to reference a column, rather the column name referred by the attnum is fetched during runtime and is then used for various operations.
column_name - Most of the database operations are performed using a column name. So it doesn't matter which identifier we go with column_name will be the last and the ultimate identifier used for database operations.

Our current code base uses Django Column model which is backed by attnums on the service layer and a mix of column_aray_index and column_name on the db layer. The attnum based Django Column model fetches the column name using attnum during runtime based on which it constructs a column object or fetch the column_array_index to be used by the db layer. Since we need to replace the deprecated column_array_index on the db layer, there is a question of what to replace it with?

Current codebase

flowchart LR
A[Frontend api] -->|Django column id| B(Django View) -->|Django Column model with attnum field| C(Django methods)
C <-->|Column array index fetched using attnum field on the Django Column model| D[db module functions] <-->|Column name| E[Database operations]
C <-->|Column name fetched using attnum field on the Django Column model| D[db module functions] <-->|Column name| E[Database operations]

The solution proposed by the issue

flowchart LR
A[Frontend api] -->|Django column id| B(Django View) -->|Django Column model with attnum field| C(Django methods)
C <-->|attnum field from the Django Column model| D[db module functions] <-->|Column name| E[Database operations]

The new solution proposed

flowchart LR
A[Frontend api] -->|Django column id| B(Django View) -->|Django Column model with attnum field| C(Django methods)
C <-->|Column name fetched using attnum| D[db module functions] <-->|Column name| E[Database operations]

Pros of the solution proposed by the issue

Better Eventual consistency - Compared to passing column names around, using attnums as function arguments and fetching column names just before a db operation will guarantee that the targeted column is correct, without having to worry if any intermediate statements change the column name.

Cons:

Increased refactoring time - Most of the codebase is already using column names to reference columns.
Duplicated logic - Each db function call will have to call an attnum to column_name converter as most of the operations are done using column names. This also increases the number of queries
Wrapping any column name access with attnum and reflecting table is the only way to guarantee the reference, which seems to be infeasible, as any subsequent SQL alchemy function call using column names fetched at the start of the function after a column name alter function would still fail. For example:

def fun(attnums):
    column_names = column_attnum_to_names()
    some_sqlalchemy_function(column_names)
    alter_column_name(attnums)
    alter_column_type(attnums) # Works
    some_sqlalchemy_function(column_names) # Won't work since the column name has changed

Since most of the calls are synchronous, do we need this complexity?

Pros of using column name:

We are using the same reference used by SQLAlchemy and most db operations, so we can avoid the need for any conversion layer.
In cases of impure functions that change a column name, we can fetch attnums using the column_name, since these take place synchronously, there shouldn't be a problem. Such impure functions are very few.

Cons:

Column names are prone to changes in the middle of an execution. So we would have to make sure we use attnums in such places.

My proposal is that

Service Layer should fetch column names using attnums and pass those column_names to any db function that will be called
db layer should be using column names
Impure function calls should be handled by the caller, just like any other impure function. It will have to fetch the attnum of the column name that will be changed, call the impure function and then reflect the column name again if it is being used by any subsequent calls. These types of functions are very rare(non-existent from what I have seen) in our codebase

mathemancer · 2022-02-21T06:49:19Z

mathemancer
Feb 21, 2022
Maintainer

Thanks for making this discussion. The diagrams you've made above match with my understanding, so we can use those as a reference (as far as I'm concerned).

For me, the example you give is a sign that the SQLAlchemy function in question should be wrapped. Further, it's obvious from looking at the function that it will fail, and the solution (get the names from the attnums again) is also obvious. I realize it's just an example, but the point is that using immutable identifiers makes such problems and their solutions easy to reason about. In particular, if you were calling your example function, but modified to use names to identify columns, you'd have to either know from the function name or docs that it would modify column names in order to know that you need to re-request the column names for subsequent calls. This is why my solution to your example would be to wrap the SQLAlchemy function so that it can also take attnum as an argument. I recognize that this would mean the function wrapper would need to get the names from the attnum again, but I think that's less of a problem than the mutability of identifiers, especially in a project that involves as many DDL manipulations as this one. It's not just calling the columns by their names on column-manipulating functions that would be tricky. It also affects Foreign Key manipulation, other constraint manipulation, view creation and definition, etc.

Taking views as an example of where stable IDs would really be helpful: If you change the name of a column underlying a view, the view will still see the column (because it's referenced by attnum under the hood). However, the view column name doesn't change. In fact, the "human readable" view definition will change appropriately. Any function which deals with the underlying columns of views then needs to keep the mapping between the underlying columns and the view columns sorted out. While you still need to know the mapping in the case of using attnums, the mapping would be stable until the view is dropped (or the underlying column is dropped with a cascade). This can of course be arbitrarily horrible, since views can source columns from other views.

Maybe we should write a decorator that would morph a function that takes a column name or names as input into one that takes attnum(s).

0 replies

silentninja · 2022-02-22T05:27:03Z

silentninja
Feb 22, 2022
Collaborator Author

There is also another problem, the tables have to be refreshed again, in order to get the latest column name, so it is not limited to column names alone.

This is why my solution to your example would be to wrap the SQLAlchemy function so that it can also take attnum as an argument

By wrapping SqlAlchemy function, I am assuming you mean wrapping the column_name arguments or do you mean the wrapping sqlAlchemy functions itself, like convert_names(sqlalchemy_function, attnums, kwargs). The latter does not seem to be a good idea since we won't have any idea which argument contains the column names

I recognize that this would mean the function wrapper would need to get the names from the attnum again, but I think that's less of a problem than the mutability of identifiers, especially in a project that involves as many DDL manipulations as this one

Just to be on the same page, it would be nice if you list the functions that would affect the column names, I am not sure if it is really that much of a problem to have these complexities in place when this problem can be solved in a simpler way.

I am looking at this as:

Can column rename be treated as a pure function by completely removing access to affected property
Or should we acknowledge that it is impure function and have conventions around its usage. This would be just like handling exceptions.

If we go with (1) We would face the cons I mentioned above

We could rather go with (2) and then have the function that changes the column name, have attnum parameters and any function calling it would be using attnum parameters to signify that it is an unsafe function too.
For example

def safe_function(column_names):
    alter_column_type(column_names)
    some_sqlalchemy_function(column_names)

def parent_fun(attnums):
    fun(attnums)
    column_names = column_attnum_to_names()
    alter_column_type(column_names)
    
def fun(attnums):
    column_names = column_attnum_to_names()
    some_sqlalchemy_function(column_names)
    # No refresh needed since the above function does not use attnums
    alter_column_name(attnums)
    refresh_tables()
    column_names = column_attnum_to_names()
    alter_column_type(column_names)
    some_sqlalchemy_function(column_names)

1 reply

mathemancer Feb 22, 2022
Maintainer

By wrapping, I mean something like, given some SQLAlchemy function that takes column names as an argument, say the signature is:

def the_sqlalchemy_ddl_function(arg1, arg2, column_names, arg3):
    ...
    return the_answer

def my_ddl_function(arg1, arg2, attnums, arg3):
    column_names = _get_col_names_from_attnums(attnums)
    return the_sqlalchemy_ddl_function(arg1, arg2, column_names, arg3)

If we then use my_ddl_function in place of the_sqlalchemy_ddl_function in our codebase (passing it the attnums where we would have passed column_names for the wrapped function), we've contained the problem to spots where SQLAlchemy is actually used. We could also wrap our own functions as a temporary measure.

If we want to get fancy, we'd need to generalize to create some kind of decorator. As you noted, we'll need to get information about which argument needs the column names derived from the attnum. For that, I suppose the decorator should take a parameter with an argument name, and use a reasonable default argname. In case the column names argument of the wrapped function is positional, we'd need to use the inspect module to find it. Writing that decorator would be a bit technical up front, but would give us a quick-and-easy way to convert current and future functions to take attnum instead of column names. For use with 3rd-party functions, we'd just apply it at the top of the file using the old-school sqlalchemy_func = mydecorator(col_names_argname)(sqlalchemy_func) syntax to each SQLAlchemy function (or we could use a different name for the wrapped function to avoid confusion).

I reiterate: I think stable, immutable identifiers for columns will reduce complexity, not add it. We're doing lots of DDL operations in this project, and we plan to do many more. Chaining these together in a sane way requires a way to address different database objects, including columns, easily.

Final note: Which of your arguments for addressing columns by name does not apply to addressing tables by name? Do you think we should also avoid using OIDs to look up tables?

silentninja · 2022-02-22T13:13:57Z

silentninja
Feb 22, 2022
Collaborator Author

If we then use my_ddl_function in place of the_sqlalchemy_ddl_function in our codebase (passing it the attnums where we would have passed column_names for the wrapped function)

Would this mean we have to create a decorator function for each sqlalchemy function used in our codebase? I would go with a generalised decorator as I find it to be a much better option if we plan to use attnums.

I think stable, immutable identifiers for columns will reduce complexity, not add it. We're doing lots of DDL operations in this project, and we plan to do many more.

But how many of those change the column name?

Which of your arguments for addressing columns by name does not apply to addressing tables by name? Do you think we should also avoid using OIDs to look up tables?

I am not against using an immutable identifier, Since the column names are already derived using attnums in the first place, I just find using it as an accessor every time to be redundant as it comes with an additional cost like having to decorate functions, additional queries. The only issue is that there are very few functions which could alter the column names in between function calls, so I think it much more efficient to take a note of those functions and use conventions around them.

I find the existing alter_column function to be a good example.

0 replies

dmos62 · 2022-02-22T14:01:32Z

dmos62
Feb 22, 2022
Collaborator

@mathemancer @silentninja I'll try to summarize the discussion. Please point out what I get wrong.

SQL not supporting immutable identifiers is awkward and we can't really get around that without shifting that awkwardness somewhere.

The options for where to shift it, as per your discussion, are:

using conventions and being careful, which means using (mutable) names or a mix of names and attnums (immutable): this makes the code more complicated, but faster; this is @silentninja's suggestion (as I understand it);
wrapping each SQLAlchemy function so that it takes an immutable identifier (attnum) and converts it to the mutable identifier (name) under-the-hood; the mutable identifier (name) is never used outside these wrapper functions; this way we'd make synchronization simpler, but increase the number of queries to the database: so simpler, but slower; two ways to do this:
1. manually write a library of functions that will wrap SA functions (plain 1:1 mapping);
2. devise an automatic wrapper/decorator: I don't like this idea: let's keep it simple and write it out by hand: this solution does not have to scale.

Both 1 and 2i sound good to me; 2ii not so much.

2i is slower, query-wise, but I have a feeling that we'd find good ways to optimize a hot-path with this setup.

1 has a developer keeping track of more things, which might be a burden for reviewing and onboarding new contributors. It's faster, query-wise, but it's not clear if that's worth-it.

I'd say let's do the slower, simpler 2i.

As for 2ii, as I said above, let's not overengineer this.

4 replies

silentninja Feb 22, 2022
Collaborator Author

Performance is not really an issue I am focussing on, it just adds unnecessary overhead if we go with (2), it can be solved using a proper real-time cache if needed, but still, it is not something to rub off easily as it can add over time. The real issue I have is that (2) increases developer's time because

We need to create a decorator for each sql alchemy function you would use
The access method for columns change, which I find to be quite unconventional. Instead of using columns[column_name].type, we would end up with

refresh_table() # Could be merged into a single call
get_column_from_attnum(attnum).type`

If we had too many functions that change a column name, I would definitely agree it would make the code complicated. But there are just 2 functions that alter a column name(db.columns.operations.alter.batch_update_columns and db.columns.operations.alter.alter_column), and parent functions that call it sum to additional 2 more, so totally 2+2 = 4 functions totally. Just to keep track of these functions, we would end up having to make a lot of changes starting from column accessors.

My issue is how often do we encounter a function that changes a column name. While I do agree with the problem (2) is solving, I just find it be not a huge problem. I am just finding (2) to be over-engineering

kgodey Feb 22, 2022
Maintainer

We need to create a decorator for each sql alchemy function you would use

This is not applicable to solution 2i.

(2) increases developer's time

Only for the initial refactor, after that, the pattern is established.

I don't think there is much point in discussing further, I think we all understand each other's perspective, we just disagree on what concerns are most important. We need a decision to move forward, which I've made in this comment.

silentninja Feb 22, 2022
Collaborator Author

We need to create a decorator for each sql alchemy function you would use

This is not applicable to solution 2i.

We would need to manually write a library of functions that will wrap SA functions (plain 1:1 mapping). So any you cannot use any sql alchemy function directly, rather it has to be wrapped into a mapping function.

kgodey Feb 22, 2022
Maintainer

We would need to manually write a library of functions that will wrap SA functions (plain 1:1 mapping). So any you cannot use any sql alchemy function directly, rather it has to be wrapped into a mapping function.

I assume you're referring to the code that @mathemancer proposed above:

def my_ddl_function(arg1, arg2, attnums, arg3):
    column_names = _get_col_names_from_attnums(attnums)
    return the_sqlalchemy_ddl_function(arg1, arg2, column_names, arg3)

If so, we already do this for all DDL operations that I'm aware of, we just need to add the column_names = _get_col_names_from_attnums(attnums) line. Could you find a couple of examples where we're directly using SQL Alchemy functions without wrapping them that we would need to change?

kgodey · 2022-02-22T14:11:13Z

kgodey
Feb 22, 2022
Maintainer

I think @dmos62 has done a good job of summarizing the discussion so far and I agree with his conclusions:

1 has a developer keeping track of more things, which might be a burden for reviewing and onboarding new contributors. It's faster, query-wise, but it's not clear if that's worth-it.

As for 2ii, as I said above, let's not overengineer this.

Let's go with option 2i. If we end up finding that we need to optimize the number of queries we're making for our operations, we can tackle that problem separately at a later date.

0 replies

kgodey · 2022-02-25T22:53:17Z

kgodey
Feb 25, 2022
Maintainer

This discussion was resolved in our weekly call, please see notes here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Which column identifier should we use for DDL operation #1081

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Which column identifier should we use for DDL operation #1081

silentninja Feb 19, 2022 Collaborator

Replies: 6 comments · 5 replies

mathemancer Feb 21, 2022 Maintainer

silentninja Feb 22, 2022 Collaborator Author

mathemancer Feb 22, 2022 Maintainer

silentninja Feb 22, 2022 Collaborator Author

dmos62 Feb 22, 2022 Collaborator

silentninja Feb 22, 2022 Collaborator Author

kgodey Feb 22, 2022 Maintainer

silentninja Feb 22, 2022 Collaborator Author

kgodey Feb 22, 2022 Maintainer

kgodey Feb 22, 2022 Maintainer

kgodey Feb 25, 2022 Maintainer

silentninja
Feb 19, 2022
Collaborator

Replies: 6 comments 5 replies

mathemancer
Feb 21, 2022
Maintainer

silentninja
Feb 22, 2022
Collaborator Author

mathemancer Feb 22, 2022
Maintainer

silentninja
Feb 22, 2022
Collaborator Author

dmos62
Feb 22, 2022
Collaborator

silentninja Feb 22, 2022
Collaborator Author

kgodey Feb 22, 2022
Maintainer

silentninja Feb 22, 2022
Collaborator Author

kgodey Feb 22, 2022
Maintainer

kgodey
Feb 22, 2022
Maintainer

kgodey
Feb 25, 2022
Maintainer