Enums, enumerations and database types #1184

dmos62 · 2022-03-16T14:48:48Z

dmos62
Mar 16, 2022
Collaborator

We had a sync call with @mathemancer about this. I summed up our sync in the following DM message:

We want a way to pass around database types and other enumerations that's superior to passing around primitive types like strings and classes. A use case we talked about was database types and the inconsistent casing of their identifiers and how it wouldn't be a problem if we had passed around Enums. Sometimes we need the identity of a database type, other times its string identifier, and other times we need its Python class, or maybe yet some other attribute of that type. We agreed that it might be interesting to put methods on Enum subclass instances for stuff like that. You pointed out that SA's ischema_names (engine.dialect.ischema_names) could be a concern when finding a solution for this stuff (we didn't have time to go into why, but it has to do with reflection).

Topics we might want to expand on:

passing around Enum instances instead of attributes unique to some enumeration;
- example of an attribute unique to an enumeration could the string identifier of a database type, but it could also be the SA class of that type;
how to deal with engine.dialect.ischema_names and reflection in this context;
maybe some concrete examples of code or concrete refactor tactics.

mathemancer · 2022-03-16T15:09:19Z

mathemancer
Mar 16, 2022
Maintainer

Currently, we just extend the engine.dialect.ischema_names dictionary with key-value pairs for new or modified types. We might consider replacing it whole-cloth, but I'm not confident doing that without understanding how that dictionary is used in the reflection code better. The only documentation is the following change log:

https://docs.sqlalchemy.org/en/14/changelog/changelog_08.html?highlight=ischema_names

After a cursory glance at the actual code, it seems that the ischema_names dict is used in the get_columns method of the reflection code. It appears to determine the python type to use by comparing the DB text representation of the column type against dictionary keys in the ischema_names dict. It's going to be difficult to replace that system.

1 reply

kgodey Mar 16, 2022
Maintainer

Here's some documentation on the purpose of ischema_names.

kgodey · 2022-03-16T18:53:20Z

kgodey
Mar 16, 2022
Maintainer

I don't have much to contribute to this discussion beyond the link to the ischema_names documentation I put in my other reply. General opinions:

A single Enum source of truth for types seems like a good idea.
I don't think replacing ischema_names is a good idea given that it seems integral to reflection.

In order to have concrete suggestions for refactors, I'd need to basically do the work for #1100. I'd be happy to comment on any concrete suggestions offered by someone else.

One very general idea I have is that we could have a single file to hold information about each type in a standard format in db/types/. Currently we only do this for custom types, but we could extend it to all supported types. That file could serve as the source of truth for all type related information and we can use it wherever we need a specific attribute related to a type.

0 replies

dmos62 · 2022-03-18T12:53:43Z

dmos62
Mar 18, 2022
Collaborator Author

The #1100 issue this discussion stemmed from (which I didn't mention in top post) is not a quick refactor, since the refactorer (me) needs to get a handle on quite a few things. Our type logic is extensive and has a fair amount of technical debt to work through.

I've been doing a slow-trickle of a refactor slash exploration of related type logic.

A central goal I have is replace most, if not all, of type strings or type classes being passed around with Enums. A thing I did was have PostgresType and MathesarCustomeType mix in a new class DatabaseType (this class itself was actually introduced in a recent alias-related PR) and give it the property ischeme_key, so that every db type Enum instance could be converted into a canonical ischeme_names-compatible string. Can be seen here: https://github.com/centerofci/mathesar/blob/9264c0d1508af517e5ed78126bdeb99114b645e0/db/types/base.py#L42

Excerpt:

class DatabaseType:

    [...]

    @property
    def ischema_key(self):
        """
        Looks up this type's canonical type (if it is an alias) and returns its string id that may
        correspond to keys on the SA ischema_names dict.

        Note that PostgresType values are already such keys. However, MathesarCustomType values
        require adding a qualifier prefix.
        """
        canonical_id = self.canonical_id
        if isinstance(self, MathesarCustomType):
            ischema_key = get_qualified_name(canonical_id)
        else:
            ischema_key = canonical_id
        return ischema_key


class PostgresType(DatabaseType, Enum):

    [...]


class MathesarCustomType(DatabaseType, Enum):

    [...]

A problem right now is that if I start changing strings with Enums I'll have a hell of a time tracking down all the places that were expecting a string, but got an Enum. A possible approach could be to add type annotations to all the relevant code. Should be laborious, but not hard. I'd remove the annotations before merge.

2 replies

mathemancer Mar 18, 2022
Maintainer

I definitely think this change would improve lots of parts of the code. As I mentioned in a DM, we have multiple different string representations of each type due to vestigial remains of previous decisions that have since changed. Sometimes the different representations are actually the same string, but have to be acquired through different means, but the point stands. It would be great to start pulling those into methods of some Enum class.

kgodey Mar 18, 2022
Maintainer

This change looks good to me too.

A problem right now is that if I start changing strings with Enums I'll have a hell of a time tracking down all the places that were expecting a string, but got an Enum.

Could you elaborate with an example from the codebase? I've done a whole bunch of these types of refactors (e.g. see #582) and might be able to help figure out a good strategy.

dmos62 · 2022-03-21T18:02:26Z

dmos62
Mar 21, 2022
Collaborator Author

We had another sync with @mathemancer.

We reflected on the fact that it's interesting that we're exposing database type alias information to the frontend. @mathemancer pointed out (correctly, I think) that ideally we wouldn't need to have to do that.

We realised that the alias information was requested by frontend team, because we were telling the frontend about database types that are possibly aliases of one another. For example, both DECIMAL and NUMERIC (they're aliases) are on the db types/ endpoint, and that is because both are on the ischema_names SA dict, even though we'll never reflect a column to have a type DECIMAL.

We discussed some solutions. Our goal would be to only use and expose types that "reflect as themselves", meaning if you create a column of type X and reflect it, it should reflect as type X. That is currently not the case for a few types. Noteworthy examples:

aliases, like DECIMAL: SA takes it as input, but once you reflect a column created with it, it's always NUMERIC;
- solution would be to either alter ischema_names directly [0] or wrap it in an abstraction that takes away the aliases;
some SA quirks, like NAME and "CHAR": columns created with these types always reflect as String (which is SA's catch-all string-like type);
- solution would be to alter the ischema_names dict, by mapping NAME and "CHAR" to our custom classes that are meant to provide what's necessary for these to be normal, distinguishable types;
  - until we do that (this should be low priority), we can stop supporting NAME and "CHAR" columns;
    - not sure how to stop supporting a type.

Above "solutions" would result in a database type set that's much easier to handle, both for the backend and frontend, because we need to be able to tell what type a column will be if we apply some type to it.

There's another approach that might be worth thinking about. We could explicitly model the fact that ischema_names is messy, or, specifically, that using type X when altering or creating a column results in a column of type Y, where X and Y are not necessarily the same types.

[0] @mathemancer showed me that ischema_names is meant to support being edited "live".

6 replies

dmos62 Mar 22, 2022
Collaborator Author

@kgodey that's the part I'm fuzzy about. We definitely want to exclude these types from type inference logic for example, which should be fine. I'm currently wrapping my head around this.

mathemancer Mar 22, 2022
Maintainer

Neither NAME nor "CHAR" is currently involved in type inference. We infer to a subset of the types (for example, we never infer a column to be of DOUBLE PRECISION type; it's always NUMERIC).

mathemancer Mar 22, 2022
Maintainer

As far as support, as I think about it, we can't currently reflect the type of columns of type "CHAR" or NAME, since those reflect in SQLAlchemy as String, which is interpreted as VARCHAR, since that's what the String python type compiles to. Argh. To fix this, we need to create UserDefinedTypes for each of those, and inject them into the ischema_names dict the same way we do for all other custom types.

This means that in the current state, users are still be able to read and write from the columns, but they'll show as VARCHAR for the database type in the UI (though the Text mathesar type is still accurate for both of those).

kgodey Mar 22, 2022
Maintainer

To fix this, we need to create UserDefinedTypes for each of those, and inject them into the ischema_names dict the same way we do for all other custom types.

Can one of you create an issue for this? We can figure out prioritization later.

mathemancer Mar 22, 2022
Maintainer

Here's an issue for it: #1214

I marked it as "help wanted" and "good first issue", since there are a few examples of how to set up these UserDefinedTypes in our codebase, and these should be reasonably simple.

dmos62 · 2022-03-21T19:58:53Z

dmos62
Mar 21, 2022
Collaborator Author

An interesting aspect of our type logic is that custom Mathesar types use qualified identifiers in some places (e.g. MATHESAR_TYPE.EMAIL) and unqualified identifiers in other places (e.g. EMAIL). @mathemancer do you think it makes sense to use unqualified identifiers at all?

2 replies

mathemancer Mar 22, 2022
Maintainer

Not anymore. The unqualified string versions of those floating around is due to a previous idea about how to present types in the API that we've since discarded. I think the only string version of (database) types that we should use should be the canonical version, which should be equal to the key in the ischema_names dict. This is subject to some unforeseen circumstance that dictates using the unqualified version for somthing.

dmos62 Mar 22, 2022
Collaborator Author

@mathemancer Ok! I'll then try and expose only qualified custom type ids.

dmos62 · 2022-03-21T21:46:29Z

dmos62
Mar 21, 2022
Collaborator Author

I've made changes to the proposed DatabaseType mixin. Together with the recent insights, this is helping get rid of a lot of code.

class DatabaseType:
    @property
    def id(self):
        # Here we're defining Enum's value attribute to be the database type id.
        # However, the meaning of this id is not clear, since often you'll use the ischema_key
        # below. I've not yet fully conceptualized this duality.
        return self.value

    # TODO it would be great to merge id(self) and ischema_key(self). for that we'd need to factor
    # out the difference in how PostgresType and MathesarCustomType ids are handled. specifically,
    # MathesarCustomType ids require adding a prefix.
    @property
    def ischema_key(self):
        """
        Returns the key corresponding to this type on the SA ischema_names dict.

        Note that PostgresType values are already such keys. However, MathesarCustomType values
        require adding a qualifier prefix.
        """
        id = self.id
        if isinstance(self, MathesarCustomType):
            return get_qualified_name(id)
        else:
            return id

    def get_sa_class(self, engine):
        """
        Returns the SA class corresponding to this type or None if this type is not supported by
        provided engine, or if it's ignored (see is_ignored).
        """
        if not self.is_ignored:
            ischema_names = engine.dialect.ischema_names
            return ischema_names.get(self.ischema_key)

    def is_available(self, engine):
        """
        Returns true if this type is available on provided engine.
        """
        return self.get_sa_class(engine) is not None

    @property
    def is_ignored(self):
        """
        We ignore some types. Current rule is that if type X is applied to a column, but upon
        reflection that column is of some other type, we ignore type X. This mostly means
        ignoring aliases. It also ignores NAME and CHAR, because both are reflected as the SA
        String type.
        """
        ignored_types = (
            PostgresType.TIME,
            PostgresType.TIMESTAMP,
            PostgresType.DECIMAL,
            PostgresType.NAME,
            PostgresType.CHAR,
        )
        return self in ignored_types

I'm pretty excited to see where this takes our type code.

4 replies

kgodey Mar 22, 2022
Maintainer

This looks great!

Why does is_ignored need to be a property on every DatabaseType instance? It seems like it should be a class property or just a constant in the same file that's referenced where needed.

dmos62 Mar 22, 2022
Collaborator Author

@kgodey hm, I'm not sure I completely follow, but the general idea is that I want an Enum mixing in DatabaseType to know whether it's ignored. I chose between doing this explicitly (like this) and doing it implicitly by removing those Enum instances from the Enum (and thus from our code) directly. I figured that this is more prudent.

I'm currently doing a wholesale refactor of a big chunk of our type code to use this. It's going very well so far. I'll try to post an update soon.

Edit: if you meant why the constant ignored_types set is defined in the instance method, I agree that it can be defined outside the class. These small details are in the air right now.

kgodey Mar 22, 2022
Maintainer

if you meant why the constant ignored_types set is defined in the instance method, I agree that it can be defined outside the class. These small details are in the air right now.

I did mean that and sounds good.

mathemancer Mar 22, 2022
Maintainer

I'm also pretty happy with the simplifications and streamlining for how we handle types

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enums, enumerations and database types #1184

{{title}}

Replies: 6 comments 15 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Enums, enumerations and database types #1184

dmos62 Mar 16, 2022 Collaborator

Replies: 6 comments · 15 replies

mathemancer Mar 16, 2022 Maintainer

kgodey Mar 16, 2022 Maintainer

kgodey Mar 16, 2022 Maintainer

dmos62 Mar 18, 2022 Collaborator Author

mathemancer Mar 18, 2022 Maintainer

kgodey Mar 18, 2022 Maintainer

dmos62 Mar 21, 2022 Collaborator Author

dmos62 Mar 22, 2022 Collaborator Author

mathemancer Mar 22, 2022 Maintainer

mathemancer Mar 22, 2022 Maintainer

kgodey Mar 22, 2022 Maintainer

mathemancer Mar 22, 2022 Maintainer

dmos62 Mar 21, 2022 Collaborator Author

mathemancer Mar 22, 2022 Maintainer

dmos62 Mar 22, 2022 Collaborator Author

dmos62 Mar 21, 2022 Collaborator Author

kgodey Mar 22, 2022 Maintainer

dmos62 Mar 22, 2022 Collaborator Author

kgodey Mar 22, 2022 Maintainer

mathemancer Mar 22, 2022 Maintainer

dmos62
Mar 16, 2022
Collaborator

Replies: 6 comments 15 replies

mathemancer
Mar 16, 2022
Maintainer

kgodey Mar 16, 2022
Maintainer

kgodey
Mar 16, 2022
Maintainer

dmos62
Mar 18, 2022
Collaborator Author

mathemancer Mar 18, 2022
Maintainer

kgodey Mar 18, 2022
Maintainer

dmos62
Mar 21, 2022
Collaborator Author

dmos62 Mar 22, 2022
Collaborator Author

mathemancer Mar 22, 2022
Maintainer

mathemancer Mar 22, 2022
Maintainer

kgodey Mar 22, 2022
Maintainer

mathemancer Mar 22, 2022
Maintainer

dmos62
Mar 21, 2022
Collaborator Author

mathemancer Mar 22, 2022
Maintainer

dmos62 Mar 22, 2022
Collaborator Author

dmos62
Mar 21, 2022
Collaborator Author

kgodey Mar 22, 2022
Maintainer

dmos62 Mar 22, 2022
Collaborator Author

kgodey Mar 22, 2022
Maintainer

mathemancer Mar 22, 2022
Maintainer