Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constraint violation when creating admin group on initial setup (with OAuth2) #801

Open
1 of 5 tasks
mouchar opened this issue Jun 29, 2022 · 23 comments
Open
1 of 5 tasks

Comments

@mouchar
Copy link

mouchar commented Jun 29, 2022

Describe the bug
Initial setup fails on constraint violation (represented as HTTP error 500 on UI) while trying to create org_admin group. Since the initial setup is not completed, the group is not created, the first user is not added to it and when user logs in, it has absolutely no permissions and no possibility to fix that.

Which area does this issue belong to?

  • FeatureHub Admin Web app
  • SDK
  • SDK examples
  • Documentation
  • Other (management-repository)

To Reproduce
Steps to reproduce the behavior:

  1. deploy FeatureHub with the following settings (only relevant settings are shown for brevity):
  • oauth2.providers.google.* (Google oauth2 credentials)
  • oauth2.providers: oauth2-google (other oauth2 providers are possible also affected)
  1. Navigate to MR UI
  2. Fill in some Organization and Portfolio
  3. Form is submitted and page /mr-api/initialize returns 500 Internal Server Error
  • auth.disable-login: "true" (only oauth2 login is allowed, even for the first user)

Expected behavior
Featurehub can be successfully initialized.

Screenshots

  • With auth.disable-login: true:
io.ebean.DataIntegrityException: Error[ERROR: null value in column "fk_person_who_created" of relation "fh_group" violates not-null constraint   Detail: Failing row contains (71bdb0ee-244a-41ee-b5c9-023987542c16, null, null, null, t, d45fe326-a6be-42f1-8364-a3626182c5f0, org_admin, 2022-06-29 14:41:54.396+00, 2022-06-29 14:41:54.396+00, 1).]
	io.ebean.config.dbplatform.SqlCodeTranslator.translate(SqlCodeTranslator.java:74) ~[ebean-api-12.15.0.jar:?]
	io.ebean.config.dbplatform.DatabasePlatform.translate(DatabasePlatform.java:245) ~[ebean-api-12.15.0.jar:?]
	io.ebeaninternal.server.persist.dml.DmlBeanPersister.execute(DmlBeanPersister.java:77) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.dml.DmlBeanPersister.insert(DmlBeanPersister.java:46) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.core.PersistRequestBean.executeInsert(PersistRequestBean.java:1221) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.core.PersistRequestBean.executeNow(PersistRequestBean.java:743) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.core.PersistRequestBean.executeNoBatch(PersistRequestBean.java:791) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.core.PersistRequestBean.executeOrQueue(PersistRequestBean.java:782) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.DefaultPersister.insert(DefaultPersister.java:470) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.DefaultPersister.insert(DefaultPersister.java:419) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.DefaultPersister.save(DefaultPersister.java:403) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.core.DefaultServer.save(DefaultServer.java:1635) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.core.DefaultServer.save(DefaultServer.java:1627) ~[ebean-core-12.15.0.jar:?]
	io.featurehub.db.services.GroupSqlApi.saveGroup(GroupSqlApi.java:504) ~[mr-db-sql-1.1-SNAPSHOT.jar:?]
	io.featurehub.db.services.GroupSqlApi.createOrgAdminGroup(GroupSqlApi.java:199) ~[mr-db-sql-1.1-SNAPSHOT.jar:?]
	io.featurehub.mr.resources.oauth2.OAuth2MRAdapter.createUser(OAuth2MRAdapter.java:127) ~[mr-1.1-SNAPSHOT.jar:?]
	io.featurehub.mr.resources.oauth2.OAuth2MRAdapter.successfulCompletion(OAuth2MRAdapter.java:82) ~[mr-1.1-SNAPSHOT.jar:?]
	io.featurehub.web.security.oauth.OauthResource.token(OauthResource.kt:62) ~[security-oauth-1.1-SNAPSHOT.jar:?]
	jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
	jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
	jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
	java.lang.reflect.Method.invoke(Unknown Source) ~[?:?]
	org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:124) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:167) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:176) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:79) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:475) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:397) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:255) [jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) [jersey-common-3.0.3.jar:?]
	org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) [jersey-common-3.0.3.jar:?]
	org.glassfish.jersey.internal.Errors.process(Errors.java:292) [jersey-common-3.0.3.jar:?]
	org.glassfish.jersey.internal.Errors.process(Errors.java:274) [jersey-common-3.0.3.jar:?]
	org.glassfish.jersey.internal.Errors.process(Errors.java:244) [jersey-common-3.0.3.jar:?]
	org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265) [jersey-common-3.0.3.jar:?]
	org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:234) [jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:684) [jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:356) [jersey-container-grizzly2-http-3.0.3.jar:?]
	io.featurehub.jersey.DelegatingHandler.service(DelegatingHandler.kt:76) [common-web-1.1-SNAPSHOT.jar:?]
	org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:190) [grizzly-http-server-3.0.1.jar:3.0.1]
	org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:535) [grizzly-framework-3.0.1.jar:3.0.1]
	org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:515) [grizzly-framework-3.0.1.jar:3.0.1]
	java.lang.Thread.run(Unknown Source) [?:?]
	Caused by: org.postgresql.util.PSQLException: ERROR: null value in column "fk_person_who_created" of relation "fh_group" violates not-null constraint
  Detail: Failing row contains (71bdb0ee-244a-41ee-b5c9-023987542c16, null, null, null, t, d45fe326-a6be-42f1-8364-a3626182c5f0, org_admin, 2022-06-29 14:41:54.396+00, 2022-06-29 14:41:54.396+00, 1).
	org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2675) ~[postgresql-42.3.3.jar:42.3.3]
	org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2365) ~[postgresql-42.3.3.jar:42.3.3]
	org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:355) ~[postgresql-42.3.3.jar:42.3.3]
	org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:490) ~[postgresql-42.3.3.jar:42.3.3]
	org.postgresql.jdbc.PgStatement.execute(PgStatement.java:408) ~[postgresql-42.3.3.jar:42.3.3]
	org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:166) ~[postgresql-42.3.3.jar:42.3.3]
	org.postgresql.jdbc.PgPreparedStatement.executeUpdate(PgPreparedStatement.java:134) ~[postgresql-42.3.3.jar:42.3.3]
	io.ebean.datasource.pool.ExtendedPreparedStatement.executeUpdate(ExtendedPreparedStatement.java:130) ~[ebean-datasource-7.3.jar:?]
	io.ebeaninternal.server.type.DataBind.executeUpdate(DataBind.java:100) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.dml.InsertHandler.execute(InsertHandler.java:106) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.dml.DmlHandler.executeNoBatch(DmlHandler.java:87) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.dml.DmlBeanPersister.execute(DmlBeanPersister.java:69) ~[ebean-core-12.15.0.jar:?]
  • With auth.disable-login: false:
io.ebean.DataIntegrityException: Error[ERROR: null value in column "fk_person_who_created" of relation "fh_group" violates not-null constraint   Detail: Failing row contains (9dfc43a6-329b-4f23-916e-02983867b03d, null, null, null, t, 75bd4fe5-457b-4998-ba6e-053c71ae7e49, org_admin, 2022-06-29 15:53:43.035+00, 2022-06-29 15:53:43.035+00, 1).]
	io.ebean.config.dbplatform.SqlCodeTranslator.translate(SqlCodeTranslator.java:74) ~[ebean-api-12.15.0.jar:?]
	io.ebean.config.dbplatform.DatabasePlatform.translate(DatabasePlatform.java:245) ~[ebean-api-12.15.0.jar:?]
	io.ebeaninternal.server.persist.dml.DmlBeanPersister.execute(DmlBeanPersister.java:77) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.dml.DmlBeanPersister.insert(DmlBeanPersister.java:46) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.core.PersistRequestBean.executeInsert(PersistRequestBean.java:1221) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.core.PersistRequestBean.executeNow(PersistRequestBean.java:743) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.core.PersistRequestBean.executeNoBatch(PersistRequestBean.java:791) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.core.PersistRequestBean.executeOrQueue(PersistRequestBean.java:782) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.DefaultPersister.insert(DefaultPersister.java:470) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.DefaultPersister.insert(DefaultPersister.java:419) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.DefaultPersister.save(DefaultPersister.java:403) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.core.DefaultServer.save(DefaultServer.java:1635) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.core.DefaultServer.save(DefaultServer.java:1627) ~[ebean-core-12.15.0.jar:?]
	io.featurehub.db.services.GroupSqlApi.saveGroup(GroupSqlApi.java:504) ~[mr-db-sql-1.1-SNAPSHOT.jar:?]
	io.featurehub.db.services.GroupSqlApi.createOrgAdminGroup(GroupSqlApi.java:199) ~[mr-db-sql-1.1-SNAPSHOT.jar:?]
	io.featurehub.mr.resources.SetupResource.setupSiteAdmin(SetupResource.java:176) ~[mr-1.1-SNAPSHOT.jar:?]
	io.featurehub.mr.api.SetupServiceDelegator.setupSiteAdmin(SetupServiceDelegator.java:28) ~[mr-api-java-server-1.1-SNAPSHOT.jar:?]
	jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
	jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
	jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
	java.lang.reflect.Method.invoke(Unknown Source) ~[?:?]
	org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:124) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:167) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:219) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:79) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:475) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:397) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:255) [jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) [jersey-common-3.0.3.jar:?]
	org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) [jersey-common-3.0.3.jar:?]
	org.glassfish.jersey.internal.Errors.process(Errors.java:292) [jersey-common-3.0.3.jar:?]
	org.glassfish.jersey.internal.Errors.process(Errors.java:274) [jersey-common-3.0.3.jar:?]
	org.glassfish.jersey.internal.Errors.process(Errors.java:244) [jersey-common-3.0.3.jar:?]
	org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265) [jersey-common-3.0.3.jar:?]
	org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:234) [jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:684) [jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:356) [jersey-container-grizzly2-http-3.0.3.jar:?]
	io.featurehub.jersey.DelegatingHandler.service(DelegatingHandler.kt:76) [common-web-1.1-SNAPSHOT.jar:?]
	org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:190) [grizzly-http-server-3.0.1.jar:3.0.1]
	org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:535) [grizzly-framework-3.0.1.jar:3.0.1]
	org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:515) [grizzly-framework-3.0.1.jar:3.0.1]
	java.lang.Thread.run(Unknown Source) [?:?]
	Caused by: org.postgresql.util.PSQLException: ERROR: null value in column "fk_person_who_created" of relation "fh_group" violates not-null constraint
  Detail: Failing row contains (9dfc43a6-329b-4f23-916e-02983867b03d, null, null, null, t, 75bd4fe5-457b-4998-ba6e-053c71ae7e49, org_admin, 2022-06-29 15:53:43.035+00, 2022-06-29 15:53:43.035+00, 1).
	org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2675) ~[postgresql-42.3.3.jar:42.3.3]
	org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2365) ~[postgresql-42.3.3.jar:42.3.3]
	org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:355) ~[postgresql-42.3.3.jar:42.3.3]
	org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:490) ~[postgresql-42.3.3.jar:42.3.3]
	org.postgresql.jdbc.PgStatement.execute(PgStatement.java:408) ~[postgresql-42.3.3.jar:42.3.3]
	org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:166) ~[postgresql-42.3.3.jar:42.3.3]
	org.postgresql.jdbc.PgPreparedStatement.executeUpdate(PgPreparedStatement.java:134) ~[postgresql-42.3.3.jar:42.3.3]
	io.ebean.datasource.pool.ExtendedPreparedStatement.executeUpdate(ExtendedPreparedStatement.java:130) ~[ebean-datasource-7.3.jar:?]
	io.ebeaninternal.server.type.DataBind.executeUpdate(DataBind.java:100) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.dml.InsertHandler.execute(InsertHandler.java:106) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.dml.DmlHandler.executeNoBatch(DmlHandler.java:87) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.dml.DmlBeanPersister.execute(DmlBeanPersister.java:69) ~[ebean-core-12.15.0.jar:?]

Versions

  • FeatureHub version [1.5.7]
  • SDK version [N/A]
  • OS: [k8s 1.21 on Amazon Linux, if that matters]
  • Browser [any]

You can get the version of the FeatureHub container by running docker ps command.
Please include the OS and what version of the OS and Docker you're running.

Additional context
The value of auth.disable-login doesn't play much role when Oauth2 is configured. When set to false, the setup form asks for username and password, but the user is created without a password anyway and therefore can not be used for local login (because of NPE in io.featurehub.db.password.PasswordSalter.validatePassword(PasswordSalter.java:68)).

Add any other context about the problem here.

@rvowles
Copy link
Contributor

rvowles commented Jun 29, 2022

HI there, just to get a handle on what you are doing I'll need to duplicate your setup, we do use OAuth2 with Google ourselves so setting up shouldn't be an issue.

The downside of failing to setup the first time is that you must empty the database schema and start again, you cannot fail and then recover. I think there is a bit in the docs about this, but the way it works is that if auth-disable is true, it will create the organisation/portfolio/etc and then redirect to your OAuth2 provider and that process must complete successfully, otherwise you are left with an unrecoverable system. We haven't really come up with a good solution to this as no-one seems to have raised it before, but I'm not sure if thats the situation you found yourself in first up... I will use the k8s config we have in featurehub-installs and set it up for oauth2 with google and disable login and see how we get on.

We are adding SAML2 with this release so its doubly important that if we have some issue we fix it as 1.5.8 is almost out the door :-)

@mouchar
Copy link
Author

mouchar commented Jun 30, 2022

I'll try to prepare a minimal reproducer. Instead of embedded postgreql 11 subchart, we are using bitnami/postgresql-ha (with pg 14.2) installed separately into the same namespace. It works fine except for the initial setup. Let's see whether it works with embedded chart or not.

@mouchar
Copy link
Author

mouchar commented Jun 30, 2022

OK, so the problem is a race condition happening when I use bitnami/postgresql-ha as a backend, instead of
bitnami/postgresql chart (without replication and load balancing support).

The postgresql-ha works so that it contains so-called pgpool that sends DML and DDL operations to primary db and distributes DQL operations (SELECTs) evenly among all nodes (both primary and replicas). There's asynchronous replication from the primary node to all replicas.

It means that when the first user is inserted to fh_person, chances are that the subsequent SELECT from fh_person will target some replica before the previous change was propagated and the newly created user is not found. Therefore, the fk_person_who_created is set to NULL, and constraint violation occurs.

As it is actually unfortunate inefficiency of Ebean ORM, that it needs to query database for the entity it just created one millisecond ago, I'm closing this issue.

Until solved in the Ebeam ORM, we can't reliably install Featurehub on an asynchronously replicated database :(

@mouchar mouchar closed this as not planned Won't fix, can't repro, duplicate, stale Jun 30, 2022
@rbygrave
Copy link

As it is actually unfortunate inefficiency of Ebean ORM, that it needs to query database for the entity it just created one millisecond ago, I'm closing this issue.

Hi, I'm the creator of Ebean ORM. What are you talking about? Ebean doesn't need to query the database after an insert? Are you referring to GetGeneratedKeys ?

Can you be more specific as to what this scenario is?

@rvowles
Copy link
Contributor

rvowles commented Jun 30, 2022

hey @rbygrave - you got some watch on the word ebean? 😂

I've got a setup for the HA setup and its mentioned in the docs that you can use HA, but you can't tell ebean to use the read replica because read-only transactions will go to the read replica, and that delay can cause us issues in the UI. As long as you don't configure the HA in ebean itself, MR will operate correctly - but thats on how we use ebean, its not an ebean issue.

I want to keep this open because I am encountering an issue where the front end isn't honouring the redirect it is getting from the backend when running locally and I'm not sure why. I need to figure it out before we release 1.5.8.

@rvowles rvowles reopened this Jun 30, 2022
@rbygrave
Copy link

rbygrave commented Jun 30, 2022

I'm watching feature-hub of course :)

https://github.com/featurehub-io/featurehub/blob/main/backend/mr/src/main/java/io/featurehub/mr/resources/SetupResource.java#L169-L176

My first glance says the Person is created at line 170 ... but the Person isn't returned from that call? To me that looks suspicious as to why person is created there but that method isn't returning the created person [edit: or returning the created persons id value]. (not that I've looked at the code, it's just a first glance).

@rvowles
Copy link
Contributor

rvowles commented Jun 30, 2022

Ok, so I finally figured out that the pvc on the basic postgres install is set to retain policy - so my setup with local user id was all setup and once i deleted that, the oauth2 stuff worked as expected with the non-HA postgres database.

The OAuth2 stuff never creates a person via setup if it is being used, only the organisation and portfolio exist, the OAuth2 process once it completes recognizes the first person should be the admin and gives them the permissions. So something else is happening that would be great to investigate!

@mouchar do you have a config you could share for kube for the HA variant? You can see I was using this one https://github.com/featurehub-io/featurehub-install/blob/master/docker-compose-options/mr-edge-rest-postgres/docker-compose.yaml for Docker Compose but i'd rather not do that if Postgres has a nice k8s one.

I'd like to fold it into 1.5.8 for you can provide something for me?

@rbygrave
Copy link

fk_person_who_created (is null on insert)

Always nice to reproduce but it does sound like a race condition with replication lag (so we need replication lag to reproduce). It sounds like there is an insert followed by a select and I'm assuming that is used to populate fk_person_who_created ... and I hence I believe the fix for this is to not execute the query after the insert but instead using a reference bean. That is, if I was pointed to the code that does the insert and code that does the select I could probably suggest the fix and it would possibly be fairly obvious once we see it (which would be to use a reference bean rather than execute the select from person after insert).

Apologies if this isn't helpful, I am just theorizing based only on looking at the stack trace and understanding there is replication lag. If I was pointed to the insert and select code in question though I could confirm.

@mouchar
Copy link
Author

mouchar commented Jun 30, 2022

Hi guys, first of all, thank you for all the effort you're investing in debugging this evasive bug.

I prepared a simple reproducer in this gist.
I tried it many times and this error doesn't happen all the time. If I decreased the number of k8s worker nodes, I managed to initialize FeatureHub successfully in ~80% of cases.

I also logged SQL statements, here is the log of the problematic part. I just removed a lot of noise coming from repmgr and k8s probes. The log starts on the first user check:

postgresql-ha-postgresql-2 21:50:48.490 GMT [378] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-0 21:50:48.490 GMT [610] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-2 21:50:48.491 GMT [378] LOG:  execute <unnamed>: select t0.id from fh_person t0 limit 1
postgresql-ha-postgresql-2 21:50:48.491 GMT [378] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-0 21:50:48.491 GMT [610] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-2 21:50:48.492 GMT [378] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-0 21:50:48.492 GMT [610] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-2 21:50:48.493 GMT [378] LOG:  execute <unnamed>: select t0.id, t0.password_alg, t0.when_last_authenticated, t0.name, t0.email, t0.password, t0.password_requires_reset, t0.token, t0.token_expiry, t0.when_archived, t0.when_updated, t0.when_created, t0.version, t0.who_changed_id, t0.fk_person_who_created from fh_person a0 where t0.email = $1
postgresql-ha-postgresql-2 21:50:48.493 GMT [378] DETAIL:  parameters: $1 = 'robert.moucha@gooddata.com'
postgresql-ha-postgresql-0 21:50:48.493 GMT [610] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-2 21:50:48.493 GMT [378] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-2 21:50:48.495 GMT [378] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-0 21:50:48.495 GMT [610] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-0 21:50:48.496 GMT [610] LOG:  execute <unnamed>: insert into fh_person (id, password_alg, when_last_authenticated, name, email, password, password_requires_reset, token, token_expiry, when_archived, when_updated, when_created, version, who_changed_id, fk_person_who_created) values ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15)
postgresql-ha-postgresql-0 21:50:48.496 GMT [610] DETAIL:  parameters: $1 = 'f25e8692-5085-45b5-b5f7-2b3a23245ae0', $2 = 'PBKDF2WithHmacSHA512', $3 = NULL, $4 = 'Robert Moucha', $5 = 'robert.moucha@gooddata.com', $6 = NULL, $7 = 'f', $8 = '656a154f-4b26-4edd-8a8d-45056307bb56', $9 = '2022-07-07 21:50:48.494747+00', $10 = NULL, $11 = '2022-06-30 21:50:48.495+00', $12 = '2022-06-30 21:50:48.495+00', $13 = '1', $14 = NULL, $15 = NULL
postgresql-ha-postgresql-0 21:50:48.497 GMT [610] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-2 21:50:48.497 GMT [378] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-2 21:50:48.505 GMT [378] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-0 21:50:48.505 GMT [610] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-2 21:50:48.505 GMT [378] LOG:  execute <unnamed>: select t0.id, t0.password_alg, t0.when_last_authenticated, t0.name, t0.email, t0.password, t0.password_requires_reset, t0.token, t0.token_expiry, t0.when_archived, t0.when_updated, t0.when_created, t0.version, t0.who_changed_id, t0.fk_person_who_created from fh_person t0 where t0.email = $1
postgresql-ha-postgresql-2 21:50:48.505 GMT [378] DETAIL:  parameters: $1 = 'robert.moucha@gooddata.com'
postgresql-ha-postgresql-0 21:50:48.506 GMT [610] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-2 21:50:48.506 GMT [378] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-0 21:50:48.507 GMT [610] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-2 21:50:48.507 GMT [378] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-2 21:50:48.508 GMT [378] LOG:  execute S_5: select t0.id, t0.when_archived, t0.name, t0.when_updated, t0.when_created, t0.version, t0.fk_named_cache, t0.group_id from fh_organization t0
postgresql-ha-postgresql-0 21:50:48.508 GMT [610] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-2 21:50:48.508 GMT [378] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-0 21:50:48.509 GMT [610] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-2 21:50:48.509 GMT [378] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-2 21:50:48.509 GMT [378] LOG:  execute <unnamed>: select t0.id, t0.when_archived, t0.is_admin_group, t0.group_name, t0.when_updated, t0.when_created, t0.version, t0.fk_person_who_created, t0.fk_portfolio_id, t0.fk_organization_id from fh_group t0 where t0.is_admin_group = $1 and t0.fk_portfolio_id is null and t0.fk_organization_id = $2
postgresql-ha-postgresql-2 21:50:48.509 GMT [378] DETAIL:  parameters: $1 = 't', $2 = '44297691-749c-4661-b7af-8e8469c8a958'
postgresql-ha-postgresql-0 21:50:48.510 GMT [610] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-2 21:50:48.510 GMT [378] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-2 21:50:48.511 GMT [378] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-0 21:50:48.511 GMT [610] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-2 21:50:48.511 GMT [378] LOG:  execute <unnamed>: select t0.id, t0.when_archived, t0.name, t0.when_updated, t0.when_created, t0.version, t0.fk_named_cache, t0.group_id from fh_organization t0 where t0.id = $1
postgresql-ha-postgresql-2 21:50:48.511 GMT [378] DETAIL:  parameters: $1 = '44297691-749c-4661-b7af-8e8469c8a958'
postgresql-ha-postgresql-0 21:50:48.512 GMT [610] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-2 21:50:48.512 GMT [378] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-2 21:50:48.513 GMT [378] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-0 21:50:48.513 GMT [610] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-2 21:50:48.514 GMT [378] LOG:  execute <unnamed>: select t0.id from fh_group t0 where t0.when_archived is null and t0.fk_portfolio_id is null and t0.fk_organization_id = $1 limit 1
postgresql-ha-postgresql-2 21:50:48.514 GMT [378] DETAIL:  parameters: $1 = '44297691-749c-4661-b7af-8e8469c8a958'
postgresql-ha-postgresql-0 21:50:48.514 GMT [610] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-2 21:50:48.514 GMT [378] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-0 21:50:48.516 GMT [610] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-2 21:50:48.516 GMT [378] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-0 21:50:48.516 GMT [610] LOG:  execute <unnamed>: insert into fh_group (id, when_archived, is_admin_group, group_name, when_updated, when_created, version, fk_person_who_created, fk_portfolio_id, fk_organization_id) values ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10)
postgresql-ha-postgresql-0 21:50:48.516 GMT [610] DETAIL:  parameters: $1 = '92c31a68-a9cd-4ae5-a8cd-79d440bd0c0d', $2 = NULL, $3 = 't', $4 = 'org_admin', $5 = '2022-06-30 21:50:48.515+00', $6 = '2022-06-30 21:50:48.515+00', $7 = '1', $8 = NULL, $9 = NULL, $10 = '44297691-749c-4661-b7af-8e8469c8a958'
postgresql-ha-postgresql-0 21:50:48.516 GMT [610] ERROR:  null value in column "fk_person_who_created" of relation "fh_group" violates not-null constraint
postgresql-ha-postgresql-0 21:50:48.516 GMT [610] DETAIL:  Failing row contains (92c31a68-a9cd-4ae5-a8cd-79d440bd0c0d, null, null, null, t, 44297691-749c-4661-b7af-8e8469c8a958, org_admin, 2022-06-30 21:50:48.515+00, 2022-06-30 21:50:48.515+00, 1).
postgresql-ha-postgresql-0 21:50:48.516 GMT [610] STATEMENT:  insert into fh_group (id, when_archived, is_admin_group, group_name, when_updated, when_created, version, fk_person_who_created, fk_portfolio_id, fk_organization_id) values ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10)
postgresql-ha-postgresql-2 21:50:48.520 GMT [378] LOG:  execute S_6: ROLLBACK
postgresql-ha-postgresql-0 21:50:48.520 GMT [610] LOG:  execute S_6: ROLLBACK

@mouchar
Copy link
Author

mouchar commented Jun 30, 2022

The time delta between insert into fh_person (done on primary postgresql-ha-postgresql-0) and select from fh_person (done on replica postgresql-ha-postgresql-2) is 9ms in this case. The rest of log is straightforward - if the select from fh_person returns zero rows (due to replication lag as I suppose), then the insert into fh_group has parameter $8 = NULL (fk_person_who_created).

While it is impossible to reliably avoid replication lag, it should be possible to avoid the select from fh_person query, especially if the Person is known because it was created literally a few lines above in the same method.

@rbygrave
Copy link

rbygrave commented Jul 1, 2022

it should be possible to avoid the select from fh_person query, especially if the Person is known because it was created literally a few lines above in the same method.

Yes, exactly. We don't want the select .. from fh_person t0 where t0.email = $1 ... and instead we should be using a reference bean when creating the group (insert into ... fh_group) regardless of replication lag.

So I think the next step is to find those 2 places in the code where the person is inserted, and where the group is inserted. We need the id value of the person inserted to be passed to the place where the group is inserted ... and remove the query for person by email and replace that with a reference bean using the passed person id value.

@rvowles
Copy link
Contributor

rvowles commented Jul 1, 2022

However, thats not one request, there is a service layer issuing those request to create the user and then reissues the generic request to add groups to the created user using service layer primitives :-) If I wrapped it all in one transaction would that make a difference @rbygrave ? In an HA situation that would probably make sense for PUT, POST and DELETE requests.

@rbygrave
Copy link

rbygrave commented Jul 1, 2022

So personApi.create(email, username,null); here ...
https://github.com/featurehub-io/featurehub/blob/v1.5.7/backend/mr/src/main/java/io/featurehub/mr/resources/oauth2/OAuth2MRAdapter.java#L114
... that returns PersonApi.PersonToken which has the person.id value in it, so we need to capture that returned PersonToken ... and use that for the Group group = groupApi.createOrgAdminGroup(organization.getId(), "org_admin", person);

... BUT ...

Just above that we have ...
Person person = authenticationApi.register(username, email, null, null);
https://github.com/featurehub-io/featurehub/blob/v1.5.7/backend/mr/src/main/java/io/featurehub/mr/resources/oauth2/OAuth2MRAdapter.java#L122

And inside that is a select followed by an update if the row is found:

public Person register(@Nullable String name, @NotNull String email, @Nullable String password, Opts opts) {
if (email == null) return null;
return new QDbPerson()
.email
.eq(email.toLowerCase())
.findOneOrEmpty()
.map(
person -> {
if (person.getToken() == null) {
return null;
}
// its ok password is null
String saltedPassword = passwordSalter.saltAnyPassword(password, DbPerson.DEFAULT_PASSWORD_ALGORITHM);
if (saltedPassword == null && password != null) {
return null;
}
person.setName(name);
person.setPassword(saltedPassword);
person.setPasswordAlgorithm(DbPerson.DEFAULT_PASSWORD_ALGORITHM);
person.setToken(null);
person.setTokenExpiry(null);
person.setWhenLastAuthenticated(Instant.now());
updateUser(person);
return convertUtils.toPerson(person, opts == null ? Opts.opts(FillOpts.Groups, FillOpts.Acls) : opts );
})
.orElse(null);

... now if the select and update was surrounded by a transaction then that select would go against the master datasource, but it does not so that select is going to go against the "read-only" datasource (that often points to a read replica database which can have replication lag).

That is, to me it looks like there is a second issue here. This second issue is that the logic to reset the token in AuthenticationSqlApi.java#L83-L113 isn't actually working (because the select runs against a read-replica with potential replication lag). IF, we want this logic then we should wrap a transaction around that select + update. IF we did that, it would probably find the newly inserted person and work.

So it looks like there are 2 possible fixes to this issue.

@rbygrave
Copy link

rbygrave commented Jul 1, 2022

I literally just posted and then saw your question :) ... Yes, putting a transaction around the whole lot would mean all the select queries in that transaction are executed against the master and not the read-only datasource / read replica. So yes.

@rbygrave
Copy link

rbygrave commented Jul 1, 2022

However, thats not one request,

Yeah I don't quite follow that it's more than 1 request. Is it not that there is 1 request that is both inserting the person and then ultimately creating the group? This code block below is doing both:

https://github.com/featurehub-io/featurehub/blob/v1.5.7/backend/mr/src/main/java/io/featurehub/mr/resources/oauth2/OAuth2MRAdapter.java#L113-L127

?

@rvowles
Copy link
Contributor

rvowles commented Jul 1, 2022

The OAuth2Adapter is making a request to the database layer which is in one transaction, and then does the groups which is another transaction - where you see personApi, groupApi, portfolioApi - those are all not in the same transaction. HA isn't something we have spent any time on, so if we wrap the appropriate steps (in this case, groups and person) in the same transaction I feel this would solve this specific issue?

I'm still not fully convinced, I ran into this issue with read replicas when issuing a POST and a GET request - the delay was sufficiently long that the GET would fail, and as you can't ensure the replica a person is talking to, you are always going to get these issues. If there was some way of ideally assigning a person based on some method to a particular connection so their requests always went to the same place, that would make it more reliable.

@rbygrave
Copy link

rbygrave commented Jul 1, 2022

so if we wrap the appropriate steps (in this case, groups and person) in the same transaction I feel this would solve this specific issue?

Yes agreed. That should fix it.

... with the caveat that, it is doing insert person, find person, update person, insert group so that isn't "optimal" per say vs just insert person, insert group but this does sound like a rare use case so I suspect that optimizing this to be just insert person, insert group isn't a priority.

when issuing a POST and a GET request

Yeah but I don't see the 2 http requests for this case? I see one request calling the single method https://github.com/featurehub-io/featurehub/blob/v1.5.7/backend/mr/src/main/java/io/featurehub/mr/resources/oauth2/OAuth2MRAdapter.java#L107-L143 ... and inside this method its doing what results in insert person, find person, update person, insert group ? ... and because there is not a transaction wrapping those the find person is using the read-only datasource / read replica (with the lag) ?

An option might be to turn on Ebean SQL and TXN logging and along with threadId it might be more obvious if this is multiple http requests ?

@rbygrave
Copy link

rbygrave commented Jul 1, 2022

FWIW AWS Aurora Postgres lag is expected to be less than 200ms and I've confirmed that was actually the case. AWS Aurora MySQL replication lag when I measured it in reality could be seconds !!! and this made it Aurora MySQL read replicas much much less useful/practical.

@rvowles
Copy link
Contributor

rvowles commented Jul 1, 2022

2 HTTP requests happen when - for example someone updates a set of features which is one state on one page, and we issue a subsequent GET to update the summary page again, the different between the commit for the updates happening and the GET to get the summary data can happen in < 50ms, particularly when running locally.

@rbygrave
Copy link

rbygrave commented Jul 1, 2022

Yeah but that isn't this specific issue right? In this case there is only the 1 http request right?

@rvowles
Copy link
Contributor

rvowles commented Jul 1, 2022

No, its not this specific issue :-)

@mouchar
Copy link
Author

mouchar commented Jul 1, 2022

And there's one more practical side effect of having all this wrapped in TXN. When things go south for any reason during the initial user/group setup, everything is nicely rolled back and you don't end up with inconsistent DB with an existing user and without groups.

Then, people can simply retry the action and hope for a better outcome :)

@rvowles
Copy link
Contributor

rvowles commented Jul 1, 2022

Yeah, so I dropped a PR that fixed this specific issue and ran it up and immediately hit other issues. One significant one is that we need to make sure feature updates are committed and in the database before we publish them. This leads to obvious problems as that async publish happens immediately after save and as its in a different transaction, it can go to a different database. That means I have to make sure the record is there by polling for it before i try and publish it, which is really pretty nasty. Suggestions are welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants