-
-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Constraint violation when creating admin group on initial setup (with OAuth2) #801
Comments
HI there, just to get a handle on what you are doing I'll need to duplicate your setup, we do use OAuth2 with Google ourselves so setting up shouldn't be an issue. The downside of failing to setup the first time is that you must empty the database schema and start again, you cannot fail and then recover. I think there is a bit in the docs about this, but the way it works is that if auth-disable is true, it will create the organisation/portfolio/etc and then redirect to your OAuth2 provider and that process must complete successfully, otherwise you are left with an unrecoverable system. We haven't really come up with a good solution to this as no-one seems to have raised it before, but I'm not sure if thats the situation you found yourself in first up... I will use the k8s config we have in featurehub-installs and set it up for oauth2 with google and disable login and see how we get on. We are adding SAML2 with this release so its doubly important that if we have some issue we fix it as 1.5.8 is almost out the door :-) |
I'll try to prepare a minimal reproducer. Instead of embedded postgreql 11 subchart, we are using bitnami/postgresql-ha (with pg 14.2) installed separately into the same namespace. It works fine except for the initial setup. Let's see whether it works with embedded chart or not. |
OK, so the problem is a race condition happening when I use The postgresql-ha works so that it contains so-called pgpool that sends DML and DDL operations to primary db and distributes DQL operations (SELECTs) evenly among all nodes (both primary and replicas). There's asynchronous replication from the primary node to all replicas. It means that when the first user is inserted to As it is actually unfortunate inefficiency of Ebean ORM, that it needs to query database for the entity it just created one millisecond ago, I'm closing this issue. Until solved in the Ebeam ORM, we can't reliably install Featurehub on an asynchronously replicated database :( |
Hi, I'm the creator of Ebean ORM. What are you talking about? Ebean doesn't need to query the database after an insert? Are you referring to GetGeneratedKeys ? Can you be more specific as to what this scenario is? |
hey @rbygrave - you got some watch on the word ebean? 😂 I've got a setup for the HA setup and its mentioned in the docs that you can use HA, but you can't tell ebean to use the read replica because read-only transactions will go to the read replica, and that delay can cause us issues in the UI. As long as you don't configure the HA in ebean itself, MR will operate correctly - but thats on how we use ebean, its not an ebean issue. I want to keep this open because I am encountering an issue where the front end isn't honouring the redirect it is getting from the backend when running locally and I'm not sure why. I need to figure it out before we release 1.5.8. |
I'm watching feature-hub of course :) My first glance says the Person is created at line 170 ... but the Person isn't returned from that call? To me that looks suspicious as to why person is created there but that method isn't returning the created person [edit: or returning the created persons id value]. (not that I've looked at the code, it's just a first glance). |
Ok, so I finally figured out that the pvc on the basic postgres install is set to retain policy - so my setup with local user id was all setup and once i deleted that, the oauth2 stuff worked as expected with the non-HA postgres database. The OAuth2 stuff never creates a person via setup if it is being used, only the organisation and portfolio exist, the OAuth2 process once it completes recognizes the first person should be the admin and gives them the permissions. So something else is happening that would be great to investigate! @mouchar do you have a config you could share for kube for the HA variant? You can see I was using this one https://github.com/featurehub-io/featurehub-install/blob/master/docker-compose-options/mr-edge-rest-postgres/docker-compose.yaml for Docker Compose but i'd rather not do that if Postgres has a nice k8s one. I'd like to fold it into 1.5.8 for you can provide something for me? |
Always nice to reproduce but it does sound like a race condition with replication lag (so we need replication lag to reproduce). It sounds like there is an insert followed by a select and I'm assuming that is used to populate Apologies if this isn't helpful, I am just theorizing based only on looking at the stack trace and understanding there is replication lag. If I was pointed to the insert and select code in question though I could confirm. |
Hi guys, first of all, thank you for all the effort you're investing in debugging this evasive bug. I prepared a simple reproducer in this gist. I also logged SQL statements, here is the log of the problematic part. I just removed a lot of noise coming from repmgr and k8s probes. The log starts on the first user check:
|
The time delta between While it is impossible to reliably avoid replication lag, it should be possible to avoid the |
Yes, exactly. We don't want the So I think the next step is to find those 2 places in the code where the person is inserted, and where the group is inserted. We need the id value of the person inserted to be passed to the place where the group is inserted ... and remove the query for person by email and replace that with a reference bean using the passed person id value. |
However, thats not one request, there is a service layer issuing those request to create the user and then reissues the generic request to add groups to the created user using service layer primitives :-) If I wrapped it all in one transaction would that make a difference @rbygrave ? In an HA situation that would probably make sense for PUT, POST and DELETE requests. |
So ... BUT ... Just above that we have ... And inside that is a select followed by an update if the row is found: featurehub/backend/mr-db-sql/src/main/java/io/featurehub/db/services/AuthenticationSqlApi.java Lines 83 to 113 in a6093bf
... now if the select and update was surrounded by a transaction then that select would go against the master datasource, but it does not so that select is going to go against the "read-only" datasource (that often points to a read replica database which can have replication lag). That is, to me it looks like there is a second issue here. This second issue is that the logic to reset the token in So it looks like there are 2 possible fixes to this issue. |
I literally just posted and then saw your question :) ... Yes, putting a transaction around the whole lot would mean all the select queries in that transaction are executed against the master and not the read-only datasource / read replica. So yes. |
Yeah I don't quite follow that it's more than 1 request. Is it not that there is 1 request that is both inserting the person and then ultimately creating the group? This code block below is doing both: ? |
The OAuth2Adapter is making a request to the database layer which is in one transaction, and then does the groups which is another transaction - where you see personApi, groupApi, portfolioApi - those are all not in the same transaction. HA isn't something we have spent any time on, so if we wrap the appropriate steps (in this case, groups and person) in the same transaction I feel this would solve this specific issue? I'm still not fully convinced, I ran into this issue with read replicas when issuing a POST and a GET request - the delay was sufficiently long that the GET would fail, and as you can't ensure the replica a person is talking to, you are always going to get these issues. If there was some way of ideally assigning a person based on some method to a particular connection so their requests always went to the same place, that would make it more reliable. |
Yes agreed. That should fix it. ... with the caveat that, it is doing
Yeah but I don't see the 2 http requests for this case? I see one request calling the single method https://github.com/featurehub-io/featurehub/blob/v1.5.7/backend/mr/src/main/java/io/featurehub/mr/resources/oauth2/OAuth2MRAdapter.java#L107-L143 ... and inside this method its doing what results in An option might be to turn on Ebean SQL and TXN logging and along with threadId it might be more obvious if this is multiple http requests ? |
FWIW AWS Aurora Postgres lag is expected to be less than 200ms and I've confirmed that was actually the case. AWS Aurora MySQL replication lag when I measured it in reality could be seconds !!! and this made it Aurora MySQL read replicas much much less useful/practical. |
2 HTTP requests happen when - for example someone updates a set of features which is one state on one page, and we issue a subsequent GET to update the summary page again, the different between the commit for the updates happening and the GET to get the summary data can happen in < 50ms, particularly when running locally. |
Yeah but that isn't this specific issue right? In this case there is only the 1 http request right? |
No, its not this specific issue :-) |
And there's one more practical side effect of having all this wrapped in TXN. When things go south for any reason during the initial user/group setup, everything is nicely rolled back and you don't end up with inconsistent DB with an existing user and without groups. Then, people can simply retry the action and hope for a better outcome :) |
Yeah, so I dropped a PR that fixed this specific issue and ran it up and immediately hit other issues. One significant one is that we need to make sure feature updates are committed and in the database before we publish them. This leads to obvious problems as that async publish happens immediately after save and as its in a different transaction, it can go to a different database. That means I have to make sure the record is there by polling for it before i try and publish it, which is really pretty nasty. Suggestions are welcome! |
Describe the bug
Initial setup fails on constraint violation (represented as HTTP error 500 on UI) while trying to create
org_admin
group. Since the initial setup is not completed, the group is not created, the first user is not added to it and when user logs in, it has absolutely no permissions and no possibility to fix that.Which area does this issue belong to?
To Reproduce
Steps to reproduce the behavior:
oauth2.providers.google.*
(Google oauth2 credentials)oauth2.providers: oauth2-google
(other oauth2 providers are possible also affected)/mr-api/initialize returns
500 Internal Server Error
auth.disable-login: "true"
(only oauth2 login is allowed, even for the first user)Expected behavior
Featurehub can be successfully initialized.
Screenshots
auth.disable-login: true
:auth.disable-login: false
:Versions
You can get the version of the FeatureHub container by running
docker ps
command.Please include the OS and what version of the OS and Docker you're running.
Additional context
The value of
auth.disable-login
doesn't play much role when Oauth2 is configured. When set tofalse
, the setup form asks for username and password, but the user is created without a password anyway and therefore can not be used for local login (because of NPE inio.featurehub.db.password.PasswordSalter.validatePassword(PasswordSalter.java:68)
).Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: