-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Session Teardown not having effect on all K8s resources #346
Comments
From other projects (e.g., Longhorn on RKE rancher), we have experienced troubles with deleting pods due to |
So usually what we are doing is to set the
When a timeout is detected only the The Kubernetes Garbage Collector should then clean up all "children/orphans" of the deleted resource. And when the deployment gets cleaned, the pods should go as well. (https://kubernetes.io/docs/concepts/architecture/garbage-collection/#orphaned-dependents) |
Thanks for the fast and detailed response, that's exactly what I was hoping for @jfaltermeier ! 😍 In there, the following information appeared to be useful:
As we can see, the ownerReference is in fact set to a Using
|
Does this mean you cannot find the ownerReference of a running, "should-be-timeouted" session? That would be really odd indeed. In that case i believe it would make sense to take a lot at all the Sessions that are on the cluster ( If you can find the ownerReference in a running, "should-be-timeouted" session, then you can just run |
Hi @sgraband, no I didn't mean to entail that. In a "should-be-timeouted" pod from 4d ago, the k get deployment session-yannik-schmidt-artemis-java-17-e37b3e539db8 -v 100: "ownerReferences": [
{
"apiVersion": "theia.cloud/v1beta8",
"kind": "Session",
"name": "ws-artemis-java-17-yannik-schmidt-tum-de-session",
"uid": "de3bc534-b24d-4bed-91b3-e37b3e539db8"
}
], Using Maybe it has something to do that the ReplicaSet has the following k get replicaset session-yannik-schmidt-artemis-java-17-e37b3e539db8-6644456b9f -v 100: "ownerReferences": [
{
"apiVersion": "apps/v1",
"kind": "Deployment",
"name": "session-yannik-schmidt-artemis-java-17-e37b3e539db8",
"uid": "6a5834e5-39b8-426c-ac95-ebe1f1cda5fa",
"controller": true,
"blockOwnerDeletion": true
}
], When creating a new session via the landing-page, the pods, replicaset, deployment, and route in the ingress are properly setup - also the session is created. k describe session ws-artemis-java-17-yannik-schmidt-tum-de-session:
This time I noticed though, that the
Then, the |
I cannot really spot something wrong with the resources you have sent. We are aware of a bug with the monitor, where, when the The order of ownerReferences should be:
This looks correct on your end. Are all of the 4 resources in the same namespace ( |
Monitor configurationCurrently, I'm setting up the pod monitor in the AppDefinition:
theia-cloud-helm-values.yml:
OwnerReference✔️ The Even though all |
Alright, thank you for checking and the information. I will test this out locally and see if i can reproduce this. Could you answer me two more questions:
|
Thank you so much @sgraband! If it helps we can also schedule a quick call to debug that together? I configured the I'm assuming that the |
Let me quickly elaborate on the timeout/monitoring features. As you noticed we have 3 different values atm that are connected with shutting down the pod after a specified amount of time:
Please note that the latter two are only working with your application if you install the monitor extension. Either the vscode-extension version or the Theia extension. This is because the application needs to provide information about the activity. Basically, the extensions startup a REST service,that the operator can then communicate with. This is also the reason, why your request is failing (your application does not start the REST service, so the URL is not available). If you do not have one of the two extensions installed you should not add This just as a little bit of background, i am not sure if this is really the problem you are currently experiencing, but you should definately change this. Maybe you can update this and report back if the issue still persists? |
Oh that's very interesting @sgraband! I did not expect the I'd like to use the VSC extension as it also offers the warning messages etc., is it correct that it has not yet been released to OpenVSX? Do you plan on releasing it there (would probably benefit almost all Theia Cloud installations, right? or would you mind if we (from TUM) do so for our use cases? As we are building our blueprints in the |
The feature set of the Theia extension and the VSC extension should be the same. What do you mean with the warning messages? In theory the theia extension offers more flexibility and is already published and consumable via npm. Maybe @jfaltermeier could chime in here regarding the plan to publish the vsc extension? |
This is what I referred to as a "warning" :) I'm fine using the extension too, when the features do not differ - the information on git about the Theia extension were just a bit scarce so I didn't know where to look for a published version. I'll simply add the extension to the Apart from that, I just found out that you are also building and releasing a version of the vsc extension in the |
I added the metrics extension to all my blueprints via the "dependencies": {
"@eclipse-theiacloud/monitor-theia": "next",
"@theia/core": "^1.53.2",
"@theia/editor": "^1.53.2",
"@theia/editor-preview": "^1.53.2",
"@theia/electron": "^1.53.2",
"@theia/filesystem": "^1.53.2"
} The monitor:
port: 8081
activityTracker:
timeoutAfter:
20
notifyAfter:
15 In the helm values of monitor:
enable: true
port: 8081
activityTracker:
enable: true
interval: 1 According to the operator's log, the INFO org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] REQUEST FAILED: GET http://198.19.97.73:8081/monitor/activity/lastActivity. Error: java.net.SocketTimeoutException: Connect timed out
INFO org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Last reported activity was: 1970-01-01T00:00:00.000Z (28764843 minutes ago)
TRACE org.eclipse.theia.cloud.operator.BasicTheiaCloudOperator - [timeout-2b490e91-7a3b-462b-91dc-f35e51826654] Session ws-artemis-java-17-yannik-schmidt-tum-de-session will not be stopped automatically [NoTimeout].
INFO org.eclipse.theia.cloud.operator.messaging.MonitorMessagingServiceImpl - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Could not send message to extension:java.net.SocketTimeoutException: Connect timed out
INFO org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Deleting session as timeout of 20 minutes was reached!
TRACE org.eclipse.theia.cloud.operator.util.SpecWatch - [session-watch-2214cdb6-44f5-410d-9d1e-5813ccf01796] Session ac5d2932-cc8c-44b7-8ed8-5dff5930e103 : received an event: DELETED This opens two questions on my end:
Apart from that, the session teardown still does not seem to work. The short period of time after starting up the session and having the operator killing it (described above), the deployment is correctly configured to have the session as "ownerReferences": [
{
"apiVersion": "theia.cloud/v1beta8",
"kind": "Session",
"name": "ws-artemis-java-17-yannik-schmidt-tum-de-session",
"uid": "ac5d2932-cc8c-44b7-8ed8-5dff5930e103"
}
], Still, when the session is successfully removed due to the monitor malfunction, the deployment stays existent. I think this is a separate problem from the monitor one and aligns more to the initial problem of this thread. |
The default ports are:
So you would need to use 3000. Then the endpoint should work, once the application is started. The bug with 1970 could still happen, but this only happens, when the first request is sent before the application is ready. Yes sure, i will try to reproduce the issue with the ownerReferences locally and get back to you. |
Could you quickly elaborate on which ports I need to use where please?
In AppDefinition:
Am I missing some configuration properties? |
In the newest theia-cloud-helm version there is no more
I would assume that you need to set both of them to 3000. See also the |
Thanks for the clarification! I'm a bit confused about the two documentations regarding Theia: The README's and the "actual" docs (https://main--theia-cloud.netlify.app/documentation/setuptheiacloud/). I used DEBUG org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - Pinging sessions: [name=ws-artemis-java-17-yannik-schmidt-tum-de-session version=255981443 value=SessionSpec [name=ws-artemis-java-17-yannik-schmidt-tum-de-session, appDefinition=artemis-java-17, user=yannik.schmidt@tum.de, workspace=null]]
INFO org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Pinging session at 198.19.105.223
INFO org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] GET http://198.19.105.223:3000/monitor/activity/lastActivity
TRACE org.eclipse.theia.cloud.operator.BasicTheiaCloudOperator - [timeout-205d39a7-d18b-4234-8ef5-723100bfa528] Session ws-artemis-java-17-yannik-schmidt-tum-de-session will not be stopped automatically [NoTimeout].
INFO org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] REQUEST FAILED (Returned 404: GET http://198.19.105.223:3000/monitor/activity/lastActivity
INFO org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Last reported activity was: 1970-01-01T00:00:00.000Z (28766050 minutes ago)
INFO org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Deleting session as timeout of 20 minutes was reached! Can you think of any reason for this behavior? Can I support you in any way to solve this problem? One more thing I noticed while building the blueprint is that the Docker Build appears to fail as soon as I add 120.2 native node modules are already rebuilt for browser
176.0
176.0 <--- Last few GCs --->
176.0
176.0 [10616:0x2c3d28b0] 54243 ms: Mark-sweep 4046.3 (4138.9) -> 4034.6 (4143.2) MB, 772.4 / 0.0 ms (average mu = 0.537, current mu = 0.010) allocation failure; scavenge might not succeed
176.0 [10616:0x2c3d28b0] 55542 ms: Mark-sweep 4050.4 (4143.2) -> 4038.9 (4146.9) MB, 1292.7 / 0.0 ms (average mu = 0.288, current mu = 0.005) allocation failure; scavenge might not succeed
176.0
176.0
176.0 <--- JS stacktrace --->
176.0
176.0 FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
...
176.1 Aborted
176.1
176.1 Error: webpack exited with an unexpected code: 134. on both my 32GB Mac and our university's runners. All other dependencies ("@eclipse-theiacloud/monitor-theia": "next", "@theia/editor": "^1.53.2", "@theia/editor-preview": "^1.53.2", "@theia/electron": "^1.53.2", "@theia/filesystem": "^1.53.2") do not stop the building process. |
Hey @sgraband, small update from our end. Apart from the bugs and errors identified in our last discussion, I found that our cluster is pretty ill-configured too.. Cascading deleted do not seem to work for now across all - also non-theia - manifests. I'll get back to you with updates and look forward to fixes for the identified problems with the monitor 👍 |
Describe the bug
I noticed that pods of timeouted sessions are still existent on our cluster and are being recreated on deletion by the according replicaset and deployment. As the entry in the ingress is properly removed, they are not of use to the user anymore and block resources.
Expected behavior
During session startup, I expect the deployment, replicaset, and pods to be created as well as a proper entry in the ingress. This is all working perfectly already.
During session teardown (due to timeout), I expect all that the system removes all those artifacts again. Currently, only the ingress configuration is removed.
Cluster provider
RKE2 1.27.9.
Version
Newest helm chart
Additional information
Using
k get pods
you can see the (correctly, not timeouted) session ending onaa0d-7757cf8mn85
and the incorrectly still running session ending onebbb-77877chgb9t
.Describing the ingress, you can see that the old session has been properly removed.
The text was updated successfully, but these errors were encountered: