-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cgroup rewrite: uses systemctl for expressing desired configuration instead drop-in files #3269
Conversation
@@ -650,8 +650,6 @@ def get_controllers(self, expected_relative_path=None): | |||
controller = MemoryControllerV1(self._cgroup_name, controller_path) | |||
|
|||
if controller is not None: | |||
msg = "{0} controller for cgroup: {1}".format(supported_controller_name, controller) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logging here is a side-effect of calling get_controllers which being called everywhere. As a result, we are simply logging at random places where it's not needed. So, removing it here
|
||
def initialize(self): | ||
try: | ||
if self._initialized: | ||
return | ||
# check whether cgroup monitoring is supported on the current distro | ||
self._cgroups_supported = CGroupUtil.cgroups_supported() | ||
self._cgroups_supported = self._check_cgroups_supported() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved all cgorups_supported checks to one place
# running in system.slice. | ||
|
||
self.__setup_azure_slice() | ||
self._setup_azure_slice() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not followed consistent naming in cgroupconfigurator. Renamed all of those to single underscore
@@ -46,9 +46,17 @@ def run_command(command: Any, shell=False) -> str: | |||
NOTE: The command's stdout and stderr are read as text streams. | |||
""" | |||
process = Popen(command, stdout=PIPE, stderr=PIPE, shell=shell, text=True) | |||
timer = threading.Timer(15 * 60, process.kill) # Kill process after timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timeout for ssh runs
_rlock = threading.RLock() | ||
|
||
@staticmethod | ||
def set_track_throttled_time(value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
track throttled time by default as part of CPU metrics, no explicit enable
@@ -46,9 +46,17 @@ def run_command(command: Any, shell=False) -> str: | |||
NOTE: The command's stdout and stderr are read as text streams. | |||
""" | |||
process = Popen(command, stdout=PIPE, stderr=PIPE, shell=shell, text=True) | |||
timer = threading.Timer(15 * 60, process.kill) # Kill process after timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is applied to all commands, not only ssh
also, for ssh, we should terminate the remote process, not the local process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I would think we even don't want the commands to stuck.
is process.kill not terminate the child process which includes remote process? In my experiment processes in the vm also terminated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, ok, as long as the remote process is killed too, we are ok
@@ -0,0 +1,28 @@ | |||
#!/usr/bin/env bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this script being invoked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially thought of using it tests; seems at the end it was not used. If it's not needed in the future. I'll remove
|
||
def is_unit_loaded(unit_name): | ||
""" | ||
Determine if a unit is loaded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean for a unit to be loaded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit being loaded means unit is parsed and available to the systemd process. We check this for extension services when we set quotas. If systemd is unaware of extension services and not loaded in the system yet, we get error while setting quotas. Hence, I added loaded check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If systemd is unaware of extension services and not loaded in the system yet, we get error while setting quotas. Hence, I added loaded check.
can you add this as a comment where you call this function?
@@ -1337,11 +1337,10 @@ def set_extension_resource_limits(self): | |||
# setup the resource limits for extension operations and it's services. | |||
man = self.load_manifest() | |||
resource_limits = man.get_resource_limits() | |||
if not CGroupConfigurator.get_instance().is_extension_resource_limits_setup_completed(extension_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this check getting removed? Will we set up resource limits on each operation for the ext?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar check was added in cgroupconfigurator setup_slice method. We don't touch the configuration again if limits already in place
log_cgroup_info("Ensuring the agent's CPUQuota is {0}".format(quota_percentage)) | ||
if CGroupConfigurator._Impl.__try_set_cpu_quota(quota_percentage): | ||
CGroupsTelemetry.set_track_throttled_time(True) | ||
log_cgroup_info("Setting {0}'s CPUQuota is {1}".format(unit_name, quota_percentage)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log_cgroup_info("Setting {0}'s CPUQuota is {1}".format(unit_name, quota_percentage)) | |
log_cgroup_info("Setting {0}'s CPUQuota to {1}".format(unit_name, quota_percentage)) |
if current == 0 and not (self._is_process_descendant_of_the_agent(process) or self._is_zombie_process(process)): | ||
current_unexpected[process] = self._format_process(process) | ||
if report_immediately: | ||
report = [current_unexpected[process] for process in current_unexpected] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
current_unexpected.values()?
tests_e2e/tests/lib/shell.py
Outdated
|
||
stdout, stderr = process.communicate() | ||
try: | ||
timer.start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start() should probably be outside the try statement (if it throws, I think we do not want to call cancel())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, approved with one minor comment on shell.py (e2e tests)
Description
Current agents use drop-in file mechanism to set cgroup desired configuration. Creating drop-in files required daemon-reload run. This cmd can affect other services in the system like have seen for wlamart incident
The main changes in the PR:
systemctl set-property --runtime
for resource control settings.https://www.man7.org/linux/man-pages/man1/systemctl.1.htmlAdded comments in the code and in the PR as well for other changes.
Issue #
PR information
develop
branch.Quality of Code and Contribution Guidelines