Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cgroup rewrite: uses systemctl for expressing desired configuration instead drop-in files #3269

Merged
merged 11 commits into from
Dec 17, 2024

Conversation

nagworld9
Copy link
Contributor

@nagworld9 nagworld9 commented Dec 3, 2024

Description

Current agents use drop-in file mechanism to set cgroup desired configuration. Creating drop-in files required daemon-reload run. This cmd can affect other services in the system like have seen for wlamart incident

The main changes in the PR:

  • Removes the drop-in files, deamon-reload and uses systemctl set-property --runtime for resource control settings.https://www.man7.org/linux/man-pages/man1/systemctl.1.html
  • New behavior: Raise the unexpected process if we see same process second time
  • CPU throttle metric calculated by default instead explicit enabling
  • In e2e test, added timeout for cmds/script runs over ssh

Added comments in the code and in the PR as well for other changes.

Issue #


PR information

  • Ensure development PR is based on the develop branch.
  • The title of the PR is clear and informative.
  • There are a small number of commits, each of which has an informative message. This means that previously merged commits do not appear in the history of the PR. For information on cleaning up the commits in your pull request, see this page.
  • If applicable, the PR references the bug/issue that it fixes in the description.
  • New Unit tests were added for the changes made

Quality of Code and Contribution Guidelines

@@ -650,8 +650,6 @@ def get_controllers(self, expected_relative_path=None):
controller = MemoryControllerV1(self._cgroup_name, controller_path)

if controller is not None:
msg = "{0} controller for cgroup: {1}".format(supported_controller_name, controller)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging here is a side-effect of calling get_controllers which being called everywhere. As a result, we are simply logging at random places where it's not needed. So, removing it here


def initialize(self):
try:
if self._initialized:
return
# check whether cgroup monitoring is supported on the current distro
self._cgroups_supported = CGroupUtil.cgroups_supported()
self._cgroups_supported = self._check_cgroups_supported()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved all cgorups_supported checks to one place

# running in system.slice.

self.__setup_azure_slice()
self._setup_azure_slice()
Copy link
Contributor Author

@nagworld9 nagworld9 Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not followed consistent naming in cgroupconfigurator. Renamed all of those to single underscore

@@ -46,9 +46,17 @@ def run_command(command: Any, shell=False) -> str:
NOTE: The command's stdout and stderr are read as text streams.
"""
process = Popen(command, stdout=PIPE, stderr=PIPE, shell=shell, text=True)
timer = threading.Timer(15 * 60, process.kill) # Kill process after timeout
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeout for ssh runs

_rlock = threading.RLock()

@staticmethod
def set_track_throttled_time(value):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

track throttled time by default as part of CPU metrics, no explicit enable

@nagworld9 nagworld9 changed the title Cgroup rewrite: uses systemctl for resource control settings instead drop-in files Cgroup rewrite: uses systemctl for expressing desired configuration instead drop-in files Dec 3, 2024
azurelinuxagent/common/osutil/systemd.py Outdated Show resolved Hide resolved
azurelinuxagent/common/osutil/systemd.py Outdated Show resolved Hide resolved
azurelinuxagent/common/osutil/systemd.py Outdated Show resolved Hide resolved
azurelinuxagent/ga/cgroupcontroller.py Outdated Show resolved Hide resolved
azurelinuxagent/ga/cgroupcontroller.py Outdated Show resolved Hide resolved
azurelinuxagent/ga/cgroupconfigurator.py Show resolved Hide resolved
azurelinuxagent/ga/cgroupconfigurator.py Outdated Show resolved Hide resolved
@@ -46,9 +46,17 @@ def run_command(command: Any, shell=False) -> str:
NOTE: The command's stdout and stderr are read as text streams.
"""
process = Popen(command, stdout=PIPE, stderr=PIPE, shell=shell, text=True)
timer = threading.Timer(15 * 60, process.kill) # Kill process after timeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is applied to all commands, not only ssh

also, for ssh, we should terminate the remote process, not the local process

Copy link
Contributor Author

@nagworld9 nagworld9 Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I would think we even don't want the commands to stuck.

is process.kill not terminate the child process which includes remote process? In my experiment processes in the vm also terminated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, ok, as long as the remote process is killed too, we are ok

@@ -0,0 +1,28 @@
#!/usr/bin/env bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this script being invoked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially thought of using it tests; seems at the end it was not used. If it's not needed in the future. I'll remove

tests_e2e/tests/scripts/get-agent-name Outdated Show resolved Hide resolved

def is_unit_loaded(unit_name):
"""
Determine if a unit is loaded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean for a unit to be loaded

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit being loaded means unit is parsed and available to the systemd process. We check this for extension services when we set quotas. If systemd is unaware of extension services and not loaded in the system yet, we get error while setting quotas. Hence, I added loaded check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If systemd is unaware of extension services and not loaded in the system yet, we get error while setting quotas. Hence, I added loaded check.

can you add this as a comment where you call this function?

@@ -1337,11 +1337,10 @@ def set_extension_resource_limits(self):
# setup the resource limits for extension operations and it's services.
man = self.load_manifest()
resource_limits = man.get_resource_limits()
if not CGroupConfigurator.get_instance().is_extension_resource_limits_setup_completed(extension_name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this check getting removed? Will we set up resource limits on each operation for the ext?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar check was added in cgroupconfigurator setup_slice method. We don't touch the configuration again if limits already in place

azurelinuxagent/ga/cgroupconfigurator.py Show resolved Hide resolved
azurelinuxagent/ga/cgroupconfigurator.py Show resolved Hide resolved
azurelinuxagent/ga/cgroupconfigurator.py Outdated Show resolved Hide resolved
azurelinuxagent/ga/cgroupconfigurator.py Outdated Show resolved Hide resolved
azurelinuxagent/ga/cgroupconfigurator.py Show resolved Hide resolved
log_cgroup_info("Ensuring the agent's CPUQuota is {0}".format(quota_percentage))
if CGroupConfigurator._Impl.__try_set_cpu_quota(quota_percentage):
CGroupsTelemetry.set_track_throttled_time(True)
log_cgroup_info("Setting {0}'s CPUQuota is {1}".format(unit_name, quota_percentage))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log_cgroup_info("Setting {0}'s CPUQuota is {1}".format(unit_name, quota_percentage))
log_cgroup_info("Setting {0}'s CPUQuota to {1}".format(unit_name, quota_percentage))

azurelinuxagent/ga/cgroupconfigurator.py Show resolved Hide resolved
if current == 0 and not (self._is_process_descendant_of_the_agent(process) or self._is_zombie_process(process)):
current_unexpected[process] = self._format_process(process)
if report_immediately:
report = [current_unexpected[process] for process in current_unexpected]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current_unexpected.values()?


stdout, stderr = process.communicate()
try:
timer.start()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start() should probably be outside the try statement (if it throws, I think we do not want to call cancel())

narrieta
narrieta previously approved these changes Dec 16, 2024
Copy link
Member

@narrieta narrieta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, approved with one minor comment on shell.py (e2e tests)

narrieta
narrieta previously approved these changes Dec 17, 2024
@nagworld9 nagworld9 merged commit d7fae2d into Azure:develop Dec 17, 2024
9 of 11 checks passed
@nagworld9 nagworld9 deleted the cgroup-v2 branch December 17, 2024 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants