Block extensions disallowed by policy #3259

mgunnala · 2024-11-11T20:43:38Z

Description

Issue #

PR #2 for the policy engine allowlist feature:

invoke policy engine from exthandlers.py
block all extensions and report status if any errors are thrown during engine initialization
block any extensions that are disallowed by policy
add unit and e2e tests

PR information

The title of the PR is clear and informative.
There are a small number of commits, each of which has an informative message. This means that previously merged commits do not appear in the history of the PR. For information on cleaning up the commits in your pull request, see this page.
If applicable, the PR references the bug/issue that it fixes in the description.
New Unit tests were added for the changes made

Quality of Code and Contribution Guidelines

I have read the contribution guidelines.

azurelinuxagent/ga/policy/policy_engine.py

codecov · 2024-11-13T00:57:46Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.77%. Comparing base (3aebcdd) to head (a37508f).
Report is 328 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #3259      +/-   ##
===========================================
+ Coverage    71.97%   72.77%   +0.79%     
===========================================
  Files          103      114      +11     
  Lines        15692    17081    +1389     
  Branches      2486     2277     -209     
===========================================
+ Hits         11295    12431    +1136     
- Misses        3881     4107     +226     
- Partials       516      543      +27

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

maddieford

Did an initial review not including tests

azurelinuxagent/ga/policy/policy_engine.py

maddieford · 2024-11-14T22:12:25Z

azurelinuxagent/ga/policy/policy_engine.py

+    """
+    # TODO: when CRP adds terminal error code for policy-related extension failures, set that as the default code.
+    def __init__(self, msg, inner=None, code=-1):
+        msg = "Extension is disallowed by agent policy and will not be processed: {0}".format(msg)


In case where agent failed to parse policy, I'm not sure we should say 'Extension is disallowed by policy'. In this case, extension is disallowed because there's some issue reading or parsing the policy.

I also am hesitant about 'agent policy' since policy is provided by customer

I could change this to "Extension will not be processed: "

Parsing errors (InvalidPolicyError) would look like "Extension will not be processed: customer-provided policy file (path) is invalid, please correct the following error..."

Extension disallowed errors (ExtensionPolicyError) would look like "Extension will not be processed: failed to enable extension CustomScript because extension is not specified in policy allowlist. To enable, add extension to the allowed list in policy file (path)."

I've updated the error message as discussed above.

maddieford · 2024-11-14T22:32:44Z

azurelinuxagent/ga/exthandlers.py

+            policy_op, policy_err_code = policy_err_map.get(ext_handler.state)
+            if policy_error is not None:
+                err = ExtensionPolicyError(msg="", inner=policy_error, code=policy_err_code)
+                self.__handle_and_report_ext_handler_errors(handler_i, err,


Does this create .status files for single config extensions?

I added a new function __handle_and_report_policy_error() - this should create a status file for any extension with settings.

maddieford · 2024-11-14T22:36:07Z

azurelinuxagent/ga/exthandlers.py

+                                                                                             ext_handler.name,
+                                                                                             conf.get_policy_file_path())
+                err = ExtensionPolicyError(msg, code=policy_err_code)
+                self.__handle_and_report_ext_handler_errors(handler_i, err,


same question here about .status file for single config extensions

I added a new function __handle_and_report_policy_error() - this should create a status file for any extension with settings.

maddieford

Changes look good. I'm going to spend tomorrow going through each e2e scenario and unit test, sorry for the slow review :/

maddieford · 2024-11-20T00:15:39Z

azurelinuxagent/ga/exthandlers.py

+                ExtensionRequestedState.Enabled: ('enable', ExtensionErrorCodes.PluginEnableProcessingFailed),
+                # TODO: CRP does not currently have a terminal error code for uninstall. Once CRP adds
+                # an error code for uninstall or for policy, use this code instead of PluginDisableProcessingFailed
+                # Note that currently, CRP waits for 90 minutes to time out for a failed uninstall operation, instead of


Could we add some more detail to this comment?
Something like:

Note that currently, CRP will poll until the agent does not report a status for an extension that should be uninstalled. In the case of a policy error, the agent will report a failed status on behalf of the extension, which will cause CRP to poll for the full timeout period, instead of failing fast.

maddieford · 2024-11-20T00:21:21Z

azurelinuxagent/ga/exthandlers.py

@@ -692,6 +734,26 @@ def __handle_and_report_ext_handler_errors(ext_handler_i, error, report_op, mess
            add_event(name=name, version=handler_version, op=report_op, is_success=False, log_event=True,
                      message=message)

+    @staticmethod
+    def __handle_and_report_policy_error(ext_handler_i, error, report_op, message, report=True, extension=None):
+        # TODO: Consider merging this function with __handle_and_report_ext_handler_errors() above.


Could you please leave some comment explaining why we broke this into a separate function? For policy related failures, we want to fail extensions fast. CRP will continue to poll for single-config ext status until timeout, so agent should write a status for single-config extensions. The other function does not create that status and we didn't want to touch the other function without investigating the impact of that change further

mgunnala · 2024-11-20T17:54:19Z

azurelinuxagent/ga/exthandlers.py

+
+        # Create status file for extensions with settings (single and multi config).
+        if extension is not None:
+            ext_handler_i.create_status_file_if_not_exist(extension, status=ExtensionStatusValue.error, code=error.code,


create_status_file_if_not_exist() will not overwrite existing status file (for the current sequence number). Is this behavior acceptable?

we should overwrite the existing file with the policy error

We now overwrite the existing file with policy error. I've added an "overwrite" parameter and changed the function name to create_status_file( ).

maddieford · 2024-11-21T02:26:07Z

azurelinuxagent/ga/exthandlers.py

+                ExtensionRequestedState.Enabled: ('enable', ExtensionErrorCodes.PluginEnableProcessingFailed),
+                # Note: currently, when uninstall is requested for an extension, CRP polls until the agent does not
+                # report status for that extension, or until timeout is reached. In the case of a policy error, the
+                # agent reports failed status on behalf of the extension, which will cause CRP to  for the full timeout,


Suggested change

# agent reports failed status on behalf of the extension, which will cause CRP to for the full timeout,

# agent reports failed status on behalf of the extension, which will cause CRP to poll for the full timeout,

nit

Fixed, thanks!

maddieford · 2024-11-21T02:55:54Z

tests/ga/test_extension.py

@@ -3507,5 +3510,144 @@ def test_report_msg_if_handler_manifest_contains_invalid_values(self):
                self.assertIn("'supportsMultipleExtensions' has a non-boolean value", kw_messages[2]['message'])


+class TestExtensionPolicy(TestExtensionBase):


could we add a test case for extension is allowed by policy

maddieford · 2024-11-21T17:51:30Z

tests_e2e/test_suites/ext_policy.yml

+name: "ExtensionPolicy"
+tests:
+  - "ext_policy/ext_policy.py"
+images: "random(endorsed)"


I think we should run this on more distros so we can get better coverage before releasing the changes

will running on all endorsed distros add too much overhead to the daily runs?

tests_e2e/tests/ext_policy/ext_policy.py

maddieford · 2024-11-21T18:38:34Z

tests_e2e/tests/ext_policy/ext_policy.py

+            fail(f"The agent should have reported an error trying to {operation} {extension_case.extension.__str__()} "
+                 f"because the extension is disallowed by policy.")
+        except Exception as error:
+            assert_that("Extension will not be processed" in str(error)) \


Could we also check for [ExtensionPolicyError] in the message to confirm the failure was due to policy

azurelinuxagent/ga/exthandlers.py

maddieford · 2024-11-21T18:58:32Z

tests/ga/test_multi_config_extension.py

@@ -630,6 +630,70 @@ def test_it_should_handle_and_report_enable_errors_properly(self):
                }
                self._assert_extension_status(sc_handler, expected_extensions)

+    def test_it_should_handle_and_report_disallowed_extensions_properly(self):


Could you also please add a case for multi config ext allowed by policy

maddieford · 2024-11-21T18:58:56Z

tests_e2e/test_suites/ext_policy_with_dependencies.yml

+name: "ExtPolicyWithDependencies"
+tests:
+  - "ext_policy/ext_policy_with_dependencies.py"
+images: "random(endorsed)"


same comment here, we should get more coverage than 1 run per day, maybe consider running on all endorsed, or 5-10 endorsed images per day

Updated to all endorsed, but I can change to 5-10 if this adds too much overhead.

tests_e2e/tests/ext_policy/ext_policy.py

narrieta

Posting comments for Agent code.

Will post comments for test code separately.

narrieta · 2024-11-26T18:32:53Z

azurelinuxagent/ga/policy/policy_engine.py

+    """
+    Error raised during agent extension policy enforcement.
+    """
+    def __init__(self, msg, code, inner=None):


The 'code' and 'inner' parameters are not in the same order as in the base class, which can lead to subtle coding errors.

I wrote it this way because I wanted "code" to be a required parameter in ExtensionPolicyEngine, but not "inner". But I can set a default value for "code", to keep them in the same order as in the base class.

I ended up removing this class, based on the other comments

narrieta · 2024-11-26T19:18:06Z

azurelinuxagent/ga/exthandlers.py

+            # Invoke policy engine to determine if extension is allowed. If not, block extension and report error on
+            # behalf of the extension.
+            policy_err_map = {
+                ExtensionRequestedState.Enabled: ('enable', ExtensionErrorCodes.PluginEnableProcessingFailed),


can we add a comment describing the elements in the tuple?

Added and moved this to the class level.

narrieta · 2024-11-26T19:19:43Z

azurelinuxagent/ga/exthandlers.py

        for extension, ext_handler in all_extensions:

            handler_i = ExtHandlerInstance(ext_handler, self.protocol, extension=extension)

+            # Invoke policy engine to determine if extension is allowed. If not, block extension and report error on
+            # behalf of the extension.
+            policy_err_map = {


seems like this is a constant... define it at the class level?

narrieta · 2024-11-26T19:36:22Z

azurelinuxagent/ga/exthandlers.py

+            # Invoke policy engine to determine if extension is allowed. If not, block extension and report error on
+            # behalf of the extension.
+            policy_err_map = {
+                ExtensionRequestedState.Enabled: ('enable', ExtensionErrorCodes.PluginEnableProcessingFailed),


'enable' and 'disable' are internal CRP/Agent operations; users are not aware of them. They should not be propagated to error messages displayed to the user

Updated this to "run" and "uninstall"

narrieta · 2024-11-26T19:41:42Z

azurelinuxagent/ga/exthandlers.py

+            }
+            policy_op, policy_err_code = policy_err_map.get(ext_handler.state)
+            if policy_error is not None:
+                err = ExtensionPolicyError(msg="", inner=policy_error, code=policy_err_code)


what is the intention of creating an exception object here? seems like it is only used to pass the error code, but it is never raised

I initially implemented the ExtensionPolicyError class to have a centralized error message for extensions blocked by policy, and also to pass the code. But you make a good point - since we never actually raise the exception, I've removed the ExtensionPolicyError class and now pass the code/message directly into the reporting function.

narrieta · 2024-11-26T19:57:00Z

azurelinuxagent/ga/exthandlers.py

+                err = ExtensionPolicyError(msg, code=policy_err_code)
+                self.__handle_and_report_policy_error(handler_i, err, report_op=handler_i.operation, message=ustr(err),
+                                                      extension=extension, report=True)
+


seems like we are missing a continue statement here

I think continue statement would break the dependency logic.

It's ok to use continue in the other condition because we know all extensions will fail (dependencies don't matter)

Yes, in the case where a specific extension is disallowed by policy, we should log an error for dependencies as well (using the existing code). Adding a continue statement would skip this logic.

In the case of a policy failure, where all extensions should be blocked regardless of dependencies, we can skip this logic.

OK. To make this clearer, can you do 'if not extension_allowed' after 'if depends_on_err_msg is not None'?

Yes, I've updated

azurelinuxagent/ga/exthandlers.py

narrieta · 2024-11-26T20:21:05Z

azurelinuxagent/ga/exthandlers.py

+
+        # Create status file for extensions with settings (single and multi config).
+        if extension is not None:
+            ext_handler_i.create_status_file_if_not_exist(extension, status=ExtensionStatusValue.error, code=error.code,


we should overwrite the existing file with the policy error

narrieta · 2024-11-26T20:23:14Z

azurelinuxagent/ga/exthandlers.py

+            ext_handler_i.create_status_file_if_not_exist(extension, status=ExtensionStatusValue.error, code=error.code,
+                                                          operation=report_op, message=message)
+
+        if report:


when would report be False?

Currently it isn't ever false, I initially wrote it this way because I was copying the exact structure of __handle_and_report_ext_handler_errors(). But I've removed it since that parameter isn't being used for now.

narrieta · 2024-11-26T23:05:56Z

azurelinuxagent/ga/exthandlers.py

@@ -990,7 +1061,10 @@ def report_ext_handler_status(self, vm_status, ext_handler, goal_state_changed):
        # extension even if HandlerState == NotInstalled (Sample scenario: ExtensionsGoalStateError, DecideVersionError, etc)
        # We also need to report extension status for an uninstalled handler if extensions are disabled because CRP
        # waits for extension runtime status before failing the extension operation.
-        if handler_state != ExtHandlerState.NotInstalled or ext_handler.supports_multi_config or not conf.get_extensions_enabled():
+        # In the case of policy failures, we want to report extension status with a terminal code so CRP fails fast. If


Let's change this

# We also need to report extension status for an uninstalled handler if extensions are disabled because CRP # waits for extension runtime status before failing the extension operation. # In the case of policy failures, we want to report extension status with a terminal code so CRP fails fast. If # extension status is not present, collect_ext_status() will set a default transitioning status, and CRP will # wait for timeout.

to

# We also need to report extension status for an uninstalled handler if extensions are disabled, or if the extension # failed due to policy, because CRP waits for extension runtime status before failing the extension operation.

The intention of the change is to enter this condition when the extension fails due to policy, but this change means that we enter the condition whenever policy is enabled.

Is there any negative effect to calling ext_handler_i.get_extension_handler_statuses... whenever policy is enabled? Why is this behind the if condition in the first place?

Discussed offline - removed this condition, because it would cause us to enter the if condition even for non-policy-related uninstall failures.

maddieford

Left comments mainly for e2e tests. I'll review unit tests once the comments in exthandlers.py are resolved

maddieford · 2024-11-26T22:53:42Z

tests_e2e/tests/ext_policy/ext_policy.py

+
+        # Only allowlisted extensions should be processed.
+        # We only allowlist CustomScript: CustomScript should be enabled, RunCommand and AzureMonitor should fail.
+        # (Note that CustomScript blocked by policy is tested in a later test case.)


Adding comments to the review so I can follow the scenarios easier. Consider adding these as comments in the code, but ultimately up to you:
This policy tests the following scenarios:
- single config ext (CSE) enable operation succeeds when allowed by policy
- no-config ext (AzureMonitor) enable operation fails fast when disallowed by policy
- single multi-config instance (RunCommandHandler) enable operation fails fast when disallowed by policy

Added, thanks!

tests_e2e/tests/ext_policy/ext_policy.py

maddieford · 2024-11-26T23:04:04Z

tests_e2e/tests/ext_policy/ext_policy.py

+            self._operation_should_succeed("delete", azure_monitor)
+
+        # Should not uninstall disallowed extensions.
+        # CustomScript is removed from the allowlist: delete operation should fail.


This policy tests the following scenarios:
- a delete operation on a previously enabled single-config ext (CSE) which is now disallowed by policy fails fast
- multiple multi-config instances (RunCommandHandler and RunCommandHandler2) enable operations fail fast when disallowed by policy
- single-config ext (CSE) enable operation fails fast when disallowed by policy

Added, thanks!

tests_e2e/tests/ext_policy/ext_policy.py

maddieford · 2024-11-27T00:02:58Z

tests_e2e/tests/ext_policy/ext_policy_with_dependencies.py

+                    log.info("CRP returned an error for deletion operation, may be a false error. Checking agent log to determine if operation succeeded. Exception: {0}".format(crp_err))
+                    try:
+                        for ssh_client in ssh_clients.values():
+                            msg = ("Remove the extension slice: {0}".format(str(ext_to_delete)))


This message is related to cgroup. Right now it's logged even when cgroup isn't enabled (which might and probably should change in the future).

Instead, we should check that the handler was successfully uninstalled. i.e. the last ext status reported by the agent shouldn't include the handler:

2024-11-26T23:54:04.306568Z INFO ExtHandler ExtHandler Extension status: [('Microsoft.Azure.Monitor.AzureMonitorLinuxAgent', 'Ready')]

You might also consider doing this by checking the instance view

Updated this - we now check for the last status reported by the agent and confirm that there is no handler status reported. (Instance view doesn't work because CRP reports a stale status)

maddieford · 2024-11-27T00:09:45Z

tests_e2e/tests/ext_policy/ext_policy_with_dependencies.py

+    _test_cases = [
+        _should_fail_single_config_depends_on_disallowed_no_config,
+        _should_fail_single_config_depends_on_disallowed_single_config,
+        # TODO: RunCommand is unable to be installed properly, so these tests are currently disabled. Investigate the


Let's specify 'RunCommandHandler' since RunCommand is a different extension (confusing, I know :)

Also is it that RunCommandHandler is unable to be "installed properly" or uninstalled?

Good catch - it's that RunCommandHandler is unable to be uninstalled. I've updated this.

maddieford · 2024-11-27T00:11:47Z

tests_e2e/tests/ext_policy/ext_policy_with_dependencies.py

+        _should_fail_single_config_depends_on_disallowed_single_config,
+        # TODO: RunCommand is unable to be installed properly, so these tests are currently disabled. Investigate the
+        # issue and enable these 3 tests.
+        # _should_fail_single_config_depends_on_disallowed_multi_config,


what about _should_fail_multi_config_depends_on_disallowed_multi_config?
RunCommandHandler1 depends on RunCommandHandler2 for example

Discussed offline - allow/disallow is at the handler level, so we wouldn't have a multi-config extension with one instance allowed and another disallowed.

maddieford · 2024-11-27T00:19:39Z

tests_e2e/tests/ext_policy/policy_dependencies_cases.py

+    return policy, template, expected_errors, deletion_order
+
+
+def _should_no_dependencies():


Looks like this may be leftover code, I don't see it referenced

Removed, thanks!

maddieford · 2024-11-27T00:19:48Z

tests_e2e/tests/scripts/agent_ext_workflow-check_data_in_agent_log.py

 from pathlib import Path
 from tests_e2e.tests.lib.agent_log import AgentLog


 def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--data", dest='data', required=True)
+    parser.add_argument("--after-timestamp", dest='after_timestamp', required=False)


nice :) thanks for adding this!

narrieta

Comments for test code

narrieta · 2024-11-27T19:59:55Z

tests/ga/test_extension.py

+
+    def _test_policy_case(self, policy, op, expected_status_code, expected_handler_status, expected_ext_count=1,
+                             expected_status_msg=None):
+


could you add some comments explaining the setup done by this function? (e.g. why incarnation 2?) thanks

I've added comments explaining the setup here.

Thanks to @maddieford for helping me figure out why updating the incarnation is required :)

got it. I'd suggest instead adding a comment "Generate a new mock goal state to uninstall the extension" just before

protocol.mock_wire_data.set_incarnation(2) protocol.mock_wire_data.set_extensions_config_state(ExtensionRequestedState.Uninstall) protocol.client.update_goal_state()

narrieta · 2024-11-27T20:18:19Z

tests/ga/test_multi_config_extension.py

@@ -630,6 +630,98 @@ def test_it_should_handle_and_report_enable_errors_properly(self):
                }
                self._assert_extension_status(sc_handler, expected_extensions)

+    def test_it_should_handle_and_report_extensions_disallowed_by_policy_properly(self):


what does "properly" mean? (what is the expected behavior?)

changed this to "test_it_should_report_successful_status_for_extensions_allowed_by_policy" and "test_it_should_report_failed_status_for_extensions_disallowed_by_policy"

narrieta · 2024-11-27T20:36:12Z

tests/ga/test_multi_config_extension.py

+    def test_it_should_handle_and_report_extensions_disallowed_by_policy_properly(self):
+        """If multiconfig extension is disallowed by policy, all instances should be blocked."""
+        policy_path = os.path.join(self.tmp_dir, "waagent_policy.json")
+        patch('azurelinuxagent.common.conf.get_policy_file_path', return_value=str(policy_path)).start()


this should be done using the 'with' statement

Updated, thanks

narrieta · 2024-11-27T20:40:25Z

tests_e2e/test_suites/ext_policy_with_dependencies.yml

+  - "ext_policy/ext_policy_with_dependencies.py"
+images: "endorsed"
+executes_on_scale_set: true
+# This test should run on its own VMSS, because other tests may leave behind extensions


we should handle this in the test and allow it to share the vm

Added code at the start of each test to remove leftover extensions

tests_e2e/tests/ext_policy/ext_policy.py

narrieta · 2024-11-27T22:43:28Z

tests_e2e/tests/ext_policy/ext_policy.py

+
+        # RunCommandHandler is a multi-config extension, so we set up two instances (configurations) here and test both.
+        run_command = ExtPolicy.TestCase(
+            VirtualMachineExtensionClient(self._context.vm, VmExtensionIds.RunCommandHandler,


should use VirtualMachineRunCommand instead of VirtualMachineExtensionClient

narrieta · 2024-11-27T23:04:44Z

tests_e2e/tests/ext_policy/ext_policy.py

+        self._create_policy_file(policy)
+        self._operation_should_succeed("enable", custom_script)
+        self._operation_should_fail("enable", run_command)
+        if VmExtensionIds.AzureMonitorLinuxAgent.supports_distro((self._ssh_client.run_command("get_distro.py").rstrip())):


how much coverage are we getting for this case?

I've changed ext_policy.yml to run on all endorsed distros, so this case will be run on all distros that support AzureMonitorLinuxAgent.

why do we need the check on distro if it's going to fail due to policy anyways?

I figured I would avoid testing this extension at all on an unsupported distro, but I can remove the distro check specifically for this case, if you think it would be useful.

yes, no need for that check

narrieta · 2024-11-27T23:07:10Z

tests_e2e/tests/ext_policy/ext_policy.py

+        if VmExtensionIds.AzureMonitorLinuxAgent.supports_distro((self._ssh_client.run_command("get_distro.py").rstrip())):
+            self._operation_should_fail("enable", azure_monitor)
+
+        # When allowlist is turned off, all extensions should be processed.


How about CustomScript?

The final test case tests this - line 256

narrieta · 2024-11-27T23:20:44Z

tests_e2e/tests/ext_policy/ext_policy.py

+            self._operation_should_succeed("enable", azure_monitor)
+            self._operation_should_succeed("delete", azure_monitor)
+
+        # Should not uninstall disallowed extensions.


seems like

# Only allowlisted extensions should be processed.

and

# Should not uninstall disallowed extensions.

or are we trying to test something different?

I've updated the comments to make it more clear what is being tested with each policy

narrieta · 2024-11-27T23:22:30Z

tests_e2e/tests/ext_policy/ext_policy.py

+                }
+            }
+        self._create_policy_file(policy)
+        # # Known CRP issue - delete/uninstall operation times out instead of reporting an error.


This is not a CRP issue, but rather a design issue. Uninstall is best effort and never fails. You should consider checking the agent log for the error and then the instance view to confirm the extension was not uninstalled

I've enabled the failed deletion case and updated the test to validate the agent log and instance view.

azurelinuxagent/ga/exthandlers.py

mgunnala · 2024-12-10T18:45:01Z

azurelinuxagent/ga/exthandlers.py

+            if not extension_allowed:
+                msg = (
+                    "Extension will not be processed: failed to {0} extension '{1}' because it is not specified "
+                    "in the allowlist. To {0}, add the extension to the allowed list in the policy file ('{2}')."


Here, we use both the terms "allowlist" and "allowed list". Does this make sense?
Maybe something like "add <ext_name> to the list of allowed extensions in the policy file"?

"list of allowed extensions", though verbose, looks good to me. maybe also use something similar for "allowList"? there is no "allowList" element in the policy file

@narrieta how about something like this:
"Extension will not be processed: failed to run extension 'CustomScript' because it is not specified as an allowed extension. To run, add the extension to the list of allowed extensions in the policy file ('/etc/waagent_policy.json')."

Updated the message as stated above

azurelinuxagent/ga/exthandlers.py

azurelinuxagent/ga/policy/policy_engine.py

narrieta

Some comments for the Agent code; will post a separate review for the test code.

azurelinuxagent/ga/policy/policy_engine.py

narrieta · 2024-12-12T20:37:08Z

azurelinuxagent/ga/exthandlers.py

+            # If an error was thrown during policy engine initialization, skip further processing of the extension.
+            # CRP is still waiting for status, so we report error status here.
+            # of the extension.
+            policy_op, policy_err_code = _POLICY_ERROR_MAP.get(ext_handler.state)


suggest removing 'policy' from policy_err_code. the error code is not related to policy

'policy_op' is also kind of misleading, since it is not a policy operation... maybe just 'operation' and 'error_code'?

Updated to "operation" and "error_code"

narrieta · 2024-12-12T20:42:30Z

azurelinuxagent/ga/exthandlers.py

+        # to write a status file for single-config extensions.
+
+        # Set handler status for all extensions (with and without settings)
+        ext_handler_i.set_handler_status(message=message, code=error_code)


add comment pointing out that we are intentionally reporting the error at the handler and status level

Do you mean something like "We report the same error at both the handler status and extension status level." ?

sorry, what i was trying to point is that reporting the error both at the handler and status level is not needed (or should not be needed). e.g. install errors are reported at the handler level, while single-config errors are reported at the status level.

narrieta · 2024-12-12T20:44:27Z

azurelinuxagent/ga/exthandlers.py


        if report:
            name = ext_handler_i.get_extension_full_name(extension)
            handler_version = ext_handler_i.ext_handler.version
            add_event(name=name, version=handler_version, op=report_op, is_success=False, log_event=True,
                      message=message)

+    @staticmethod
+    def __report_policy_error(ext_handler_i, error_code, report_op, message, extension=None):


let's remove the default value for the 'extension' parameter

azurelinuxagent/ga/exthandlers.py

narrieta · 2024-12-12T21:48:20Z

azurelinuxagent/ga/exthandlers.py

+            if not extension_allowed:
+                msg = (
+                    "Extension will not be processed: failed to {0} extension '{1}' because it is not specified "
+                    "in the allowlist. To {0}, add the extension to the allowed list in the policy file ('{2}')."


"list of allowed extensions", though verbose, looks good to me. maybe also use something similar for "allowList"? there is no "allowList" element in the policy file

narrieta · 2024-12-12T21:49:24Z

azurelinuxagent/ga/exthandlers.py

            # Process extensions and get if it was successfully executed or not
-            extension_success = self.handle_ext_handler(handler_i, extension, goal_state_id)
+            # If extension was blocked by policy, treat the extension as failed and do not process the handler.
+            if not extension_allowed:


merge this 'if not extension_allowed:' with the one just above it?

Good point, made this change, thanks!

narrieta

minor comments on the unit tests, will post end-to-end on separate review

thanks for these tests!

narrieta · 2024-12-13T16:36:53Z

tests/ga/test_extension.py

+
+    def _test_policy_case(self, policy, op, expected_status_code, expected_handler_status, expected_ext_count=1,
+                             expected_status_msg=None):
+


got it. I'd suggest instead adding a comment "Generate a new mock goal state to uninstall the extension" just before

protocol.mock_wire_data.set_incarnation(2) protocol.mock_wire_data.set_extensions_config_state(ExtensionRequestedState.Uninstall) protocol.client.update_goal_state()

narrieta · 2024-12-13T16:55:56Z

tests/ga/test_extension.py

+                policy_file.write(policy)
+            policy_file.flush()
+
+    def _test_policy_case(self, policy, op, expected_status_code, expected_handler_status, expected_ext_count=0,


i'd suggest removing the default values and have the tests be very explicit about what their expectations are

Removed the default and updated the tests accordingly

azurelinuxagent/ga/exthandlers.py

maddieford · 2024-12-13T22:19:11Z

tests_e2e/tests/ext_policy/ext_policy.py

+            # 2024-10-24T17:34:20.808235Z ERROR ExtHandler ExtHandler Event: name=Microsoft.Azure.Monitor.AzureMonitorLinuxAgent, op=None, message=[ExtensionPolicyError] Extension will not be processed: failed to enable extension 'Microsoft.Azure.Monitor.AzureMonitorLinuxAgent' because extension is not specified in allowlist. To enable, add extension to the allowed list in the policy file ('/etc/waagent_policy.json')., duration=0
+            # We intentionally block extensions with policy and expect this failure message
+            {
+                'message': r"Extension will not be processed"


Could you update the comment on line 273 with the new error message? No need to add the prefix directly, but want to make sure we only ignore policy related failures

maddieford · 2024-12-13T22:21:50Z

tests_e2e/tests/ext_policy/ext_policy.py

+            try:
+                timeout = (16 * 60) # We want to wait for CRP timeout, which is 15 minutes.
+                extension_case.extension.delete(timeout)
+                fail(f"The agent should have reported a timeout error when attempting to delete {extension_case.extension} "


nit: CRP reports timeout because agent reports a failure due to policy

Updated the message here

maddieford · 2024-12-13T22:26:53Z

tests_e2e/tests/ext_policy/ext_policy.py

+                extension_case.extension.delete(timeout)
+                fail(f"The agent should have reported a timeout error when attempting to delete {extension_case.extension} "
+                     f"because the extension is disallowed by policy.")
+            except TimeoutError:


I think this is catching the Timeout from the 'tests_e2e.tests.lib.azure_sdk_client.AzureSdkClient._execute_async_operation' function. You're giving it a 16 minute timeout, so it will just give up and stop waiting on the operation after that.

The delete operation will fail on its own (should be ~15 min) because CRP will timeout.

Pointing this out since you had this comment: "# We want to wait for CRP timeout, which is 15 minutes."

I set the extensionsTimeBudget (CRP timeout) to 15 minutes exactly, I set the timeout here to 16 minutes to be just longer than that so we can wait for the full CRP timeout period. Are you saying we shouldn't have a timeout for _execute_async_operation() at all?

narrieta

comments for end-to-end tests

narrieta · 2024-12-16T17:00:54Z

tests_e2e/tests/ext_policy/ext_policy.py

+        # CustomScript is a single-config extension.
+        custom_script = ExtPolicy.TestCase(
+            VirtualMachineExtensionClient(self._context.vm, VmExtensionIds.CustomScript,
+                                          resource_name="CustomScriptPolicy"),


All extension call should use the same resource name (in this case "CustomScript"), otherwise a subsequent call with a different name will fail

All extension calls do use the same resource name

narrieta · 2024-12-16T17:05:29Z

tests_e2e/tests/ext_policy/ext_policy.py

+        # Another e2e test may have left behind an extension we want to test here. Cleanup any leftovers so that they
+        # do not affect the test results.
+        log.info("Cleaning up existing extensions on the test VM [%s]", self._context.vm.name)
+        self._context.vm.delete_all_extensions()


instead of removing all extensions, should we remove only the extensions used by this test, then have the test adapt to that?

Is there an issue with removing all extensions?

Deleting individual extensions was causing issues, I think because other e2e tests may leave behind extensions different resource names and deletion is done by resource name. I thought it was safer to just delete everything. If you think we should only delete the extensions used by this test, I could implement that in the next PR.

other e2e tests should be using the same resource names; if that is not the case, we should fix them

having more extensions in the mix for sure makes the test a little more difficult to write, but provides scenarios a tiny bit closer to what we will see in prod

in these test runs, some extensions will be added by other tests, and some by policy

if handling existing extensions is way too complex, i'm OK deleting all of them, but my first impression is that it should not be the case

OK to do on next PR

narrieta · 2024-12-16T17:24:29Z

tests_e2e/tests/ext_policy/ext_policy.py

+        try:
+            if operation == "enable":
+                # VirtualMachineRunCommandClient (and VirtualMachineRunCommand) does not take force_update_tag as a parameter.
+                if type(extension_case.extension) == VirtualMachineRunCommandClient:


probably isinstance instead of == in case we ever subclass it

Good point, updated

narrieta · 2024-12-16T17:24:40Z

tests_e2e/tests/ext_policy/ext_policy.py

+            if operation == "enable":
+                # VirtualMachineRunCommandClient (and VirtualMachineRunCommand) does not take force_update_tag as a parameter.
+                if type(extension_case.extension) == VirtualMachineRunCommandClient:
+                    extension_case.extension.enable(settings=extension_case.settings, timeout=15*60)


why the timeout?

I've updated to just use the default timeout

tests_e2e/tests/ext_policy/ext_policy.py

narrieta · 2024-12-16T17:35:53Z

tests_e2e/tests/ext_policy/ext_policy.py

+            # instance view and that the expected error is in the agent log to confirm that deletion failed.
+            delete_start_time = self._ssh_client.run_command("date '+%Y-%m-%d %T'").rstrip()
+            try:
+                timeout = (16 * 60) # We want to wait for CRP timeout, which is 15 minutes.


should check the agent's log asynchronously and fail immediately rather than waiting for the full 15 min

(can be on next PR)

Marked as TODO

I initially implemented it this way, but ran into issues - subsequent operations would fail, because CRP was still waiting on the delete operation. I wasn't sure how to force CRP to quit, I will take a look at it in the next PR.

narrieta · 2024-12-16T17:41:58Z

tests_e2e/tests/ext_policy/ext_policy.py

+        self._create_policy_file(policy)
+        self._operation_should_succeed("enable", custom_script)
+        self._operation_should_fail("enable", run_command)
+        if VmExtensionIds.AzureMonitorLinuxAgent.supports_distro((self._ssh_client.run_command("get_distro.py").rstrip())):


why do we need the check on distro if it's going to fail due to policy anyways?

tests_e2e/tests/scripts/agent_ext_policy-verify_uninstall_success.py

narrieta · 2024-12-16T18:33:02Z

tests_e2e/tests/ext_policy/ext_policy_with_dependencies.py

+        _should_fail_single_config_depends_on_disallowed_single_config, \
+        _should_succeed_single_config_depends_on_no_config, \
+        _should_succeed_single_config_depends_on_single_config
+        # _should_fail_single_config_depends_on_disallowed_multi_config,


minor: looks like a leftover - let's remove it or add a todo comment

Added a TODO. These tests are currently disabled, which is why the imports are commented out.

narrieta · 2024-12-17T16:43:03Z

azurelinuxagent/ga/exthandlers.py

-                                                          code=-1,
-                                                          operation=handler_i.operation,
-                                                          message=msg)
+                handler_i.create_status_file(extension,


minor: the previous code was very explicit (in the method's name) about not overwriting existing files. I think that is a good choice. We should probably be explicit in the new method as well, remove the default value and set overwrite=False here

Updated to remove the default and explicitly set the value for overwrite in all calls

narrieta · 2024-12-17T16:45:48Z

azurelinuxagent/ga/policy/policy_engine.py

@@ -50,7 +44,6 @@ def __init__(self, msg, inner=None):
        msg = "Customer-provided policy file ('{0}') is invalid, please correct the following error: {1}".format(conf.get_policy_file_path(), msg)
        super(InvalidPolicyError, self).__init__(msg, inner)

-


Can we add an INFO message just after the check for enabled stating that we are using Policy? This makes clearer the fact that we are now processing policies.

__read_policy() is called right after the check for enabled, and it logs the following statement:

Policy enforcement is enabled. Enforcing policy using policy file found at '<path>'. File contents: <policy>

Is that sufficient, or do you think we need an additional log message?

It's sufficient, but the message should probably be in the caller instead of read_policy. Who knows, as code evolves we may add other code before read_policy, or call read_policy multiple times.

Alternatively, the caller can log "Policy enforcement is enabled." and read_policy "Enforcing policy using policy file found at ''. File contents: "

mgunnala · 2024-12-17T17:28:17Z

tests_e2e/tests/ext_policy/ext_policy.py

+
+    def _operation_should_fail(self, operation, extension_case):
+        log.info("")
+        log.info(f"Attempting to {operation} {extension_case.extension}, should fail fast.")


this should say "should reach timeout" for delete. split the log messages for enable and delete.

mgunnala · 2024-12-17T17:28:59Z

tests_e2e/tests/ext_policy/ext_policy.py

+        log.info("Enabling policy via conf file on the test VM [%s]", self._context.vm.name)
+        self._ssh_client.run_command("update-waagent-conf Debug.EnableExtensionPolicy=y", use_sudo=True)
+
+        # This policy tests the following scenarios:


Add log messages for these comments in test logs

mgunnala added 3 commits November 8, 2024 16:22

Block disallowed extension processing

c2cc2c6

Enable policy e2e tests

151081d

Pylint

edec2af

mgunnala requested review from narrieta, ZhidongPeng, nagworld9 and maddieford as code owners November 11, 2024 20:43

mgunnala commented Nov 12, 2024

View reviewed changes

azurelinuxagent/ga/policy/policy_engine.py Outdated Show resolved Hide resolved

Fix e2e test failures

a37508f

mgunnala force-pushed the allowlist_2 branch from b440696 to a37508f Compare November 12, 2024 18:52

maddieford reviewed Nov 14, 2024

View reviewed changes

mgunnala and others added 2 commits November 18, 2024 12:19

Address review comments

b0da554

Merge branch 'develop' into allowlist_2

a4f5cab

maddieford reviewed Nov 20, 2024

View reviewed changes

Address review comments

699b9ba

mgunnala commented Nov 20, 2024

View reviewed changes

maddieford reviewed Nov 21, 2024

View reviewed changes

Address test review comments

86de0c5

mgunnala force-pushed the allowlist_2 branch from e909568 to 86de0c5 Compare November 21, 2024 21:41

mgunnala added 5 commits November 22, 2024 13:08

Remove status file for single-config

c3e9b89

Add back status file for single-config

65d7034

Run e2e tests on all endorsed

95f247a

Fix UT failures

3b18519

Pylint

63da127

narrieta reviewed Nov 26, 2024

View reviewed changes

Merge branch 'develop' into allowlist_2

471cd59

maddieford reviewed Nov 27, 2024

View reviewed changes

narrieta reviewed Nov 27, 2024

View reviewed changes

Address review comments for agent code

8ea989b

mgunnala added 2 commits December 6, 2024 14:41

Address test comments

ba3869c

Address test comments

dfcc158

mgunnala force-pushed the allowlist_2 branch from c9f1c2b to dfcc158 Compare December 9, 2024 20:59

Merge branch 'develop' into allowlist_2

fe07ffa

mgunnala force-pushed the allowlist_2 branch from 822ced7 to 248f662 Compare December 10, 2024 17:30

Address test comments

a31bdcf

mgunnala force-pushed the allowlist_2 branch from 248f662 to a31bdcf Compare December 10, 2024 18:01

mgunnala commented Dec 10, 2024

View reviewed changes

azurelinuxagent/ga/exthandlers.py Outdated Show resolved Hide resolved

mgunnala commented Dec 10, 2024

View reviewed changes

azurelinuxagent/ga/exthandlers.py Outdated Show resolved Hide resolved

mgunnala commented Dec 10, 2024

View reviewed changes

azurelinuxagent/ga/exthandlers.py Outdated Show resolved Hide resolved

mgunnala commented Dec 10, 2024

View reviewed changes

azurelinuxagent/ga/policy/policy_engine.py Outdated Show resolved Hide resolved

Cleanup existing extensions on test VMs

5198cf8

narrieta reviewed Dec 12, 2024

View reviewed changes

narrieta reviewed Dec 13, 2024

View reviewed changes

maddieford reviewed Dec 16, 2024

View reviewed changes

narrieta reviewed Dec 16, 2024

View reviewed changes

mgunnala and others added 3 commits December 16, 2024 17:20

Address comments and disable dependencies e2e tests

4a0a4ef

Merge branch 'develop' into allowlist_2

daa8017

Add fixes for e2e tests

bacc425

mgunnala force-pushed the allowlist_2 branch from df28b47 to bacc425 Compare December 17, 2024 06:39

mgunnala added 2 commits December 17, 2024 09:10

Add back delete failure test case

3319916

Address comments round 3

8c31798

narrieta reviewed Dec 17, 2024

View reviewed changes

mgunnala commented Dec 17, 2024

View reviewed changes

mgunnala and others added 3 commits December 17, 2024 13:45

Address comments

32ef5c1

Merge branch 'develop' into allowlist_2

f0895b7

Pylint

0c9f1c7

	# agent reports failed status on behalf of the extension, which will cause CRP to for the full timeout,
	# agent reports failed status on behalf of the extension, which will cause CRP to poll for the full timeout,

		@@ -3507,5 +3510,144 @@ def test_report_msg_if_handler_manifest_contains_invalid_values(self):
		self.assertIn("'supportsMultipleExtensions' has a non-boolean value", kw_messages[2]['message'])


		class TestExtensionPolicy(TestExtensionBase):

		return policy, template, expected_errors, deletion_order


		def _should_no_dependencies():


		def _test_policy_case(self, policy, op, expected_status_code, expected_handler_status, expected_ext_count=1,
		expected_status_msg=None):

		@@ -50,7 +44,6 @@ def __init__(self, msg, inner=None):
		msg = "Customer-provided policy file ('{0}') is invalid, please correct the following error: {1}".format(conf.get_policy_file_path(), msg)
		super(InvalidPolicyError, self).__init__(msg, inner)

Block extensions disallowed by policy #3259

Are you sure you want to change the base?

Block extensions disallowed by policy #3259

Conversation

mgunnala commented Nov 11, 2024

Description

PR information

Quality of Code and Contribution Guidelines

codecov bot commented Nov 13, 2024

Codecov Report

maddieford left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgunnala Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maddieford left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

narrieta left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maddieford left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgunnala Nov 15, 2024 •

edited

Loading