Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block extensions disallowed by policy #3259

Open
wants to merge 30 commits into
base: develop
Choose a base branch
from

Conversation

mgunnala
Copy link

Description

Issue #


PR #2 for the policy engine allowlist feature:

  • invoke policy engine from exthandlers.py
  • block all extensions and report status if any errors are thrown during engine initialization
  • block any extensions that are disallowed by policy
  • add unit and e2e tests

PR information

  • The title of the PR is clear and informative.
  • There are a small number of commits, each of which has an informative message. This means that previously merged commits do not appear in the history of the PR. For information on cleaning up the commits in your pull request, see this page.
  • If applicable, the PR references the bug/issue that it fixes in the description.
  • New Unit tests were added for the changes made

Quality of Code and Contribution Guidelines

Copy link

codecov bot commented Nov 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.77%. Comparing base (3aebcdd) to head (a37508f).
Report is 328 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3259      +/-   ##
===========================================
+ Coverage    71.97%   72.77%   +0.79%     
===========================================
  Files          103      114      +11     
  Lines        15692    17081    +1389     
  Branches      2486     2277     -209     
===========================================
+ Hits         11295    12431    +1136     
- Misses        3881     4107     +226     
- Partials       516      543      +27     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@maddieford maddieford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an initial review not including tests

azurelinuxagent/ga/policy/policy_engine.py Outdated Show resolved Hide resolved
"""
# TODO: when CRP adds terminal error code for policy-related extension failures, set that as the default code.
def __init__(self, msg, inner=None, code=-1):
msg = "Extension is disallowed by agent policy and will not be processed: {0}".format(msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case where agent failed to parse policy, I'm not sure we should say 'Extension is disallowed by policy'. In this case, extension is disallowed because there's some issue reading or parsing the policy.

I also am hesitant about 'agent policy' since policy is provided by customer

Copy link
Author

@mgunnala mgunnala Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could change this to "Extension will not be processed: "

Parsing errors (InvalidPolicyError) would look like "Extension will not be processed: customer-provided policy file (path) is invalid, please correct the following error..."

Extension disallowed errors (ExtensionPolicyError) would look like "Extension will not be processed: failed to enable extension CustomScript because extension is not specified in policy allowlist. To enable, add extension to the allowed list in policy file (path)."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the error message as discussed above.

policy_op, policy_err_code = policy_err_map.get(ext_handler.state)
if policy_error is not None:
err = ExtensionPolicyError(msg="", inner=policy_error, code=policy_err_code)
self.__handle_and_report_ext_handler_errors(handler_i, err,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this create .status files for single config extensions?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new function __handle_and_report_policy_error() - this should create a status file for any extension with settings.

ext_handler.name,
conf.get_policy_file_path())
err = ExtensionPolicyError(msg, code=policy_err_code)
self.__handle_and_report_ext_handler_errors(handler_i, err,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question here about .status file for single config extensions

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new function __handle_and_report_policy_error() - this should create a status file for any extension with settings.

Copy link
Contributor

@maddieford maddieford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good. I'm going to spend tomorrow going through each e2e scenario and unit test, sorry for the slow review :/

ExtensionRequestedState.Enabled: ('enable', ExtensionErrorCodes.PluginEnableProcessingFailed),
# TODO: CRP does not currently have a terminal error code for uninstall. Once CRP adds
# an error code for uninstall or for policy, use this code instead of PluginDisableProcessingFailed
# Note that currently, CRP waits for 90 minutes to time out for a failed uninstall operation, instead of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add some more detail to this comment?
Something like:

Note that currently, CRP will poll until the agent does not report a status for an extension that should be uninstalled. In the case of a policy error, the agent will report a failed status on behalf of the extension, which will cause CRP to poll for the full timeout period, instead of failing fast.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@@ -692,6 +734,26 @@ def __handle_and_report_ext_handler_errors(ext_handler_i, error, report_op, mess
add_event(name=name, version=handler_version, op=report_op, is_success=False, log_event=True,
message=message)

@staticmethod
def __handle_and_report_policy_error(ext_handler_i, error, report_op, message, report=True, extension=None):
# TODO: Consider merging this function with __handle_and_report_ext_handler_errors() above.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please leave some comment explaining why we broke this into a separate function? For policy related failures, we want to fail extensions fast. CRP will continue to poll for single-config ext status until timeout, so agent should write a status for single-config extensions. The other function does not create that status and we didn't want to touch the other function without investigating the impact of that change further

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added


# Create status file for extensions with settings (single and multi config).
if extension is not None:
ext_handler_i.create_status_file_if_not_exist(extension, status=ExtensionStatusValue.error, code=error.code,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_status_file_if_not_exist() will not overwrite existing status file (for the current sequence number). Is this behavior acceptable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should overwrite the existing file with the policy error

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now overwrite the existing file with policy error. I've added an "overwrite" parameter and changed the function name to create_status_file( ).

ExtensionRequestedState.Enabled: ('enable', ExtensionErrorCodes.PluginEnableProcessingFailed),
# Note: currently, when uninstall is requested for an extension, CRP polls until the agent does not
# report status for that extension, or until timeout is reached. In the case of a policy error, the
# agent reports failed status on behalf of the extension, which will cause CRP to for the full timeout,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# agent reports failed status on behalf of the extension, which will cause CRP to for the full timeout,
# agent reports failed status on behalf of the extension, which will cause CRP to poll for the full timeout,

nit

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

@@ -3507,5 +3510,144 @@ def test_report_msg_if_handler_manifest_contains_invalid_values(self):
self.assertIn("'supportsMultipleExtensions' has a non-boolean value", kw_messages[2]['message'])


class TestExtensionPolicy(TestExtensionBase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add a test case for extension is allowed by policy

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

name: "ExtensionPolicy"
tests:
- "ext_policy/ext_policy.py"
images: "random(endorsed)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should run this on more distros so we can get better coverage before releasing the changes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will running on all endorsed distros add too much overhead to the daily runs?

tests_e2e/tests/ext_policy/ext_policy.py Show resolved Hide resolved
fail(f"The agent should have reported an error trying to {operation} {extension_case.extension.__str__()} "
f"because the extension is disallowed by policy.")
except Exception as error:
assert_that("Extension will not be processed" in str(error)) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also check for [ExtensionPolicyError] in the message to confirm the failure was due to policy

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

azurelinuxagent/ga/exthandlers.py Outdated Show resolved Hide resolved
@@ -630,6 +630,70 @@ def test_it_should_handle_and_report_enable_errors_properly(self):
}
self._assert_extension_status(sc_handler, expected_extensions)

def test_it_should_handle_and_report_disallowed_extensions_properly(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also please add a case for multi config ext allowed by policy

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

name: "ExtPolicyWithDependencies"
tests:
- "ext_policy/ext_policy_with_dependencies.py"
images: "random(endorsed)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here, we should get more coverage than 1 run per day, maybe consider running on all endorsed, or 5-10 endorsed images per day

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to all endorsed, but I can change to 5-10 if this adds too much overhead.

tests_e2e/tests/ext_policy/ext_policy.py Show resolved Hide resolved
Copy link
Member

@narrieta narrieta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posting comments for Agent code.

Will post comments for test code separately.

"""
Error raised during agent extension policy enforcement.
"""
def __init__(self, msg, code, inner=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'code' and 'inner' parameters are not in the same order as in the base class, which can lead to subtle coding errors.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote it this way because I wanted "code" to be a required parameter in ExtensionPolicyEngine, but not "inner". But I can set a default value for "code", to keep them in the same order as in the base class.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up removing this class, based on the other comments

# Invoke policy engine to determine if extension is allowed. If not, block extension and report error on
# behalf of the extension.
policy_err_map = {
ExtensionRequestedState.Enabled: ('enable', ExtensionErrorCodes.PluginEnableProcessingFailed),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a comment describing the elements in the tuple?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added and moved this to the class level.

for extension, ext_handler in all_extensions:

handler_i = ExtHandlerInstance(ext_handler, self.protocol, extension=extension)

# Invoke policy engine to determine if extension is allowed. If not, block extension and report error on
# behalf of the extension.
policy_err_map = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this is a constant... define it at the class level?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

# Invoke policy engine to determine if extension is allowed. If not, block extension and report error on
# behalf of the extension.
policy_err_map = {
ExtensionRequestedState.Enabled: ('enable', ExtensionErrorCodes.PluginEnableProcessingFailed),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'enable' and 'disable' are internal CRP/Agent operations; users are not aware of them. They should not be propagated to error messages displayed to the user

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this to "run" and "uninstall"

}
policy_op, policy_err_code = policy_err_map.get(ext_handler.state)
if policy_error is not None:
err = ExtensionPolicyError(msg="", inner=policy_error, code=policy_err_code)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the intention of creating an exception object here? seems like it is only used to pass the error code, but it is never raised

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially implemented the ExtensionPolicyError class to have a centralized error message for extensions blocked by policy, and also to pass the code. But you make a good point - since we never actually raise the exception, I've removed the ExtensionPolicyError class and now pass the code/message directly into the reporting function.

err = ExtensionPolicyError(msg, code=policy_err_code)
self.__handle_and_report_policy_error(handler_i, err, report_op=handler_i.operation, message=ustr(err),
extension=extension, report=True)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like we are missing a continue statement here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think continue statement would break the dependency logic.

It's ok to use continue in the other condition because we know all extensions will fail (dependencies don't matter)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the case where a specific extension is disallowed by policy, we should log an error for dependencies as well (using the existing code). Adding a continue statement would skip this logic.

In the case of a policy failure, where all extensions should be blocked regardless of dependencies, we can skip this logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. To make this clearer, can you do 'if not extension_allowed' after 'if depends_on_err_msg is not None'?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've updated

azurelinuxagent/ga/exthandlers.py Outdated Show resolved Hide resolved

# Create status file for extensions with settings (single and multi config).
if extension is not None:
ext_handler_i.create_status_file_if_not_exist(extension, status=ExtensionStatusValue.error, code=error.code,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should overwrite the existing file with the policy error

ext_handler_i.create_status_file_if_not_exist(extension, status=ExtensionStatusValue.error, code=error.code,
operation=report_op, message=message)

if report:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when would report be False?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it isn't ever false, I initially wrote it this way because I was copying the exact structure of __handle_and_report_ext_handler_errors(). But I've removed it since that parameter isn't being used for now.

@@ -990,7 +1061,10 @@ def report_ext_handler_status(self, vm_status, ext_handler, goal_state_changed):
# extension even if HandlerState == NotInstalled (Sample scenario: ExtensionsGoalStateError, DecideVersionError, etc)
# We also need to report extension status for an uninstalled handler if extensions are disabled because CRP
# waits for extension runtime status before failing the extension operation.
if handler_state != ExtHandlerState.NotInstalled or ext_handler.supports_multi_config or not conf.get_extensions_enabled():
# In the case of policy failures, we want to report extension status with a terminal code so CRP fails fast. If
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change this

        # We also need to report extension status for an uninstalled handler if extensions are disabled because CRP
        # waits for extension runtime status before failing the extension operation.
        # In the case of policy failures, we want to report extension status with a terminal code so CRP fails fast. If
        # extension status is not present, collect_ext_status() will set a default transitioning status, and CRP will
        # wait for timeout.

to

        # We also need to report extension status for an uninstalled handler if extensions are disabled, or if the extension
        # failed due to policy, because CRP waits for extension runtime status before failing the extension operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention of the change is to enter this condition when the extension fails due to policy, but this change means that we enter the condition whenever policy is enabled.

Is there any negative effect to calling ext_handler_i.get_extension_handler_statuses... whenever policy is enabled? Why is this behind the if condition in the first place?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline - removed this condition, because it would cause us to enter the if condition even for non-policy-related uninstall failures.

Copy link
Contributor

@maddieford maddieford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments mainly for e2e tests. I'll review unit tests once the comments in exthandlers.py are resolved


# Only allowlisted extensions should be processed.
# We only allowlist CustomScript: CustomScript should be enabled, RunCommand and AzureMonitor should fail.
# (Note that CustomScript blocked by policy is tested in a later test case.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding comments to the review so I can follow the scenarios easier. Consider adding these as comments in the code, but ultimately up to you:
This policy tests the following scenarios:
- single config ext (CSE) enable operation succeeds when allowed by policy
- no-config ext (AzureMonitor) enable operation fails fast when disallowed by policy
- single multi-config instance (RunCommandHandler) enable operation fails fast when disallowed by policy

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

tests_e2e/tests/ext_policy/ext_policy.py Show resolved Hide resolved
tests_e2e/tests/ext_policy/ext_policy.py Show resolved Hide resolved
self._operation_should_succeed("delete", azure_monitor)

# Should not uninstall disallowed extensions.
# CustomScript is removed from the allowlist: delete operation should fail.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This policy tests the following scenarios:
- a delete operation on a previously enabled single-config ext (CSE) which is now disallowed by policy fails fast
- multiple multi-config instances (RunCommandHandler and RunCommandHandler2) enable operations fail fast when disallowed by policy
- single-config ext (CSE) enable operation fails fast when disallowed by policy

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

tests_e2e/tests/ext_policy/ext_policy.py Show resolved Hide resolved
log.info("CRP returned an error for deletion operation, may be a false error. Checking agent log to determine if operation succeeded. Exception: {0}".format(crp_err))
try:
for ssh_client in ssh_clients.values():
msg = ("Remove the extension slice: {0}".format(str(ext_to_delete)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message is related to cgroup. Right now it's logged even when cgroup isn't enabled (which might and probably should change in the future).

Instead, we should check that the handler was successfully uninstalled. i.e. the last ext status reported by the agent shouldn't include the handler:

2024-11-26T23:54:04.306568Z INFO ExtHandler ExtHandler Extension status: [('Microsoft.Azure.Monitor.AzureMonitorLinuxAgent', 'Ready')]

You might also consider doing this by checking the instance view

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this - we now check for the last status reported by the agent and confirm that there is no handler status reported. (Instance view doesn't work because CRP reports a stale status)

_test_cases = [
_should_fail_single_config_depends_on_disallowed_no_config,
_should_fail_single_config_depends_on_disallowed_single_config,
# TODO: RunCommand is unable to be installed properly, so these tests are currently disabled. Investigate the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's specify 'RunCommandHandler' since RunCommand is a different extension (confusing, I know :)

Also is it that RunCommandHandler is unable to be "installed properly" or uninstalled?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - it's that RunCommandHandler is unable to be uninstalled. I've updated this.

_should_fail_single_config_depends_on_disallowed_single_config,
# TODO: RunCommand is unable to be installed properly, so these tests are currently disabled. Investigate the
# issue and enable these 3 tests.
# _should_fail_single_config_depends_on_disallowed_multi_config,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about _should_fail_multi_config_depends_on_disallowed_multi_config?
RunCommandHandler1 depends on RunCommandHandler2 for example

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline - allow/disallow is at the handler level, so we wouldn't have a multi-config extension with one instance allowed and another disallowed.

return policy, template, expected_errors, deletion_order


def _should_no_dependencies():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this may be leftover code, I don't see it referenced

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, thanks!

from pathlib import Path
from tests_e2e.tests.lib.agent_log import AgentLog


def main():
parser = argparse.ArgumentParser()
parser.add_argument("--data", dest='data', required=True)
parser.add_argument("--after-timestamp", dest='after_timestamp', required=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice :) thanks for adding this!

Copy link
Member

@narrieta narrieta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments for test code


def _test_policy_case(self, policy, op, expected_status_code, expected_handler_status, expected_ext_count=1,
expected_status_msg=None):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add some comments explaining the setup done by this function? (e.g. why incarnation 2?) thanks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added comments explaining the setup here.

Thanks to @maddieford for helping me figure out why updating the incarnation is required :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. I'd suggest instead adding a comment "Generate a new mock goal state to uninstall the extension" just before

            protocol.mock_wire_data.set_incarnation(2)
            protocol.mock_wire_data.set_extensions_config_state(ExtensionRequestedState.Uninstall)
            protocol.client.update_goal_state()

@@ -630,6 +630,98 @@ def test_it_should_handle_and_report_enable_errors_properly(self):
}
self._assert_extension_status(sc_handler, expected_extensions)

def test_it_should_handle_and_report_extensions_disallowed_by_policy_properly(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "properly" mean? (what is the expected behavior?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed this to "test_it_should_report_successful_status_for_extensions_allowed_by_policy" and "test_it_should_report_failed_status_for_extensions_disallowed_by_policy"

def test_it_should_handle_and_report_extensions_disallowed_by_policy_properly(self):
"""If multiconfig extension is disallowed by policy, all instances should be blocked."""
policy_path = os.path.join(self.tmp_dir, "waagent_policy.json")
patch('azurelinuxagent.common.conf.get_policy_file_path', return_value=str(policy_path)).start()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be done using the 'with' statement

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks

- "ext_policy/ext_policy_with_dependencies.py"
images: "endorsed"
executes_on_scale_set: true
# This test should run on its own VMSS, because other tests may leave behind extensions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should handle this in the test and allow it to share the vm

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added code at the start of each test to remove leftover extensions

tests_e2e/tests/ext_policy/ext_policy.py Show resolved Hide resolved

# RunCommandHandler is a multi-config extension, so we set up two instances (configurations) here and test both.
run_command = ExtPolicy.TestCase(
VirtualMachineExtensionClient(self._context.vm, VmExtensionIds.RunCommandHandler,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should use VirtualMachineRunCommand instead of VirtualMachineExtensionClient

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

self._create_policy_file(policy)
self._operation_should_succeed("enable", custom_script)
self._operation_should_fail("enable", run_command)
if VmExtensionIds.AzureMonitorLinuxAgent.supports_distro((self._ssh_client.run_command("get_distro.py").rstrip())):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much coverage are we getting for this case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed ext_policy.yml to run on all endorsed distros, so this case will be run on all distros that support AzureMonitorLinuxAgent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need the check on distro if it's going to fail due to policy anyways?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured I would avoid testing this extension at all on an unsupported distro, but I can remove the distro check specifically for this case, if you think it would be useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, no need for that check

if VmExtensionIds.AzureMonitorLinuxAgent.supports_distro((self._ssh_client.run_command("get_distro.py").rstrip())):
self._operation_should_fail("enable", azure_monitor)

# When allowlist is turned off, all extensions should be processed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about CustomScript?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final test case tests this - line 256

self._operation_should_succeed("enable", azure_monitor)
self._operation_should_succeed("delete", azure_monitor)

# Should not uninstall disallowed extensions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like

# Only allowlisted extensions should be processed.

and

 # Should not uninstall disallowed extensions.

or are we trying to test something different?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the comments to make it more clear what is being tested with each policy

}
}
self._create_policy_file(policy)
# # Known CRP issue - delete/uninstall operation times out instead of reporting an error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a CRP issue, but rather a design issue. Uninstall is best effort and never fails. You should consider checking the agent log for the error and then the instance view to confirm the extension was not uninstalled

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've enabled the failed deletion case and updated the test to validate the agent log and instance view.

if not extension_allowed:
msg = (
"Extension will not be processed: failed to {0} extension '{1}' because it is not specified "
"in the allowlist. To {0}, add the extension to the allowed list in the policy file ('{2}')."
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we use both the terms "allowlist" and "allowed list". Does this make sense?
Maybe something like "add <ext_name> to the list of allowed extensions in the policy file"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"list of allowed extensions", though verbose, looks good to me. maybe also use something similar for "allowList"? there is no "allowList" element in the policy file

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@narrieta how about something like this:
"Extension will not be processed: failed to run extension 'CustomScript' because it is not specified as an allowed extension. To run, add the extension to the list of allowed extensions in the policy file ('/etc/waagent_policy.json')."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the message as stated above

Copy link
Member

@narrieta narrieta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments for the Agent code; will post a separate review for the test code.

azurelinuxagent/ga/policy/policy_engine.py Outdated Show resolved Hide resolved
# If an error was thrown during policy engine initialization, skip further processing of the extension.
# CRP is still waiting for status, so we report error status here.
# of the extension.
policy_op, policy_err_code = _POLICY_ERROR_MAP.get(ext_handler.state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest removing 'policy' from policy_err_code. the error code is not related to policy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'policy_op' is also kind of misleading, since it is not a policy operation... maybe just 'operation' and 'error_code'?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to "operation" and "error_code"

# to write a status file for single-config extensions.

# Set handler status for all extensions (with and without settings)
ext_handler_i.set_handler_status(message=message, code=error_code)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment pointing out that we are intentionally reporting the error at the handler and status level

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean something like "We report the same error at both the handler status and extension status level." ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, what i was trying to point is that reporting the error both at the handler and status level is not needed (or should not be needed). e.g. install errors are reported at the handler level, while single-config errors are reported at the status level.


if report:
name = ext_handler_i.get_extension_full_name(extension)
handler_version = ext_handler_i.ext_handler.version
add_event(name=name, version=handler_version, op=report_op, is_success=False, log_event=True,
message=message)

@staticmethod
def __report_policy_error(ext_handler_i, error_code, report_op, message, extension=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove the default value for the 'extension' parameter

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

azurelinuxagent/ga/exthandlers.py Show resolved Hide resolved
if not extension_allowed:
msg = (
"Extension will not be processed: failed to {0} extension '{1}' because it is not specified "
"in the allowlist. To {0}, add the extension to the allowed list in the policy file ('{2}')."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"list of allowed extensions", though verbose, looks good to me. maybe also use something similar for "allowList"? there is no "allowList" element in the policy file

# Process extensions and get if it was successfully executed or not
extension_success = self.handle_ext_handler(handler_i, extension, goal_state_id)
# If extension was blocked by policy, treat the extension as failed and do not process the handler.
if not extension_allowed:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge this 'if not extension_allowed:' with the one just above it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, made this change, thanks!

Copy link
Member

@narrieta narrieta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments on the unit tests, will post end-to-end on separate review

thanks for these tests!


def _test_policy_case(self, policy, op, expected_status_code, expected_handler_status, expected_ext_count=1,
expected_status_msg=None):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. I'd suggest instead adding a comment "Generate a new mock goal state to uninstall the extension" just before

            protocol.mock_wire_data.set_incarnation(2)
            protocol.mock_wire_data.set_extensions_config_state(ExtensionRequestedState.Uninstall)
            protocol.client.update_goal_state()

policy_file.write(policy)
policy_file.flush()

def _test_policy_case(self, policy, op, expected_status_code, expected_handler_status, expected_ext_count=0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd suggest removing the default values and have the tests be very explicit about what their expectations are

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the default and updated the tests accordingly

azurelinuxagent/ga/exthandlers.py Outdated Show resolved Hide resolved
# 2024-10-24T17:34:20.808235Z ERROR ExtHandler ExtHandler Event: name=Microsoft.Azure.Monitor.AzureMonitorLinuxAgent, op=None, message=[ExtensionPolicyError] Extension will not be processed: failed to enable extension 'Microsoft.Azure.Monitor.AzureMonitorLinuxAgent' because extension is not specified in allowlist. To enable, add extension to the allowed list in the policy file ('/etc/waagent_policy.json')., duration=0
# We intentionally block extensions with policy and expect this failure message
{
'message': r"Extension will not be processed"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the comment on line 273 with the new error message? No need to add the prefix directly, but want to make sure we only ignore policy related failures

try:
timeout = (16 * 60) # We want to wait for CRP timeout, which is 15 minutes.
extension_case.extension.delete(timeout)
fail(f"The agent should have reported a timeout error when attempting to delete {extension_case.extension} "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: CRP reports timeout because agent reports a failure due to policy

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the message here

extension_case.extension.delete(timeout)
fail(f"The agent should have reported a timeout error when attempting to delete {extension_case.extension} "
f"because the extension is disallowed by policy.")
except TimeoutError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is catching the Timeout from the 'tests_e2e.tests.lib.azure_sdk_client.AzureSdkClient._execute_async_operation' function. You're giving it a 16 minute timeout, so it will just give up and stop waiting on the operation after that.

The delete operation will fail on its own (should be ~15 min) because CRP will timeout.

Pointing this out since you had this comment: "# We want to wait for CRP timeout, which is 15 minutes."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set the extensionsTimeBudget (CRP timeout) to 15 minutes exactly, I set the timeout here to 16 minutes to be just longer than that so we can wait for the full CRP timeout period. Are you saying we shouldn't have a timeout for _execute_async_operation() at all?

Copy link
Member

@narrieta narrieta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments for end-to-end tests

# CustomScript is a single-config extension.
custom_script = ExtPolicy.TestCase(
VirtualMachineExtensionClient(self._context.vm, VmExtensionIds.CustomScript,
resource_name="CustomScriptPolicy"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All extension call should use the same resource name (in this case "CustomScript"), otherwise a subsequent call with a different name will fail

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All extension calls do use the same resource name

# Another e2e test may have left behind an extension we want to test here. Cleanup any leftovers so that they
# do not affect the test results.
log.info("Cleaning up existing extensions on the test VM [%s]", self._context.vm.name)
self._context.vm.delete_all_extensions()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of removing all extensions, should we remove only the extensions used by this test, then have the test adapt to that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an issue with removing all extensions?

Deleting individual extensions was causing issues, I think because other e2e tests may leave behind extensions different resource names and deletion is done by resource name. I thought it was safer to just delete everything. If you think we should only delete the extensions used by this test, I could implement that in the next PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other e2e tests should be using the same resource names; if that is not the case, we should fix them

having more extensions in the mix for sure makes the test a little more difficult to write, but provides scenarios a tiny bit closer to what we will see in prod

in these test runs, some extensions will be added by other tests, and some by policy

if handling existing extensions is way too complex, i'm OK deleting all of them, but my first impression is that it should not be the case

OK to do on next PR

try:
if operation == "enable":
# VirtualMachineRunCommandClient (and VirtualMachineRunCommand) does not take force_update_tag as a parameter.
if type(extension_case.extension) == VirtualMachineRunCommandClient:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably isinstance instead of == in case we ever subclass it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, updated

if operation == "enable":
# VirtualMachineRunCommandClient (and VirtualMachineRunCommand) does not take force_update_tag as a parameter.
if type(extension_case.extension) == VirtualMachineRunCommandClient:
extension_case.extension.enable(settings=extension_case.settings, timeout=15*60)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the timeout?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated to just use the default timeout

tests_e2e/tests/ext_policy/ext_policy.py Show resolved Hide resolved
# instance view and that the expected error is in the agent log to confirm that deletion failed.
delete_start_time = self._ssh_client.run_command("date '+%Y-%m-%d %T'").rstrip()
try:
timeout = (16 * 60) # We want to wait for CRP timeout, which is 15 minutes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should check the agent's log asynchronously and fail immediately rather than waiting for the full 15 min

(can be on next PR)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marked as TODO

I initially implemented it this way, but ran into issues - subsequent operations would fail, because CRP was still waiting on the delete operation. I wasn't sure how to force CRP to quit, I will take a look at it in the next PR.

self._create_policy_file(policy)
self._operation_should_succeed("enable", custom_script)
self._operation_should_fail("enable", run_command)
if VmExtensionIds.AzureMonitorLinuxAgent.supports_distro((self._ssh_client.run_command("get_distro.py").rstrip())):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need the check on distro if it's going to fail due to policy anyways?

_should_fail_single_config_depends_on_disallowed_single_config, \
_should_succeed_single_config_depends_on_no_config, \
_should_succeed_single_config_depends_on_single_config
# _should_fail_single_config_depends_on_disallowed_multi_config,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: looks like a leftover - let's remove it or add a todo comment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO. These tests are currently disabled, which is why the imports are commented out.

code=-1,
operation=handler_i.operation,
message=msg)
handler_i.create_status_file(extension,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: the previous code was very explicit (in the method's name) about not overwriting existing files. I think that is a good choice. We should probably be explicit in the new method as well, remove the default value and set overwrite=False here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to remove the default and explicitly set the value for overwrite in all calls

@@ -50,7 +44,6 @@ def __init__(self, msg, inner=None):
msg = "Customer-provided policy file ('{0}') is invalid, please correct the following error: {1}".format(conf.get_policy_file_path(), msg)
super(InvalidPolicyError, self).__init__(msg, inner)


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an INFO message just after the check for enabled stating that we are using Policy? This makes clearer the fact that we are now processing policies.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__read_policy() is called right after the check for enabled, and it logs the following statement:

Policy enforcement is enabled. Enforcing policy using policy file found at '<path>'. File contents: <policy>

Is that sufficient, or do you think we need an additional log message?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's sufficient, but the message should probably be in the caller instead of read_policy. Who knows, as code evolves we may add other code before read_policy, or call read_policy multiple times.

Alternatively, the caller can log "Policy enforcement is enabled." and read_policy "Enforcing policy using policy file found at ''. File contents: "

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated


def _operation_should_fail(self, operation, extension_case):
log.info("")
log.info(f"Attempting to {operation} {extension_case.extension}, should fail fast.")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should say "should reach timeout" for delete. split the log messages for enable and delete.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

log.info("Enabling policy via conf file on the test VM [%s]", self._context.vm.name)
self._ssh_client.run_command("update-waagent-conf Debug.EnableExtensionPolicy=y", use_sudo=True)

# This policy tests the following scenarios:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add log messages for these comments in test logs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants