-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block extensions disallowed by policy #3259
base: develop
Are you sure you want to change the base?
Conversation
b440696
to
a37508f
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #3259 +/- ##
===========================================
+ Coverage 71.97% 72.77% +0.79%
===========================================
Files 103 114 +11
Lines 15692 17081 +1389
Branches 2486 2277 -209
===========================================
+ Hits 11295 12431 +1136
- Misses 3881 4107 +226
- Partials 516 543 +27 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did an initial review not including tests
""" | ||
# TODO: when CRP adds terminal error code for policy-related extension failures, set that as the default code. | ||
def __init__(self, msg, inner=None, code=-1): | ||
msg = "Extension is disallowed by agent policy and will not be processed: {0}".format(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case where agent failed to parse policy, I'm not sure we should say 'Extension is disallowed by policy'. In this case, extension is disallowed because there's some issue reading or parsing the policy.
I also am hesitant about 'agent policy' since policy is provided by customer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could change this to "Extension will not be processed: "
Parsing errors (InvalidPolicyError) would look like "Extension will not be processed: customer-provided policy file (path) is invalid, please correct the following error..."
Extension disallowed errors (ExtensionPolicyError) would look like "Extension will not be processed: failed to enable extension CustomScript because extension is not specified in policy allowlist. To enable, add extension to the allowed list in policy file (path)."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the error message as discussed above.
azurelinuxagent/ga/exthandlers.py
Outdated
policy_op, policy_err_code = policy_err_map.get(ext_handler.state) | ||
if policy_error is not None: | ||
err = ExtensionPolicyError(msg="", inner=policy_error, code=policy_err_code) | ||
self.__handle_and_report_ext_handler_errors(handler_i, err, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this create .status files for single config extensions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new function __handle_and_report_policy_error() - this should create a status file for any extension with settings.
azurelinuxagent/ga/exthandlers.py
Outdated
ext_handler.name, | ||
conf.get_policy_file_path()) | ||
err = ExtensionPolicyError(msg, code=policy_err_code) | ||
self.__handle_and_report_ext_handler_errors(handler_i, err, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question here about .status file for single config extensions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new function __handle_and_report_policy_error() - this should create a status file for any extension with settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good. I'm going to spend tomorrow going through each e2e scenario and unit test, sorry for the slow review :/
azurelinuxagent/ga/exthandlers.py
Outdated
ExtensionRequestedState.Enabled: ('enable', ExtensionErrorCodes.PluginEnableProcessingFailed), | ||
# TODO: CRP does not currently have a terminal error code for uninstall. Once CRP adds | ||
# an error code for uninstall or for policy, use this code instead of PluginDisableProcessingFailed | ||
# Note that currently, CRP waits for 90 minutes to time out for a failed uninstall operation, instead of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add some more detail to this comment?
Something like:
Note that currently, CRP will poll until the agent does not report a status for an extension that should be uninstalled. In the case of a policy error, the agent will report a failed status on behalf of the extension, which will cause CRP to poll for the full timeout period, instead of failing fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
azurelinuxagent/ga/exthandlers.py
Outdated
@@ -692,6 +734,26 @@ def __handle_and_report_ext_handler_errors(ext_handler_i, error, report_op, mess | |||
add_event(name=name, version=handler_version, op=report_op, is_success=False, log_event=True, | |||
message=message) | |||
|
|||
@staticmethod | |||
def __handle_and_report_policy_error(ext_handler_i, error, report_op, message, report=True, extension=None): | |||
# TODO: Consider merging this function with __handle_and_report_ext_handler_errors() above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please leave some comment explaining why we broke this into a separate function? For policy related failures, we want to fail extensions fast. CRP will continue to poll for single-config ext status until timeout, so agent should write a status for single-config extensions. The other function does not create that status and we didn't want to touch the other function without investigating the impact of that change further
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
azurelinuxagent/ga/exthandlers.py
Outdated
|
||
# Create status file for extensions with settings (single and multi config). | ||
if extension is not None: | ||
ext_handler_i.create_status_file_if_not_exist(extension, status=ExtensionStatusValue.error, code=error.code, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create_status_file_if_not_exist() will not overwrite existing status file (for the current sequence number). Is this behavior acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should overwrite the existing file with the policy error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now overwrite the existing file with policy error. I've added an "overwrite" parameter and changed the function name to create_status_file( ).
azurelinuxagent/ga/exthandlers.py
Outdated
ExtensionRequestedState.Enabled: ('enable', ExtensionErrorCodes.PluginEnableProcessingFailed), | ||
# Note: currently, when uninstall is requested for an extension, CRP polls until the agent does not | ||
# report status for that extension, or until timeout is reached. In the case of a policy error, the | ||
# agent reports failed status on behalf of the extension, which will cause CRP to for the full timeout, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# agent reports failed status on behalf of the extension, which will cause CRP to for the full timeout, | |
# agent reports failed status on behalf of the extension, which will cause CRP to poll for the full timeout, |
nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks!
@@ -3507,5 +3510,144 @@ def test_report_msg_if_handler_manifest_contains_invalid_values(self): | |||
self.assertIn("'supportsMultipleExtensions' has a non-boolean value", kw_messages[2]['message']) | |||
|
|||
|
|||
class TestExtensionPolicy(TestExtensionBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we add a test case for extension is allowed by policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
tests_e2e/test_suites/ext_policy.yml
Outdated
name: "ExtensionPolicy" | ||
tests: | ||
- "ext_policy/ext_policy.py" | ||
images: "random(endorsed)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should run this on more distros so we can get better coverage before releasing the changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will running on all endorsed distros add too much overhead to the daily runs?
fail(f"The agent should have reported an error trying to {operation} {extension_case.extension.__str__()} " | ||
f"because the extension is disallowed by policy.") | ||
except Exception as error: | ||
assert_that("Extension will not be processed" in str(error)) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also check for [ExtensionPolicyError] in the message to confirm the failure was due to policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
@@ -630,6 +630,70 @@ def test_it_should_handle_and_report_enable_errors_properly(self): | |||
} | |||
self._assert_extension_status(sc_handler, expected_extensions) | |||
|
|||
def test_it_should_handle_and_report_disallowed_extensions_properly(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also please add a case for multi config ext allowed by policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
name: "ExtPolicyWithDependencies" | ||
tests: | ||
- "ext_policy/ext_policy_with_dependencies.py" | ||
images: "random(endorsed)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment here, we should get more coverage than 1 run per day, maybe consider running on all endorsed, or 5-10 endorsed images per day
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to all endorsed, but I can change to 5-10 if this adds too much overhead.
e909568
to
86de0c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Posting comments for Agent code.
Will post comments for test code separately.
""" | ||
Error raised during agent extension policy enforcement. | ||
""" | ||
def __init__(self, msg, code, inner=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'code' and 'inner' parameters are not in the same order as in the base class, which can lead to subtle coding errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote it this way because I wanted "code" to be a required parameter in ExtensionPolicyEngine, but not "inner". But I can set a default value for "code", to keep them in the same order as in the base class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up removing this class, based on the other comments
azurelinuxagent/ga/exthandlers.py
Outdated
# Invoke policy engine to determine if extension is allowed. If not, block extension and report error on | ||
# behalf of the extension. | ||
policy_err_map = { | ||
ExtensionRequestedState.Enabled: ('enable', ExtensionErrorCodes.PluginEnableProcessingFailed), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a comment describing the elements in the tuple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added and moved this to the class level.
azurelinuxagent/ga/exthandlers.py
Outdated
for extension, ext_handler in all_extensions: | ||
|
||
handler_i = ExtHandlerInstance(ext_handler, self.protocol, extension=extension) | ||
|
||
# Invoke policy engine to determine if extension is allowed. If not, block extension and report error on | ||
# behalf of the extension. | ||
policy_err_map = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like this is a constant... define it at the class level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
azurelinuxagent/ga/exthandlers.py
Outdated
# Invoke policy engine to determine if extension is allowed. If not, block extension and report error on | ||
# behalf of the extension. | ||
policy_err_map = { | ||
ExtensionRequestedState.Enabled: ('enable', ExtensionErrorCodes.PluginEnableProcessingFailed), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'enable' and 'disable' are internal CRP/Agent operations; users are not aware of them. They should not be propagated to error messages displayed to the user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this to "run" and "uninstall"
azurelinuxagent/ga/exthandlers.py
Outdated
} | ||
policy_op, policy_err_code = policy_err_map.get(ext_handler.state) | ||
if policy_error is not None: | ||
err = ExtensionPolicyError(msg="", inner=policy_error, code=policy_err_code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the intention of creating an exception object here? seems like it is only used to pass the error code, but it is never raised
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially implemented the ExtensionPolicyError class to have a centralized error message for extensions blocked by policy, and also to pass the code. But you make a good point - since we never actually raise the exception, I've removed the ExtensionPolicyError class and now pass the code/message directly into the reporting function.
azurelinuxagent/ga/exthandlers.py
Outdated
err = ExtensionPolicyError(msg, code=policy_err_code) | ||
self.__handle_and_report_policy_error(handler_i, err, report_op=handler_i.operation, message=ustr(err), | ||
extension=extension, report=True) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like we are missing a continue statement here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think continue statement would break the dependency logic.
It's ok to use continue in the other condition because we know all extensions will fail (dependencies don't matter)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in the case where a specific extension is disallowed by policy, we should log an error for dependencies as well (using the existing code). Adding a continue statement would skip this logic.
In the case of a policy failure, where all extensions should be blocked regardless of dependencies, we can skip this logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. To make this clearer, can you do 'if not extension_allowed' after 'if depends_on_err_msg is not None'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've updated
azurelinuxagent/ga/exthandlers.py
Outdated
|
||
# Create status file for extensions with settings (single and multi config). | ||
if extension is not None: | ||
ext_handler_i.create_status_file_if_not_exist(extension, status=ExtensionStatusValue.error, code=error.code, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should overwrite the existing file with the policy error
azurelinuxagent/ga/exthandlers.py
Outdated
ext_handler_i.create_status_file_if_not_exist(extension, status=ExtensionStatusValue.error, code=error.code, | ||
operation=report_op, message=message) | ||
|
||
if report: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when would report be False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it isn't ever false, I initially wrote it this way because I was copying the exact structure of __handle_and_report_ext_handler_errors(). But I've removed it since that parameter isn't being used for now.
azurelinuxagent/ga/exthandlers.py
Outdated
@@ -990,7 +1061,10 @@ def report_ext_handler_status(self, vm_status, ext_handler, goal_state_changed): | |||
# extension even if HandlerState == NotInstalled (Sample scenario: ExtensionsGoalStateError, DecideVersionError, etc) | |||
# We also need to report extension status for an uninstalled handler if extensions are disabled because CRP | |||
# waits for extension runtime status before failing the extension operation. | |||
if handler_state != ExtHandlerState.NotInstalled or ext_handler.supports_multi_config or not conf.get_extensions_enabled(): | |||
# In the case of policy failures, we want to report extension status with a terminal code so CRP fails fast. If |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's change this
# We also need to report extension status for an uninstalled handler if extensions are disabled because CRP
# waits for extension runtime status before failing the extension operation.
# In the case of policy failures, we want to report extension status with a terminal code so CRP fails fast. If
# extension status is not present, collect_ext_status() will set a default transitioning status, and CRP will
# wait for timeout.
to
# We also need to report extension status for an uninstalled handler if extensions are disabled, or if the extension
# failed due to policy, because CRP waits for extension runtime status before failing the extension operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention of the change is to enter this condition when the extension fails due to policy, but this change means that we enter the condition whenever policy is enabled.
Is there any negative effect to calling ext_handler_i.get_extension_handler_statuses...
whenever policy is enabled? Why is this behind the if condition in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline - removed this condition, because it would cause us to enter the if condition even for non-policy-related uninstall failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left comments mainly for e2e tests. I'll review unit tests once the comments in exthandlers.py are resolved
|
||
# Only allowlisted extensions should be processed. | ||
# We only allowlist CustomScript: CustomScript should be enabled, RunCommand and AzureMonitor should fail. | ||
# (Note that CustomScript blocked by policy is tested in a later test case.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding comments to the review so I can follow the scenarios easier. Consider adding these as comments in the code, but ultimately up to you:
This policy tests the following scenarios:
- single config ext (CSE) enable operation succeeds when allowed by policy
- no-config ext (AzureMonitor) enable operation fails fast when disallowed by policy
- single multi-config instance (RunCommandHandler) enable operation fails fast when disallowed by policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, thanks!
self._operation_should_succeed("delete", azure_monitor) | ||
|
||
# Should not uninstall disallowed extensions. | ||
# CustomScript is removed from the allowlist: delete operation should fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This policy tests the following scenarios:
- a delete operation on a previously enabled single-config ext (CSE) which is now disallowed by policy fails fast
- multiple multi-config instances (RunCommandHandler and RunCommandHandler2) enable operations fail fast when disallowed by policy
- single-config ext (CSE) enable operation fails fast when disallowed by policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, thanks!
log.info("CRP returned an error for deletion operation, may be a false error. Checking agent log to determine if operation succeeded. Exception: {0}".format(crp_err)) | ||
try: | ||
for ssh_client in ssh_clients.values(): | ||
msg = ("Remove the extension slice: {0}".format(str(ext_to_delete))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message is related to cgroup. Right now it's logged even when cgroup isn't enabled (which might and probably should change in the future).
Instead, we should check that the handler was successfully uninstalled. i.e. the last ext status reported by the agent shouldn't include the handler:
2024-11-26T23:54:04.306568Z INFO ExtHandler ExtHandler Extension status: [('Microsoft.Azure.Monitor.AzureMonitorLinuxAgent', 'Ready')]
You might also consider doing this by checking the instance view
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this - we now check for the last status reported by the agent and confirm that there is no handler status reported. (Instance view doesn't work because CRP reports a stale status)
_test_cases = [ | ||
_should_fail_single_config_depends_on_disallowed_no_config, | ||
_should_fail_single_config_depends_on_disallowed_single_config, | ||
# TODO: RunCommand is unable to be installed properly, so these tests are currently disabled. Investigate the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's specify 'RunCommandHandler' since RunCommand is a different extension (confusing, I know :)
Also is it that RunCommandHandler is unable to be "installed properly" or uninstalled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch - it's that RunCommandHandler is unable to be uninstalled. I've updated this.
_should_fail_single_config_depends_on_disallowed_single_config, | ||
# TODO: RunCommand is unable to be installed properly, so these tests are currently disabled. Investigate the | ||
# issue and enable these 3 tests. | ||
# _should_fail_single_config_depends_on_disallowed_multi_config, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about _should_fail_multi_config_depends_on_disallowed_multi_config?
RunCommandHandler1 depends on RunCommandHandler2 for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline - allow/disallow is at the handler level, so we wouldn't have a multi-config extension with one instance allowed and another disallowed.
return policy, template, expected_errors, deletion_order | ||
|
||
|
||
def _should_no_dependencies(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this may be leftover code, I don't see it referenced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, thanks!
from pathlib import Path | ||
from tests_e2e.tests.lib.agent_log import AgentLog | ||
|
||
|
||
def main(): | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument("--data", dest='data', required=True) | ||
parser.add_argument("--after-timestamp", dest='after_timestamp', required=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice :) thanks for adding this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments for test code
|
||
def _test_policy_case(self, policy, op, expected_status_code, expected_handler_status, expected_ext_count=1, | ||
expected_status_msg=None): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add some comments explaining the setup done by this function? (e.g. why incarnation 2?) thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added comments explaining the setup here.
Thanks to @maddieford for helping me figure out why updating the incarnation is required :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. I'd suggest instead adding a comment "Generate a new mock goal state to uninstall the extension" just before
protocol.mock_wire_data.set_incarnation(2)
protocol.mock_wire_data.set_extensions_config_state(ExtensionRequestedState.Uninstall)
protocol.client.update_goal_state()
@@ -630,6 +630,98 @@ def test_it_should_handle_and_report_enable_errors_properly(self): | |||
} | |||
self._assert_extension_status(sc_handler, expected_extensions) | |||
|
|||
def test_it_should_handle_and_report_extensions_disallowed_by_policy_properly(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does "properly" mean? (what is the expected behavior?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed this to "test_it_should_report_successful_status_for_extensions_allowed_by_policy" and "test_it_should_report_failed_status_for_extensions_disallowed_by_policy"
def test_it_should_handle_and_report_extensions_disallowed_by_policy_properly(self): | ||
"""If multiconfig extension is disallowed by policy, all instances should be blocked.""" | ||
policy_path = os.path.join(self.tmp_dir, "waagent_policy.json") | ||
patch('azurelinuxagent.common.conf.get_policy_file_path', return_value=str(policy_path)).start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be done using the 'with' statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks
- "ext_policy/ext_policy_with_dependencies.py" | ||
images: "endorsed" | ||
executes_on_scale_set: true | ||
# This test should run on its own VMSS, because other tests may leave behind extensions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should handle this in the test and allow it to share the vm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added code at the start of each test to remove leftover extensions
|
||
# RunCommandHandler is a multi-config extension, so we set up two instances (configurations) here and test both. | ||
run_command = ExtPolicy.TestCase( | ||
VirtualMachineExtensionClient(self._context.vm, VmExtensionIds.RunCommandHandler, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should use VirtualMachineRunCommand instead of VirtualMachineExtensionClient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
self._create_policy_file(policy) | ||
self._operation_should_succeed("enable", custom_script) | ||
self._operation_should_fail("enable", run_command) | ||
if VmExtensionIds.AzureMonitorLinuxAgent.supports_distro((self._ssh_client.run_command("get_distro.py").rstrip())): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how much coverage are we getting for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed ext_policy.yml to run on all endorsed distros, so this case will be run on all distros that support AzureMonitorLinuxAgent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need the check on distro if it's going to fail due to policy anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured I would avoid testing this extension at all on an unsupported distro, but I can remove the distro check specifically for this case, if you think it would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, no need for that check
if VmExtensionIds.AzureMonitorLinuxAgent.supports_distro((self._ssh_client.run_command("get_distro.py").rstrip())): | ||
self._operation_should_fail("enable", azure_monitor) | ||
|
||
# When allowlist is turned off, all extensions should be processed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about CustomScript?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final test case tests this - line 256
self._operation_should_succeed("enable", azure_monitor) | ||
self._operation_should_succeed("delete", azure_monitor) | ||
|
||
# Should not uninstall disallowed extensions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like
# Only allowlisted extensions should be processed.
and
# Should not uninstall disallowed extensions.
or are we trying to test something different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the comments to make it more clear what is being tested with each policy
} | ||
} | ||
self._create_policy_file(policy) | ||
# # Known CRP issue - delete/uninstall operation times out instead of reporting an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a CRP issue, but rather a design issue. Uninstall is best effort and never fails. You should consider checking the agent log for the error and then the instance view to confirm the extension was not uninstalled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've enabled the failed deletion case and updated the test to validate the agent log and instance view.
822ced7
to
248f662
Compare
248f662
to
a31bdcf
Compare
azurelinuxagent/ga/exthandlers.py
Outdated
if not extension_allowed: | ||
msg = ( | ||
"Extension will not be processed: failed to {0} extension '{1}' because it is not specified " | ||
"in the allowlist. To {0}, add the extension to the allowed list in the policy file ('{2}')." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we use both the terms "allowlist" and "allowed list". Does this make sense?
Maybe something like "add <ext_name> to the list of allowed extensions in the policy file"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"list of allowed extensions", though verbose, looks good to me. maybe also use something similar for "allowList"? there is no "allowList" element in the policy file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@narrieta how about something like this:
"Extension will not be processed: failed to run extension 'CustomScript' because it is not specified as an allowed extension. To run, add the extension to the list of allowed extensions in the policy file ('/etc/waagent_policy.json')."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the message as stated above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments for the Agent code; will post a separate review for the test code.
azurelinuxagent/ga/exthandlers.py
Outdated
# If an error was thrown during policy engine initialization, skip further processing of the extension. | ||
# CRP is still waiting for status, so we report error status here. | ||
# of the extension. | ||
policy_op, policy_err_code = _POLICY_ERROR_MAP.get(ext_handler.state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest removing 'policy' from policy_err_code. the error code is not related to policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'policy_op' is also kind of misleading, since it is not a policy operation... maybe just 'operation' and 'error_code'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to "operation" and "error_code"
# to write a status file for single-config extensions. | ||
|
||
# Set handler status for all extensions (with and without settings) | ||
ext_handler_i.set_handler_status(message=message, code=error_code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comment pointing out that we are intentionally reporting the error at the handler and status level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean something like "We report the same error at both the handler status and extension status level." ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, what i was trying to point is that reporting the error both at the handler and status level is not needed (or should not be needed). e.g. install errors are reported at the handler level, while single-config errors are reported at the status level.
azurelinuxagent/ga/exthandlers.py
Outdated
|
||
if report: | ||
name = ext_handler_i.get_extension_full_name(extension) | ||
handler_version = ext_handler_i.ext_handler.version | ||
add_event(name=name, version=handler_version, op=report_op, is_success=False, log_event=True, | ||
message=message) | ||
|
||
@staticmethod | ||
def __report_policy_error(ext_handler_i, error_code, report_op, message, extension=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove the default value for the 'extension' parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
azurelinuxagent/ga/exthandlers.py
Outdated
if not extension_allowed: | ||
msg = ( | ||
"Extension will not be processed: failed to {0} extension '{1}' because it is not specified " | ||
"in the allowlist. To {0}, add the extension to the allowed list in the policy file ('{2}')." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"list of allowed extensions", though verbose, looks good to me. maybe also use something similar for "allowList"? there is no "allowList" element in the policy file
azurelinuxagent/ga/exthandlers.py
Outdated
# Process extensions and get if it was successfully executed or not | ||
extension_success = self.handle_ext_handler(handler_i, extension, goal_state_id) | ||
# If extension was blocked by policy, treat the extension as failed and do not process the handler. | ||
if not extension_allowed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge this 'if not extension_allowed:' with the one just above it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, made this change, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments on the unit tests, will post end-to-end on separate review
thanks for these tests!
|
||
def _test_policy_case(self, policy, op, expected_status_code, expected_handler_status, expected_ext_count=1, | ||
expected_status_msg=None): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. I'd suggest instead adding a comment "Generate a new mock goal state to uninstall the extension" just before
protocol.mock_wire_data.set_incarnation(2)
protocol.mock_wire_data.set_extensions_config_state(ExtensionRequestedState.Uninstall)
protocol.client.update_goal_state()
tests/ga/test_extension.py
Outdated
policy_file.write(policy) | ||
policy_file.flush() | ||
|
||
def _test_policy_case(self, policy, op, expected_status_code, expected_handler_status, expected_ext_count=0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd suggest removing the default values and have the tests be very explicit about what their expectations are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the default and updated the tests accordingly
# 2024-10-24T17:34:20.808235Z ERROR ExtHandler ExtHandler Event: name=Microsoft.Azure.Monitor.AzureMonitorLinuxAgent, op=None, message=[ExtensionPolicyError] Extension will not be processed: failed to enable extension 'Microsoft.Azure.Monitor.AzureMonitorLinuxAgent' because extension is not specified in allowlist. To enable, add extension to the allowed list in the policy file ('/etc/waagent_policy.json')., duration=0 | ||
# We intentionally block extensions with policy and expect this failure message | ||
{ | ||
'message': r"Extension will not be processed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the comment on line 273 with the new error message? No need to add the prefix directly, but want to make sure we only ignore policy related failures
try: | ||
timeout = (16 * 60) # We want to wait for CRP timeout, which is 15 minutes. | ||
extension_case.extension.delete(timeout) | ||
fail(f"The agent should have reported a timeout error when attempting to delete {extension_case.extension} " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: CRP reports timeout because agent reports a failure due to policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the message here
extension_case.extension.delete(timeout) | ||
fail(f"The agent should have reported a timeout error when attempting to delete {extension_case.extension} " | ||
f"because the extension is disallowed by policy.") | ||
except TimeoutError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is catching the Timeout from the 'tests_e2e.tests.lib.azure_sdk_client.AzureSdkClient._execute_async_operation' function. You're giving it a 16 minute timeout, so it will just give up and stop waiting on the operation after that.
The delete operation will fail on its own (should be ~15 min) because CRP will timeout.
Pointing this out since you had this comment: "# We want to wait for CRP timeout, which is 15 minutes."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set the extensionsTimeBudget (CRP timeout) to 15 minutes exactly, I set the timeout here to 16 minutes to be just longer than that so we can wait for the full CRP timeout period. Are you saying we shouldn't have a timeout for _execute_async_operation() at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments for end-to-end tests
# CustomScript is a single-config extension. | ||
custom_script = ExtPolicy.TestCase( | ||
VirtualMachineExtensionClient(self._context.vm, VmExtensionIds.CustomScript, | ||
resource_name="CustomScriptPolicy"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All extension call should use the same resource name (in this case "CustomScript"), otherwise a subsequent call with a different name will fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All extension calls do use the same resource name
# Another e2e test may have left behind an extension we want to test here. Cleanup any leftovers so that they | ||
# do not affect the test results. | ||
log.info("Cleaning up existing extensions on the test VM [%s]", self._context.vm.name) | ||
self._context.vm.delete_all_extensions() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of removing all extensions, should we remove only the extensions used by this test, then have the test adapt to that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an issue with removing all extensions?
Deleting individual extensions was causing issues, I think because other e2e tests may leave behind extensions different resource names and deletion is done by resource name. I thought it was safer to just delete everything. If you think we should only delete the extensions used by this test, I could implement that in the next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other e2e tests should be using the same resource names; if that is not the case, we should fix them
having more extensions in the mix for sure makes the test a little more difficult to write, but provides scenarios a tiny bit closer to what we will see in prod
in these test runs, some extensions will be added by other tests, and some by policy
if handling existing extensions is way too complex, i'm OK deleting all of them, but my first impression is that it should not be the case
OK to do on next PR
try: | ||
if operation == "enable": | ||
# VirtualMachineRunCommandClient (and VirtualMachineRunCommand) does not take force_update_tag as a parameter. | ||
if type(extension_case.extension) == VirtualMachineRunCommandClient: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably isinstance instead of == in case we ever subclass it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, updated
if operation == "enable": | ||
# VirtualMachineRunCommandClient (and VirtualMachineRunCommand) does not take force_update_tag as a parameter. | ||
if type(extension_case.extension) == VirtualMachineRunCommandClient: | ||
extension_case.extension.enable(settings=extension_case.settings, timeout=15*60) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated to just use the default timeout
# instance view and that the expected error is in the agent log to confirm that deletion failed. | ||
delete_start_time = self._ssh_client.run_command("date '+%Y-%m-%d %T'").rstrip() | ||
try: | ||
timeout = (16 * 60) # We want to wait for CRP timeout, which is 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should check the agent's log asynchronously and fail immediately rather than waiting for the full 15 min
(can be on next PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marked as TODO
I initially implemented it this way, but ran into issues - subsequent operations would fail, because CRP was still waiting on the delete operation. I wasn't sure how to force CRP to quit, I will take a look at it in the next PR.
self._create_policy_file(policy) | ||
self._operation_should_succeed("enable", custom_script) | ||
self._operation_should_fail("enable", run_command) | ||
if VmExtensionIds.AzureMonitorLinuxAgent.supports_distro((self._ssh_client.run_command("get_distro.py").rstrip())): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need the check on distro if it's going to fail due to policy anyways?
_should_fail_single_config_depends_on_disallowed_single_config, \ | ||
_should_succeed_single_config_depends_on_no_config, \ | ||
_should_succeed_single_config_depends_on_single_config | ||
# _should_fail_single_config_depends_on_disallowed_multi_config, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: looks like a leftover - let's remove it or add a todo comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a TODO. These tests are currently disabled, which is why the imports are commented out.
df28b47
to
bacc425
Compare
code=-1, | ||
operation=handler_i.operation, | ||
message=msg) | ||
handler_i.create_status_file(extension, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: the previous code was very explicit (in the method's name) about not overwriting existing files. I think that is a good choice. We should probably be explicit in the new method as well, remove the default value and set overwrite=False here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to remove the default and explicitly set the value for overwrite in all calls
@@ -50,7 +44,6 @@ def __init__(self, msg, inner=None): | |||
msg = "Customer-provided policy file ('{0}') is invalid, please correct the following error: {1}".format(conf.get_policy_file_path(), msg) | |||
super(InvalidPolicyError, self).__init__(msg, inner) | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add an INFO message just after the check for enabled stating that we are using Policy? This makes clearer the fact that we are now processing policies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__read_policy() is called right after the check for enabled, and it logs the following statement:
Policy enforcement is enabled. Enforcing policy using policy file found at '<path>'. File contents: <policy>
Is that sufficient, or do you think we need an additional log message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's sufficient, but the message should probably be in the caller instead of read_policy. Who knows, as code evolves we may add other code before read_policy, or call read_policy multiple times.
Alternatively, the caller can log "Policy enforcement is enabled." and read_policy "Enforcing policy using policy file found at ''. File contents: "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
|
||
def _operation_should_fail(self, operation, extension_case): | ||
log.info("") | ||
log.info(f"Attempting to {operation} {extension_case.extension}, should fail fast.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should say "should reach timeout" for delete. split the log messages for enable and delete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
log.info("Enabling policy via conf file on the test VM [%s]", self._context.vm.name) | ||
self._ssh_client.run_command("update-waagent-conf Debug.EnableExtensionPolicy=y", use_sudo=True) | ||
|
||
# This policy tests the following scenarios: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add log messages for these comments in test logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Description
Issue #
PR #2 for the policy engine allowlist feature:
PR information
Quality of Code and Contribution Guidelines