Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support CGroup v2 on Supervised with manual addon restarts #5419

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mdegat01
Copy link
Contributor

@mdegat01 mdegat01 commented Nov 14, 2024

Proposed change

CGroup v2 does not support the ability to add device access on the fly to a container like we could with CGroup v1. It does not appear we are going to get that support and CGroup v1 support is completely dropped at this point so we will use a workaround on Supervised systems.

CGroup v2 will no longer be marked as unsupported on Supervised systems. If a device is hotplugged and we need to add access to a running addon we will instead create a repair asking the user to restart the addon. This will not be automatic, users will have to restart the addon manually in order for it to access the device.

Note that this only applies to Supervised systems with CGroup v2. HAOS uses a patched version of CGroup v2 with the ability to add device access on the fly so users on HAOS systems will never see this repair or have to restart addons for device access.

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New feature (which adds functionality to the supervisor)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Additional information

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • The code has been formatted using Ruff (ruff format supervisor tests)
  • Tests have been added to verify that the new code works.

If API endpoints or add-on configuration are added/changed:

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced enhanced error handling for device access issues in add-ons.
    • Added a mechanism to suggest restarting add-ons when device access is missing.
  • Bug Fixes

    • Improved management of device access issues during add-on state transitions.
    • Updated logic to handle cgroup versions more effectively.
  • Tests

    • Expanded testing framework for Docker add-ons, including new scenarios for device management and error handling.
    • Added unit tests for the add-on restart functionality, covering various operational states and error conditions.

@mdegat01 mdegat01 added the new-feature A new feature label Nov 14, 2024
@mdegat01 mdegat01 requested a review from agners November 14, 2024 19:44
@mdegat01 mdegat01 added the need-client-library Added to pull requests that need client library changes but none is linked label Nov 14, 2024
Copy link

coderabbitai bot commented Nov 14, 2024

📝 Walkthrough

Walkthrough

The changes involve enhancements to the error handling and state management within the Addon and DockerAddon classes, the introduction of new constants and enumerations in the resolution system, and the addition of a new helper class for addon restart functionality. The modifications also include updates to the testing framework to validate these changes, ensuring robust handling of device access issues and cgroup management.

Changes

File Change Summary
supervisor/addons/addon.py Added _device_access_missing_issue property and updated state management in the Addon class.
supervisor/docker/addon.py Modified DockerAddon class to manage device access issues and updated import statements.
supervisor/resolution/const.py Added constants CGROUP_V1_VERSION, CGROUP_V2_VERSION, and new issue/suggestion types.
supervisor/resolution/evaluations/cgroup.py Updated expected cgroup versions logic; removed constants from imports.
supervisor/resolution/fixups/addon_execute_restart.py Introduced FixupAddonExecuteRestart class for managing addon restarts with various methods.
tests/docker/test_addon.py Enhanced testing for Docker add-ons with new test cases for device management.
tests/resolution/evaluation/test_evaluate_cgroup.py Modified assertions in cgroup version evaluation tests.
tests/resolution/fixup/test_addon_execute_restart.py Added unit tests for FixupAddonExecuteRestart functionality.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Addon
    participant DockerAddon
    participant ResolutionSystem

    User->>Addon: Start Addon
    Addon->>ResolutionSystem: Check device access
    ResolutionSystem-->>Addon: Return access issue if missing
    Addon->>Addon: Manage state and dismiss issues
    Addon->>User: Addon started or error logged

    User->>DockerAddon: Stop Docker Addon
    DockerAddon->>ResolutionSystem: Check for issues
    ResolutionSystem-->>DockerAddon: Return device access issue if present
    DockerAddon->>DockerAddon: Dismiss issue
    DockerAddon->>User: Addon stopped or error logged

    User->>FixupAddonExecuteRestart: Restart Addon
    FixupAddonExecuteRestart->>Addon: Attempt to stop
    Addon-->>FixupAddonExecuteRestart: Return status
    FixupAddonExecuteRestart->>Addon: Attempt to start
    Addon-->>FixupAddonExecuteRestart: Return status
    FixupAddonExecuteRestart->>User: Return restart status
Loading

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@mdegat01 mdegat01 removed the need-client-library Added to pull requests that need client library changes but none is linked label Nov 14, 2024
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (15)
supervisor/resolution/fixups/addon_execute_restart.py (5)

1-1: Enhance the module docstring with more context

The current docstring is quite brief. Consider expanding it to include the CGroup v2 context and explain when this helper is used.

-"""Helpers to fix addon by restarting it."""
+"""Helpers to fix addon by restarting it.
+
+This module provides functionality to restart addons when device access needs to be
+updated under CGroup v2. Unlike CGroup v1, device access cannot be added dynamically
+to running containers in CGroup v2, necessitating a container restart.
+"""

13-15: Improve setup function docstring

The current docstring doesn't explain the function's purpose clearly.

-    """Check setup function."""
+    """Initialize and return a FixupAddonExecuteRestart instance.
+
+    Args:
+        coresys: Core system instance for dependency injection
+
+    Returns:
+        FixupBase: Configured fixup instance for addon restart operations
+    """

21-22: Fix incorrect docstring

The docstring describes initialization but this method processes the fixup.

-        """Initialize the fixup class."""
+        """Process the addon restart fixup.
+
+        Args:
+            reference: The addon slug to restart
+
+        Raises:
+            ResolutionFixupError: If the addon cannot be stopped
+        """

42-60: Add return type hints to properties

The property return types should be explicitly annotated for better code maintainability.

     @property
-    def suggestion(self) -> SuggestionType:
+    def suggestion(self) -> SuggestionType | None:
         """Return a SuggestionType enum."""
         return SuggestionType.EXECUTE_RESTART

     @property
-    def context(self) -> ContextType:
+    def context(self) -> ContextType | None:
         """Return a ContextType enum."""
         return ContextType.ADDON

     @property
-    def issues(self) -> list[IssueType]:
+    def issues(self) -> list[IssueType] | None:
         """Return a IssueType enum list."""
         return [IssueType.DEVICE_ACCESS_MISSING]

     @property
     def auto(self) -> bool:
-        """Return if a fixup can be apply as auto fix."""
+        """Return if a fixup can be applied as auto fix."""
         return False

18-20: Consider adding CGroup version validation

The class implements CGroup v2 specific behavior but doesn't validate if CGroup v2 is actually in use.

Consider adding a check in process_fixup to verify the CGroup version and handle cases where CGroup v1 is still in use. This would make the code more robust against system configuration changes.

tests/resolution/evaluation/test_evaluate_cgroup.py (1)

Line range hint 1-85: Consider adding specific test cases for device access scenarios.

While the current test coverage is good for CGroup version evaluation, consider adding test cases that specifically verify:

  1. The behavior when device access is requested in CGroup v2
  2. The system's response indicating the need for manual addon restart

Example test case structure:

async def test_cgroup_v2_device_access(coresys: CoreSys):
    """Test device access behavior with CGroup v2."""
    cgroup_version = EvaluateCGroupVersion(coresys)
    coresys.docker.info.cgroup = CGROUP_V2_VERSION
    # Add assertions for device access scenarios
supervisor/resolution/const.py (1)

83-83: LGTM! Consider adding docstring documentation

The new issue type aligns well with the PR objectives for handling device access limitations in CGroup v2. Consider adding docstring documentation to the IssueType enum to describe when this issue is raised and how it relates to CGroup v2 limitations.

tests/resolution/fixup/test_addon_execute_restart.py (4)

19-26: Maintain consistency in addon reference values.

The DEVICE_ACCESS_MISSING_ISSUE uses TEST_ADDON_SLUG while EXECUTE_RESTART_SUGGESTION hardcodes "local_ssh". Consider using TEST_ADDON_SLUG consistently for both constants.

 EXECUTE_RESTART_SUGGESTION = Suggestion(
-    SuggestionType.EXECUTE_RESTART, ContextType.ADDON, reference="local_ssh"
+    SuggestionType.EXECUTE_RESTART, ContextType.ADDON, reference=TEST_ADDON_SLUG
 )

36-38: Remove unused mock_stop function.

This function is defined but never used in the test. The stopping behavior is handled by the DockerInterface.stop mock instead.


43-48: Improve readability of context manager setup.

Consider breaking down the mock setup for better readability:

-    with (
-        patch.object(DockerInterface, "stop") as stop,
-        patch.object(DockerAddon, "run") as run,
-        patch.object(Addon, "_wait_for_startup"),
-        patch.object(Addon, "write_options"),
-    ):
+    with patch.object(DockerInterface, "stop") as stop, \
+         patch.object(DockerAddon, "run") as run, \
+         patch.object(Addon, "_wait_for_startup"), \
+         patch.object(Addon, "write_options"):

63-63: Fix inconsistent variable naming.

The variable name addon_execute_start should be addon_execute_restart to match the class name FixupAddonExecuteRestart. This occurs in multiple test functions.

-    addon_execute_start = FixupAddonExecuteRestart(coresys)
+    addon_execute_restart = FixupAddonExecuteRestart(coresys)

Also applies to: 87-87, 108-108

supervisor/docker/addon.py (1)

835-844: Add documentation for CGroup v2 handling.

Consider adding a docstring or comment explaining the CGroup v2 specific handling and why a restart is required when device access is needed.

Add documentation like this:

+        # When running on CGroup v2 without OS support, device access cannot be
+        # modified dynamically. A container restart is required to apply changes.
         if (
             self.sys_docker.info.cgroup == CGROUP_V2_VERSION
             and not self.sys_os.available
         ):
supervisor/addons/addon.py (1)

164-168: Consider enhancing the docstring.

The implementation is correct, but the docstring could be more descriptive about when this issue is raised and how it relates to CGroup v2 limitations.

 @property
 def device_access_missing_issue(self) -> Issue:
-    """Get issue used if device access is missing and can't be automatically added."""
+    """Get issue used when device access is missing due to CGroup v2 limitations.
+
+    This issue is raised when a device is hotplugged and cannot be automatically
+    added to a running container due to CGroup v2 restrictions, requiring a manual
+    addon restart to gain access to the device.
+    """
     return self._device_access_missing_issue
tests/docker/test_addon.py (2)

438-438: Use patch to mock methods instead of direct assignment

Directly assigning coresys.hardware.disk.get_disk_free_space = lambda x: 5000 modifies the method globally, which could affect other tests and lead to unintended side effects.

Consider using unittest.mock.patch to temporarily mock the method within the scope of the test to ensure isolation:

with patch.object(coresys.hardware.disk, 'get_disk_free_space', return_value=5000):
    # rest of the test code

454-455: Avoid using arbitrary sleep in asynchronous tests

Using await asyncio.sleep(0.01) to wait for event processing can make the test flaky, as it relies on timing that might vary across different environments.

Consider using synchronization primitives like asyncio.Event to wait for the event to be processed:

event_processed = asyncio.Event()

def on_event(*args, **kwargs):
    event_processed.set()

coresys.bus.async_listen_once(BusEvent.HARDWARE_NEW_DEVICE, on_event)
coresys.bus.fire_event(BusEvent.HARDWARE_NEW_DEVICE, TEST_HW_DEVICE)
await event_processed.wait()
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ca3e6da and b35728e.

📒 Files selected for processing (8)
  • supervisor/addons/addon.py (3 hunks)
  • supervisor/docker/addon.py (4 hunks)
  • supervisor/resolution/const.py (3 hunks)
  • supervisor/resolution/evaluations/cgroup.py (2 hunks)
  • supervisor/resolution/fixups/addon_execute_restart.py (1 hunks)
  • tests/docker/test_addon.py (3 hunks)
  • tests/resolution/evaluation/test_evaluate_cgroup.py (1 hunks)
  • tests/resolution/fixup/test_addon_execute_restart.py (1 hunks)
🔇 Additional comments (16)
supervisor/resolution/evaluations/cgroup.py (2)

20-20: LGTM! Verify documentation updates

The change to support both CGroup versions aligns with the PR objectives to enable CGroup v2 support in Supervised systems. This is a significant change that enables the manual addon restart workflow for device access.

Let's check if documentation needs updates to reflect this change:

#!/bin/bash
# Description: Look for documentation that might need updates regarding CGroup support
# Search for documentation files
echo "Searching for documentation files..."
fd -e md -e rst . docs/

# Search for CGroup-related content in documentation
echo "Searching for CGroup-related content..."
rg -i "cgroup|device.*access" docs/

5-5: Verify the relocated constants

The constants have been moved to a centralized location, which is good for maintainability.

Let's verify the constants exist in their new location:

✅ Verification successful

Constants are correctly defined and imported

The constants CGROUP_V1_VERSION and CGROUP_V2_VERSION are properly defined in supervisor/resolution/const.py and are being used correctly in the import statement. The constants are also being used consistently across the codebase, including in tests and Docker-related functionality.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify CGROUP constants are properly defined in const.py
rg "CGROUP_V[12]_VERSION.*=.*" supervisor/resolution/const.py

Length of output: 374


Script:

#!/bin/bash
# Let's first check if the const.py file exists and examine its content
fd "const.py" supervisor/resolution/

# If found, let's examine its content
fd "const.py" supervisor/resolution/ -x cat {}

# Let's also search for these constants in the entire codebase
# to ensure we're not missing any alternate locations
rg "CGROUP_V[12]_VERSION" -A 1

Length of output: 5520

supervisor/resolution/fixups/addon_execute_restart.py (1)

34-41: Reconsider error handling strategy for addon start

The current implementation silently logs start failures. While the comment explains why, this approach might hide important issues from the user.

Let's verify if this matches the intended behavior:

Consider one of these alternatives:

  1. Raise a different error type for start failures
  2. Return a status indicating partial success
  3. Add a warning state to the addon
tests/resolution/evaluation/test_evaluate_cgroup.py (1)

29-29: LGTM! Assertion change aligns with PR objectives.

The modified assertion correctly reflects that CGroup v2 should no longer be marked as unsupported in Supervised systems, which is consistent with the PR's goal to support CGroup v2 with manual addon restarts.

supervisor/resolution/const.py (2)

18-19: LGTM! Well-defined constants for CGroup versions

The string values are appropriate for configuration and logging purposes.


119-119: LGTM! Verify suggestion type usage

The new suggestion type aligns with the PR's requirement for manual addon restarts in CGroup v2 environments.

Let's verify how this suggestion type is used in the codebase:

✅ Verification successful

EXECUTE_RESTART suggestion type is properly implemented

The suggestion type is well integrated into the codebase:

  • Implemented in FixupAddonExecuteRestart class with proper restart logic (stop and start)
  • Used in docker addon context to handle device access issues
  • Comprehensive test coverage including error cases
  • Non-automatic fixup (requires manual intervention)
  • Correctly associated with IssueType.DEVICE_ACCESS_MISSING and ContextType.ADDON

The implementation aligns with the PR's requirement for manual addon restarts in CGroup v2 environments.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check how the new suggestion type is implemented
# Expected: Should find references in resolution manager and addon-related files

# Search for direct references to the new suggestion type
rg "EXECUTE_RESTART" --type python

# Look for resolution manager implementations handling this suggestion
ast-grep --pattern 'def $_(self, suggestion: SuggestionType):
  $$$
  if suggestion == SuggestionType.EXECUTE_RESTART:
    $$$
'

Length of output: 196


Script:

#!/bin/bash
# Let's try with correct ripgrep file type and more specific ast-grep pattern

# Search for direct references to the new suggestion type
rg "EXECUTE_RESTART" -t py

# Look for resolution manager implementations
ast-grep --pattern 'class $_ {
  $$$
  def process_suggestion(self, suggestion: $_):
    $$$
    EXECUTE_RESTART
    $$$
}'

# Check imports of SuggestionType where EXECUTE_RESTART might be used
rg "from supervisor.resolution.const import.*SuggestionType" -t py

Length of output: 5797


Script:

#!/bin/bash
# Let's check the implementation of the restart fixup

# Check the addon restart fixup implementation
cat supervisor/resolution/fixups/addon_execute_restart.py

# Look for the resolution manager's suggestion processing
rg -A 10 "process_suggestion.*EXECUTE_RESTART" -t py

# Check the test implementation for restart suggestion
cat tests/resolution/fixup/test_addon_execute_restart.py

Length of output: 6283

tests/resolution/fixup/test_addon_execute_restart.py (2)

81-104: Well-structured test for start failure scenario.

The test properly verifies that:

  1. The stop operation is still called
  2. Issues are cleaned up even on failure
  3. Appropriate error message is logged

106-120: Good coverage of missing addon scenario.

The test effectively verifies the edge case of a missing addon, ensuring:

  1. No operations are attempted
  2. Issues and suggestions are cleaned up
  3. Clear error message is logged
supervisor/docker/addon.py (2)

13-13: LGTM: Import changes are appropriate.

The new imports for evolve and CGROUP_V2_VERSION are correctly placed and necessary for the CGroup v2 support implementation.

Also applies to: 44-44


791-797: LGTM: Clean handling of device access issues during container removal.

The code appropriately cleans up device access issues when the container is removed, preventing stale issues from persisting.

supervisor/addons/addon.py (2)

151-153: LGTM! Well-structured issue tracking implementation.

The new _device_access_missing_issue property follows the established pattern and correctly initializes the issue with appropriate type and context.


193-199: LGTM! Proper cleanup of device access issues.

The implementation correctly dismisses device access issues when the addon stops, which is essential for the manual restart workflow with CGroup v2. This ensures that the issue is cleared when the addon is intentionally stopped or restarted.

tests/docker/test_addon.py (4)

392-413: Constants for test devices are appropriate and well-defined

The definitions of TEST_DEV_PATH, TEST_SYSFS_PATH, and TEST_HW_DEVICE provide clear and necessary fixtures for the tests, ensuring accurate simulation of device-related scenarios.


468-469: Use patch to mock methods instead of direct assignment

Similar to a previous comment, directly assigning coresys.hardware.disk.get_disk_free_space = lambda x: 5000 can affect other tests.


483-484: Avoid using arbitrary sleep in asynchronous tests

As previously noted, using await asyncio.sleep(0.01) may cause unreliable test results.


456-456: Verify hardcoded device major and minor numbers

The assertion add_devices.assert_called_once_with(123, "c 0:0 rwm") uses hardcoded major and minor numbers "0:0". This may not match the actual device numbers and could lead to test failures if the device numbers differ.

Run the following script to confirm the major and minor numbers of TEST_HW_DEVICE:

Ensure that the major and minor numbers output by the script match "0:0". If they differ, update the assertion to use the correct numbers:

-add_devices.assert_called_once_with(123, "c 0:0 rwm")
+add_devices.assert_called_once_with(123, f"c {major}:{minor} rwm")

supervisor/docker/addon.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant