Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IAM trust error with cross-account shared VPC #82

Open
TripleMalahat opened this issue Aug 26, 2024 · 8 comments
Open

IAM trust error with cross-account shared VPC #82

TripleMalahat opened this issue Aug 26, 2024 · 8 comments

Comments

@TripleMalahat
Copy link

I am trying to deploy a private ROSA STS cluster that leverages an existing shared VPC in another AWS account. We have previously had this configuration working successfully with the legacy rosa-sts module. Example resource configuration with the new rosa-classic module is below.

module "rosa-classic" {
  source  = "terraform-redhat/rosa-classic/rhcs"
  version = "1.6.2"

  # General Cluster Config
  cluster_name             = local.cluster_name
  compute_machine_type     = var.machine_type
  destroy_timeout          = 90
  etcd_encryption          = false
  wait_for_create_complete = false

  # Cluster Version Config
  openshift_version            = var.rosa_openshift_version
  upgrade_acknowledgements_for = true

  # Admin Account Config
  admin_credentials_username = "admin-name"
  admin_credentials_password = module.sm-secret.secret_value

  # Networking Config
  machine_cidr = data.aws_vpc.shared_vpc.cidr_block
  multi_az     = true

  # Autoscaling
  autoscaling_enabled                         = true
  autoscaler_ignore_daemonsets_utilization    = true
  autoscaler_max_node_provision_time          = "10m"
  autoscaler_max_nodes_total                  = 42
  autoscaler_max_pod_grace_period             = 5
  autoscaler_scale_down_delay_after_add       = "10m"
  autoscaler_scale_down_delay_after_delete    = "30s"
  autoscaler_scale_down_delay_after_failure   = "3m"
  autoscaler_scale_down_enabled               = true
  autoscaler_scale_down_unneeded_time         = "10m"
  autoscaler_scale_down_utilization_threshold = 0.5

  cluster_autoscaler_enabled = true
  max_replicas               = var.max_worker_node_replicas
  min_replicas               = 3

  autoscaler_cores = {
    min = 0
    max = 240
  }

  autoscaler_memory = {
    min = 0
    max = 1056
  }

  # DNS Config
  aws_private_link             = true
  aws_subnet_ids               = var.aws_subnet_ids
  base_dns_domain              = var.base_dns_domain
  private                      = true
  private_hosted_zone_id       = var.private_hosted_zone_id
  private_hosted_zone_role_arn = var.private_hosted_zone_role_arn

  # OIDC Config
  create_oidc  = true
  managed_oidc = true
  create_operator_roles = true
  create_account_roles  = true
}

When attempting to deploy with this configuration, we receive the following error:

│ Error: Can't build cluster
│ 
│   with module.rosa-cluster.module.rosa-classic.module.rosa_cluster_classic.rhcs_cluster_rosa_classic.rosa_classic_cluster,
│   on .terraform/modules/rosa-cluster.rosa-classic/modules/rosa-cluster-classic/main.tf line 39, in resource "rhcs_cluster_rosa_classic" "rosa_classic_cluster":
│   39: resource "rhcs_cluster_rosa_classic" "rosa_classic_cluster" {
│ 
│ Can't create cluster with name 'my-cluster': status is 400, identifier
│ is '400', code is 'CLUSTERS-MGMT-400' and operation identifier is
│ '71ab3fab-8d99-41e6-86db-0318c20b72c1': Failed to assume role with ARN
│ 'arn:aws:iam::111111111111:role/ROSA-SharedVPCRole-my-cluster-Z0590036143NZLABMGPJR':
│ operation error STS: AssumeRole, https response error StatusCode: 403,
│ RequestID: db511d8b-7461-410f-b636-527f4303bc5c, api error AccessDenied:
│ User:
│ arn:aws:sts::222222222222:assumed-role/my-cluster-Installer-Role/OCM is
│ not authorized to perform: sts:AssumeRole on resource:
│ arn:aws:iam::111111111111:role/ROSA-SharedVPCRole-my-cluster-Z0590036143NZLABMGPJR

Our troubleshooting has suggested that the trust policy on the ROSA-SharedVPCRole-my-cluster role is configured properly. However, the identity policies on the my-cluster-Installer-Role do not include the requisite permissions that allow it to assume the ROSA-SharedVPCRole-my-cluster role.

@gdbranco
Copy link
Contributor

The installer role permission policy should include the permission for "sts:AssumeRole" and "sts:AssumeRoleWithWebIdentity", it does not specify which roles to assume though. Could you help to ensure those are within the set of permissions for my-cluster-Installer-Role? If they do include it, could you help to verify the trust policy for the ROSA-SharedVPCRole-my-cluster it should include to allow to perform action "sts:AssumeRole" from the principal "my-cluster-Installer-Role"

@TripleMalahat
Copy link
Author

Hi Guilherme, thanks for getting back to me. I went through my findings again and double checked the parts you mentioned.

  1. The error example I provided above is not the only one we're getting apparently. That one only appears on a green field apply attempt. On subsequent apply attempts with partially deployed resources, it goes away and is replaced by the error in point 2. For completeness though, here are the trust policy from the target role and the rights (abridged) from the user role for the first error. You are correct that the two-way trust appears to be configured properly in this instance.
# ROSA-SharedVPCRole-rosa-pbit-my-cluster-Z01724557PTJ8IV92KS3

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::222222222222:root"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

# rosa-pbit-my-cluster-account-Installer-Role (from account 222222222222)

{
    "Statement": [
        {
            "Action": [
                ...
                "sts:AssumeRole",
                "sts:AssumeRoleWithWebIdentity",
                "sts:GetCallerIdentity",
                ...
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Action": [
                "secretsmanager:GetSecretValue"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/red-hat-managed": "true"
                }
            },
            "Effect": "Allow",
            "Resource": "*"
        }
    ],
    "Version": "2012-10-17"
}
  1. What I didn't realize earlier is that on subsequent apply attempts, we get a similar error but with a different role trying to assume the SharedVPC role:
│ Error: Can't build cluster
│ 
│   with module.rosa-cluster.module.rosa-classic.module.rosa_cluster_classic.rhcs_cluster_rosa_classic.rosa_classic_cluster,
│   on .terraform/modules/rosa-cluster.rosa-classic/modules/rosa-cluster-classic/main.tf line 39, in resource "rhcs_cluster_rosa_classic" "rosa_classic_cluster":
│   39: resource "rhcs_cluster_rosa_classic" "rosa_classic_cluster" {
│ 
│ Can't create cluster with name 'rosa-pbit-matt2': status is 400, identifier
│ is '400', code is 'CLUSTERS-MGMT-400' and operation identifier is
│ '41f8acb8-f7c9-4c40-b939-d77b7ba998f5': Failed to assume role with ARN
│ 'arn:aws:iam::111111111111:role/ROSA-SharedVPCRole-rosa-pbit-my-cluster-Z01724557PTJ8IV92KS3':
│ operation error STS: AssumeRole, https response error StatusCode: 403,
│ RequestID: 664f2aef-4ae6-481c-8d12-6f3650c7de82, api error AccessDenied:
│ User:
│ arn:aws:sts::222222222222:assumed-role/rosa-pbit-my-cluster-operator-openshift-ingress-operator-cloud-creden/OCM
│ is not authorized to perform: sts:AssumeRole on resource:
│ arn:aws:iam::111111111111:role/ROSA-SharedVPCRole-rosa-pbit-my-cluster-Z01724557PTJ8IV92KS3

In this instance, it is the "rosa-pbit-my-cluster-operator-openshift-ingress-operator-cloud-creden" role that is trying to assume the SharedVPCRole. That role does not have appropriate rights to assume roles in another account:

{
    "Statement": [
        {
            "Action": [
                "elasticloadbalancing:DescribeLoadBalancers",
                "route53:ListHostedZones",
                "route53:ListTagsForResources",
                "route53:ChangeResourceRecordSets",
                "tag:GetResources"
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ],
    "Version": "2012-10-17"
}

Thoughts?

@gdbranco
Copy link
Contributor

gdbranco commented Aug 27, 2024

For the installer and ingress operator roles not being allowed to the shared vpc role, it seems you are missing this step to update the shared vpc role to include the correct principals within it. According to the trust policy you presented it is only allowing the root user to assume it.

Regarding the permission policy of the ingress operator it should have picked up this set of permissions when the shared vpc role is supplied, I'll take a look into it

Edit: I have identified that the main module seems to be missing the shared vpc role arn input forwarding so that the operator policies may route to the correct permission set mentioned above. I'll be bringing that to the team and provide an estimate for a fix soon. In the meantime as a workaround please check the shared vpc example, you'll notice that it relies on the standalone modules instead of the main module directly

@TripleMalahat
Copy link
Author

The trust policy we have there (arn:aws:iam:111111111111:root) should actually be more permissive than what's described in the doc you linked. It allows ANY principal from the specified account to assume the role.

I'm glad you were able to confirm the other issue. Please do keep me posted on when you think the fix might be available. For now, I will need to proceed with the legacy rosa-sts module since I have that working and am under the gun a touch. But I would love to be able to shift gears to the new providers asap if that becomes feasible.

@gdbranco
Copy link
Contributor

gdbranco commented Aug 27, 2024

On further inspection that is correct and it seems to be working internally on my tests using the more permissive :root as well, albeit be in mind some aws services do not work well through that setup and require the use of more specific restriction. My current explanation is that it was a sync delay on AWS side for it to pick up and accept the installer role.

For the second issue it is indeed because the main module is not forwarding the intent of use of the private_hosted_zone_role_arn to the operator policies to select the correct set of permissions for the ingress operator role and it will not allow the role to be assumed as it does not contain the permission for it. We have addressed that through #83

We are currently going through the release of v1.6.3 of the provider, so that fix will be included in v1.6.3 of the classic modules

@TripleMalahat
Copy link
Author

Any update on where we're at with this? I saw that v1.6.3 was released but it doesn't look like the fix made it in.

@gdbranco
Copy link
Contributor

gdbranco commented Sep 4, 2024

We had an issue with 1.6.2 modules that required the 1.6.3 release early, the PR has been merged to main and will be available for 1.6.4. Sorry about that inconvenience

@TripleMalahat
Copy link
Author

TripleMalahat commented Sep 4, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants