How to avoid Security Group changes corruption in terraform by applying Open Policy Agents (OPA)?

Introduction

Hi everyone, I'm @tnqv, a Platform SRE in the Service Infrastructure Division, developing the infrastructure for the entire company. Today, based on our experienced real-world use cases, I would like to share about how we are applying Open Policy Agents to the Platform in order to protect the SLA by avoiding Security Group changes's corruption in Terraform.
What is Platform?
From the last quarter, Moneyforward's Service Infrastructure Division is splitting to 3 groups: Enabling SRE, Platform Group, and Guardian Group.
As the Platform SRE, we will take a bird's eye view of the entire company and develop the Platform. Through the Platform, we will improve the developer experience and development productivity of engineers throughout the company.

What is OPA (Open policy agents)

From the official documents Open policy agents,

The Open Policy Agent (OPA) is an open-source, general-purpose policy engine that unifies policy implementation across the stack. The project was created by Styra and it is currently incubating at the Cloud Native Computing Foundation. You can use OPA to enforce policies in microservices, Kubernetes, CI/CD pipelines, API gateways, data protection, ssh/sudo & container exec control, Terraform risk-analysis, and much more.

In short, OPA helps us to check whether the resources definitions(Kubernetes manifests, Terraform code) meet our defined policies or not.
To learn more about OPA's basic syntax, we can refer here: https://www.openpolicyagent.org/docs/latest/policy-language/
How to integrate Open Policy Agents to Platform?
For the basic:
- Our policies will be written in Rego, an OPA-provided declarative language.
- To test the policies we have created for our configuration and run it via our CI/CD pipeline, we use a utility tool called Conftest.
Monorepo in Platform
- To control each service's resources, we are currently structuring our service's application folder by Monorepo with Terraform codebase and Kubernetes manifests.
To integrate OPA checks to the the Platform,
- When developers push new code changes to services code base.
- terraform plan workflow will be automatically run.
- After terraform plan finished, we receive new resources changes results.
- OPA workflow checks will be run and confirm if the resources changes results violate the policy.
- Also Platform SRE/Developer can freely creates/updates the policies.
  Applying Open Policy Agents to the Platform to protect the SLA by avoiding Security Group changes corruption in Terraform.
Because we are using Terraform as Infrastructure as code, so for most cases, the resource changes operation are being provisioned predictably, but for some ad-hoc cases, it is not.
For example, like this issue is as same as the issue that we had run into, when updating security group with Terraform, the behavior is inconsistent with the Terraform idea. Although terraform plan succeeds in the plan results, it will fail when we apply it, which critically causes service downtime.
Based on the above described behavior, there are 2 general cases that Terraform apply will fail:
- If the number of IPs in cidr blocks exceeds the security group quota boundaries, all security group rules will be removed after applying failed. (It also failed when we set create_before_destroy to true).
- If there are duplicated IPs in cidr blocks, all security group rules will be removed after applying failed.
To prevent the above failures, the idea is we gonna add policy to check for both aws_security_group_rules resource and inline ingress/egress of aws_security_group resource.

Case 1: Policy for the number of IPs in cidr blocks exceeds the security group quota boundaries

- First, we need to know the security group quota boundary from plan results, note that for default quota of aws is 60, if we requested AWS to increase the quota, it should be different value.So we need to declare it via Terraform code to know which value is our security group is using.
```
 data "aws_servicequotas_service_quota" "security_group_service_quota" {
   quota_code   = "L-XXXXXXXX" # security group boundary quota code
   service_code = "vpc"
 }
```
After having quota value in plan, we will declare the exceed_limit_number for policy checks based on terraform plan results. rego default exceed_limit_number = 60 # default quota exceed_limit_number = upgraded_quota_number { some i; input.prior_state.values.root_module.resources[i].type == "aws_servicequotas_service_quota" input.prior_state.values.root_module.resources[i].name == "security_group_service_quota" upgraded_quota_number := input.prior_state.values.root_module.resources[i].values.value }
Now to check if inline aws_security_group does have IPs exceed the limit number, we define below policy ```rego deny[msg] { changeset := input.resource_changes[_] #

is_create_or_update(changeset.change.actions) # Only checks for created/updated resources

changeset.type == "aws_security_group"

cidr_blocks := changeset.change.after.ingress[].cidr_blocks security_group_source_ids := changeset.change.after.ingress[].security_groups prefix_ids := changeset.change.after.ingress[_].prefix_list_ids value := count(array.concat(array.concat(cidr_blocks, security_group_source_ids), prefix_ids)) value > exceed_limit_number

msg := sprintf("Security group '%v' inline ingress rules numbers has more than %v, current value %v", [changeset.address, exceed_limit_number, value]) } ```
Note that, aws_security_group_rules and inline aws_security_group are different (and terraform does point out about how different they are here) so we need to write it like below. ```rego deny[msg] { some i count(ingress_cidr_blocks_by_security_group_rule_ids[i]) > exceed_limit_number msg := sprintf("security_groups '%v' containing (ingress) security_group_rules with inbound rules exceed: required value less than %v, actual value: %v", [i, exceed_limit_number, count(ingress_cidr_blocks_by_security_group_rule_ids[i])]) }

ingress_cidr_blocks_by_security_group_rule_ids := {name: all_rules | some i

input.resource_changes[i].type == "aws_security_group_rule" input.resource_changes[i].change.after.type == "ingress" security_group_id := input.resource_changes[i].change.after.security_group_id security_group := [security_group| some f input.planned_values.root_module.resources[f].values.id == security_group_id input.planned_values.root_module.resources[f].type == "aws_security_group" security_group := input.planned_values.root_module.resources[f].name ] name := security_group[0]

all_cidr_blocks := [cidr_blocks| some j input.resource_changes[j].change.after.security_group_id == security_group_id input.resource_changes[j].change.after.type == "ingress" cidr_blocks := input.resource_changes[j].change.after.cidr_blocks[_] ]

all_security_group_source_ids := [security_groups| some k input.resource_changes[k].change.after.security_group_id == security_group_id input.resource_changes[k].change.after.type == "ingress" security_groups := input.resource_changes[k].change.after.source_security_group_id security_groups != null ]

all_prefix_ids := [prefix_ids| some l input.resource_changes[l].change.after.security_group_id == security_group_id input.resource_changes[l].change.after.type == "ingress" prefix_ids := input.resource_changes[l].change.after.prefix_list_ids[_] ]

all_rules := array.concat(array.concat(all_cidr_blocks, all_security_group_source_ids), all_prefix_ids) } ```
After that, the check will raise the error if developer pushed code changes violate the security group policy.

Case 2: Policy for the IPs in cidr blocks contain the duplicated elements

Checking policy for duplicated IPs in cidr blocks ingress, we can take advantage of set and array implementation of Rego, which set's elements will remove the duplicated value, and we compare it with array elements. For example, the below is checking for security group's ingress rule is containing overlap IPs or not. (egress is same but different type)

  deny[msg] {
    some i
    input.resource_changes[i].type == "aws_security_group_rule"
    input.resource_changes[i].change.after.type == "ingress"
    address := input.resource_changes[i].address

    set = {set |
      some j
      input.resource_changes[j].type == "aws_security_group_rule"
      input.resource_changes[j].change.after.type == "ingress"
      input.resource_changes[j].address == address
      set := input.resource_changes[j].change.after.cidr_blocks[_]
    }

    array = [array |
      some k
      input.resource_changes[k].type == "aws_security_group_rule"
      input.resource_changes[k].change.after.type == "ingress"
      input.resource_changes[k].address == address
      array := input.resource_changes[k].change.after.cidr_blocks[_]
    ]

    count(set) != count(array)

    msg := sprintf("security_group_rules '%v' ingress containing overlaps rules", [address])
  }

Conclusion

This solution helps us to provide the policy to prevent the unexpected updates for now. Hopefully, Terraform team will support to resolve it natively.( Although the above issue was already raised, the fix is still under consideration.)
The above described use case is not the only way how we can apply the OPA to our system, furthermore, we can also exploit it based on resolving our problems such as growing our policies from others best practices, building the authorization of RESTful API endpoints, and so on..

Join us!

MoneyForward, Service Infrastructure Division is looking for engineers.
We can start with a casual interview, so let's talk first! The links below will take you to our SRE recruitment page and SRE/Infrastructure related engineer blog! MoneyForward/SRE blogs
Reference:

マネーフォワードでは、エンジニアを募集しています。ご応募お待ちしています。

【会社情報】 ■Wantedly ■株式会社マネーフォワード ■福岡開発拠点 ■関西開発拠点（大阪/京都）

【SNS】 ■マネーフォワード公式note ■Twitter - 【公式】マネーフォワード ■Twitter - Money Forward Developers ■connpass - マネーフォワード ■YouTube - Money Forward Developers

Money Forward Developers Blog

株式会社マネーフォワード公式開発者向けブログです。技術や開発手法、イベント登壇などを発信します。サービスに関するご質問は、各サービス窓口までご連絡ください。