Introduction
- Hi everyone, I'm @tnqv, a Platform SRE in the Service Infrastructure Division, developing the infrastructure for the entire company. Today, based on our experienced real-world use cases, I would like to share about how we are applying Open Policy Agents to the Platform in order to protect the SLA by avoiding Security Group changes's corruption in Terraform.
What is Platform?
- From the last quarter, Moneyforward's Service Infrastructure Division is splitting to 3 groups: Enabling SRE, Platform Group, and Guardian Group.
- As the Platform SRE, we will take a bird's eye view of the entire company and develop the Platform. Through the Platform, we will improve the developer experience and development productivity of engineers throughout the company.
What is OPA (Open policy agents)
- From the official documents Open policy agents,
The Open Policy Agent (OPA) is an open-source, general-purpose policy engine that unifies policy implementation across the stack. The project was created by Styra and it is currently incubating at the Cloud Native Computing Foundation. You can use OPA to enforce policies in microservices, Kubernetes, CI/CD pipelines, API gateways, data protection, ssh/sudo & container exec control, Terraform risk-analysis, and much more.
- In short, OPA helps us to check whether the resources definitions(Kubernetes manifests, Terraform code) meet our defined policies or not.
- To learn more about OPA's basic syntax, we can refer here: https://www.openpolicyagent.org/docs/latest/policy-language/
How to integrate Open Policy Agents to Platform?
- For the basic:
- Our policies will be written in Rego, an OPA-provided declarative language.
- To test the policies we have created for our configuration and run it via our CI/CD pipeline, we use a utility tool called Conftest.
-
- To control each service's resources, we are currently structuring our service's application folder by Monorepo with Terraform codebase and Kubernetes manifests.
To integrate OPA checks to the the Platform,
- When developers push new code changes to services code base.
terraform plan
workflow will be automatically run.- After
terraform plan
finished, we receive new resources changes results. - OPA workflow checks will be run and confirm if the resources changes results violate the policy.
- Also Platform SRE/Developer can freely creates/updates the policies.
Applying Open Policy Agents to the Platform to protect the SLA by avoiding Security Group changes corruption in Terraform.
- Because we are using Terraform as Infrastructure as code, so for most cases, the resource changes operation are being provisioned predictably, but for some ad-hoc cases, it is not.
- For example, like this issue is as same as the issue that we had run into, when updating security group with Terraform, the behavior is inconsistent with the Terraform idea. Although terraform plan succeeds in the plan results, it will fail when we apply it, which critically causes service downtime.
- Based on the above described behavior, there are 2 general cases that Terraform apply will fail:
- If the number of IPs in cidr blocks exceeds the security group quota boundaries, all security group rules will be removed after applying failed. (It also failed when we set
create_before_destroy
totrue
). - If there are duplicated IPs in cidr blocks, all security group rules will be removed after applying failed.
- If the number of IPs in cidr blocks exceeds the security group quota boundaries, all security group rules will be removed after applying failed. (It also failed when we set
- To prevent the above failures, the idea is we gonna add policy to check for both
aws_security_group_rules
resource and inlineingress/egress
ofaws_security_group
resource.
Case 1: Policy for the number of IPs in cidr blocks exceeds the security group quota boundaries
- First, we need to know the security group quota boundary from plan results, note that for default quota of aws is 60, if we requested AWS to increase the quota, it should be different value.So we need to declare it via Terraform code to know which value is our security group is using.
data "aws_servicequotas_service_quota" "security_group_service_quota" { quota_code = "L-XXXXXXXX" # security group boundary quota code service_code = "vpc" }
- After having quota value in plan, we will declare the
exceed_limit_number
for policy checks based onterraform plan
results.rego default exceed_limit_number = 60 # default quota exceed_limit_number = upgraded_quota_number { some i; input.prior_state.values.root_module.resources[i].type == "aws_servicequotas_service_quota" input.prior_state.values.root_module.resources[i].name == "security_group_service_quota" upgraded_quota_number := input.prior_state.values.root_module.resources[i].values.value }
Now to check if inline
aws_security_group
does have IPs exceed the limit number, we define below policy ```rego deny[msg] { changeset := input.resource_changes[_] #is_create_or_update(changeset.change.actions) # Only checks for created/updated resources
changeset.type == "aws_security_group"
cidr_blocks := changeset.change.after.ingress[].cidr_blocks security_group_source_ids := changeset.change.after.ingress[].security_groups prefix_ids := changeset.change.after.ingress[_].prefix_list_ids value := count(array.concat(array.concat(cidr_blocks, security_group_source_ids), prefix_ids)) value > exceed_limit_number
msg := sprintf("Security group '%v' inline ingress rules numbers has more than %v, current value %v", [changeset.address, exceed_limit_number, value]) } ```
Note that,
aws_security_group_rules
and inlineaws_security_group
are different (and terraform does point out about how different they are here) so we need to write it like below. ```rego deny[msg] { some i count(ingress_cidr_blocks_by_security_group_rule_ids[i]) > exceed_limit_number msg := sprintf("security_groups '%v' containing (ingress) security_group_rules with inbound rules exceed: required value less than %v, actual value: %v", [i, exceed_limit_number, count(ingress_cidr_blocks_by_security_group_rule_ids[i])]) }ingress_cidr_blocks_by_security_group_rule_ids := {name: all_rules | some i
input.resource_changes[i].type == "aws_security_group_rule" input.resource_changes[i].change.after.type == "ingress" security_group_id := input.resource_changes[i].change.after.security_group_id security_group := [security_group| some f input.planned_values.root_module.resources[f].values.id == security_group_id input.planned_values.root_module.resources[f].type == "aws_security_group" security_group := input.planned_values.root_module.resources[f].name ] name := security_group[0]
all_cidr_blocks := [cidr_blocks| some j input.resource_changes[j].change.after.security_group_id == security_group_id input.resource_changes[j].change.after.type == "ingress" cidr_blocks := input.resource_changes[j].change.after.cidr_blocks[_] ]
all_security_group_source_ids := [security_groups| some k input.resource_changes[k].change.after.security_group_id == security_group_id input.resource_changes[k].change.after.type == "ingress" security_groups := input.resource_changes[k].change.after.source_security_group_id security_groups != null ]
all_prefix_ids := [prefix_ids| some l input.resource_changes[l].change.after.security_group_id == security_group_id input.resource_changes[l].change.after.type == "ingress" prefix_ids := input.resource_changes[l].change.after.prefix_list_ids[_] ]
all_rules := array.concat(array.concat(all_cidr_blocks, all_security_group_source_ids), all_prefix_ids) } ```
- After that, the check will raise the error if developer pushed code changes violate the security group policy.
Case 2: Policy for the IPs in cidr blocks contain the duplicated elements
Checking policy for duplicated IPs in cidr blocks ingress, we can take advantage of
set
andarray
implementation of Rego, whichset
's elements will remove the duplicated value, and we compare it witharray
elements. For example, the below is checking for security group's ingress rule is containing overlap IPs or not. (egress is same but different type)deny[msg] { some i input.resource_changes[i].type == "aws_security_group_rule" input.resource_changes[i].change.after.type == "ingress" address := input.resource_changes[i].address set = {set | some j input.resource_changes[j].type == "aws_security_group_rule" input.resource_changes[j].change.after.type == "ingress" input.resource_changes[j].address == address set := input.resource_changes[j].change.after.cidr_blocks[_] } array = [array | some k input.resource_changes[k].type == "aws_security_group_rule" input.resource_changes[k].change.after.type == "ingress" input.resource_changes[k].address == address array := input.resource_changes[k].change.after.cidr_blocks[_] ] count(set) != count(array) msg := sprintf("security_group_rules '%v' ingress containing overlaps rules", [address]) }
Conclusion
- This solution helps us to provide the policy to prevent the unexpected updates for now. Hopefully, Terraform team will support to resolve it natively.( Although the above issue was already raised, the fix is still under consideration.)
- The above described use case is not the only way how we can apply the OPA to our system, furthermore, we can also exploit it based on resolving our problems such as growing our policies from others best practices, building the authorization of RESTful API endpoints, and so on..
Join us!
- MoneyForward, Service Infrastructure Division is looking for engineers.
We can start with a casual interview, so let's talk first! The links below will take you to our SRE recruitment page and SRE/Infrastructure related engineer blog! MoneyForward/SRE blogs
Reference:
マネーフォワードでは、エンジニアを募集しています。 ご応募お待ちしています。
【会社情報】 ■Wantedly ■株式会社マネーフォワード ■福岡開発拠点 ■関西開発拠点(大阪/京都)
【SNS】 ■マネーフォワード公式note ■Twitter - 【公式】マネーフォワード ■Twitter - Money Forward Developers ■connpass - マネーフォワード ■YouTube - Money Forward Developers