In the previous blog, we discussed tagging 101, how tagging could help with cloud cost reporting, how to promote a tagging strategy that supports cost attribution in your business and engineering context, and cultivating the culture of cloud resource ownership at scale.
This blog continues the series and discusses tagging strategies that work at scale and how to tag resources with Infrastructure-as-Code (IaC). We will add suggestions for key-value pairs (tags) that could fit in your environments and suggestions for a tags hierarchy. Use this as a reference for your tagging enhancements.
Cloud Tagging Hierarchy
Tagging cloud resources is the first step toward cost visibility and attribution. Tags are needed at several levels to ensure precise attribution.
From a business perspective, you could tag based on your organization structure (usually based on cost centers and organization hierarchies), which could look something like below:
From an infrastructure perspective, a cloud resource tagging hierarchy looks like the following:
- Account/project level tags
- Cloud Infrastructure resource level tags
- Microservice tags (K8s)
Let's explore each in detail.
ACCOUNT-LEVEL TAGS
These tags can help with cost attribution at an account level (AWS) / Project level (GCP) and are the founding steps towards budgeting and forecasting, as well as building chargeback/show-back models.
Note: Having tags defined as required in the list below will help your organization’s tagging strategy thrive. We’ve seen large enterprises use 1:1 mapping for accounts and teams. This works well when there’s a landing zone structure in place; otherwise, it adds a lot of overhead. 1:1 mapping also helps build precise cost attribution and chargeback models. (Network costs and other shared costs are quite tricky to handle, and this mapping helps with that.)
- service_owner tag: If the account is owned by a service/team and not shared across teams (Required)
- point_of_contact tag: Email of the owner of the account or requester of the account (Required)
- tf_managed/ cf_managed: Indicates if cloud resource is managed via Terraform(tf)/CloudFormation/other Infrastructure as Code (IaC) tool (Required — if managed by Terraform, set it to true; otherwise, false)
- cost_center tag: This tag identifies the cost center that the resources belong to (Required)
- executive_sponsor tag: Could represent costs and expenditures from an executive perspective. This tag can promote budget alignment for cloud spending (Required)
- cob (cost_of_business): Indicates whether it’s direct production costs vs. R&D (non-prod)
- Values: opex or cogs (Operational Expense or Cost of Goods Sold)
- account_name: Name the account with naming conventions that the organization uses (Required)
Note: These tags should be enough when some accounts/projects are owned by engineering teams and not shared with others. This level of granularity won’t be sufficient for cost attribution for shared accounts/microservices. In the below sections, we’ll cover how to overcome that problem and allocate costs for shared services.
CLOUD INFRASTRUCTURE RESOURCE-LEVEL TAGS
An account in the Cloud contains more than a few teams working on it. In these scenarios with multiple teams, tagging has to be performed at a resource level to build cost attribution.
Note: From a cloud resource deployment perspective, the recommendation is to use environment (dev, qa, staging, prod, etc.) to identify resources from specific environments.
- service_name tag: Name of the service the resource belongs to (Required)
- Example: front-end
- service_owner tag: Name of the Eng. team that owns the service (Required)
- A manager or an IC responsible for the service
- Pro tip: A team alias works best in this scenario
- If it’s a shared service, we need to add multiple owners separated by “,”
- shared_service tag: Takes boolean values (yes or no as a value). If the value is yes, we’ll have to add all the service owners under the service_owner tag separated by “,”
- While this does not solve the problem of cost attribution to teams, central cloud teams will know who consumes the resource
- Shared services cost attribution will be discussed further in the next section
- cost_center tag: Could be used to identify resources under a business unit
- cob (cost_of_business): Indicates whether it’s direct production costs vs. R&D (non-prod)
- Values: opex or cogs (Operational Expense or Cost of Goods Sold)
- managed_by tag: Team alias/IC email of the team that manages the service
- point_of_contact tag: UserName(everything before @Organization.net in your official email and not alias) of the primary POC for that service (Conditional)
- requestor tag: The name of the team that requests the service; might be required in cases of creating an account (Conditional)
- env tag: dev, test, qa, prod, etc. If infra. falls under one of these categories, adding this tag is super important (Conditional)
- name tag: Any special names with which the team can identify a resource meaningfully. This is for service owners to decide how they can name their services to identify quickly (Conditional)
- Example: us-west-2a-front-end-01
- tf_managed / cf_managed tag: Used to indicate it’s managed via Terraform (Required — if managed by Terraform, set it to true; otherwise false. In an ideal world, all of our infra. should be tf only.)
Tip: Resource Cleanup-related Tags (Save dollars with these tags + automation)
Pro tip #1: This tag helps to clean up resources that are no longer needed after a particular.
- remove_after_date tag (Required for resources created outside of IaC (regular) process and other temp. environments): If there’s any additional infrastructure created with response to incident response or for testing purposes, this tag helps to remove cloud resources after the specified time period when no longer needed.
- Example: remove_after_date = “12/21/2021”
Pro tip #2: This tag helps to shut down resources that are no longer needed after a particular date.
- shut-down tag (boolean): This tag is to be used for non-prod workloads where resources can be turned off during non-business hours and weekends.
- For instance, if this is set to true, then a lambda function or some automation script can turn off a resource with this tag at 5 pm and then bring that back at 8 am. (You can set the schedule that works best for you.)
In my previous life, we stopped QA clusters and EC2 Instances with the help of K8s labels in combination with cronjobs and AWS EC2 tags in combination with lambda function, respectively.
Security-related Tags
This is a curated list of tags that could come in handy with security teams with regards to incident response, etc. While these tags do not directly contribute towards cost allocation, they can help your security teams implement guardrails and automate processes.
- criticality tag: This tag may be useful for security teams to let them know of the criticality of the environments and resources. This could help set up some automation for incident response based on the criticality of the vulnerability.
- low
- medium
- high
- business unit-critical
- Mission-critical
- dr (disaster recovery) tag: This tag may be useful for cloud infra teams to identify failover environments during dr. This could potentially help to identify the costs of running dr.
- mission-critical
- critical
- essential
- security:incident_response
- pii (Required): This tag helps identify if env contains PII. This can help identify costs for securing envs and enable IAM guarding policies.
- False
- True
- cluster_name tag: Organization-* name for the cluster (Required)
- Example: prodn1 or Organization-prodn1
- tf_managed tag: Used to indicate it’s managed via Terraform (Required if managed by Terraform, set it to true, otherwise false. In an ideal world, it’s recommended and also a best practice to launch all the cloud infrastructure via Infrastructure-as-Code.)
- service_name label: Name of the service the resource belongs to (Required)
- Example: back-end
- point_of_contact label: Name of the primary POC for that service. A manager or an IC responsible for the service. (Required)
- service_owner label: Name of the engineering team that owns the service. (Required)
- shared_service tag: Takes boolean values – yes or no as a value. If the value is yes, we’ll have to add all the service owners under the service_owner tag separated by “,”. (Conditional)
- env label: dev, test, qa, prod, etc. If infra. falls under one of these categories, you must add this tag. (Conditional)
- name label: Any special names with which teams can identify their resources meaningfully. (Optional; this is for service owners to decide how they can name their services to identify easily.)
- Example: region-az-service-#
- remove_after_date: If any additional infrastructure is created in response to an incident response or for testing purposes, this tag helps remove that piece after the specified period.
- Example: remove_after_date = “12/21/2021”
module “account_tags” {
We do this by populating default_tags with the module which we defined in main.tf above.
source = “github.com/Organization-dev/terraform-utils/modules/tags/account-tags?ref=v0.1″account_name = “devtest”
service_owner = “Ops”
requestor = “ops@Organization.net”
}provider “aws” {
How to Tag Your K8s Resources Here’s how to tag/ label your K8s resources via manifest files Label section.
region = “us-west-2”
profile = “Organization-devtest”default_tags {
tags = module.account_tags.tags
}
}apiVersion: v1
Shared Services Cost AttributionShared services range from taxes, support fees, credits, and databases to microservices. This could be a shared S3 bucket, an RDS database consumed by multiple teams, a K8s microservice used to process data, an EMR cluster that processes data, etc. It’s a pain to build chargeback models with shared services.Cost attribution for taxes, support fees, and credits
kind: Pod
metadata:
name: front-end-app
labels:
env: dev
app: nginx
service_owner: team-xyz
service_name: nginx- These charges from the CSP are not split to accounts/projects; they’re usually charged at the payer account level with no granular attribution.
- A fair way to chargeback (allocate these charges back to engineering teams) is to split these charges based on a team’s cloud spending proportionally.
- Example: If Team A spends 10% of the total bill, 10 % of taxes, support fees, and credits must be allocated to Team A.
- Time taken for the query to run: This metric can help build a chargeback model for associating costs to customers. This will help identify which customers are profitable.
- Amount of data transferred/ processed: This metric helps chargeback costs on a system heavy on data processing.
- A way to chargeback here is to proportionally distribute costs to users based on the amount of data processed.
- Use a combination of these metrics to proportionally distribute resource costs such as EC2, S3, RDS, etc.
- You could use a formula that looks like this:
- shared_service_cost = cloud_resource_cost * (percent_distribution_of_metric1 * weight_of_metric1 + percent_distribution_of_metric2 * weight_of_metric2 +…)
- cloud_resource_cost refers to the actual price of cloud resources such as EC2, S3, RDS, etc.
- weight_of_metric needs consensus from within engineering teams and service owners
- Use the metrics that make sense for your workloads. The proportional weight of metrics is a prerequisite to building this model.
- pii (Required): This tag helps identify if env contains PII. This can help identify costs for securing envs and enable IAM guarding policies.