Sep 8, 2024

#02. Here is everything you need to know about a Tagging Strategy

Before you start tagging your cloud resources you need a tagging strategy. Here are some guidelines and best practices for you.

As mentioned in the post Is tagging the way to go for cost allocation, organizations should primarily rely on account hierarchy structure for cost allocation and complement it with tagging to enhance granularity. Using tags for resource classification offers several advantages over solely relying on native AWS hierarchical constructs (organization, account, resource group). Specifically:

  • Follow Business Organizational Structure: Tags can mirror the actual naming of the organization's internal structure.
  • Building Matrix Dimensions: Multiple tags can be applied to a single resource, whereas a resource can only belong to one hierarchical construct. This makes tags ideal for filtering and achieving deeper cost granularity.
  • Consistency: The same tags can be applied to resources belonging to different hierarchical constructs.
  • Compliance and Security: Use tags to mark resources that need special attention. For example, tagging buckets to comply to a certain retention policies or with certain ISO27001 standards.
  • Multi-Cloud Applicability: It's possible to implement the same tagging dictionary across providers without adopting different naming conventions per provider. Tags can be implemented across major cloud providers with minimal or no changes.

Implementing a well-defined tagging strategy provides control and visibility over infrastructure, leading to improved cost management, streamlined operations, and enhanced governance.

However, it's important to note that tags can become a major source of complexity if their implementation doesn't follow a clear strategy. Common issues include inconsistency, lack of comprehensiveness, the need for exceptions, and frequent tag updates. Therefore here is a guideline for you to set your Tagging Strategy

TL;DR

  • Don't start tagging resources before planning a clear tagging strategy and implementing it according to best practices.
  • Best practices include tag consistency, comprehensiveness, and adherence to hyperscaler limitations. Leverage automation when possible.
  • To determine which tags are needed, start by defining tag categories and purposes.
  • Tagging should have a clear governance and accountability structure.
  • A tagging strategy consists of four phases: define, audit, automate, and enforce.

General Tagging Best Practices

The following best practices are “agnostic” and shall be taken into account for creating tags, regardless of the purpose of the tag.

  1. A culture of tagging among cloud users is fundamental. Users are responsible for tagging their resources. The governance team (e.g., owners of different tagging categories, such as the FinOps team for the FinOps tag category) primarily acts as a support function, providing corrective actions when necessary.
  2. Tags must consider relevant dimensions. A dimension's relevance is typically defined by the tagging strategy's goal, the organisation's structure, the cloud infrastructure's organisation, and any applicable limitations or constraints.
  3. Example: To track Cost Center, use a tag key "cost-center" with a value like "abc-xyz". For applications using shared infrastructure, consider tagging common services as "shared" and work with IT Finance to determine a shared cost allocation method. If this isn't feasible, reassess the goal, governance, or infrastructure.
  4. Tags must be consistent. Consistency in tagging means defining a standardized format for tag keys and values, and enforcing it through policies. Inconsistent tagging strategy execution leads to increased, often unmanageable complexity, especially in large environments.
    Example: To track application costs, use a tag key "application". Avoid variations like "app", "app-name", or misspellings.
  5. Tagging must be comprehensive. It should be possible to map 100% of resources (within existing limitations, as some services may not be taggable). Comprehensiveness is measured in two ways:
    1. Against a specific dimension.
    2. Example: All resources must have an "application" tag.
    3. Against a specific tagging initiative.
    4. Example: All resources must have the FinOps-related tags.
    5. In other words, merely having a tag is not sufficient.
  6. Tagging should be grouped into categories. Each tag category must follow the general tagging guidelines and describe relevant tags. Additionally, each category must have an owner—a department or group.
    Example: Tag categories can include "FinOps", "Security", and "Automation". Each category needs an owner accountable for the tagging strategy, its implementation, and any related operational needs.
  7. Tags must adhere to hyperscaler limitations and rules. If multi-cloud is part of the cloud strategy, the tagging approach must account for the rules and limitations of different hyperscalers. (see tagging limitations for AWS, Azure and GCP)
  8. Programmatically enforce tags to ensure consistent implementation. Provisioning automation and Infrastructure as Code (IaC) enhance transparency and reduce compliance violations, especially those caused by human error.
  9. Avoid duplicating available information. Don't create tags that replicate data already accessible through the provider's API (e.g., "region").

Which Tags/Labels to Use?

A simple and effective way to start organizing tags is to define tagging categories (e.g., Business, Technical, FinOps, Automation, Security, and Compliance …etc). For each category name a list of tags and underline the purpose or need for the tags needed. Finally, append other relevant attributes to these tags.

Tag attributes may include the owner, which scope it applies to, and whether it's inheritable, proposed values, format, and a description.

Where to Apply the Tags?

Part of the tag definition is the scope and if it's inheritable. Tags can be applied to different hierarchy layers (e.g., OU, Account, Subscriptions, resource groups, folders and projects) besides of course the cloud resources themselves. However, if the tag is applied to a parent, it does not always mean that child resources would inherit it automatically. In some cases tagging inheritance should be set as a policy.

For example, in AWS there are two ways to implement tagging inheritance, AWS tag policy, or Service Control Policies (SCPs). In Azure, tagging inheritance can be implemented using builtin ic Azure policies. These policies can control the inheritance of tags between the parent and child resources (e.g., Inherit a tag from the resource group, subscription or resource group or Append a tag and its value from the resource group). Finally, in GCP, labels are inherited by default throughout the resource hierarchy (Organisation → Folders → projects → resources). You have the flexibility to override an inherited label on a child resource by applying a label with the same key but a different value.

References: Tagging Polices in AWS, Implement AWS resource tagging strategy using AWS Tag Policies and Service Control Policies (SCPs) and Assign policy definitions for tag compliance in Azure

Tagging Governance

Consider these aspects of Governance for tagging:

Ownership

The ownership and definition of tags should be addressed from the outset. It's common to assign tagging accountability to the tag category owner. However, since there are often multiple tag categories, it's typical to assign ownership of the overall tagging strategy to the team responsible for governance—usually the "cloud center of excellence" or CCoE. Ultimately, the CCoE leader is accountable for the entire tagging strategy.

Tagging Stakeholders

The CCoE needs to work with several teams to create an effective tagging strategy. These key teams include:

  • App Development: They suggest tags for automation and DevOps.
  • Cloud Infrastructure: They help set up the necessary automation processes.
  • Finance: They provide input on cost breakdowns, define cost centers, and decide how to allocate costs for shared resources.
  • Operations: They fix any tagging policy issues and put automated controls in place.
  • Security: They suggest tags related to security baselines and how to manage overall security.

Remediation

There are different approaches used to fix tagging issues. These approaches heavily depend on your organization's level of maturity:

  • Reactive approach: Uses tools or custom scripts to find resources that are not tagged correctly. These tools can be provided by cloud providers or third-party software.
  • Proactive approach: Uses automation and Infrastructure as Code (IaC) to make sure tags are applied correctly when resources are created. This method aims to ensure all resources are tagged properly from the start.
  • Hybrid approach: Combines both reactive and proactive methods. It uses automation to regularly check tags, find resources that are not tagged correctly, and fix the issues.

Implementing the Tagging Strategy

Phase 1: Define

The first part of the implementation of the tagging strategy is to define and create awareness around the tagging dictionary. Success metrics shall also be defined in this part, to keep track of the tags coverage.

NOTE: Do not enforce tags that haven’t been previously agreed upon.

This phase can be considered complete when:

  • The tagging dictionary contains the proposed list of tags, along with their description, example value and associated success metrics.
  • The tagging dictionary is published along with the principles that govern the conventions used in the definition of the tags
  • Feedback from the involved stakeholders is collected and used to refine the dictionary (at least one round)
  • Success metrics are defined

You may also find it useful to keep a log of the stakeholders interviewed as part of the rationale for the tagging strategy.

Create Awareness

Raising stakeholder awareness of the tag dictionary is crucial for the tagging strategy's success. The dictionary must:

  • Circulate Internally: Be shared within the organization to gather feedback and refine the agreed-upon list of tags.
  • Utilize Internal Documentation Portals: Be published on the organization's intranet or wiki, including the principles that guided the dictionary's development.

Success Metrics (Non-Technical)

To gauge the success of a tagging strategy, it's crucial to monitor whether tags provide real value to the organization. Tracking success metrics is essential for evaluating the strategy's effectiveness. These metrics should be regularly shared with stakeholders, clearly demonstrating the benefits the organization derives from implementing tags.

The following is an example demonstrating how to measure the success of a tagging strategy:

Phase 2: Auditing

In this phase, the FinOps team in charge of the tags within the CCOE must verify the compliance of deployed cloud resources against the dictionary defined in Phase 1. To be noted, this audit process should be run periodically, to monitor progression and to detect and remediate violations.

Phase 2 can be considered complete when there is a process in place that periodically:

  • regularly checks if required tags are present
  • makes sure each tag has a value
  • verifies that tag values comply with approved naming conventions (if values can be freely defined, they must not be empty)
  • communicates the overall compliance status to the stakeholders
  • communicates non-compliance to resource owners (soft remediation)

If violations are found, resource owners shall be notified. The FinOps team in the CCoE is accountable for notifying resource owners, who are instead responsible to make sure violations are remediated.

The Audit Process

When non-compliance is detected, a notification is sent to the resource owner. This notification must include:

  • The resource ID and a brief description (e.g., "Virtual Machine")
  • Details of the non-compliance (e.g., empty value, non-compliant value, missing tag)
  • Recommended action
  • A link to the published tagging dictionary and guidelines

This approach represents a "soft" remediation strategy. It's generally not advisable to implement "hard" measures, such as forcibly decommissioning non-compliant resources.

Multiple tools can support this process. For example, AWS Config, Azure Policy, and GCP Resource Manager (to name a few) allow you to define which tag keys must be applied, whether a value should be specified, and if it must comply with a certain format. Once the rules are set, these tools continuously audit existing resources and report violations.

Success Metrics (Non-Technical)

Here again, It is critical to have success metrics and to show them to stakeholders. Monitoring the tagging index (see below) will help demonstrate progress. Discussing the identified challenges will create momentum and provide a basis for continued support from resource owners towards the tagging initiative.

Phase 3: Automation

Automation should be implemented to apply tags automatically wherever possible.

  • Leveraging cloud-native hierarchies (e.g., accounts and resource groups): Create inheritance policies to apply tags with a parent-child relationship.
  • For example, block storage volumes could inherit all the tags of the compute instance they're attached to.
  • Leveraging Infrastructure as Code (IaC): Standardize and automate the provisioning process.
    • Note: Audit IaC to ensure resources have compliant tags among their parameters.

This phase is complete when:

  • Tagging policies are created (e.g., inheritance policies)
  • IaC templates include tags (e.g., as a requirement for the DevOps team, aligned with the tagging strategy)

Phase 4: Enforcement

Tagging guidelines are programmatically enforced as the final step of this strategy implementation. The enforcement is designed through policies and can be both "preventive" and "retrospective." The preventive approach denies all noncompliant operations, while the retrospective approach detects and remediates policy violations after provisioning.

Both approaches can be recommended. However, each approach has its advantages and disadvantages:

  • Preventive approach: The main advantage is that non-compliant resources are never created. However, its disadvantages must be considered in complex architectures or cases of large technical debt. Alignment with DevOps is crucial to avoid disrupting existing workflows (e.g., deployment halting because Infrastructure as Code tried to deploy a non-compliant resource).
  • Retrospective approach: This is easier to implement and less risky. However, its disadvantage lies in the extra administrative work it requires (e.g., follow-ups and remediation). The main risk is that the root cause isn't always identified, resulting in a continuous need for discovery and remediation.

This phase can be considered complete when it's possible to:

  • Monitor failed provisioning requests and remediations due to tagging policy violations
  • Intervene in case of disrupted deployments due to preventive tagging actions, e.g.:
    • Creating and standardizing (if necessary) a process to handle exceptions.

Summary

Tagging is essential for refining cost allocation granularity, ensuring standardization, and building matrix dimensions. Before creating tags in your organization, follow the tagging best practices guidelines and ensure the implementation of the tagging strategy discussed in this article.

Thanks for reading! Share if you found it helpful. Have questions or suggestions for future topics? We'd love to hear from you!

Special thanks for my friend and FinOps Guru Gabriele Russo, who was the one who composed many of the elements in this article.