Application Metrics: Instrumentation Guide

📅May 20, 2026

🏷️Observability

⏱️16 min

Prometheus can collect metrics from your infrastructure, but the real value comes from instrumenting your application code. This guide shows you how to add custom metrics to Flask applications using real production code from Goalixa Core-API.

Why Application Metrics Matter

Infrastructure metrics tell you what is happening (CPU, memory, disk). Application metrics tell you why it’s happening (slow database queries, failed operations, business logic performance).

Infrastructure metrics:

CPU at 80%
Memory usage climbing
Disk I/O high

Application metrics:

1,000 task creation operations/sec
Database queries taking > 1s
5% of API requests failing

The second set gives you actionable insights.

Metric Categories

Before writing code, understand what to measure:

1. RED Metrics (Requests, Errors, Duration)

The foundation of service monitoring:

Metric	Type	Example
Rate	Counter	`http_requests_total` - requests per second
Errors	Counter	`http_requests_total{status_code="500"}`
Duration	Histogram	`http_request_duration_seconds` - latency

2. USE Metrics (Utilization, Saturation, Errors)

For resources:

Metric	Type	Example
Utilization	Gauge	`db_connections_active / db_connection_pool_size`
Saturation	Gauge	`queue_depth`, `thread_pool_size`
Errors	Counter	`db_query_errors_total`

3. Business Metrics

Domain-specific operations:

Task operations (create, complete, delete)
User authentication (login, validation)
Timer operations (start, stop)
Feature usage (goals created, habits tracked)

💡 Start with RED Metrics

Every service should expose RED metrics. They answer the three most important questions:

How much traffic? (Rate)
How many errors? (Errors)
How slow? (Duration)

Add business metrics only after RED metrics are in place.

Production Example: Goalixa Core-API

Let’s walk through the actual implementation from Core-API - a Flask service handling tasks, projects, goals, and time tracking.

Architecture

app/
├── observability.py       # Metric definitions + Flask middleware
├── metrics.py            # Helper functions for recording metrics
├── service/              # Business logic using metrics
│   ├── task_service.py
│   ├── goal_service.py
│   └── project_service.py
└── repository/           # Database layer using metrics
    ├── task_repository.py
    └── ...

Step 1: Define Metrics

Create app/observability.py to define all metrics:

from prometheus_client import Counter, Histogram, Gauge, Summary, Info, generate_latest
 
# ============= HTTP Request Metrics ============
REQUESTS_TOTAL = Counter(
    "goalixa_http_requests_total",
    "Total number of HTTP requests.",
    ["method", "route", "status_code"],
)
 
REQUEST_DURATION_SECONDS = Histogram(
    "goalixa_http_request_duration_seconds",
    "HTTP request latency in seconds.",
    ["method", "route"],
    buckets=(0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0),
)
 
REQUEST_SIZE_BYTES = Summary(
    "goalixa_http_request_size_bytes",
    "HTTP request size in bytes.",
    ["method", "route"]
)
 
RESPONSE_SIZE_BYTES = Summary(
    "goalixa_http_response_size_bytes",
    "HTTP response size in bytes.",
    ["method", "route", "status_code"]
)
 
REQUEST_EXCEPTIONS_TOTAL = Counter(
    "goalixa_http_request_exceptions_total",
    "Total number of request exceptions.",
    ["method", "route", "exception_type"],
)
 
ACTIVE_REQUESTS = Gauge(
    "goalixa_http_active_requests",
    "Number of active HTTP requests.",
)
 
 
# ============= Database Metrics =============
DB_QUERY_DURATION_SECONDS = Histogram(
    "goalixa_db_query_duration_seconds",
    "Database query duration in seconds.",
    ["operation", "table"],
    buckets=(0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5),
)
 
DB_QUERY_TOTAL = Counter(
    "goalixa_db_queries_total",
    "Total number of database queries.",
    ["operation", "table", "status"],
)
 
DB_CONNECTION_POOL_SIZE = Gauge(
    "goalixa_db_connection_pool_size",
    "Database connection pool size.",
)
 
DB_CONNECTIONS_ACTIVE = Gauge(
    "goalixa_db_connections_active",
    "Number of active database connections.",
)
 
 
# ============= Business Logic Metrics =============
TASK_OPERATIONS_TOTAL = Counter(
    "goalixa_task_operations_total",
    "Total number of task operations.",
    ["operation", "status"],  # operation: create, update, delete, complete
)
 
GOAL_OPERATIONS_TOTAL = Counter(
    "goalixa_goal_operations_total",
    "Total number of goal operations.",
    ["operation", "status"],
)
 
TIMER_OPERATIONS_TOTAL = Counter(
    "goalixa_timer_operations_total",
    "Total number of timer operations.",
    ["operation", "status"],  # operation: start, stop, complete
)
 
PROJECT_OPERATIONS_TOTAL = Counter(
    "goalixa_project_operations_total",
    "Total number of project operations.",
    ["operation", "status"],
)
 
 
# ============= Cache Metrics =============
CACHE_OPERATIONS_TOTAL = Counter(
    "goalixa_cache_operations_total",
    "Total number of cache operations.",
    ["operation", "status"],  # operation: hit, miss, set, delete
)
 
CACHE_DURATION_SECONDS = Histogram(
    "goalixa_cache_operation_duration_seconds",
    "Cache operation duration in seconds.",
    ["operation"],
    buckets=(0.0001, 0.0005, 0.001, 0.005, 0.01, 0.025, 0.05, 0.1),
)
 
 
# ============= Application Info =============
APP_INFO = Info(
    "goalixa_app_info",
    "Goalixa application information"
)

Metric Naming Best Practices

✅ Naming Conventions

Format: {namespace}_{metric_name}_{unit}_{suffix}

Examples:

goalixa_http_requests_total ← Counter (always ends in _total)
goalixa_http_request_duration_seconds ← Histogram (includes unit)
goalixa_db_connections_active ← Gauge (current state)

Rules:

Use snake_case
Include namespace prefix (goalixa_)
Add units for measurements (_seconds, _bytes)
Suffix counters with _total
Keep labels lowercase

Step 2: Register Flask Middleware

Add middleware to automatically track HTTP requests:

import time
import uuid
from flask import Response, g, request
 
def register_observability(app):
    # Initialize application info
    APP_INFO.info({
        'version': os.getenv('APP_VERSION', '1.0.0'),
        'environment': os.getenv('ENVIRONMENT', 'production'),
        'service': 'goalixa-app'
    })
 
    @app.route("/metrics", methods=["GET"])
    def prometheus_metrics():
        """Expose metrics endpoint for Prometheus scraping"""
        return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)
 
    @app.before_request
    def start_request_tracking():
        """Track request start time and increment active requests"""
        ACTIVE_REQUESTS.inc()
        g.request_started_at = time.perf_counter()
 
        # Generate unique request ID for tracing
        incoming_request_id = (request.headers.get("X-Request-ID") or "").strip()
        g.request_id = incoming_request_id or uuid.uuid4().hex
 
        # Track request size
        if request.content_length:
            REQUEST_SIZE_BYTES.labels(
                method=request.method,
                route=request.endpoint or "unknown"
            ).observe(request.content_length)
 
    @app.after_request
    def complete_request_tracking(response):
        """Record request metrics after completion"""
        ACTIVE_REQUESTS.dec()
 
        route = _route_label()
        method = request.method
        status_code = str(response.status_code)
 
        # Calculate duration
        elapsed_seconds = max(
            0.0,
            time.perf_counter() - getattr(g, "request_started_at", time.perf_counter()),
        )
 
        # Record metrics
        REQUESTS_TOTAL.labels(
            method=method,
            route=route,
            status_code=status_code
        ).inc()
 
        REQUEST_DURATION_SECONDS.labels(
            method=method,
            route=route
        ).observe(elapsed_seconds)
 
        # Track response size
        if response.content_length:
            RESPONSE_SIZE_BYTES.labels(
                method=method,
                route=route,
                status_code=status_code
            ).observe(response.content_length)
 
        # Add request ID to response headers for tracing
        request_id = getattr(g, "request_id", "")
        if request_id:
            response.headers.setdefault("X-Request-ID", request_id)
 
        return response
 
    @app.teardown_request
    def track_request_exception(error):
        """Track failed requests"""
        ACTIVE_REQUESTS.dec()
        if error is None:
            return
 
        route = _route_label()
        REQUEST_EXCEPTIONS_TOTAL.labels(
            method=request.method,
            route=route,
            exception_type=error.__class__.__name__,
        ).inc()
 
 
def _route_label():
    """Extract route pattern from request"""
    if request.url_rule and request.url_rule.rule:
        return request.url_rule.rule
    return "unmatched"

What This Gives You

With just this middleware, you now have:

✅ Request rate per route
✅ Latency percentiles (P50, P95, P99)
✅ Error rate by status code
✅ Active concurrent requests
✅ Request/response size distribution
✅ Exception tracking by type

Step 3: Helper Functions

Create app/metrics.py for convenient metric recording:

import time
import functools
from contextlib import contextmanager
 
# ============= Database Metrics Helpers =============
 
@contextmanager
def track_db_query(operation: str, table: str):
    """
    Context manager to track database query metrics.
 
    Usage:
        with track_db_query("SELECT", "tasks"):
            result = db.session.query(Task).all()
    """
    start_time = time.perf_counter()
    status = "success"
 
    try:
        yield
    except Exception:
        status = "error"
        raise
    finally:
        duration = time.perf_counter() - start_time
        DB_QUERY_DURATION_SECONDS.labels(
            operation=operation,
            table=table
        ).observe(duration)
        DB_QUERY_TOTAL.labels(
            operation=operation,
            table=table,
            status=status
        ).inc()
 
 
def track_db_query_decorator(operation: str, table: str):
    """
    Decorator to track database query metrics.
 
    Usage:
        @track_db_query_decorator("SELECT", "tasks")
        def get_all_tasks():
            return db.session.query(Task).all()
    """
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            with track_db_query(operation, table):
                return func(*args, **kwargs)
        return wrapper
    return decorator
 
 
# ============= Business Logic Metrics Helpers =============
 
def record_task_operation(operation: str, success: bool = True):
    """
    Record task operation.
 
    Args:
        operation: Operation type (create, update, delete, complete)
        success: Whether operation was successful
    """
    status = "success" if success else "failed"
    TASK_OPERATIONS_TOTAL.labels(
        operation=operation,
        status=status
    ).inc()
 
 
def record_goal_operation(operation: str, success: bool = True):
    """Record goal operation."""
    status = "success" if success else "failed"
    GOAL_OPERATIONS_TOTAL.labels(operation=operation, status=status).inc()
 
 
def record_timer_operation(operation: str, success: bool = True):
    """Record timer operation (start, stop, complete)."""
    status = "success" if success else "failed"
    TIMER_OPERATIONS_TOTAL.labels(operation=operation, status=status).inc()
 
 
def record_project_operation(operation: str, success: bool = True):
    """Record project operation."""
    status = "success" if success else "failed"
    PROJECT_OPERATIONS_TOTAL.labels(operation=operation, status=status).inc()
 
 
# ============= Cache Metrics Helpers =============
 
def record_cache_hit():
    """Record a cache hit."""
    CACHE_OPERATIONS_TOTAL.labels(operation="get", status="hit").inc()
 
 
def record_cache_miss():
    """Record a cache miss."""
    CACHE_OPERATIONS_TOTAL.labels(operation="get", status="miss").inc()
 
 
@contextmanager
def track_cache_operation(operation: str):
    """
    Context manager to track cache operation metrics.
 
    Usage:
        with track_cache_operation("set"):
            cache.set(key, value)
    """
    start_time = time.perf_counter()
    status = "success"
 
    try:
        yield
    except Exception:
        status = "error"
        raise
    finally:
        duration = time.perf_counter() - start_time
        CACHE_DURATION_SECONDS.labels(operation=operation).observe(duration)
        CACHE_OPERATIONS_TOTAL.labels(operation=operation, status=status).inc()

Step 4: Use Metrics in Business Logic

Now instrument your service layer:

Example: Task Service

# app/service/task_service.py
from app.metrics import record_task_operation, track_db_query
 
class TaskService:
    def create_task(self, user_id, task_data):
        try:
            # Validate input
            if not task_data.get('name'):
                record_task_operation("create", success=False)
                raise ValidationError("Task name is required")
 
            # Create task with database tracking
            with track_db_query("INSERT", "tasks"):
                task = self.repository.create_task(user_id, task_data)
 
            # Record successful operation
            record_task_operation("create", success=True)
 
            return task
 
        except Exception as e:
            # Record failure
            record_task_operation("create", success=False)
            raise
 
    def start_timer(self, user_id, task_id):
        try:
            with track_db_query("UPDATE", "time_entries"):
                entry = self.repository.start_timer(user_id, task_id)
 
            record_timer_operation("start", success=True)
            return entry
 
        except Exception:
            record_timer_operation("start", success=False)
            raise
 
    def complete_task(self, user_id, task_id):
        try:
            with track_db_query("UPDATE", "tasks"):
                task = self.repository.mark_complete(user_id, task_id)
 
            record_task_operation("complete", success=True)
            return task
 
        except Exception:
            record_task_operation("complete", success=False)
            raise

Example: Repository Layer

# app/repository/task_repository.py
from app.metrics import track_db_query_decorator
 
class TaskRepository:
    @track_db_query_decorator("SELECT", "tasks")
    def get_by_id(self, task_id):
        return db.session.query(Task).filter_by(id=task_id).first()
 
    @track_db_query_decorator("SELECT", "tasks")
    def get_all_for_user(self, user_id):
        return db.session.query(Task).filter_by(user_id=user_id).all()
 
    def create_task(self, user_id, data):
        # Manual tracking for more control
        with track_db_query("INSERT", "tasks"):
            task = Task(user_id=user_id, **data)
            db.session.add(task)
            db.session.commit()
            return task

Step 5: Initialize in Application

Wire everything together in main.py:

from flask import Flask
from app.observability import register_observability, configure_logging
from app import routes
 
def create_app():
    app = Flask(__name__)
 
    # Configure logging
    configure_logging()
 
    # Register observability (metrics + middleware)
    register_observability(app)
 
    # Register routes
    routes.register_routes(app)
 
    return app
 
if __name__ == "__main__":
    app = create_app()
    app.run(host="0.0.0.0", port=80)

Step 6: Expose Metrics Endpoint

The middleware already creates /metrics endpoint. Test it:

# Start your application
python main.py
 
# Check metrics endpoint
curl http://localhost:80/metrics
 
# Output:
# HELP goalixa_http_requests_total Total number of HTTP requests.
# TYPE goalixa_http_requests_total counter
# goalixa_http_requests_total{method="GET",route="/api/tasks",status_code="200"} 142.0
# goalixa_http_requests_total{method="POST",route="/api/tasks",status_code="201"} 37.0
#
# HELP goalixa_http_request_duration_seconds HTTP request latency in seconds.
# TYPE goalixa_http_request_duration_seconds histogram
# goalixa_http_request_duration_seconds_bucket{le="0.005",method="GET",route="/api/tasks"} 98.0
# goalixa_http_request_duration_seconds_bucket{le="0.01",method="GET",route="/api/tasks"} 135.0
# ...

Label Selection Strategy

Labels create unique time series. More labels = more storage and slower queries.

Good Labels (Low Cardinality)

# ✅ Good: Limited number of values
Counter("requests_total", ["method", "route", "status_code"])
# method: GET, POST, PUT, DELETE (4 values)
# route: /api/tasks, /api/goals, etc. (~20 values)
# status_code: 200, 201, 400, 500, etc. (~10 values)
# Total series: 4 × 20 × 10 = 800

Bad Labels (High Cardinality)

# ❌ Bad: Unlimited values
Counter("requests_total", ["user_id", "request_id", "ip_address"])
# user_id: Thousands of users
# request_id: Every request is unique
# ip_address: Thousands of IPs
# Total series: Millions → Prometheus OOM

⚠️ Cardinality Explosion

Never use these as labels:

User IDs or emails
Request IDs or trace IDs
IP addresses
Timestamps
UUIDs or any unique identifiers

Rule of thumb: If a label can have > 100 unique values, don’t use it.

Safe Label Values

Category	Safe Labels	Unsafe Labels
HTTP	method, route, status_code	user_id, request_id
Database	operation, table, status	query_text, user_id
Auth	validation_type, status	user_email, token
Business	operation_type, status	entity_id, user_name

Querying Application Metrics

Once instrumented, query your metrics in Prometheus or Grafana:

Request Rate

# Requests per second by route
rate(goalixa_http_requests_total[5m])

# Total requests per second
sum(rate(goalixa_http_requests_total[5m]))

Error Rate

# Percentage of 5xx errors
sum(rate(goalixa_http_requests_total{status_code=~"5.."}[5m]))
  / sum(rate(goalixa_http_requests_total[5m])) * 100

Latency Percentiles

# P95 latency per route
histogram_quantile(0.95,
  sum(rate(goalixa_http_request_duration_seconds_bucket[5m]))
  by (route, le)
)

# P99 latency across all routes
histogram_quantile(0.99,
  sum(rate(goalixa_http_request_duration_seconds_bucket[5m]))
  by (le)
)

Business Metrics

# Task creation rate
rate(goalixa_task_operations_total{operation="create"}[5m])

# Task operation success rate
sum(rate(goalixa_task_operations_total{status="success"}[5m]))
  / sum(rate(goalixa_task_operations_total[5m])) * 100

# Database query duration by table
histogram_quantile(0.95,
  sum(rate(goalixa_db_query_duration_seconds_bucket[5m]))
  by (table, le)
)

Common Patterns

Pattern 1: Operation Tracking

Track success/failure of operations:

def perform_operation():
    try:
        # Do work
        result = do_something()
 
        # Record success
        OPERATIONS_TOTAL.labels(operation="action", status="success").inc()
        return result
 
    except ValidationError:
        OPERATIONS_TOTAL.labels(operation="action", status="validation_error").inc()
        raise
 
    except DatabaseError:
        OPERATIONS_TOTAL.labels(operation="action", status="database_error").inc()
        raise
 
    except Exception:
        OPERATIONS_TOTAL.labels(operation="action", status="unknown_error").inc()
        raise

Pattern 2: Duration Tracking

Measure how long operations take:

import time
 
def timed_operation():
    start_time = time.perf_counter()
 
    try:
        result = do_work()
        return result
    finally:
        duration = time.perf_counter() - start_time
        OPERATION_DURATION.labels(operation="work").observe(duration)

Pattern 3: Gauge for Current State

Track current system state:

# Update gauge when connections change
def acquire_connection():
    conn = pool.get_connection()
    DB_CONNECTIONS_ACTIVE.inc()
    return conn
 
def release_connection(conn):
    pool.release(conn)
    DB_CONNECTIONS_ACTIVE.dec()
 
# Or update periodically
def update_pool_metrics():
    DB_CONNECTION_POOL_SIZE.set(pool.size)
    DB_CONNECTIONS_ACTIVE.set(pool.active_connections)

Testing Metrics

Verify metrics are working:

# tests/test_metrics.py
import pytest
from prometheus_client import REGISTRY
 
def test_request_metrics_recorded(client):
    # Make request
    response = client.get('/api/tasks')
 
    # Check metric exists
    metrics = REGISTRY.get_sample_value(
        'goalixa_http_requests_total',
        {'method': 'GET', 'route': '/api/tasks', 'status_code': '200'}
    )
 
    assert metrics >= 1
 
def test_task_creation_recorded(client):
    # Create task
    client.post('/api/tasks', json={'name': 'Test'})
 
    # Check metric
    success_count = REGISTRY.get_sample_value(
        'goalixa_task_operations_total',
        {'operation': 'create', 'status': 'success'}
    )
 
    assert success_count >= 1

Performance Considerations

Metric Collection Overhead

Metric Type	Cost	When to Use
Counter	Very Low (~10ns)	Always safe
Gauge	Very Low (~10ns)	Always safe
Histogram	Low (~100ns)	Safe for request metrics
Summary	Medium (~1µs)	Use sparingly, prefer Histogram

✅ Performance Best Practices

Counter/Gauge are cheap - Use liberally
Histogram is efficient - Good for latencies
Summary is expensive - Avoid in hot paths
Limit label cardinality - Keep < 1000 series per metric
Don’t measure everything - Focus on actionable metrics

Troubleshooting

Metrics Not Appearing

# 1. Check /metrics endpoint exists
curl http://localhost:80/metrics
 
# 2. Check if ServiceMonitor is created
kubectl get servicemonitor -n your-namespace
 
# 3. Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus 9090:9090
# Open: http://localhost:9090/targets
 
# 4. Search for your metric
# In Prometheus UI, query: goalixa_http_requests_total

High Cardinality Issues

# Find metrics with most series
topk(10, count by (__name__)({__name__=~".+"}))

# Check specific metric cardinality
count(goalixa_http_requests_total)

# If > 10,000, you have a cardinality problem

Solution: Remove high-cardinality labels (user_id, request_id, etc.)

Next Steps

Now that your application is instrumented:

Create Grafana Dashboards - Visualize your metrics
Set Up Alerts - Get notified of issues
Define SLOs - Set latency and error rate targets
Build runbooks - Document how to respond to alerts

The metrics implementation shown here powers Goalixa’s production monitoring - tracking millions of operations daily with minimal overhead.

Alertmanager ArgoCD First Steps