Application Metrics: Instrumentation Guide
Prometheus can collect metrics from your infrastructure, but the real value comes from instrumenting your application code. This guide shows you how to add custom metrics to Flask applications using real production code from Goalixa Core-API.
Why Application Metrics Matter
Infrastructure metrics tell you what is happening (CPU, memory, disk). Application metrics tell you why it’s happening (slow database queries, failed operations, business logic performance).
Infrastructure metrics:
- CPU at 80%
- Memory usage climbing
- Disk I/O high
Application metrics:
- 1,000 task creation operations/sec
- Database queries taking > 1s
- 5% of API requests failing
The second set gives you actionable insights.
Metric Categories
Before writing code, understand what to measure:
1. RED Metrics (Requests, Errors, Duration)
The foundation of service monitoring:
| Metric | Type | Example |
|---|---|---|
| Rate | Counter | http_requests_total - requests per second |
| Errors | Counter | http_requests_total{status_code="500"} |
| Duration | Histogram | http_request_duration_seconds - latency |
2. USE Metrics (Utilization, Saturation, Errors)
For resources:
| Metric | Type | Example |
|---|---|---|
| Utilization | Gauge | db_connections_active / db_connection_pool_size |
| Saturation | Gauge | queue_depth, thread_pool_size |
| Errors | Counter | db_query_errors_total |
3. Business Metrics
Domain-specific operations:
- Task operations (create, complete, delete)
- User authentication (login, validation)
- Timer operations (start, stop)
- Feature usage (goals created, habits tracked)
Every service should expose RED metrics. They answer the three most important questions:
- How much traffic? (Rate)
- How many errors? (Errors)
- How slow? (Duration)
Add business metrics only after RED metrics are in place.
Production Example: Goalixa Core-API
Let’s walk through the actual implementation from Core-API - a Flask service handling tasks, projects, goals, and time tracking.
Architecture
app/
├── observability.py # Metric definitions + Flask middleware
├── metrics.py # Helper functions for recording metrics
├── service/ # Business logic using metrics
│ ├── task_service.py
│ ├── goal_service.py
│ └── project_service.py
└── repository/ # Database layer using metrics
├── task_repository.py
└── ...Step 1: Define Metrics
Create app/observability.py to define all metrics:
from prometheus_client import Counter, Histogram, Gauge, Summary, Info, generate_latest
# ============= HTTP Request Metrics ============
REQUESTS_TOTAL = Counter(
"goalixa_http_requests_total",
"Total number of HTTP requests.",
["method", "route", "status_code"],
)
REQUEST_DURATION_SECONDS = Histogram(
"goalixa_http_request_duration_seconds",
"HTTP request latency in seconds.",
["method", "route"],
buckets=(0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0),
)
REQUEST_SIZE_BYTES = Summary(
"goalixa_http_request_size_bytes",
"HTTP request size in bytes.",
["method", "route"]
)
RESPONSE_SIZE_BYTES = Summary(
"goalixa_http_response_size_bytes",
"HTTP response size in bytes.",
["method", "route", "status_code"]
)
REQUEST_EXCEPTIONS_TOTAL = Counter(
"goalixa_http_request_exceptions_total",
"Total number of request exceptions.",
["method", "route", "exception_type"],
)
ACTIVE_REQUESTS = Gauge(
"goalixa_http_active_requests",
"Number of active HTTP requests.",
)
# ============= Database Metrics =============
DB_QUERY_DURATION_SECONDS = Histogram(
"goalixa_db_query_duration_seconds",
"Database query duration in seconds.",
["operation", "table"],
buckets=(0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5),
)
DB_QUERY_TOTAL = Counter(
"goalixa_db_queries_total",
"Total number of database queries.",
["operation", "table", "status"],
)
DB_CONNECTION_POOL_SIZE = Gauge(
"goalixa_db_connection_pool_size",
"Database connection pool size.",
)
DB_CONNECTIONS_ACTIVE = Gauge(
"goalixa_db_connections_active",
"Number of active database connections.",
)
# ============= Business Logic Metrics =============
TASK_OPERATIONS_TOTAL = Counter(
"goalixa_task_operations_total",
"Total number of task operations.",
["operation", "status"], # operation: create, update, delete, complete
)
GOAL_OPERATIONS_TOTAL = Counter(
"goalixa_goal_operations_total",
"Total number of goal operations.",
["operation", "status"],
)
TIMER_OPERATIONS_TOTAL = Counter(
"goalixa_timer_operations_total",
"Total number of timer operations.",
["operation", "status"], # operation: start, stop, complete
)
PROJECT_OPERATIONS_TOTAL = Counter(
"goalixa_project_operations_total",
"Total number of project operations.",
["operation", "status"],
)
# ============= Cache Metrics =============
CACHE_OPERATIONS_TOTAL = Counter(
"goalixa_cache_operations_total",
"Total number of cache operations.",
["operation", "status"], # operation: hit, miss, set, delete
)
CACHE_DURATION_SECONDS = Histogram(
"goalixa_cache_operation_duration_seconds",
"Cache operation duration in seconds.",
["operation"],
buckets=(0.0001, 0.0005, 0.001, 0.005, 0.01, 0.025, 0.05, 0.1),
)
# ============= Application Info =============
APP_INFO = Info(
"goalixa_app_info",
"Goalixa application information"
)Metric Naming Best Practices
Format: {namespace}_{metric_name}_{unit}_{suffix}
Examples:
goalixa_http_requests_total← Counter (always ends in_total)goalixa_http_request_duration_seconds← Histogram (includes unit)goalixa_db_connections_active← Gauge (current state)
Rules:
- Use snake_case
- Include namespace prefix (
goalixa_) - Add units for measurements (
_seconds,_bytes) - Suffix counters with
_total - Keep labels lowercase
Step 2: Register Flask Middleware
Add middleware to automatically track HTTP requests:
import time
import uuid
from flask import Response, g, request
def register_observability(app):
# Initialize application info
APP_INFO.info({
'version': os.getenv('APP_VERSION', '1.0.0'),
'environment': os.getenv('ENVIRONMENT', 'production'),
'service': 'goalixa-app'
})
@app.route("/metrics", methods=["GET"])
def prometheus_metrics():
"""Expose metrics endpoint for Prometheus scraping"""
return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)
@app.before_request
def start_request_tracking():
"""Track request start time and increment active requests"""
ACTIVE_REQUESTS.inc()
g.request_started_at = time.perf_counter()
# Generate unique request ID for tracing
incoming_request_id = (request.headers.get("X-Request-ID") or "").strip()
g.request_id = incoming_request_id or uuid.uuid4().hex
# Track request size
if request.content_length:
REQUEST_SIZE_BYTES.labels(
method=request.method,
route=request.endpoint or "unknown"
).observe(request.content_length)
@app.after_request
def complete_request_tracking(response):
"""Record request metrics after completion"""
ACTIVE_REQUESTS.dec()
route = _route_label()
method = request.method
status_code = str(response.status_code)
# Calculate duration
elapsed_seconds = max(
0.0,
time.perf_counter() - getattr(g, "request_started_at", time.perf_counter()),
)
# Record metrics
REQUESTS_TOTAL.labels(
method=method,
route=route,
status_code=status_code
).inc()
REQUEST_DURATION_SECONDS.labels(
method=method,
route=route
).observe(elapsed_seconds)
# Track response size
if response.content_length:
RESPONSE_SIZE_BYTES.labels(
method=method,
route=route,
status_code=status_code
).observe(response.content_length)
# Add request ID to response headers for tracing
request_id = getattr(g, "request_id", "")
if request_id:
response.headers.setdefault("X-Request-ID", request_id)
return response
@app.teardown_request
def track_request_exception(error):
"""Track failed requests"""
ACTIVE_REQUESTS.dec()
if error is None:
return
route = _route_label()
REQUEST_EXCEPTIONS_TOTAL.labels(
method=request.method,
route=route,
exception_type=error.__class__.__name__,
).inc()
def _route_label():
"""Extract route pattern from request"""
if request.url_rule and request.url_rule.rule:
return request.url_rule.rule
return "unmatched"What This Gives You
With just this middleware, you now have:
- ✅ Request rate per route
- ✅ Latency percentiles (P50, P95, P99)
- ✅ Error rate by status code
- ✅ Active concurrent requests
- ✅ Request/response size distribution
- ✅ Exception tracking by type
Step 3: Helper Functions
Create app/metrics.py for convenient metric recording:
import time
import functools
from contextlib import contextmanager
# ============= Database Metrics Helpers =============
@contextmanager
def track_db_query(operation: str, table: str):
"""
Context manager to track database query metrics.
Usage:
with track_db_query("SELECT", "tasks"):
result = db.session.query(Task).all()
"""
start_time = time.perf_counter()
status = "success"
try:
yield
except Exception:
status = "error"
raise
finally:
duration = time.perf_counter() - start_time
DB_QUERY_DURATION_SECONDS.labels(
operation=operation,
table=table
).observe(duration)
DB_QUERY_TOTAL.labels(
operation=operation,
table=table,
status=status
).inc()
def track_db_query_decorator(operation: str, table: str):
"""
Decorator to track database query metrics.
Usage:
@track_db_query_decorator("SELECT", "tasks")
def get_all_tasks():
return db.session.query(Task).all()
"""
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
with track_db_query(operation, table):
return func(*args, **kwargs)
return wrapper
return decorator
# ============= Business Logic Metrics Helpers =============
def record_task_operation(operation: str, success: bool = True):
"""
Record task operation.
Args:
operation: Operation type (create, update, delete, complete)
success: Whether operation was successful
"""
status = "success" if success else "failed"
TASK_OPERATIONS_TOTAL.labels(
operation=operation,
status=status
).inc()
def record_goal_operation(operation: str, success: bool = True):
"""Record goal operation."""
status = "success" if success else "failed"
GOAL_OPERATIONS_TOTAL.labels(operation=operation, status=status).inc()
def record_timer_operation(operation: str, success: bool = True):
"""Record timer operation (start, stop, complete)."""
status = "success" if success else "failed"
TIMER_OPERATIONS_TOTAL.labels(operation=operation, status=status).inc()
def record_project_operation(operation: str, success: bool = True):
"""Record project operation."""
status = "success" if success else "failed"
PROJECT_OPERATIONS_TOTAL.labels(operation=operation, status=status).inc()
# ============= Cache Metrics Helpers =============
def record_cache_hit():
"""Record a cache hit."""
CACHE_OPERATIONS_TOTAL.labels(operation="get", status="hit").inc()
def record_cache_miss():
"""Record a cache miss."""
CACHE_OPERATIONS_TOTAL.labels(operation="get", status="miss").inc()
@contextmanager
def track_cache_operation(operation: str):
"""
Context manager to track cache operation metrics.
Usage:
with track_cache_operation("set"):
cache.set(key, value)
"""
start_time = time.perf_counter()
status = "success"
try:
yield
except Exception:
status = "error"
raise
finally:
duration = time.perf_counter() - start_time
CACHE_DURATION_SECONDS.labels(operation=operation).observe(duration)
CACHE_OPERATIONS_TOTAL.labels(operation=operation, status=status).inc()Step 4: Use Metrics in Business Logic
Now instrument your service layer:
Example: Task Service
# app/service/task_service.py
from app.metrics import record_task_operation, track_db_query
class TaskService:
def create_task(self, user_id, task_data):
try:
# Validate input
if not task_data.get('name'):
record_task_operation("create", success=False)
raise ValidationError("Task name is required")
# Create task with database tracking
with track_db_query("INSERT", "tasks"):
task = self.repository.create_task(user_id, task_data)
# Record successful operation
record_task_operation("create", success=True)
return task
except Exception as e:
# Record failure
record_task_operation("create", success=False)
raise
def start_timer(self, user_id, task_id):
try:
with track_db_query("UPDATE", "time_entries"):
entry = self.repository.start_timer(user_id, task_id)
record_timer_operation("start", success=True)
return entry
except Exception:
record_timer_operation("start", success=False)
raise
def complete_task(self, user_id, task_id):
try:
with track_db_query("UPDATE", "tasks"):
task = self.repository.mark_complete(user_id, task_id)
record_task_operation("complete", success=True)
return task
except Exception:
record_task_operation("complete", success=False)
raiseExample: Repository Layer
# app/repository/task_repository.py
from app.metrics import track_db_query_decorator
class TaskRepository:
@track_db_query_decorator("SELECT", "tasks")
def get_by_id(self, task_id):
return db.session.query(Task).filter_by(id=task_id).first()
@track_db_query_decorator("SELECT", "tasks")
def get_all_for_user(self, user_id):
return db.session.query(Task).filter_by(user_id=user_id).all()
def create_task(self, user_id, data):
# Manual tracking for more control
with track_db_query("INSERT", "tasks"):
task = Task(user_id=user_id, **data)
db.session.add(task)
db.session.commit()
return taskStep 5: Initialize in Application
Wire everything together in main.py:
from flask import Flask
from app.observability import register_observability, configure_logging
from app import routes
def create_app():
app = Flask(__name__)
# Configure logging
configure_logging()
# Register observability (metrics + middleware)
register_observability(app)
# Register routes
routes.register_routes(app)
return app
if __name__ == "__main__":
app = create_app()
app.run(host="0.0.0.0", port=80)Step 6: Expose Metrics Endpoint
The middleware already creates /metrics endpoint. Test it:
# Start your application
python main.py
# Check metrics endpoint
curl http://localhost:80/metrics
# Output:
# HELP goalixa_http_requests_total Total number of HTTP requests.
# TYPE goalixa_http_requests_total counter
# goalixa_http_requests_total{method="GET",route="/api/tasks",status_code="200"} 142.0
# goalixa_http_requests_total{method="POST",route="/api/tasks",status_code="201"} 37.0
#
# HELP goalixa_http_request_duration_seconds HTTP request latency in seconds.
# TYPE goalixa_http_request_duration_seconds histogram
# goalixa_http_request_duration_seconds_bucket{le="0.005",method="GET",route="/api/tasks"} 98.0
# goalixa_http_request_duration_seconds_bucket{le="0.01",method="GET",route="/api/tasks"} 135.0
# ...Label Selection Strategy
Labels create unique time series. More labels = more storage and slower queries.
Good Labels (Low Cardinality)
# ✅ Good: Limited number of values
Counter("requests_total", ["method", "route", "status_code"])
# method: GET, POST, PUT, DELETE (4 values)
# route: /api/tasks, /api/goals, etc. (~20 values)
# status_code: 200, 201, 400, 500, etc. (~10 values)
# Total series: 4 × 20 × 10 = 800Bad Labels (High Cardinality)
# ❌ Bad: Unlimited values
Counter("requests_total", ["user_id", "request_id", "ip_address"])
# user_id: Thousands of users
# request_id: Every request is unique
# ip_address: Thousands of IPs
# Total series: Millions → Prometheus OOMNever use these as labels:
- User IDs or emails
- Request IDs or trace IDs
- IP addresses
- Timestamps
- UUIDs or any unique identifiers
Rule of thumb: If a label can have > 100 unique values, don’t use it.
Safe Label Values
| Category | Safe Labels | Unsafe Labels |
|---|---|---|
| HTTP | method, route, status_code | user_id, request_id |
| Database | operation, table, status | query_text, user_id |
| Auth | validation_type, status | user_email, token |
| Business | operation_type, status | entity_id, user_name |
Querying Application Metrics
Once instrumented, query your metrics in Prometheus or Grafana:
Request Rate
# Requests per second by route
rate(goalixa_http_requests_total[5m])
# Total requests per second
sum(rate(goalixa_http_requests_total[5m]))Error Rate
# Percentage of 5xx errors
sum(rate(goalixa_http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(goalixa_http_requests_total[5m])) * 100Latency Percentiles
# P95 latency per route
histogram_quantile(0.95,
sum(rate(goalixa_http_request_duration_seconds_bucket[5m]))
by (route, le)
)
# P99 latency across all routes
histogram_quantile(0.99,
sum(rate(goalixa_http_request_duration_seconds_bucket[5m]))
by (le)
)Business Metrics
# Task creation rate
rate(goalixa_task_operations_total{operation="create"}[5m])
# Task operation success rate
sum(rate(goalixa_task_operations_total{status="success"}[5m]))
/ sum(rate(goalixa_task_operations_total[5m])) * 100
# Database query duration by table
histogram_quantile(0.95,
sum(rate(goalixa_db_query_duration_seconds_bucket[5m]))
by (table, le)
)Common Patterns
Pattern 1: Operation Tracking
Track success/failure of operations:
def perform_operation():
try:
# Do work
result = do_something()
# Record success
OPERATIONS_TOTAL.labels(operation="action", status="success").inc()
return result
except ValidationError:
OPERATIONS_TOTAL.labels(operation="action", status="validation_error").inc()
raise
except DatabaseError:
OPERATIONS_TOTAL.labels(operation="action", status="database_error").inc()
raise
except Exception:
OPERATIONS_TOTAL.labels(operation="action", status="unknown_error").inc()
raisePattern 2: Duration Tracking
Measure how long operations take:
import time
def timed_operation():
start_time = time.perf_counter()
try:
result = do_work()
return result
finally:
duration = time.perf_counter() - start_time
OPERATION_DURATION.labels(operation="work").observe(duration)Pattern 3: Gauge for Current State
Track current system state:
# Update gauge when connections change
def acquire_connection():
conn = pool.get_connection()
DB_CONNECTIONS_ACTIVE.inc()
return conn
def release_connection(conn):
pool.release(conn)
DB_CONNECTIONS_ACTIVE.dec()
# Or update periodically
def update_pool_metrics():
DB_CONNECTION_POOL_SIZE.set(pool.size)
DB_CONNECTIONS_ACTIVE.set(pool.active_connections)Testing Metrics
Verify metrics are working:
# tests/test_metrics.py
import pytest
from prometheus_client import REGISTRY
def test_request_metrics_recorded(client):
# Make request
response = client.get('/api/tasks')
# Check metric exists
metrics = REGISTRY.get_sample_value(
'goalixa_http_requests_total',
{'method': 'GET', 'route': '/api/tasks', 'status_code': '200'}
)
assert metrics >= 1
def test_task_creation_recorded(client):
# Create task
client.post('/api/tasks', json={'name': 'Test'})
# Check metric
success_count = REGISTRY.get_sample_value(
'goalixa_task_operations_total',
{'operation': 'create', 'status': 'success'}
)
assert success_count >= 1Performance Considerations
Metric Collection Overhead
| Metric Type | Cost | When to Use |
|---|---|---|
| Counter | Very Low (~10ns) | Always safe |
| Gauge | Very Low (~10ns) | Always safe |
| Histogram | Low (~100ns) | Safe for request metrics |
| Summary | Medium (~1µs) | Use sparingly, prefer Histogram |
- Counter/Gauge are cheap - Use liberally
- Histogram is efficient - Good for latencies
- Summary is expensive - Avoid in hot paths
- Limit label cardinality - Keep < 1000 series per metric
- Don’t measure everything - Focus on actionable metrics
Troubleshooting
Metrics Not Appearing
# 1. Check /metrics endpoint exists
curl http://localhost:80/metrics
# 2. Check if ServiceMonitor is created
kubectl get servicemonitor -n your-namespace
# 3. Check Prometheus targets
kubectl port-forward -n monitoring svc/prometheus 9090:9090
# Open: http://localhost:9090/targets
# 4. Search for your metric
# In Prometheus UI, query: goalixa_http_requests_totalHigh Cardinality Issues
# Find metrics with most series
topk(10, count by (__name__)({__name__=~".+"}))
# Check specific metric cardinality
count(goalixa_http_requests_total)
# If > 10,000, you have a cardinality problemSolution: Remove high-cardinality labels (user_id, request_id, etc.)
Next Steps
Now that your application is instrumented:
- Create Grafana Dashboards - Visualize your metrics
- Set Up Alerts - Get notified of issues
- Define SLOs - Set latency and error rate targets
- Build runbooks - Document how to respond to alerts
The metrics implementation shown here powers Goalixa’s production monitoring - tracking millions of operations daily with minimal overhead.