Monitoring & Metrics

Track performance and usage with the metrics API

Monitoring & Metrics Guide

Track your Redis instance's performance and usage. Understand what metrics mean, how to fetch them, and how to act on what they tell you.

Available Metrics

SwiftCache tracks the last 24 hours of detailed metrics for every instance. All metrics are hourly aggregates.

Metrics Overview

MetricUnitWhat It Measures
commandsCountTotal Redis commands executed
bytesInBytesData sent TO your instance
bytesOutBytesData sent FROM your instance
hitsCountCache hits (key existed)
missesCountCache misses (key not found)

Understanding Each Metric

Commands: Every GET, SET, DEL, etc. is one command. Higher = more load.

4,000 commands per hour = ~1 command per second average
100,000 commands per hour = ~28 commands per second
1,000,000 commands per hour = ~278 commands per second (heavy load)

Bytes In: Total size of data written to the cache. Helps you understand write volume.

10 MB per hour = Small cache, light usage
100 MB per hour = Moderate usage
1 GB per hour = Heavy write workload

Bytes Out: Total size of data read from the cache. Indicates read patterns.

bytesOut > bytesIn = Many reads, good cache utilization
bytesOut < bytesIn = Light reads, data sitting in cache
bytesOut >> bytesIn = Hot data with repetitive reads

Hits vs Misses: The most important metrics. Together they determine your hit ratio.

1000 hits + 0 misses = 100% hit ratio (perfect cache)
800 hits + 200 misses = 80% hit ratio (good)
500 hits + 500 misses = 50% hit ratio (cache barely helping)
100 hits + 900 misses = 10% hit ratio (cache not helping, data not cacheable)

Fetching Metrics from the API

Basic Metrics Request

curl -H "Authorization: Bearer sk_live_your_api_key" \
  https://api.swiftcache.io/api/v1/instances/inst_abc123/metrics

Full Response Example

{
  "metrics": [
    {
      "time": "2024-03-15T10:00:00Z",
      "commands": 12500,
      "bytesIn": 1048576,
      "bytesOut": 2097152,
      "hits": 9800,
      "misses": 2700
    },
    {
      "time": "2024-03-15T11:00:00Z",
      "commands": 14200,
      "bytesIn": 1234567,
      "bytesOut": 2468013,
      "hits": 11200,
      "misses": 3000
    },
    // ... more hourly data for last 24 hours
  ],
  "current": {
    "commands": 14200,
    "bytesIn": 1234567,
    "bytesOut": 2468013,
    "hits": 11200,
    "misses": 3000
  }
}
  • metrics array: Last 24 hours of hourly data (up to 24 objects)
  • current object: Most recent hour (not an average, the latest data point)

TypeScript Metrics Fetcher

interface Metric {
  time: string;
  commands: number;
  bytesIn: number;
  bytesOut: number;
  hits: number;
  misses: number;
}

interface MetricsResponse {
  metrics: Metric[];
  current: Metric;
}

async function getMetrics(instanceId: string, apiKey: string): Promise<MetricsResponse> {
  const response = await fetch(
    `https://api.swiftcache.io/api/v1/instances/${instanceId}/metrics`,
    {
      headers: { 'Authorization': `Bearer ${apiKey}` }
    }
  );

  if (!response.ok) {
    throw new Error(`API error: ${response.statusCode}`);
  }

  return await response.json();
}

// Usage
const metrics = await getMetrics('inst_abc123', 'sk_live_xxx');
console.log('Last hour:', metrics.current);
console.log('Last 24 hours:', metrics.metrics.length, 'data points');

Python Metrics Fetcher

import requests
from typing import List, Dict
from datetime import datetime

def get_metrics(instance_id: str, api_key: str) -> Dict:
    """Fetch last 24 hours of metrics for an instance"""
    response = requests.get(
        f'https://api.swiftcache.io/api/v1/instances/{instance_id}/metrics',
        headers={'Authorization': f'Bearer {api_key}'}
    )
    response.raise_for_status()
    return response.json()

# Usage
data = get_metrics('inst_abc123', 'sk_live_xxx')
print(f"Data points: {len(data['metrics'])}")
print(f"Current commands: {data['current']['commands']}")

Analyzing Your Metrics

Calculate Hit Ratio

The hit ratio is the most important metric. It tells you how often your cache helps.

function calculateHitRatio(metrics: Metric[]): number {
  const totalHits = metrics.reduce((sum, m) => sum + m.hits, 0);
  const totalMisses = metrics.reduce((sum, m) => sum + m.misses, 0);
  const total = totalHits + totalMisses;

  if (total === 0) return 0;
  return totalHits / total;
}

const hitRatio = calculateHitRatio(metrics.metrics);
console.log(`Hit ratio: ${(hitRatio * 100).toFixed(1)}%`);

// Interpretation:
// > 90% = Excellent, your cache is doing its job
// 80-90% = Good, mostly serving from cache
// 70-80% = Acceptable, but room for improvement
// 60-70% = Poor, more misses than desired
// < 60% = Cache barely helping, needs investigation

Python Hit Ratio

def calculate_hit_ratio(metrics: List[Dict]) -> float:
    total_hits = sum(m['hits'] for m in metrics)
    total_misses = sum(m['misses'] for m in metrics)
    total = total_hits + total_misses

    if total == 0:
        return 0
    return total_hits / total

ratio = calculate_hit_ratio(data['metrics'])
print(f"Hit ratio: {ratio * 100:.1f}%")

Average Command Rate

Track how many commands per second you're executing:

function getAverageCommandsPerSecond(metrics: Metric[]): number {
  const totalCommands = metrics.reduce((sum, m) => sum + m.commands, 0);
  const seconds = metrics.length * 3600; // Hours to seconds
  return totalCommands / seconds;
}

const avgCps = getAverageCommandsPerSecond(metrics.metrics);
console.log(`Average: ${avgCps.toFixed(1)} commands/second`);

// Peak command rate
const peakMetric = metrics.metrics.reduce((max, m) =>
  m.commands > max.commands ? m : max
);
const peakCps = peakMetric.commands / 3600;
console.log(`Peak: ${peakCps.toFixed(1)} commands/second at ${peakMetric.time}`);

Data Volume Analysis

Understand your data flow patterns:

function analyzeDataFlow(metrics: Metric[]) {
  const totalIn = metrics.reduce((sum, m) => sum + m.bytesIn, 0);
  const totalOut = metrics.reduce((sum, m) => sum + m.bytesOut, 0);

  const inMb = totalIn / 1024 / 1024;
  const outMb = totalOut / 1024 / 1024;

  console.log(`Last 24 hours:`);
  console.log(`  Data in: ${inMb.toFixed(1)} MB`);
  console.log(`  Data out: ${outMb.toFixed(1)} MB`);
  console.log(`  Ratio: ${(outMb / inMb).toFixed(2)}x`);

  // Interpretation:
  if (outMb > inMb * 3) {
    console.log("  Insight: Hot data, lots of reads");
  } else if (outMb < inMb * 0.5) {
    console.log("  Insight: Write-heavy, not much re-reading");
  }
}

Setting Up Monitoring Alerts

TypeScript Alert System

interface AlertRule {
  metric: 'hitRatio' | 'commandRate' | 'memory';
  threshold: number;
  operator: '<' | '>' | '==';
  action: (value: number) => void;
}

async function checkMetricsAndAlert(
  instanceId: string,
  apiKey: string,
  rules: AlertRule[]
) {
  const data = await getMetrics(instanceId, apiKey);
  const metrics = data.metrics;

  // Check each rule
  for (const rule of rules) {
    let value: number;

    if (rule.metric === 'hitRatio') {
      value = calculateHitRatio(metrics);
    } else if (rule.metric === 'commandRate') {
      value = getAverageCommandsPerSecond(metrics);
    } else {
      continue;
    }

    // Evaluate rule
    const triggered =
      rule.operator === '<' ? value < rule.threshold :
      rule.operator === '>' ? value > rule.threshold :
      value === rule.threshold;

    if (triggered) {
      rule.action(value);
    }
  }
}

// Define alert rules
const alerts: AlertRule[] = [
  {
    metric: 'hitRatio',
    threshold: 0.8,
    operator: '<',
    action: (ratio) => {
      console.warn(`Alert: Hit ratio ${(ratio * 100).toFixed(1)}% below 80%`);
      // Send Slack/email/PagerDuty alert
    }
  },
  {
    metric: 'commandRate',
    threshold: 10000,
    operator: '>',
    action: (rate) => {
      console.warn(`Alert: Command rate ${rate.toFixed(0)}/sec above 10k`);
      // Trigger auto-scaling or alert
    }
  }
];

// Check every 5 minutes
setInterval(() => {
  checkMetricsAndAlert('inst_abc123', 'sk_live_xxx', alerts);
}, 5 * 60 * 1000);

Python Alert System

from typing import Callable
from dataclasses import dataclass
from enum import Enum
import schedule
import time

class Operator(Enum):
    LESS_THAN = '<'
    GREATER_THAN = '>'
    EQUAL = '=='

@dataclass
class AlertRule:
    metric: str
    threshold: float
    operator: Operator
    action: Callable

def check_alerts(instance_id: str, api_key: str, rules: List[AlertRule]):
    """Check metrics against alert rules"""
    data = get_metrics(instance_id, api_key)
    metrics = data['metrics']

    # Calculate metrics
    hit_ratio = calculate_hit_ratio(metrics)
    avg_cps = sum(m['commands'] for m in metrics) / (len(metrics) * 3600)

    # Check rules
    for rule in rules:
        if rule.metric == 'hitRatio':
            value = hit_ratio
        elif rule.metric == 'commandRate':
            value = avg_cps
        else:
            continue

        # Evaluate
        triggered = (
            rule.operator == Operator.LESS_THAN and value < rule.threshold or
            rule.operator == Operator.GREATER_THAN and value > rule.threshold
        )

        if triggered:
            rule.action(value)

# Define rules
alerts = [
    AlertRule(
        metric='hitRatio',
        threshold=0.75,
        operator=Operator.LESS_THAN,
        action=lambda v: print(f"Alert: Hit ratio {v*100:.1f}% is low!")
    ),
    AlertRule(
        metric='commandRate',
        threshold=5000,
        operator=Operator.GREATER_THAN,
        action=lambda v: print(f"Alert: High load {v:.0f} commands/sec!")
    )
]

# Schedule checks
schedule.every(5).minutes.do(
    check_alerts,
    'inst_abc123',
    'sk_live_xxx',
    alerts
)

while True:
    schedule.run_pending()
    time.sleep(1)

Common Metrics Scenarios

Scenario 1: Low Hit Ratio (Cache Not Helping)

Metrics show:

  • Hit ratio: 45%
  • Hits: 4,500
  • Misses: 5,500

What it means: Almost as many misses as hits. Your cache isn't helping much.

Causes:

  • Wrong TTL (expiring too fast)
  • Wrong data cached (cacheable data not in cache)
  • Cache too small (evicting frequently)
  • Data is not cacheable (constantly changing, never requested twice)

Solutions:

  1. Check your TTL settings:

    // Are you setting long enough TTLs?
    redis.set('key', 'value', 'EX', 3600); // 1 hour
    redis.set('key', 'value', 'EX', 86400); // 1 day better for stable data
    
  2. Verify correct data is cached:

    # What keys are in your cache?
    redis-cli KEYS "*" | head -100
    redis-cli SCAN 0 COUNT 100
    
  3. Check memory - if evicting, your instance is too small:

    # Get metrics showing memory pressure
    curl -H "Authorization: Bearer sk_live_xxx" \
      https://api.swiftcache.io/api/v1/instances/inst_abc123/metrics
    
  4. Consider if data is cacheable - some data patterns can't be cached effectively

Scenario 2: High Command Rate Spike

Metrics show:

  • Peak commands: 50,000 per hour (13 commands/second)
  • Normally: 10,000 per hour (3 commands/second)

What it means: 5x increase in load. Something changed.

Causes:

  • New feature deployed
  • Traffic spike (viral post, sale, etc.)
  • Thundering herd (many clients retrying after error)
  • Inefficient code (N+1 queries)

Solutions:

  1. Check if this is expected:

    # Did you deploy something?
    git log -1 --oneline
    
    # Check application logs during the spike time
    grep "2024-03-15T14:" app.log | wc -l
    
  2. If unexpected, investigate root cause:

    // Add logging to Redis operations
    redisClient.on('command', (cmd) => {
      if (cmd.args[0] === 'GET') {
        console.log('GET:', cmd.args[1]);
      }
    });
    
  3. Optimize if possible:

    // Batch operations
    const results = await redis.mget(['key1', 'key2', 'key3']);
    
    // Use pipelines
    const pipeline = redis.pipeline();
    for (const key of keys) {
      pipeline.get(key);
    }
    await pipeline.exec();
    
  4. Scale up instance if load is sustained:

    # Increase memory/capacity
    curl -X PATCH \
      -H "Authorization: Bearer sk_live_xxx" \
      -H "Content-Type: application/json" \
      -d '{"maxMemory": 1073741824}' \
      https://api.swiftcache.io/api/v1/instances/inst_abc123
    

Scenario 3: Asymmetric Data Flow (bytesOut >> bytesIn)

Metrics show:

  • Bytes in: 50 MB per hour
  • Bytes out: 500 MB per hour
  • Ratio: 10x more data flowing out

What it means: You're reading much more than writing. Data stays in cache long.

Interpretation: Actually good! Your cache is being used effectively for hot data.

No action needed unless:

  • You expected more balanced traffic
  • Memory usage is growing (data not expiring)

Scenario 4: Zero Hits (Completely Dead Cache)

Metrics show:

  • Hits: 0
  • Misses: Everything

Causes:

  • Cache is not being used (code path bypassed)
  • All keys expired (TTL too short)
  • Instance just created (no data yet)
  • Wrong hostname/key (connecting elsewhere)

Debug:

# Check if there's any data in cache
redis-cli KEYS "*" | wc -l  # Should be > 0

# Check if expiring too fast
redis-cli TTL mykey  # Should be positive, not -2 (expired)

# Check if using correct instance
echo $REDIS_HOSTNAME  # Verify it's correct

Building a Metrics Dashboard

Simple HTML Dashboard

<!DOCTYPE html>
<html>
<head>
  <title>SwiftCache Metrics</title>
  <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
  <style>
    body { font-family: Arial; margin: 20px; }
    .metric { display: inline-block; margin: 20px; padding: 10px; border: 1px solid #ddd; }
    canvas { max-width: 600px; }
  </style>
</head>
<body>
  <h1>Instance Metrics</h1>

  <div class="metric">
    <h3>Hit Ratio</h3>
    <p id="hitRatio">Loading...</p>
  </div>

  <div class="metric">
    <h3>Commands/sec</h3>
    <p id="commandRate">Loading...</p>
  </div>

  <canvas id="commandChart"></canvas>

  <script>
    async function loadMetrics() {
      const response = await fetch(
        '/api/metrics?instanceId=inst_abc123'
      );
      const data = await response.json();

      // Calculate hit ratio
      const hits = data.metrics.reduce((s, m) => s + m.hits, 0);
      const misses = data.metrics.reduce((s, m) => s + m.misses, 0);
      const ratio = (hits / (hits + misses) * 100).toFixed(1);

      document.getElementById('hitRatio').textContent = `${ratio}%`;

      // Calculate command rate
      const commands = data.metrics.reduce((s, m) => s + m.commands, 0);
      const rate = (commands / (data.metrics.length * 3600)).toFixed(1);

      document.getElementById('commandRate').textContent = `${rate}/sec`;

      // Draw chart
      drawCommandChart(data.metrics);
    }

    function drawCommandChart(metrics) {
      const ctx = document.getElementById('commandChart');
      new Chart(ctx, {
        type: 'line',
        data: {
          labels: metrics.map(m => new Date(m.time).toLocaleTimeString()),
          datasets: [{
            label: 'Commands',
            data: metrics.map(m => m.commands),
            borderColor: 'blue'
          }]
        },
        options: {
          responsive: true,
          plugins: {
            title: { text: 'Commands per Hour' }
          }
        }
      });
    }

    loadMetrics();
    setInterval(loadMetrics, 60000); // Refresh every minute
  </script>
</body>
</html>

Common Monitoring Mistakes

Mistake 1: Ignoring Low Hit Ratio

// Wrong: Assuming cache is working
redis.get('key');  // 45% hit ratio, mostly misses!

// Right: Investigate and fix
if (hitRatio < 0.8) {
  // Increase TTL or cache different data
}

Mistake 2: Only Checking Occasionally

// Wrong: Manual checks
// "Oh, I'll check metrics when I remember"

// Right: Automated monitoring
setInterval(() => {
  checkMetrics().then(alert);
}, 5 * 60 * 1000);

Mistake 3: Ignoring Spikes

// Wrong: Not investigating sudden changes
// Metrics show 10x command increase, "probably fine"

// Right: Alert on anomalies
const avgRate = metrics.reduce((s, m) => s + m.commands, 0) / metrics.length;
const lastRate = metrics[metrics.length - 1].commands;

if (lastRate > avgRate * 3) {
  alert('Unusual spike detected');
}

Best Practices

  1. Monitor hit ratio - It's your primary health indicator
  2. Alert on anomalies - Not just thresholds
  3. Track over time - Weekly/monthly trends matter
  4. Correlate with deploys - Command spikes after deployment are normal
  5. Plan capacity - Use growth trends to predict when to scale
  6. Archive metrics - Keep long-term history for analysis

Summary

Metrics tell the story of your cache:

  • Hit ratio shows if cache is helping
  • Command rate shows workload intensity
  • Data flow shows access patterns
  • Changes indicate something shifted

Regular monitoring catches problems before they become outages. Set up alerts, build dashboards, and act on what the metrics tell you.