Monitoring & Metrics
Track performance and usage with the metrics API
Monitoring & Metrics Guide
Track your Redis instance's performance and usage. Understand what metrics mean, how to fetch them, and how to act on what they tell you.
Available Metrics
SwiftCache tracks the last 24 hours of detailed metrics for every instance. All metrics are hourly aggregates.
Metrics Overview
| Metric | Unit | What It Measures |
|---|---|---|
| commands | Count | Total Redis commands executed |
| bytesIn | Bytes | Data sent TO your instance |
| bytesOut | Bytes | Data sent FROM your instance |
| hits | Count | Cache hits (key existed) |
| misses | Count | Cache misses (key not found) |
Understanding Each Metric
Commands: Every GET, SET, DEL, etc. is one command. Higher = more load.
4,000 commands per hour = ~1 command per second average
100,000 commands per hour = ~28 commands per second
1,000,000 commands per hour = ~278 commands per second (heavy load)
Bytes In: Total size of data written to the cache. Helps you understand write volume.
10 MB per hour = Small cache, light usage
100 MB per hour = Moderate usage
1 GB per hour = Heavy write workload
Bytes Out: Total size of data read from the cache. Indicates read patterns.
bytesOut > bytesIn = Many reads, good cache utilization
bytesOut < bytesIn = Light reads, data sitting in cache
bytesOut >> bytesIn = Hot data with repetitive reads
Hits vs Misses: The most important metrics. Together they determine your hit ratio.
1000 hits + 0 misses = 100% hit ratio (perfect cache)
800 hits + 200 misses = 80% hit ratio (good)
500 hits + 500 misses = 50% hit ratio (cache barely helping)
100 hits + 900 misses = 10% hit ratio (cache not helping, data not cacheable)
Fetching Metrics from the API
Basic Metrics Request
curl -H "Authorization: Bearer sk_live_your_api_key" \
https://api.swiftcache.io/api/v1/instances/inst_abc123/metrics
Full Response Example
{
"metrics": [
{
"time": "2024-03-15T10:00:00Z",
"commands": 12500,
"bytesIn": 1048576,
"bytesOut": 2097152,
"hits": 9800,
"misses": 2700
},
{
"time": "2024-03-15T11:00:00Z",
"commands": 14200,
"bytesIn": 1234567,
"bytesOut": 2468013,
"hits": 11200,
"misses": 3000
},
// ... more hourly data for last 24 hours
],
"current": {
"commands": 14200,
"bytesIn": 1234567,
"bytesOut": 2468013,
"hits": 11200,
"misses": 3000
}
}
- metrics array: Last 24 hours of hourly data (up to 24 objects)
- current object: Most recent hour (not an average, the latest data point)
TypeScript Metrics Fetcher
interface Metric {
time: string;
commands: number;
bytesIn: number;
bytesOut: number;
hits: number;
misses: number;
}
interface MetricsResponse {
metrics: Metric[];
current: Metric;
}
async function getMetrics(instanceId: string, apiKey: string): Promise<MetricsResponse> {
const response = await fetch(
`https://api.swiftcache.io/api/v1/instances/${instanceId}/metrics`,
{
headers: { 'Authorization': `Bearer ${apiKey}` }
}
);
if (!response.ok) {
throw new Error(`API error: ${response.statusCode}`);
}
return await response.json();
}
// Usage
const metrics = await getMetrics('inst_abc123', 'sk_live_xxx');
console.log('Last hour:', metrics.current);
console.log('Last 24 hours:', metrics.metrics.length, 'data points');
Python Metrics Fetcher
import requests
from typing import List, Dict
from datetime import datetime
def get_metrics(instance_id: str, api_key: str) -> Dict:
"""Fetch last 24 hours of metrics for an instance"""
response = requests.get(
f'https://api.swiftcache.io/api/v1/instances/{instance_id}/metrics',
headers={'Authorization': f'Bearer {api_key}'}
)
response.raise_for_status()
return response.json()
# Usage
data = get_metrics('inst_abc123', 'sk_live_xxx')
print(f"Data points: {len(data['metrics'])}")
print(f"Current commands: {data['current']['commands']}")
Analyzing Your Metrics
Calculate Hit Ratio
The hit ratio is the most important metric. It tells you how often your cache helps.
function calculateHitRatio(metrics: Metric[]): number {
const totalHits = metrics.reduce((sum, m) => sum + m.hits, 0);
const totalMisses = metrics.reduce((sum, m) => sum + m.misses, 0);
const total = totalHits + totalMisses;
if (total === 0) return 0;
return totalHits / total;
}
const hitRatio = calculateHitRatio(metrics.metrics);
console.log(`Hit ratio: ${(hitRatio * 100).toFixed(1)}%`);
// Interpretation:
// > 90% = Excellent, your cache is doing its job
// 80-90% = Good, mostly serving from cache
// 70-80% = Acceptable, but room for improvement
// 60-70% = Poor, more misses than desired
// < 60% = Cache barely helping, needs investigation
Python Hit Ratio
def calculate_hit_ratio(metrics: List[Dict]) -> float:
total_hits = sum(m['hits'] for m in metrics)
total_misses = sum(m['misses'] for m in metrics)
total = total_hits + total_misses
if total == 0:
return 0
return total_hits / total
ratio = calculate_hit_ratio(data['metrics'])
print(f"Hit ratio: {ratio * 100:.1f}%")
Average Command Rate
Track how many commands per second you're executing:
function getAverageCommandsPerSecond(metrics: Metric[]): number {
const totalCommands = metrics.reduce((sum, m) => sum + m.commands, 0);
const seconds = metrics.length * 3600; // Hours to seconds
return totalCommands / seconds;
}
const avgCps = getAverageCommandsPerSecond(metrics.metrics);
console.log(`Average: ${avgCps.toFixed(1)} commands/second`);
// Peak command rate
const peakMetric = metrics.metrics.reduce((max, m) =>
m.commands > max.commands ? m : max
);
const peakCps = peakMetric.commands / 3600;
console.log(`Peak: ${peakCps.toFixed(1)} commands/second at ${peakMetric.time}`);
Data Volume Analysis
Understand your data flow patterns:
function analyzeDataFlow(metrics: Metric[]) {
const totalIn = metrics.reduce((sum, m) => sum + m.bytesIn, 0);
const totalOut = metrics.reduce((sum, m) => sum + m.bytesOut, 0);
const inMb = totalIn / 1024 / 1024;
const outMb = totalOut / 1024 / 1024;
console.log(`Last 24 hours:`);
console.log(` Data in: ${inMb.toFixed(1)} MB`);
console.log(` Data out: ${outMb.toFixed(1)} MB`);
console.log(` Ratio: ${(outMb / inMb).toFixed(2)}x`);
// Interpretation:
if (outMb > inMb * 3) {
console.log(" Insight: Hot data, lots of reads");
} else if (outMb < inMb * 0.5) {
console.log(" Insight: Write-heavy, not much re-reading");
}
}
Setting Up Monitoring Alerts
TypeScript Alert System
interface AlertRule {
metric: 'hitRatio' | 'commandRate' | 'memory';
threshold: number;
operator: '<' | '>' | '==';
action: (value: number) => void;
}
async function checkMetricsAndAlert(
instanceId: string,
apiKey: string,
rules: AlertRule[]
) {
const data = await getMetrics(instanceId, apiKey);
const metrics = data.metrics;
// Check each rule
for (const rule of rules) {
let value: number;
if (rule.metric === 'hitRatio') {
value = calculateHitRatio(metrics);
} else if (rule.metric === 'commandRate') {
value = getAverageCommandsPerSecond(metrics);
} else {
continue;
}
// Evaluate rule
const triggered =
rule.operator === '<' ? value < rule.threshold :
rule.operator === '>' ? value > rule.threshold :
value === rule.threshold;
if (triggered) {
rule.action(value);
}
}
}
// Define alert rules
const alerts: AlertRule[] = [
{
metric: 'hitRatio',
threshold: 0.8,
operator: '<',
action: (ratio) => {
console.warn(`Alert: Hit ratio ${(ratio * 100).toFixed(1)}% below 80%`);
// Send Slack/email/PagerDuty alert
}
},
{
metric: 'commandRate',
threshold: 10000,
operator: '>',
action: (rate) => {
console.warn(`Alert: Command rate ${rate.toFixed(0)}/sec above 10k`);
// Trigger auto-scaling or alert
}
}
];
// Check every 5 minutes
setInterval(() => {
checkMetricsAndAlert('inst_abc123', 'sk_live_xxx', alerts);
}, 5 * 60 * 1000);
Python Alert System
from typing import Callable
from dataclasses import dataclass
from enum import Enum
import schedule
import time
class Operator(Enum):
LESS_THAN = '<'
GREATER_THAN = '>'
EQUAL = '=='
@dataclass
class AlertRule:
metric: str
threshold: float
operator: Operator
action: Callable
def check_alerts(instance_id: str, api_key: str, rules: List[AlertRule]):
"""Check metrics against alert rules"""
data = get_metrics(instance_id, api_key)
metrics = data['metrics']
# Calculate metrics
hit_ratio = calculate_hit_ratio(metrics)
avg_cps = sum(m['commands'] for m in metrics) / (len(metrics) * 3600)
# Check rules
for rule in rules:
if rule.metric == 'hitRatio':
value = hit_ratio
elif rule.metric == 'commandRate':
value = avg_cps
else:
continue
# Evaluate
triggered = (
rule.operator == Operator.LESS_THAN and value < rule.threshold or
rule.operator == Operator.GREATER_THAN and value > rule.threshold
)
if triggered:
rule.action(value)
# Define rules
alerts = [
AlertRule(
metric='hitRatio',
threshold=0.75,
operator=Operator.LESS_THAN,
action=lambda v: print(f"Alert: Hit ratio {v*100:.1f}% is low!")
),
AlertRule(
metric='commandRate',
threshold=5000,
operator=Operator.GREATER_THAN,
action=lambda v: print(f"Alert: High load {v:.0f} commands/sec!")
)
]
# Schedule checks
schedule.every(5).minutes.do(
check_alerts,
'inst_abc123',
'sk_live_xxx',
alerts
)
while True:
schedule.run_pending()
time.sleep(1)
Common Metrics Scenarios
Scenario 1: Low Hit Ratio (Cache Not Helping)
Metrics show:
- Hit ratio: 45%
- Hits: 4,500
- Misses: 5,500
What it means: Almost as many misses as hits. Your cache isn't helping much.
Causes:
- Wrong TTL (expiring too fast)
- Wrong data cached (cacheable data not in cache)
- Cache too small (evicting frequently)
- Data is not cacheable (constantly changing, never requested twice)
Solutions:
-
Check your TTL settings:
// Are you setting long enough TTLs? redis.set('key', 'value', 'EX', 3600); // 1 hour redis.set('key', 'value', 'EX', 86400); // 1 day better for stable data -
Verify correct data is cached:
# What keys are in your cache? redis-cli KEYS "*" | head -100 redis-cli SCAN 0 COUNT 100 -
Check memory - if evicting, your instance is too small:
# Get metrics showing memory pressure curl -H "Authorization: Bearer sk_live_xxx" \ https://api.swiftcache.io/api/v1/instances/inst_abc123/metrics -
Consider if data is cacheable - some data patterns can't be cached effectively
Scenario 2: High Command Rate Spike
Metrics show:
- Peak commands: 50,000 per hour (13 commands/second)
- Normally: 10,000 per hour (3 commands/second)
What it means: 5x increase in load. Something changed.
Causes:
- New feature deployed
- Traffic spike (viral post, sale, etc.)
- Thundering herd (many clients retrying after error)
- Inefficient code (N+1 queries)
Solutions:
-
Check if this is expected:
# Did you deploy something? git log -1 --oneline # Check application logs during the spike time grep "2024-03-15T14:" app.log | wc -l -
If unexpected, investigate root cause:
// Add logging to Redis operations redisClient.on('command', (cmd) => { if (cmd.args[0] === 'GET') { console.log('GET:', cmd.args[1]); } }); -
Optimize if possible:
// Batch operations const results = await redis.mget(['key1', 'key2', 'key3']); // Use pipelines const pipeline = redis.pipeline(); for (const key of keys) { pipeline.get(key); } await pipeline.exec(); -
Scale up instance if load is sustained:
# Increase memory/capacity curl -X PATCH \ -H "Authorization: Bearer sk_live_xxx" \ -H "Content-Type: application/json" \ -d '{"maxMemory": 1073741824}' \ https://api.swiftcache.io/api/v1/instances/inst_abc123
Scenario 3: Asymmetric Data Flow (bytesOut >> bytesIn)
Metrics show:
- Bytes in: 50 MB per hour
- Bytes out: 500 MB per hour
- Ratio: 10x more data flowing out
What it means: You're reading much more than writing. Data stays in cache long.
Interpretation: Actually good! Your cache is being used effectively for hot data.
No action needed unless:
- You expected more balanced traffic
- Memory usage is growing (data not expiring)
Scenario 4: Zero Hits (Completely Dead Cache)
Metrics show:
- Hits: 0
- Misses: Everything
Causes:
- Cache is not being used (code path bypassed)
- All keys expired (TTL too short)
- Instance just created (no data yet)
- Wrong hostname/key (connecting elsewhere)
Debug:
# Check if there's any data in cache
redis-cli KEYS "*" | wc -l # Should be > 0
# Check if expiring too fast
redis-cli TTL mykey # Should be positive, not -2 (expired)
# Check if using correct instance
echo $REDIS_HOSTNAME # Verify it's correct
Building a Metrics Dashboard
Simple HTML Dashboard
<!DOCTYPE html>
<html>
<head>
<title>SwiftCache Metrics</title>
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
<style>
body { font-family: Arial; margin: 20px; }
.metric { display: inline-block; margin: 20px; padding: 10px; border: 1px solid #ddd; }
canvas { max-width: 600px; }
</style>
</head>
<body>
<h1>Instance Metrics</h1>
<div class="metric">
<h3>Hit Ratio</h3>
<p id="hitRatio">Loading...</p>
</div>
<div class="metric">
<h3>Commands/sec</h3>
<p id="commandRate">Loading...</p>
</div>
<canvas id="commandChart"></canvas>
<script>
async function loadMetrics() {
const response = await fetch(
'/api/metrics?instanceId=inst_abc123'
);
const data = await response.json();
// Calculate hit ratio
const hits = data.metrics.reduce((s, m) => s + m.hits, 0);
const misses = data.metrics.reduce((s, m) => s + m.misses, 0);
const ratio = (hits / (hits + misses) * 100).toFixed(1);
document.getElementById('hitRatio').textContent = `${ratio}%`;
// Calculate command rate
const commands = data.metrics.reduce((s, m) => s + m.commands, 0);
const rate = (commands / (data.metrics.length * 3600)).toFixed(1);
document.getElementById('commandRate').textContent = `${rate}/sec`;
// Draw chart
drawCommandChart(data.metrics);
}
function drawCommandChart(metrics) {
const ctx = document.getElementById('commandChart');
new Chart(ctx, {
type: 'line',
data: {
labels: metrics.map(m => new Date(m.time).toLocaleTimeString()),
datasets: [{
label: 'Commands',
data: metrics.map(m => m.commands),
borderColor: 'blue'
}]
},
options: {
responsive: true,
plugins: {
title: { text: 'Commands per Hour' }
}
}
});
}
loadMetrics();
setInterval(loadMetrics, 60000); // Refresh every minute
</script>
</body>
</html>
Common Monitoring Mistakes
Mistake 1: Ignoring Low Hit Ratio
// Wrong: Assuming cache is working
redis.get('key'); // 45% hit ratio, mostly misses!
// Right: Investigate and fix
if (hitRatio < 0.8) {
// Increase TTL or cache different data
}
Mistake 2: Only Checking Occasionally
// Wrong: Manual checks
// "Oh, I'll check metrics when I remember"
// Right: Automated monitoring
setInterval(() => {
checkMetrics().then(alert);
}, 5 * 60 * 1000);
Mistake 3: Ignoring Spikes
// Wrong: Not investigating sudden changes
// Metrics show 10x command increase, "probably fine"
// Right: Alert on anomalies
const avgRate = metrics.reduce((s, m) => s + m.commands, 0) / metrics.length;
const lastRate = metrics[metrics.length - 1].commands;
if (lastRate > avgRate * 3) {
alert('Unusual spike detected');
}
Best Practices
- Monitor hit ratio - It's your primary health indicator
- Alert on anomalies - Not just thresholds
- Track over time - Weekly/monthly trends matter
- Correlate with deploys - Command spikes after deployment are normal
- Plan capacity - Use growth trends to predict when to scale
- Archive metrics - Keep long-term history for analysis
Summary
Metrics tell the story of your cache:
- Hit ratio shows if cache is helping
- Command rate shows workload intensity
- Data flow shows access patterns
- Changes indicate something shifted
Regular monitoring catches problems before they become outages. Set up alerts, build dashboards, and act on what the metrics tell you.