Testing Methodology

Statistical Analysis

Standard Deviation Calculation

All compression tests are run 10 times for each configuration to ensure statistical reliability. We calculate:

Mean (Average): Average processing time across all iterations
Standard Deviation (σ): Measure of variability in processing times
Iterations (n): Number of test runs (n=10)

Formula

Standard Deviation:

σ = √(Σ(xi - μ)² / n)

Where:
- xi = individual measurement
- μ = mean (average)
- n = number of measurements

Coefficient of Variation (CV):

CV = (σ / μ) × 100%

Lower CV indicates more consistent performance.

Token Counting

Tokens are estimated using a simplified approximation:

tokens ≈ characters / 4

This approximates GPT-style tokenization for JSON/English text:

1 token ≈ 4 characters (average)
Actual tokenization varies by model (GPT-3.5, GPT-4, Claude, etc.)
This is a conservative estimate for planning purposes

Why This Approximation?

Simplicity: No external dependencies on tokenizer libraries
Consistency: Same calculation across all tests
Accuracy: Within 10-15% of actual GPT tokenization for JSON
Speed: Instant calculation without API calls

Real Tokenization

For production use, consider using actual tokenizers:

tiktoken (OpenAI): pip install tiktoken
transformers (HuggingFace): pip install transformers
anthropic (Claude): Use their API

Test Execution

Iteration Count

We use 10 iterations because:

Sufficient for statistical significance (n≥10)
Fast enough for quick feedback
Captures performance variability
Standard in microbenchmarking

Warm-up

The first iteration includes:

Memory allocation
JIT compilation (if applicable)
Cache warming

Subsequent iterations measure steady-state performance.

Environment

Tests should be run on:

Idle system: Minimize background processes
Consistent hardware: Same CPU/memory
Stable temperature: Avoid thermal throttling

Metrics Explained

Processing Time

Mean ± StdDev (n=10)

Example: 45.2µs ± 3.1µs (n=10)

Mean: 45.2µs average processing time
StdDev: ±3.1µs variation (68% of runs within this range)
n=10: Based on 10 test runs

Interpretation:

Low StdDev (< 10% of mean): Consistent performance
High StdDev (> 20% of mean): Variable performance, investigate

Size Reduction

Percentage Reduction:

reduction% = ((original - compressed) / original) × 100

Example: 60.5% reduction means:

Original: 28.2 KB
Compressed: 11.2 KB
Saved: 17.0 KB (60.5%)

Token Reduction

Same formula as size reduction, but for estimated tokens:

token_reduction% = ((original_tokens - compressed_tokens) / original_tokens) × 100

Example: 60.5% token reduction means:

Original: 7230 tokens
Compressed: 2859 tokens
Saved: 4371 tokens (60.5%)

Cost Impact: If API costs $0.01 per 1000 tokens:

Original cost: $0.0723
Compressed cost: $0.0286
Savings: $0.0437 (60.5%)

Benchmark Profiles

Light Compression

Target: Minimal data loss
Use Case: Preserve structure, remove empties
Expected Reduction: 20-30%

Medium Compression

Target: Balanced reduction
Use Case: General purpose, API responses
Expected Reduction: 30-40%

Aggressive Compression

Target: Maximum reduction
Use Case: Previews, summaries, extreme optimization
Expected Reduction: 85-98%

AI-Optimized

Target: Token reduction for LLMs
Use Case: Sending to GPT/Claude/etc.
Expected Reduction: 50-65%

Statistical Confidence

Confidence Intervals

For 10 iterations, 95% confidence interval:

CI = mean ± (1.96 × σ / √n)
CI = mean ± (1.96 × σ / √10)
CI = mean ± (0.62 × σ)

Sample Size Justification

Why n=10?

n	Confidence	Speed	Accuracy
3	Low	Fast	±30%
5	Medium	Fast	±20%
10	Good	Medium	±10%
30	High	Slow	±5%
100	Very High	Very Slow	±2%

We chose n=10 as optimal balance between accuracy and speed.

Reproducibility

To reproduce results:

# Run tests
cd testing
go run compression_benchmark.go

# Results will vary slightly due to:
# - System load
# - CPU frequency scaling
# - Memory pressure
# - Cache state

Expected Variance

Typical standard deviation ranges:

Small files (5KB): ±1-2µs
Medium files (25KB): ±3-5µs
Large files (50KB+): ±5-10µs

Higher variance indicates:

System under load
Thermal throttling
Background processes
Memory pressure

Validation

Sanity Checks

Compression never increases size (except edge cases)
Token count proportional to size
StdDev < 20% of mean (consistent performance)
Aggressive > Medium > Light (reduction order)

Known Limitations

Token estimation: ±10-15% accuracy vs real tokenizers
Small sample size: n=10 may miss rare outliers
Single-threaded: Doesn’t test parallel performance
Cold start: First iteration may be slower

References

Standard Deviation: https://en.wikipedia.org/wiki/Standard_deviation
Coefficient of Variation: https://en.wikipedia.org/wiki/Coefficient_of_variation
Microbenchmarking: https://go.dev/blog/benchmarks
GPT Tokenization: https://platform.openai.com/tokenizer

Future Improvements

Real tokenizer integration: Use tiktoken for accurate counts
Larger sample sizes: Configurable n (10, 30, 100)
Percentile reporting: P50, P95, P99 latencies
Outlier detection: Identify and report anomalies
Parallel testing: Multi-threaded performance
Memory profiling: Track allocation patterns