Skip to the content.

Statistical Analysis

Standard Deviation Calculation

All compression tests are run 10 times for each configuration to ensure statistical reliability. We calculate:

Formula

Standard Deviation:

σ = √(Σ(xi - μ)² / n)

Where:
- xi = individual measurement
- μ = mean (average)
- n = number of measurements

Coefficient of Variation (CV):

CV = (σ / μ) × 100%

Lower CV indicates more consistent performance.

Token Counting

Tokens are estimated using a simplified approximation:

tokens ≈ characters / 4

This approximates GPT-style tokenization for JSON/English text:

Why This Approximation?

  1. Simplicity: No external dependencies on tokenizer libraries
  2. Consistency: Same calculation across all tests
  3. Accuracy: Within 10-15% of actual GPT tokenization for JSON
  4. Speed: Instant calculation without API calls

Real Tokenization

For production use, consider using actual tokenizers:

Test Execution

Iteration Count

We use 10 iterations because:

Warm-up

The first iteration includes:

Subsequent iterations measure steady-state performance.

Environment

Tests should be run on:

Metrics Explained

Processing Time

Mean ± StdDev (n=10)

Example: 45.2µs ± 3.1µs (n=10)

Interpretation:

Size Reduction

Percentage Reduction:

reduction% = ((original - compressed) / original) × 100

Example: 60.5% reduction means:

Token Reduction

Same formula as size reduction, but for estimated tokens:

token_reduction% = ((original_tokens - compressed_tokens) / original_tokens) × 100

Example: 60.5% token reduction means:

Cost Impact: If API costs $0.01 per 1000 tokens:

Benchmark Profiles

Light Compression

Medium Compression

Aggressive Compression

AI-Optimized

Statistical Confidence

Confidence Intervals

For 10 iterations, 95% confidence interval:

CI = mean ± (1.96 × σ / √n)
CI = mean ± (1.96 × σ / √10)
CI = mean ± (0.62 × σ)

Sample Size Justification

Why n=10?

n Confidence Speed Accuracy
3 Low Fast ±30%
5 Medium Fast ±20%
10 Good Medium ±10%
30 High Slow ±5%
100 Very High Very Slow ±2%

We chose n=10 as optimal balance between accuracy and speed.

Reproducibility

To reproduce results:

# Run tests
cd testing
go run compression_benchmark.go

# Results will vary slightly due to:
# - System load
# - CPU frequency scaling
# - Memory pressure
# - Cache state

Expected Variance

Typical standard deviation ranges:

Higher variance indicates:

Validation

Sanity Checks

  1. Compression never increases size (except edge cases)
  2. Token count proportional to size
  3. StdDev < 20% of mean (consistent performance)
  4. Aggressive > Medium > Light (reduction order)

Known Limitations

  1. Token estimation: ±10-15% accuracy vs real tokenizers
  2. Small sample size: n=10 may miss rare outliers
  3. Single-threaded: Doesn’t test parallel performance
  4. Cold start: First iteration may be slower

References

Future Improvements

  1. Real tokenizer integration: Use tiktoken for accurate counts
  2. Larger sample sizes: Configurable n (10, 30, 100)
  3. Percentile reporting: P50, P95, P99 latencies
  4. Outlier detection: Identify and report anomalies
  5. Parallel testing: Multi-threaded performance
  6. Memory profiling: Track allocation patterns