Batch Benchmarking¶

Benchmark multiple models and generate a ranked leaderboard.

Run a Benchmark¶

from aime_loc import LOC

loc = LOC()
results = loc.benchmark([
    "meta-llama/Llama-4-Scout",
    "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    "mistralai/Mistral-Small-24B-Instruct-2501",
    "Qwen/Qwen3.5-35B-A3B",
    "google/gemma-3-12b-it",
])

View Results¶

# Print leaderboard table
results.leaderboard_table()
# | Rank | Model | Size | TC % | Best Function |
# |:---:|-------|------|:---:|:---:|
# | 1 | Llama-4-Scout | 70B | 15.37 | Emotion |
# ...

# Heatmap across all models and functions
results.heatmap()

Access Individual Profiles¶

for profile in results.profiles:
    print(f"{profile.model_id}: TC={profile.tc_score:.2f}%")

Public Leaderboard¶

Access the pre-computed public leaderboard:

lb = loc.leaderboard(top_n=20)
for entry in lb.entries:
    print(f"#{entry.rank} {entry.model_id}: TC={entry.tc_score:.2f}%")