Short version: there isn’t a neat, free “Copilot rank tracker” yet, but you can get close if you’re willing to be a bit scrappy.
I mostly agree with @sternenwanderer, but I’d actually push you in a slightly different direction: instead of logging everything and building a huge benchmarking rig, focus on small, frictionless signals in your actual editor, because anything too manual will die after 3 days.
A few ideas that don’t repeat what was already suggested:
-
Lightweight in‑editor rating hotkeys
- Use something like a global hotkey tool (AutoHotkey on Windows, skhd/Karabiner on mac, whatever on Linux).
- Bind:
Ctrl+Alt+1= “this suggestion sucked”Ctrl+Alt+2= “meh”Ctrl+Alt+3= “good”
- Each hotkey writes a JSON line to a local log file:
{'ts':'2026-03-02T12:34:56Z','model':'github-copilot','score':3,'lang':'ts'} - You only press it when the AI suggestion actually mattered.
- Once a week, quick script to aggregate by model & language.
This is way less work than trying to auto‑track every single insertion.
-
Model‑per‑day rotation instead of side‑by‑side
I kinda disagree with the constant multi‑tool switching idea. Using 3 copilots at once tends to bias results toward “the one you used on easy stuff.”- Pick a schedule like:
- Mon/Tue: GitHub Copilot
- Wed: Cursor / Codeium
- Thu/Fri: Claude / OpenAI
- Keep everything else fixed: same editor, same plugins.
- Tag each day in your log / rating system by model.
You’ll get a cleaner signal on “how it feels to live with this thing” over time.
- Pick a schedule like:
-
Track pain instead of just success
Most folks track “did it work,” but what usually makes a copilot bad isn’t that it never works, it’s that it wastes your time. Things that are actually measurable with minimal setup:- Count how often you hit “undo” right after an AI suggestion. If you use a consistent keybinding to trigger completions, you can watch for:
- “AI key pressed → large paste → undo within 3 seconds.”
- Count how many times a file switches from “tests passing” to “tests failing” within 2 minutes of inserting AI code. A simple
git diff+ test runner hook can approximate this.
That effectively measures “regret rate” instead of just success rate.
- Count how often you hit “undo” right after an AI suggestion. If you use a consistent keybinding to trigger completions, you can watch for:
-
Language / stack‑specific scoring
All the leaderboard stuff is nice, but it hides a big issue: a model that’s great at Python might be terrible on your React + TypeScript + weird internal API setup.- In your logs or spreadsheet, always store
languageand maybeproject. - After a few weeks, you might find something like:
- Copilot: awesome in TS, mediocre in infra
- Another: better in SQL / data stuff
Then you can route your tasks: one copilot for frontend, another for data / scripting.
- In your logs or spreadsheet, always store
-
Simple “time cost” metric
Instead of trying to auto‑measure response time per request, just track:- “How many minutes from first AI suggestion to green tests?”
Use a small CLI “timer” you start and stop manually per task, and note the model used. Very low tech, but it gives you an actual business‑relevant metric: “who gets me to done faster.”
- “How many minutes from first AI suggestion to green tests?”
-
UX friction as a metric
This is super subjective, but still quantifiable with a 1–5 rating logged at the end of a coding session:- 1 = I turned this thing off out of frustration
- 3 = neutral, didn’t help much
- 5 = I would pay real money for this today
One line append to a log when you stop for lunch or end your day. That “session score” will often correlate better to reality than synthetic benchmarks.
If you want a “closest thing to a rank tracker” without going full data‑scientist:
- Rotate tools by day.
- Use quick hotkeys to rate only meaningful suggestions.
- Once a week, throw the log into a simple script or spreadsheet and graph:
- avg score per model
- per language
- maybe per project
You won’t get a fancy dashboard with ELO scores for copilots, but you will know which one actually pulls its weight in your stack, for free, without turning your life into a research project.