Any recommendations for the best free Copilot rank tracker tool

Short version: there isn’t a neat, free “Copilot rank tracker” yet, but you can get close if you’re willing to be a bit scrappy.

I mostly agree with @sternenwanderer, but I’d actually push you in a slightly different direction: instead of logging everything and building a huge benchmarking rig, focus on small, frictionless signals in your actual editor, because anything too manual will die after 3 days.

A few ideas that don’t repeat what was already suggested:

  1. Lightweight in‑editor rating hotkeys

    • Use something like a global hotkey tool (AutoHotkey on Windows, skhd/Karabiner on mac, whatever on Linux).
    • Bind:
      • Ctrl+Alt+1 = “this suggestion sucked”
      • Ctrl+Alt+2 = “meh”
      • Ctrl+Alt+3 = “good”
    • Each hotkey writes a JSON line to a local log file:
      {'ts':'2026-03-02T12:34:56Z','model':'github-copilot','score':3,'lang':'ts'}
      
    • You only press it when the AI suggestion actually mattered.
    • Once a week, quick script to aggregate by model & language.
      This is way less work than trying to auto‑track every single insertion.
  2. Model‑per‑day rotation instead of side‑by‑side
    I kinda disagree with the constant multi‑tool switching idea. Using 3 copilots at once tends to bias results toward “the one you used on easy stuff.”

    • Pick a schedule like:
      • Mon/Tue: GitHub Copilot
      • Wed: Cursor / Codeium
      • Thu/Fri: Claude / OpenAI
    • Keep everything else fixed: same editor, same plugins.
    • Tag each day in your log / rating system by model.
      You’ll get a cleaner signal on “how it feels to live with this thing” over time.
  3. Track pain instead of just success
    Most folks track “did it work,” but what usually makes a copilot bad isn’t that it never works, it’s that it wastes your time. Things that are actually measurable with minimal setup:

    • Count how often you hit “undo” right after an AI suggestion. If you use a consistent keybinding to trigger completions, you can watch for:
      • “AI key pressed → large paste → undo within 3 seconds.”
    • Count how many times a file switches from “tests passing” to “tests failing” within 2 minutes of inserting AI code. A simple git diff + test runner hook can approximate this.
      That effectively measures “regret rate” instead of just success rate.
  4. Language / stack‑specific scoring
    All the leaderboard stuff is nice, but it hides a big issue: a model that’s great at Python might be terrible on your React + TypeScript + weird internal API setup.

    • In your logs or spreadsheet, always store language and maybe project.
    • After a few weeks, you might find something like:
      • Copilot: awesome in TS, mediocre in infra
      • Another: better in SQL / data stuff
        Then you can route your tasks: one copilot for frontend, another for data / scripting.
  5. Simple “time cost” metric
    Instead of trying to auto‑measure response time per request, just track:

    • “How many minutes from first AI suggestion to green tests?”
      Use a small CLI “timer” you start and stop manually per task, and note the model used. Very low tech, but it gives you an actual business‑relevant metric: “who gets me to done faster.”
  6. UX friction as a metric
    This is super subjective, but still quantifiable with a 1–5 rating logged at the end of a coding session:

    • 1 = I turned this thing off out of frustration
    • 3 = neutral, didn’t help much
    • 5 = I would pay real money for this today
      One line append to a log when you stop for lunch or end your day. That “session score” will often correlate better to reality than synthetic benchmarks.

If you want a “closest thing to a rank tracker” without going full data‑scientist:

  • Rotate tools by day.
  • Use quick hotkeys to rate only meaningful suggestions.
  • Once a week, throw the log into a simple script or spreadsheet and graph:
    • avg score per model
    • per language
    • maybe per project

You won’t get a fancy dashboard with ELO scores for copilots, but you will know which one actually pulls its weight in your stack, for free, without turning your life into a research project.