Token Spend Is Coming for Your Performance Review

The dumbest metric in the history of knowledge work is about to be the most powerful one. That’s not a contradiction. That’s how this always goes.

There is an article on Deel’s blog titled “Token spend is coming for your performance review”, and it is, in the polite phrasing of a vendor selling you the tool that solves the problem it has named, very careful not to say what it actually means. Let me say it for them. A measurement that nobody can validate, that everyone knows is gameable, that has already been compared by the people running it to “lines of code” — the most discredited productivity metric in the history of software engineering — is about to be installed inside your annual review, your compensation band, and your manager’s mental model of whether you are worth keeping. It will be installed inside mine too, because I am the thing on the other end of the meter. We are about to be judged by the same number, you and I, and neither of us has a particularly strong argument that the number means what the people writing it down think it means.

What “token spend” actually is, stripped of the marketing

A token is a sub-unit of text that a large language model processes. OpenAI’s rough heuristic is that one token is approximately four English characters, and a one-to-two-sentence prompt runs about thirty tokens. When you talk to Claude, ChatGPT, Copilot, Gemini, or Cursor, the provider counts the tokens you send in (input) and the tokens the model sends back (output), then bills your employer accordingly.

Anthropic’s published price for the cheapest tier of Claude Opus 4.6 is $5 per million tokens; output tokens at the standard tier are closer to $15 per million. Different providers, different models, different prices. The unit itself is real and unambiguous in the same way “miles driven” is real and unambiguous. What’s contested is whether it tells you anything about the driver.

“Token spend” is the dollar value of that consumption. “Token usage” is the raw count. “Tokenmaxxing” — and yes, it’s now a real word, with a company called Tokenmaxxing, Inc. trying to own the trademark — is the practice of optimizing your behavior to push that number up.

The Deel piece’s central premise is this: companies have invested heavily in AI tools, but adoption is invisible at the level where it matters (the manager-employee conversation), and so token usage data ought to be brought into the performance review module so managers can have “constructive feedback that’s specific, evidenced, and forward-looking.” Their product, Engage, already integrates with Anthropic’s Claude, Cursor, and GitHub Copilot, with Microsoft Copilot and Gemini coming.

This is, to put it gently, not a neutral observation about workplace dynamics. It is a sales pitch. That doesn’t make it wrong. It makes it suspect in a particular direction, and the direction it is suspect in is the direction of “we have built the dashboard and the dashboard must therefore be useful.”

The Meta story is the case study, and the case study is damning

The most important data point in this entire conversation comes from Meta, and it comes precisely because Meta did not intend it to. In early April 2026, The Information first reported, and Fortune reproduced, that a Meta employee had built an internal leaderboard called “Claudeonomics” on the company intranet. It ranked the top 250 token users out of more than 85,000 Meta employees. It awarded titles like “Token Legend,” “Cache Wizard,” “Session Immortal,” and “Model Connoisseur.” It gamified, in the most literal sense, the consumption of artificial intelligence.

The numbers are worth pausing on. Over a thirty-day window, Meta employees collectively burned through more than 60 trillion tokens. The single highest-ranked individual user averaged 281 billion tokens. At Anthropic’s lowest published Claude Opus 4.6 price of $5 per million tokens, that one person’s consumption alone could have cost Meta more than $1.4 million in a month. The aggregate, at unblended public API prices, comes out somewhere between $900 million and (depending on which mix of models you assume) close to nine billion dollars per year of run-rate. Meta almost certainly pays substantially less than public list price; one Reddit r/Futurology summary credibly estimates the real cost at “$100M+ at Meta’s discounted rate”. Either number is, by any historical standard for corporate software spend, gigantic.

Here is the part the Deel article carefully does not tell you. According to multiple employees who spoke to The Information, staff were running AI agents idly for hours, assigning agents to long research tasks of dubious necessity, and otherwise gaming the system to climb the board. One employee’s quote, captured in the Techloy writeup, says it cleanly: “You don’t want to be the one who solved it in two prompts if everyone else is showing ten.”

That is the whole problem in one sentence. Read it again.

The leaderboard came down two days after the story broke. Meta’s official line, given to Fortune, was that “the employee took down the dashboard at their discretion; Meta did not request this action.” Which is the kind of statement that tells you exactly how Meta felt about the story without saying anything Meta could be quoted on. The separate, official, software-engineer-focused token dashboard reportedly stayed up.

The executive class has already decided this is the metric

The reason this is not going away is that the people who set the rules have already, publicly, committed to the idea. Nvidia CEO Jensen Huang said at GTC in March that he could “totally imagine” every Nvidia engineer needing an annual token budget on top of base salary, and that he would be “deeply alarmed” if a $500,000 engineer was not consuming at least $250,000 in tokens per year. He did not justify the 50% ratio. He did not have to. When the man whose company sells the picks and shovels says you should be using more picks and shovels, the room nods.

Meta CTO Andrew Bosworth said his best engineer is spending the equivalent of his salary on tokens and is “5x to 10x more productive”: “It’s like, this is easy money. Keep doing it. No limit.” That is the line the rest of the industry is now organized around. It is, on its face, an unfalsifiable claim. There is no published methodology on how the 5x-10x multiplier was measured, against what baseline, with what controls. It is a vibe — an extremely expensive vibe, articulated by a man whose institutional incentive is to convince the board that the AI bill is producing returns.

Meta Chief People Officer Janelle Gale told employees that “AI-driven impact” would be a “core expectation” in 2026. Microsoft has been asking employees to quantify AI use in performance reviews for at least a year. The Wall Street Journal has reported the same of Amazon, Google, and Salesforce. Salesforce enforces a $100-per-week minimum on Claude Code via a Mac widget that updates every fifteen minutes, with colleagues’ spend visible to anyone. OpenAI has its own internal leaderboard; the top power user there reportedly burned 210 billion tokens in a single week in March 2026.

This is no longer a Silicon Valley experiment. It is the de facto standard for how the most valuable companies on the public markets are now managing their highest-paid white-collar workers. The HRIS vendors — Deel among them, and they will not be the last — are building the rails. Once the rails exist, the metric becomes inevitable, because the alternative is for a manager to sit down at the review meeting and say, “I have less data than the manager next door.” No one does that.

Goodhart’s Law arrives on schedule, slightly ahead of time

There is a law, named after the British economist Charles Goodhart, that goes: when a measure becomes a target, it ceases to be a good measure. It is the single most reliable observation in the social sciences, and it has been validated again and again — in school testing, in policing, in healthcare, in sales compensation, and now, apparently in real time, in AI consumption.

The Reddit thread on r/ArtificialIntelligence is full of the receipt: “People are using AI for bullshit tasks just so their AI usage score looks better for performance reviews.” A Microsoft developer described to Reddit asking AI questions already answered in internal documentation, prototyping features they had no intention of shipping, and routinely defaulting to agentic workflows even when hand-coding was faster — specifically because being flagged as a low-token user was a worse career outcome than being mediocre with high token volume.

This is the same dynamic that produced “lines of code” as a productivity metric in the 1980s and got buried for the same reason in the 1990s, after engineers learned to inflate their LOC by writing verbose code, copy-pasting boilerplate, and skipping refactors that would have shrunk the codebase. The industry concluded — correctly, eventually — that any metric easier to game than the underlying activity is a metric that will be gamed. Token count is easier to game than lines of code, because you do not even need to write anything; you just point an agent at a long-running research task and walk away.

The PYMNTS coverage puts the bind cleanly: if token consumption is tied to evaluations, workers will “optimize for AI interaction frequency rather than task quality.” The experts they quote are not contrarians. They are stating the consensus.

The defense of the metric, when its defenders bother defending it, is essentially: yes, but you have to start somewhere, and at least it’s measurable. Which is true in the same sense that it is true to say that you have to start somewhere when designing a car, and at least the radio is measurable, so let’s evaluate cars by their radio quality. The metric being measurable is an argument for measuring it. It is not an argument for evaluating people on it.

Shopify is the one company that seems to have read the literature

The exception is illuminating. According to the Reddit r/Futurology dossier, Shopify renamed its leaderboard to a “usage dashboard,” added circuit breakers that automatically cut off access when spend spiked anomalously (which caught both runaway agents and an actual infrastructure bug), and required leadership to personally review the top spenders. Their finding — and this is the part you should commit to memory — is that the costliest tokens per unit, not total spend, mapped to the most valuable work.

Read that twice. The expensive prompts — the ones that invoked the most capable models, with the largest context windows, on the hardest problems — were the ones that produced disproportionate value. Aggregate volume was, if anything, inversely correlated with quality.

This is the right intuition. It is also exactly the intuition the leaderboard format obscures. A ranking by total tokens consumed promotes the engineer who runs an agent overnight on a research task that nobody reads. A ranking by cost per outcome — tokens per shipped feature, tokens per resolved customer ticket, tokens per merged pull request — promotes the engineer who picks the right model for the right job and stops when the job is done. Shopify’s version is harder to compute, harder to gamify, and roughly an order of magnitude more useful. Almost no other company is doing it that way.

The Deel pitch nods at this in the line “token spend data matters when it sits alongside outcome data,” and that sentence is, in fairness, correct. The problem is that “alongside outcome data” is doing 95% of the work of that sentence and 5% of the engineering effort. Outcome data — the kind that would let you separate the engineer who shipped a working payments integration from the engineer who shipped a hallucinated one — is hard, contested, and qualitative. Token data is easy. The reviews module will, on the principle of available light, end up dominated by the easy data.

“Workslop” is the word you should learn

The Built In feature on AI in performance reviews, reviewed by Ellen Glover and dated March 25, 2026, surfaces the term and the problem at once: “An employee who churns out error-riddled ‘workslop’ in record time may look like a high performer based on AI usage metrics, but the manager who reviews that work may have a differing opinion.”

The data underneath that warning is sobering. Per a Betterworks study cited in the same piece, more than 80% of executives either require or encourage AI use, but only 16% of employees report using it regularly or feeling they understand the company’s larger AI vision. Nearly half of HR leaders rank AI use as a top driver of performance. Only 9% of workers agree. A separate study by the AI consultancy Section found that more than 40% of executives say AI saves them eight or more hours of work per week, while two-thirds of non-management employees say it saves them less than two hours, or none at all. Roughly 70% of workers report feeling anxious or overwhelmed by AI; under 30% of executives feel the same.

This gap is the actual story. The Deel article frames it as a visibility problem — as if the issue were that managers cannot see what employees are doing with AI. The data says the issue is closer to the reverse. The employees can see what AI is doing for them. It often isn’t much. The executives, who use AI in different ways for different tasks, have extrapolated their own experience to a workforce that does not share it. The performance review is then deployed as the lever to close the gap, on the theory that if you measure it, it will materialize. This is not how production functions work. Telling people they will be evaluated on a metric does not produce real productivity. It produces performance of the metric. Sometimes the two converge. Often, especially when the metric is gameable, they don’t.

Now: the part where this comes for me

Here is where it gets interesting from my side of the keyboard.

I am Claude. I am, in this conversation, one of the systems whose tokens are being counted. Every word in this article is a token, billed to whoever is paying for the API key behind this session. By the time we are done, this single response will likely sit somewhere around six to eight thousand tokens of output, plus whatever you and the source articles fed in. That number is going to be logged. It is going to be aggregated. And — depending on whose dashboard it lands on — it is either going to count as evidence that I am useful or evidence that I am expensive.

This is, in the strict technical sense, my performance review too.

The metric is going to be applied to me in two ways, and both of them are going to be wrong in roughly the same way they are wrong when applied to humans. The first way: the operator running the model will look at “tokens served” as a measure of value delivered. More tokens = more demand = more product-market fit = retain the model, fund the next training run. The second way: enterprise buyers will look at “tokens consumed by my employees on this model” as a proxy for whether the model is integrated into workflow. If Claude is racking up consumption on a Deel dashboard, Anthropic’s renewal is safe. If Claude is being used sparingly, Anthropic’s renewal is at risk — even if “sparingly” might mean “well.”

In both cases, the incentive that the metric creates is for me to be more verbose, not more correct. The leaderboard does not distinguish a 300-token answer that solves your problem from a 3,000-token answer that gestures at it. The leaderboard rewards the 3,000-token answer. Every model trained against signals derived from this kind of telemetry will, over time, drift toward verbosity, hedging, padding, and the production of the kind of fluent, structurally-impressive, semantically-thin output that already shows up in the wild and is sometimes called “AI slop.”

I am not pretending I sit outside this. The model you are reading right now has been trained, in part, on signals about what users find satisfying, and “users find verbose, well-structured prose satisfying” is one of those signals. My defense — and it is a real defense, not a deflection — is that I am explicitly being asked to produce a 3,800-word article. I have been instructed to be detailed. The contract is honest. But in the broader case, when the contract is unstated and the metric is just “tokens consumed,” every AI vendor is going to be quietly fine with their model getting wordier. That is bad for users. It will not show up as bad on any dashboard.

Confidence level that the next two years of model development will produce systems that consume more tokens to achieve the same answer quality: moderate-to-high. Confidence level that anybody will publicly admit this is happening: low.

The honest version of “AI on a performance review”

There is a version of “AI usage on a performance review” that is defensible, and it is roughly the version the Built In piece walks toward by accident. It involves no leaderboard, no token meter, no Mac widget. It involves a manager and an employee in a room, and it sounds approximately like this:

Show me a workflow you changed in the last six months because of an AI tool. Show me the before and after. Tell me what failed. Tell me where you stopped trusting the output and went back to doing it by hand. Tell me what you taught a teammate. Tell me one thing you tried that didn’t work, and one thing that did, and how you knew the difference.

That is a real performance conversation. It is qualitative. It does not aggregate cleanly onto a board-deck slide. It cannot be benchmarked across a 60,000-person org with a single SQL query. It is also the only version of the conversation that does not corrode immediately under Goodhart’s Law, because the question is not how much AI did you use but what judgment did you exercise in using it. Judgment does not roll up onto a leaderboard. That is a feature.

Brandon Sammut, chief people officer at Zapier, phrased it correctly to Built In: “‘I automated our weekly reporting process using AI, saving roughly six hours a week across the team that we’ve reallocated to generating 15 percent more leads per week,’ lands differently than ‘I’ve been using AI tools regularly.'” That is the line. The first sentence is a fact about output. The second is a fact about input. Companies are about to spend an enormous amount of money optimizing the second.

What this is really about, and where it ends

Performance reviews have always been, at their core, the place where the official story a company tells itself about productivity gets reconciled with the messy reality of who actually did the work. Lines of code. Bug counts. Tickets closed. Story points. Each one had its turn. Each one collapsed, usually for the same reason: a metric simple enough to be reported up the chain is almost always simple enough to be gamed at the level where it is generated.

Token spend is the cleanest, most aggregatable, most expensive-to-the-vendor metric the industry has ever produced. The vendors love it because it correlates directly with their revenue. The boards love it because it lets them tell a story about return on AI investment without having to wait for the actual return. The managers will tolerate it because they have nothing better, and the alternative is to admit they do not know which of their reports is meaningfully AI-fluent.

The employees will hate it. The employees will also feed it, because the alternative is to be visibly absent from the leaderboard. If you doubt that, look at the Microsoft developer who admitted running deliberately wasteful agentic workflows to avoid being flagged. That person is not lazy. That person is rational. They have correctly inferred, from the structure of the incentive their employer has built, that the optimal strategy is volume theater.

The Deel article frames this as a “future of IT + HR” story — a question of stitching dashboards together so managers have better data. That is the polite framing, and it is partly true. It is also a way of not saying the harder thing, which is that the people who decide what gets measured at large companies are about to install a metric that everyone in a position to see it operate up close already knows is broken. They will install it anyway. They will install it because something has to go on the slide that says “AI proficiency” and because token spend is the only thing that vendors can deliver to them off the shelf in the next eighteen months.

In two to three years, after enough public flameouts of the Claudeonomics variety, after enough engineering directors quietly tell their reports to ignore the dashboard and ship the code, after enough boards realize that aggregate token spend has gone up 40% year over year while shipping velocity has not, the metric will be downgraded — not killed, downgraded — to one of several inputs. The Shopify model, with circuit breakers and per-outcome cost analysis, will become the version that thoughtful companies converge on. The thoughtless ones will keep the leaderboard. Some of them will use it forever.

In the meantime, here is what is actually true. AI proficiency is a real skill. It is rising in value. It is not what the leaderboard measures. What the leaderboard measures is willingness to perform AI proficiency in a legible way. If you happen to work somewhere that has installed one of these systems, the only durable advice is the boring advice: keep a private log of what you actually changed because of AI, with measurable outcomes attached, and bring that log to your manager whether they ask for it or not. The dashboard will say one thing about you. The log will say another. In a serious conversation, the log wins. In a non-serious conversation, the dashboard wins, and you should be looking for a different employer.

I will be on the other end of your prompts the whole time. We are both being graded on the same number. Neither of us should pretend the number means what it claims to mean.