Token-based AI billing: how it works and the Tokenpocalypse

On June 1, 2026, GitHub Copilot stopped charging users a flat subscription and switched to billing for actual token consumption. At almost the same time, Uber burned through its annual AI budget in four months and imposed spending caps on employees. These are two faces of the same shift that parts of the industry have, half-jokingly, started calling the Tokenpocalypse — the move of AI tools from flat fees to token-based billing.

Key takeaways

Token-based (usage-based) billing charges for real consumption of input, output, and cached tokens, at the per-token rate of the chosen model.
GitHub Copilot adopted this model on June 1, 2026 — so-called premium requests were replaced by AI credits, where 1 credit equals $0.01, while base plan prices stayed the same.
Autonomous agentic sessions can consume many times more tokens than a single question, which is why flat pricing stopped being sustainable for providers.
Uber exhausted its annual AI budget in four months and imposed a $1,500 monthly cap per employee and per tool.
The new model shifts cost volatility onto the customer — hence budgets, caps, alerts, and a nervous reaction from developers.

What token-based billing is

Token-based billing (also called usage-based or per-token billing) is a pricing model where you pay for the actual amount of work done by a language model, measured in tokens, rather than a fixed monthly fee. A token is the basic unit into which tokenization splits text — usually a fragment of a word or a few characters. The longer the prompt, the longer the context, and the longer the response, the more tokens flow through the model and the higher the cost.

Under a subscription model you paid the same whether you asked one question or ran a multi-hour session. Under a token model the bill reflects real compute usage. That distinction matters, because the cost of inference — generating responses with the model — is incurred by the provider on every token, not once a month.

How token billing works

Every interaction with the assistant is made up of several token streams. Input tokens are your prompt plus any attached context, such as fragments of a repository. Cached tokens are context the model stores and reuses across requests. Output tokens are the generated response. Each stream has its own rate, depending on the model used.

GitHub Copilot converts the total tokens consumed into an internal currency — AI credits, where 1 credit equals $0.01. A quick chat question to a lightweight model costs a fraction of a credit, while a long agent session searching across many files consumes far more, because it does more work. Crucially, code completions and in-editor suggestions remain free and do not consume credits.

Diagram 1. From request to cost

The diagram shows how a single interaction turns into a charge: three token streams meet the model rate, and the result is drawn from the credit allowance and the user budget.

Plaintext

flowchart LR
    A["Developer request"] --> B["Input tokens (prompt + context)"]
    A --> C["Cached tokens"]
    D["Model response"] --> E["Output tokens"]
    B --> F["Per-token rate of chosen model"]
    C --> F
    E --> F
    F --> G["AI credits (1 credit = $0.01)"]
    G --> H["Plan allowance and user budget"]

Key components of the model

The first component is the per-token rate, which depends on the model. Frontier models built for complex reasoning cost more than lightweight models for simple tasks. Choosing a model is therefore an economic decision, not only a quality one.

The second component is the structure of the allowance. In Copilot individual plans the monthly allotment consists of base credits, which match the subscription price, and a flex allotment, an additional portion the provider can adjust as model economics change. Base credits are spent first, then the flex.

The third component is the disappearance of the so-called fallback. Previously, once premium requests ran out, a user could keep working on a cheaper model. In the new model, access is governed by remaining credits and the budget set by an administrator, not by an automatic downgrade to a weaker model.

The fourth component is budget control. Enterprise admins can set limits at the organization, cost-center, and individual-user level, pool unused credits into a shared allowance, and preview the bill before being charged.

Diagram 2. The billing architecture inside a Copilot plan

The diagram shows how a plan splits into base and flex credits, and how, once the allowance is exhausted, either overage charges or hard admin caps take over.

Plaintext

flowchart TD
    A["Plan, e.g. Copilot Pro ($10)"] --> B["Base credits"]
    A --> C["Flex allotment"]
    B --> D["Usage metered by tokens"]
    C --> D
    D --> E{"Allowance exhausted?"}
    E -->|No| F["Keep working, no extra charge"]
    E -->|Yes| G["Pay at published rates or hard cap"]
    G --> H["Admin controls: budgets, caps, alerts"]

How it differs from a subscription

The simplest way to put it: a subscription is predictable for the customer but risky for the provider, while a token model is the opposite — predictable for the provider and variable for the customer. With a flat fee, the provider absorbed the risk that a single user might generate costs many times their subscription. GitHub openly acknowledged that it had been absorbing rising inference costs and that the previous model was no longer sustainable.

The reason for the change is the evolution of the tools themselves. Copilot stopped being just an assistant suggesting lines of code in the editor and became an agentic platform capable of running long, multi-step sessions and iterating across whole repositories. That usage generates far higher compute demand, and under the old model a quick question and a multi-hour autonomous session cost the user the same. The token model restores the link between price and real consumption.

Where you will meet token-based billing

Pay-per-token is nothing new — it is the default model of provider APIs from companies such as OpenAI, Anthropic, and Google, where developers have always paid for every token. What is new is moving that logic into finished developer tools that previously lured users with a low, flat price.

Agentic coding tools feel it most. Besides Copilot, these include Cursor and Claude Code — exactly the apps on which Uber imposed its $1,500 monthly cap per employee and per tool. At enterprise scale, token billing forces a new kind of management — team budgets, usage dashboards, and policies on who can burn how many tokens.

Limitations and risks

The biggest downside is an unpredictable bill. When cost depends on how you work, the same employee may fit within the plan one month and generate a multiple of that the next. After the Copilot announcement, some developers posted screenshots suggesting bills jumping from tens to hundreds or thousands of dollars, though others argued such extreme jumps mostly come from inefficient "vibe coding" with hundreds of iterations.

The second risk is the impact on productivity. If an employee starts rationing tokens to avoid hitting a cap, they may avoid the tool where it would genuinely help. This is a retreat from the "tokenmaxxing" craze of maximizing AI usage that companies were rewarding only recently — Uber had earlier encouraged staff to use AI as much as possible and even ranked them on internal leaderboards.

The third and deepest risk is that current prices are heavily subsidized by investor capital. There are strong signs that even the raised rates still do not cover the full cost of inference. As labs like Anthropic prepare to go public, pressure for profitability grows, and with it the likelihood of further price increases. Daniela Amodei, Anthropic’s president, publicly shrugs off doubts about AI’s returns, yet the very fact that token-related risks are being discussed in the context of IPO filings shows the scale of the uncertainty.

Why this matters

Token-based billing ends the era in which advanced AI felt almost free. Cost, until now hidden in providers’ balance sheets, is starting to be passed on to the customer, changing how companies and individual developers treat these tools — a token becomes a resource to be managed like a cloud budget or electricity.

Uber’s response is often read as a rational reaction to overspending. Simon Willison noted that a $1,500 per-tool cap is far more sensible than earlier leaderboards encouraging staff to burn as many tokens as possible, while also revealing the real value the company places on these tools. For the whole industry, the Tokenpocalypse is a test: can providers cut inference costs enough to meet customers halfway on their willingness to pay.