How to Choose the Right AI Model: A Practical LLM Routing Guide

The token hangover is here.

For a while, the default AI strategy was simple: put the smartest frontier model in every workflow and celebrate the usage chart. Strategy memo? Premium model. Code comment? Premium model. Lunch macros? Premium model. Five hundred thousand rows of low-stakes extraction? Premium model, because why not.

That made sense during the land-grab phase. Teams were proving that AI could work. They were training people to ask better questions. They were discovering use cases in the wild. But scale has a way of turning enthusiasm into invoices.

Gokul Rajaram recently framed the problem as “the token hangover” in his summary of Harry Stebbings’ 20VC interview with Factory CEO Matan Grinberg. The phrase works because it captures the awkward second morning of enterprise AI: usage went up, token bills went up, and business outcomes did not always rise in proportion.

The lesson is not “stop using frontier models.” That is the lazy backlash. Frontier models are extraordinary. The lesson is that frontier intelligence is too valuable to waste on work that a smaller, cheaper, open-weight, local, or task-specific model can do well enough.

The next serious question is not “Can we use AI?” It is: which model deserves this task?

That is the real AI stack problem. Not model worship. Model selection.

What Model Selection Actually Means

Before teams can route tasks intelligently, they need cleaner language. “Use an AI model” is too vague. Different model layers behave differently, cost differently, and create different operational risks.

Frontier models are the most capable general-purpose models from leading labs. They are the models you reach for when the job needs deep reasoning, complex planning, high-context synthesis, careful judgment, sophisticated coding, or agent orchestration.

Smaller proprietary models are hosted models from the same labs or platforms, optimized for speed, price, latency, and high-volume work. The official pricing pages make the spread obvious. OpenAI’s pricing, Anthropic’s pricing, and Google’s Gemini pricing all show materially different token rates across model tiers, service tiers, batch modes, and context options.

Open-weight models make model weights available under license, but that does not automatically make the model “open source.” Meta’s Llama family, for example, is distributed through official channels such as the verified Meta Llama organization on Hugging Face, with license terms users need to accept.

Open-source models should be treated more carefully. The Open Source Initiative’s Open Source AI Definition says an open-source AI system should provide freedoms to use, study, modify, and share, and the preferred form for modification includes data information, code, and parameters. In plain English: open weights are useful, but open weights alone are not always the same as open source.

Local models run on a laptop, workstation, private server, or dedicated cluster. They can be excellent for privacy-sensitive workflows, offline use, batch processing, internal document handling, and cost control, but they bring their own maintenance burden.

Hosted open models are open-weight or open models served through a cloud provider. They can reduce infrastructure pain while preserving some model flexibility, though the privacy and cost profile depends on the host.

Fine-tuned models are adapted to a narrow domain, brand voice, schema, tool use pattern, or support workflow. They can be powerful when the task is repetitive and measurable, but fine-tuning before the workflow is proven is usually theater with a training bill attached.

Embedding models do not write the final answer. They turn text, images, or other data into vectors for search, clustering, retrieval, deduplication, and recommendation. If your app has a knowledge base, product directory, or source-backed research layer, embeddings are part of the plumbing.

Multimodal models handle combinations of text, image, audio, video, documents, and sometimes screen or computer interaction. They matter when the input or output is not just text.

Agentic model stacks combine multiple models and tools. A frontier model might plan. A cheaper model might execute repetitive steps. An embedding model might retrieve context. A guardrail model might classify risk. The product is not one model. The product is the loop.

Frontier For Judgment, Smaller Models For Volume

The clean operating principle is this:

Use frontier models for judgment. Use open, local, smaller, and efficient hosted models for volume.

That does not mean every organization should mechanically force exactly 20% of usage through frontier models and 80% elsewhere. Treat the 10-20 / 80-90 split as a starting hypothesis, not a religion. Your real split depends on your work, risk tolerance, evaluation results, data sensitivity, latency needs, and willingness to operate infrastructure.

But as a default mental model, it is useful. Many organizations have a minority of work that truly requires frontier-level reasoning and a majority of work that is repetitive enough to route somewhere cheaper.

Chart showing frontier models for judgment and smaller open or local models for volume — A useful starting hypothesis: reserve frontier models for the minority of work where better judgment changes the outcome.

Use Frontier Models For

Complex planning
Multi-step reasoning
Ambiguous strategic decisions
Hard coding architecture
Debugging unfamiliar failures
Long-context synthesis
Novel tasks where the workflow is not known yet
High-value creative direction
Agent orchestration
Safety-sensitive review
Important business decisions
Any task where one bad answer is expensive

Use Smaller, Open, Local, Or Efficient Hosted Models For

Summarization
Classification
Tagging
Formatting
Extraction
Rewriting
Simple customer support drafts
Repetitive coding tasks
Data cleanup
Search query expansion
Internal document chunking
Batch processing
Private local workflows
Low-stakes automation
Any task where volume matters more than brilliance

This is not anti-frontier. It is pro-discipline. A Ferrari is a wonderful machine. It is still a strange choice for delivering 40,000 identical envelopes across town.

A Practical Model-Routing Table

The table below is intentionally practical. It is not asking which model is “best.” It is asking which class of model is good enough for the job, what the risk is, and when to escalate.

Job / task	Recommended model type	Why	Example workflow	Cost sensitivity	Risk level	When to escalate to frontier
Blog research	Small hosted plus frontier review	Source gathering is repetitive; synthesis needs judgment.	Cheap model summarizes sources, frontier model builds argument.	Medium	Medium	Conflicting sources, current events, or reputation-sensitive claims.
SEO content outline	Smaller hosted model	Structure and keyword mapping are patterned work.	Generate briefs from SERP notes, then editor reviews.	High	Low	High-stakes pillar page or complex positioning decision.
Final editorial polish	Frontier model	Taste, nuance, argument quality, and factual caution matter.	Run final draft through a frontier editor with source constraints.	Medium	Medium	Always for flagship content.
Code architecture	Frontier model	Architecture mistakes compound across the repo.	Use frontier model for design review before implementation.	Low	High	Always for unfamiliar systems or major changes.
Code refactoring	Smaller coding model plus tests	Many refactors are mechanical if tests exist.	Efficient model edits narrow functions; frontier reviews failures.	High	Medium	Cross-module behavior, security, data migration, or broken tests.
Test generation	Smaller coding model	Patterned, high-volume work with clear validation.	Generate unit tests, run suite, escalate failing logic.	High	Low to medium	When tests define critical business behavior.
Customer support	Smaller model with retrieval	Most drafts follow policy and knowledge-base context.	Retrieve account docs, draft reply, human approves edge cases.	High	Medium	Refund disputes, legal threats, safety, angry enterprise account.
Sales email personalization	Small hosted model	High-volume, low-stakes copy variation.	Personalize snippets from CRM fields and recent company notes.	High	Low	Strategic enterprise deal or sensitive relationship.
Contract summarization	Private/local extraction plus frontier review	Privacy and accuracy both matter.	Local model extracts clauses; lawyer and frontier model review.	Medium	High	Any interpretation, negotiation, or legal-risk call.
Legal review support	Frontier plus qualified human	Failure cost is high and regulated expertise matters.	AI flags issues; counsel makes the decision.	Low	High	Always. This is support, not legal advice.
Medical/health content support	Frontier plus expert review	High-stakes claims require careful sourcing.	AI drafts plain-English explanation; clinician reviews.	Low	High	Always for advice, diagnosis, treatment, or safety claims.
Data extraction	Small model or local model	Structured output is measurable and batchable.	Extract fields from invoices, resumes, PDFs, or product pages.	Very high	Low to medium	Ambiguous documents, low confidence, or compliance impact.
Image/video prompt generation	Smaller creative model	Variation matters more than deep reasoning.	Generate prompt variants, then use human taste for selection.	High	Low	Flagship campaign concept or strict brand/legal constraints.
Product strategy	Frontier model	The value is judgment, sequencing, tradeoffs, and synthesis.	Analyze users, market, constraints, and roadmap options.	Low	High	Always for strategic bets.
Financial analysis support	Frontier plus human review	Numerical and investment errors can be expensive.	Use AI for scenario summaries; finance owner validates.	Medium	High	Any forecast, investment, credit, or capital allocation decision.
Agent planning	Frontier planner	Bad plans waste every downstream tool call.	Frontier model decomposes goal, selects tools, sets checkpoints.	Medium	High	Always for long-horizon or external-action agents.
Agent execution	Cheaper executor model	Many steps are narrow once the plan is fixed.	Small model runs extraction, formatting, retries, and status updates.	High	Medium	Low confidence, tool error, permission change, or customer impact.
Embeddings / retrieval	Embedding model	Search needs representations, not essay writing.	Embed docs, retrieve passages, pass context to answer model.	High	Medium	When retrieval quality blocks the final answer.
Internal knowledge-base Q&A	Small model plus retrieval	Most answers should be grounded in company docs.	Retrieve source passages, answer with citations, escalate unknowns.	High	Medium	Policy, HR, legal, customer commitment, or no source found.
Social media repurposing	Smaller model	High-volume rewriting is not a frontier job.	Turn blog sections into posts, hooks, captions, and variants.	Very high	Low	Founder voice, crisis comms, or sensitive claims.
YouTube script writing	Frontier for outline, smaller for variants	Structure and originality matter; repurposing is routine.	Frontier builds arc; cheaper model creates intros, shorts, descriptions.	Medium	Medium	High-stakes sponsor, medical/finance topic, or public controversy.
AI product comparison pages	Small model for data, frontier for verdict	Extraction is volume; recommendation is judgment.	Collect pricing/features, then frontier model writes sourced analysis.	Medium	Medium	Ranking, “best” claims, pricing uncertainty, or affiliate risk.

The Router Decides Better Than The Ego

Individual users often think their task deserves the best model. That is human. Everyone’s task feels special from inside the task.

At company scale, that impulse burns money. A healthy AI stack should classify the work before choosing the model. The router should ask boring questions before the employee can spend glorious tokens.

Decision flow chart for an AI model router classifying task type risk reasoning privacy latency and cost — The router should classify the work before the user spends the tokens.

A simple router can work like this:

Identify the task type.
Estimate the risk level.
Estimate the reasoning depth required.
Check privacy and compliance needs.
Check latency requirements.
Check the cost ceiling.
Choose the model class.
Escalate only if confidence is low or stakes are high.

In pseudo-flow terms:

If task = low-risk + repetitive + structured -> use smaller/open/local model.
If task = ambiguous + high-value + requires planning -> use frontier model.
If task = private/sensitive + low reasoning -> use local/open/private model.
If task = high-stakes + regulated -> use frontier review plus human review.
If task = agent planning -> use frontier planner + cheaper executor model.

This is where the application layer starts to matter. The user should not have to become a pricing analyst every time they ask for help. The app should know the workflow, the risk, the budget, the data context, and the escalation path.

The Frontier Model Layer Still Matters

The right reaction to token waste is not frontier nihilism. It is frontier respect.

Use the strongest models where intelligence changes the outcome. Anthropic’s own model-selection guidance frames model choice around capabilities, speed, and cost, with more capable models reserved for complex reasoning, scientific or mathematical work, nuanced understanding, accuracy-sensitive applications, advanced coding, and high-autonomy agentic work.

That is exactly the point. Frontier models are planners. They can decompose messy goals, identify hidden constraints, and decide which subtasks can be delegated.

Frontier models are reviewers. They can catch the subtle problem in a cheap model’s draft, the unstated assumption in an internal memo, or the risk hiding in a generated support policy.

Frontier models are architects. When a codebase is unfamiliar, the failure mode is not just a bad line of code. It is a bad mental model of the system. Spend the tokens there.

Frontier models are strategic copilots. They help founders, CIOs, marketers, and product teams reason through tradeoffs. For Kingy.ai readers building AI products, that is where the leverage lives: not in asking the most expensive model to add commas, but in using it to decide what deserves to exist.

Frontier models are escalation layers. When the cheap path is uncertain, route up. When the answer affects money, customers, safety, regulation, or reputation, route up. When one mistake costs more than the whole day’s token bill, route up.

The Open, Local, And Smaller Model Layer Is The Operating Layer

Open-weight, local, and smaller models matter because most production AI work is not glamorous. It is repetitive. It is bounded. It happens in volume. It has schemas. It has logs. It has pass/fail tests.

Mistral’s current model overview is a useful sign of where the market is going: a mix of frontier-class, open, efficient, coding, speech, OCR, moderation, and embedding-oriented models. The same pattern appears across the broader ecosystem. Teams do not need one hammer. They need a rack of tools.

Open and smaller models can help with cost control, privacy, customization, fine-tuning, local deployment, lower latency in some setups, batch processing, and vendor resilience. They also reduce the risk that one provider’s price change, model deprecation, policy change, rate limit, outage, or terms update becomes your product’s emergency.

But be balanced. Running local or open models is not free magic. Someone must handle infrastructure, hosting, security, monitoring, inference optimization, evaluation, data handling, patching, model updates, and incident response. A hosted frontier API has cost. A self-hosted open model has operational gravity.

The right question is not “open or closed?” It is “what is the total cost per successful workflow, with the risk level we can accept?”

The Application Layer Wins

If multiple frontier labs are roughly competitive, the application layer gains power. The model becomes a replaceable intelligence supplier. The durable moat shifts toward workflow, data, UX, routing, evaluation, trust, distribution, and business outcome.

That is why AI products should be model-flexible. A serious app should be able to swap models based on price, speed, quality, context length, coding ability, multimodal ability, tool-use reliability, data policy, and uptime.

This is already visible in infrastructure. Vercel’s AI Gateway, for example, presents the pattern of one application endpoint sitting over multiple models and providers, with docs for provider options, fallbacks, observability, usage, billing, and bring-your-own-key patterns. You do not need that exact product to understand the strategic direction: the app wants leverage over the model market.

For builders listed in the Kingy AI tools directory, this matters. If your app is welded to one model, your margins, latency, reliability, and roadmap are partially owned by that provider. If your app separates workflow from model, you can route intelligently and renegotiate every week.

This is also why AI Launch Intelligence tracks not just flashy model drops, but the tooling around launches: agents, evals, gateways, workflow infrastructure, and distribution. The real product is often the system that turns model capability into repeatable output.

Cost sensitivity and failure risk matrix for choosing frontier smaller open local or human reviewed AI workflows — The right model depends on this job’s economics and failure cost, not on the user’s ego.

A Simple Model Selection Scorecard

Before choosing a model, score the workflow from 1 to 5 across these factors:

Reasoning depth required
Cost per task
Volume
Latency
Privacy
Context length
Accuracy requirement
Creativity requirement
Tool-use requirement
Multimodal requirement
Fine-tuning requirement
Failure cost
Regulatory sensitivity
Vendor lock-in risk
Deployment complexity

Then apply the practical rule:

If reasoning depth, ambiguity, failure cost, and business importance are high, escalate to frontier. If volume, repetition, privacy, and cost sensitivity are high, consider open, local, or smaller models. If the task is regulated, treat AI as support and keep qualified humans in the accountability loop.

Model selection scorecard radar chart for reasoning failure cost privacy volume latency context tool use and multimodal needs — Score the workflow first, then pick the cheapest model that clears the quality bar.

Example Stacks For Different Teams

Solo Creator Or Blogger

Use a frontier model for strategy, outline, angle selection, source skepticism, and final edit. Use smaller or open models for source summaries, title variants, formatting, metadata, social repurposing, and transcript cleanup. If you are building content workflows, pair this with a real publishing system rather than a folder of disconnected prompts. The goal is not more drafts. The goal is better finished work.

AI Startup

Use frontier models for product strategy, code architecture, agent planning, security review, and customer-facing high-stakes reasoning. Use open or smaller models for telemetry analysis, support drafts, internal tools, test generation, changelog drafts, PR summaries, and low-risk agent execution.

If you are in the “from code to customers” phase, the model is only part of the bottleneck. Kingy.ai has written about why distribution becomes the constraint for AI-built apps. Model routing helps margins, but positioning and adoption still decide whether the product matters.

Enterprise CIO

Use frontier models for high-stakes planning, cross-document synthesis, executive decision support, architecture review, and escalations. Use open, local, or smaller models for batch processing, internal Q&A, classification, compliance-controlled workflows, and repetitive document operations. The CIO version of AI maturity is not “everyone uses the biggest model.” It is observability, governance, routing, evaluation, and accountability.

Developer Team

Use frontier models for architecture, debugging unfamiliar failures, multi-repo planning, security review, and agent orchestration. Use smaller coding models for code completion, tests, documentation, PR summaries, refactors, type fixes, and routine migrations. For teams learning advanced AI workflows, the Kingy AI Codex course for beginners and the AI loops guide are useful companion reads.

Marketing Team

Use frontier models for campaign strategy, positioning, audience insight, offer design, and high-value creative direction. Use smaller models for repurposing, tagging, metadata, drafts, A/B variants, captions, and format conversions. A tool like the Kingy AI video prompt generator belongs in the volume layer; the strategy behind a campaign belongs closer to the frontier layer.

Legal Or Regulated Team

Use private or local models for first-pass extraction where appropriate. Use frontier models for issue spotting and structured review. Use qualified humans for interpretation, decisions, and accountability. This article is not legal, medical, or financial advice. In regulated domains, AI should support experts, not impersonate them.

Evaluation: The Router Needs Truth, Not Vibes

Model routing without evaluation is just vibes with a YAML file.

You need golden task sets: real examples of the work your organization does, with expected outputs, acceptance criteria, and edge cases. You need human review for ambiguous work. You need A/B tests between models. You need to measure cost per successful output, not just cost per token.

OpenAI’s evals documentation describes the basic pattern: compare model outputs against test inputs and ground truth labels. That idea is broader than any one provider. If your router sends tasks to cheaper models, you need to know whether quality held. If a provider updates a model, you need regression tests. If a workflow touches customers, you need escalation-rate tracking.

Track at least these:

Task success rate
Cost per successful output
Latency
Hallucination or unsupported-claim rate
Escalation rate
Human correction rate
User satisfaction
Business outcome metrics
Regression failures after model changes

Do not over-optimize cost before the workflow works. First prove the model can create value. Then route, compress, batch, cache, downgrade, localize, and fine-tune. OpenAI’s cost optimization guidance points to simple levers such as reducing requests, minimizing tokens, selecting smaller models, batch processing, and flex processing. Those levers are useful after you know what quality means.

Common Mistakes

The first mistake is using the most expensive model for every task. It feels safe. It is often just lazy.

The second mistake is choosing models from leaderboard screenshots. Benchmarks are useful, but your workflow is the benchmark that pays the bill.

The third mistake is ignoring privacy and data retention. A cheap hosted model is not cheap if it sends sensitive data somewhere your policy does not allow.

The fourth mistake is ignoring latency. The best model on a benchmark may be the wrong model inside a real-time customer workflow.

The fifth mistake is ignoring output consistency. A model that is brilliant but unpredictable can be worse than a smaller model that reliably returns the right schema.

The sixth mistake is confusing open-weight with fully open-source. Use the OSI definition as a sanity check and read the license.

The seventh mistake is forgetting human review for regulated or high-stakes work. If the output affects health, law, finance, employment, safety, or material customer commitments, the accountability layer is human.

The eighth mistake is building custom infrastructure too early. You do not need a grand internal model platform for a workflow nobody uses.

The ninth mistake is over-optimizing cost before proving the workflow creates value. A cheap useless workflow is still useless.

The tenth mistake is vendor-locking the entire app to one provider. Build for optionality where it matters.

The eleventh mistake is letting engineers choose models manually forever. Manual choice is fine during discovery. Production wants routing logic.

What This Means For Kingy AI Readers

For AI founders: build model-flexible products. The model market will keep moving. Your durable value is the workflow, dataset, UX, trust layer, distribution channel, and evaluation harness.

For marketers: stop spending frontier tokens on bulk rewrites. Use frontier models for taste, positioning, campaign strategy, and narrative. Use cheaper models for volume.

For developers: use frontier models for architecture, unfamiliar debugging, agent planning, and review. Do not burn premium reasoning on every trivial code question.

For enterprises: build routing and evaluation into the AI stack early. If you do not classify tasks, the invoice will classify them for you later.

For creators: use frontier models for originality, structure, voice, and final judgment. Then let cheaper models help with formatting, variants, clips, descriptions, and metadata. If you are modeling sponsored content economics, the AI sponsored video ROI calculator is another place where structured workflow beats model worship.

For AI buyers: use the Kingy AI model intelligence hub and the open-source AI and local LLM guide as starting points, but make the final decision against your own tasks.

The Bottom Line

The best AI stack is not one model. It is a routing system.

Frontier models are the executive layer. Open, local, smaller, and efficient hosted models are the operating layer. Evaluation is the truth layer. Human experts are the accountability layer.

The winners will not be the companies that spend the most tokens. They will be the companies that know which tokens are worth spending.

Sources And Further Reading

Kingy Launch Brief

The public Friday pilot has not sent its first issue yet. Join for a source-checked launch briefing with a clear try, watch or skip verdict, then check your inbox and confirm your address.

Free · Friday pilot · Double opt-in · Unsubscribe anytime