GPT-5 for GTM: Reasoning Tokens Change the Math

The GPT-5 API backlash is missing the point.

A lot of people have been disappointed with GPT-5’s performance via API. The complaints are predictable: it’s slow, it’s expensive, the output isn’t dramatically better than 4o for simple tasks. Some teams tested it as a drop-in replacement for their existing workflows and got worse results at higher cost.

We ran an updated test in Clay. I think people need to reframe how they think about this model.

GPT-5 Is Not the New 4o

GPT-5 is a reasoning model. We have to treat it as the new o3, not the new 4o.

This distinction matters because it changes everything about how you use it — when it’s worth the cost, how you structure your prompts, and what tasks you assign it.

A base model like 4o processes your input and generates output. It’s fast, cheap, and good enough for the majority of straightforward tasks. Summarize this text. Rewrite this paragraph. Extract these fields from this data.

A reasoning model like GPT-5 thinks before it responds. It decomposes problems, considers multiple approaches, evaluates its own reasoning, and arrives at answers through deliberation. This is more powerful for complex tasks. It’s also more expensive, because you’re paying for that thinking.

The Token Economics Shift

Here’s where people are getting burned. With base models, API billing is simple:

Cost = input tokens + output tokens

With reasoning models, there’s a third variable:

Cost = input tokens + output tokens + reasoning tokens

Reasoning tokens are the “thinking” the model does before it produces visible output. You don’t see them in the response. But you pay for them. And they can be substantial — sometimes several times larger than the actual output.

This means a prompt that cost $0.02 with 4o might cost $0.15 with GPT-5. Not because the output is longer, but because the model spent significant compute reasoning through the problem before writing its answer.

If you’re running that prompt across 10,000 prospects in a Clay table, you just went from $200 to $1,500. That’s not a rounding error. That’s a budget conversation.

The Clay Problem

This cost issue is amplified by a specific limitation in Clay: you can’t control the API parameters for reasoning models.

When you call the OpenAI API directly, you can set parameters that control how much reasoning the model does. You can tell it to think less on simpler tasks and think more on complex ones. You can cap reasoning tokens. You have fine-grained control over the cost-quality tradeoff.

Through Clay’s Claygent and AI formula columns, you don’t have that control. Clay calls the API with its own default parameters. If you select GPT-5 as your model, every single row in your table gets the full reasoning treatment, whether it needs it or not.

This means a simple task like “extract the company name from this URL” — something 4.1 Mini handles perfectly — will burn through reasoning tokens unnecessarily if you route it to GPT-5 in Clay. The output won’t be meaningfully better. The cost will be meaningfully higher.

Until Clay exposes reasoning parameter controls, you need to be very deliberate about which columns use which models. Don’t blanket-apply GPT-5 across your entire workflow.

When GPT-5 Actually Earns Its Cost

We’ve been testing GPT-5 across our GTM workflows and the pattern is clear: it’s worth it for tasks that require genuine reasoning, and overkill for tasks that are essentially pattern matching or text manipulation.

Where GPT-5 is worth it:

Account scoring. Evaluating whether a company is a good fit for your ICP requires synthesizing multiple signals — funding stage, team size, tech stack, recent hires, market position, competitive landscape — and making a judgment call. This is multi-step reasoning, exactly what GPT-5 is built for. The quality difference between a GPT-5 score and a 4o score is noticeable and measurable.

Trigger detection and prioritization. Given a set of signals about a company, determining which ones represent genuine buying triggers versus noise requires reasoning about causality and timing. A company hiring three SDRs could mean they’re scaling outbound (good trigger) or replacing turnover (neutral signal). GPT-5 is better at distinguishing these.

Persona analysis. Understanding a prospect’s likely priorities, challenges, and communication preferences based on their role, background, company context, and public content. This requires the model to build a mental model of the person and reason about what matters to them. GPT-5 produces noticeably richer and more accurate persona profiles.

Detailed personalization. Writing a genuinely personalized email that connects a specific signal to a specific pain point through a logical bridge. This is the highest-value reasoning task in outbound because it requires the model to chain together: signal identification, pain point mapping, offer relevance, and natural voice. GPT-5 does this better, and the reply rate difference pays for the extra tokens.

Where 4.1 Mini is still the right call:

Data extraction and formatting. Pulling structured fields from unstructured text — names, titles, company info, contact details. This is pattern matching, not reasoning. 4.1 Mini handles it accurately at a fraction of the cost.

Simple enrichment. Classifying companies by industry, normalizing job titles, categorizing content topics. Straightforward tasks with clear right answers.

Template-based copy. When you have a proven template and just need the model to fill in prospect-specific details. The structure is already defined; the model just needs to follow the pattern.

Boolean and filtering tasks. Does this company match these criteria? Is this email address likely valid? Does this job posting mention relevant keywords? Binary or low-complexity decisions.

The Practical Framework

Here’s how we’re structuring our workflows now:

Step 1: Classify the task. Is it fundamentally a reasoning task (synthesize, evaluate, judge, create) or a processing task (extract, classify, format, filter)?

Step 2: Match the model. Reasoning tasks get GPT-5. Processing tasks get 4.1 Mini. When in doubt, start with 4.1 Mini and upgrade to GPT-5 only if the output quality isn’t sufficient.

Step 3: Test the economics. Run a sample batch of 50-100 rows with each model. Compare the output quality and the total cost. Sometimes GPT-5 produces marginally better results that don’t justify 5-8x the cost. Sometimes the quality difference is dramatic and the cost is easily justified by better reply rates or more accurate scoring.

Step 4: Build your workflow with mixed models. A single Clay table might use 4.1 Mini for data extraction and normalization in early columns, then GPT-5 for scoring and personalization in later columns. This is how you control costs while getting the quality where it matters.

What This Means Going Forward

The era of one-model-fits-all is over. GPT-5 is not a universal upgrade. It’s a specialized tool for complex reasoning tasks that justifies its cost when applied correctly and wastes money when applied blindly.

The teams that figure out the right model routing for their specific workflows will have a meaningful cost and quality advantage over teams that either avoid GPT-5 entirely (missing quality gains on complex tasks) or apply it everywhere (burning budget on tasks that don’t need it).

We’re still experimenting with the exact boundaries. The balance between GPT-5 and the 4-series will keep shifting as pricing changes and Clay adds more parameter controls. But the framework holds: match the model to the task, test the economics, and be intentional about where you spend your reasoning tokens.