Why Bigger AI Models Often Fail at Key Tasks

Several months ago, I dedicated a weekend to developing a carousel workflow using Lindy. The concept was straightforward: input a podcast transcript, process it via a Lindy agent, and produce a sequence of properly formatted slides automatically. This task, which previously demanded a couple of hour

When the time came to refine this workflow, exciting new developments had emerged. Models like Gemini 3 and ChatGPT 5.1 had recently launched. It seemed like the perfect opportunity to incorporate these advancements.

However, in just 20 minutes, I found myself utterly perplexed. The introductory section had vanished entirely. The meticulously crafted layout, honed over weeks of iteration, was now utterly disrupted. I experimented relentlessly—rephrasing prompts, fine-tuning directives, and making countless adjustments. Despite my efforts, nothing yielded results. These supposedly superior models simply refused to adhere to my specified format.

Frustrated, I reverted to Gemini 2.5 Flash, the most budget-friendly option on the roster. On the very first attempt, it delivered flawless results.

Reasons Newer AI Models Can Struggle with Instruction Adherence

At first glance, this outcome feels illogical. If a model boasts greater intelligence, shouldn't it excel at executing tasks more effectively?

The reality is more nuanced: enhanced "smartness" typically involves training that emphasizes helpfulness, creativity, and interpretive capabilities. These models excel at bridging informational gaps, predicting user intent, and improvising solutions. For numerous applications, this is a tremendous asset. They assist in ideation, offer constructive challenges, and integrate diverse insights seamlessly.

Yet, when the objective demands unwavering precision—such as producing a carousel with an exact structure, replicated consistently without deviation—this very adaptability becomes a hindrance. The model perceives the guidelines as overly prescriptive and opts to "improve" them, introducing unauthorized modifications.

In contrast, Gemini Flash receives training geared toward literal compliance. It parses the instructions and implements them precisely, without embellishment.

This experience has led me to draw parallels with staffing decisions in a professional setting. For processing invoices into a rigid format, you wouldn't select the office's most imaginative individual. Instead, you'd choose someone meticulous who follows directives to the letter. Various roles require tailored skill sets, just as diverse tasks necessitate specific AI models.

The Enduring Power of Long-Standing Automations

This observation ties into another reflection I've had lately.

Within my Gmail account, a filter automatically tags emails with terms such as "receipt," "order," or "thank you for your purchase," routing them to a dedicated Purchases folder. I implemented this simple rule over a decade ago.

Remarkably, it continues to operate flawlessly every day. It faded into obscurity until a recent conversation about inbox management prompted me to inventory my automations. Rediscovering it, I was reminded of its quiet persistence.

True excellence in automation manifests this way: it integrates seamlessly into your routine, running unobtrusively without demanding attention. The common pitfall is the relentless pursuit of cutting-edge tools and models while neglecting to fully leverage established ones. That Gmail filter required only 10 minutes of setup ten years back, and its cumulative time savings are incalculable.

Recognizing Valuable Patterns in Tool Adoption

Over time, my engagement with Lindy has echoed a pivotal experience from about 15 years ago involving OmniFocus.

As Asian Efficiency took its initial steps, we forged a robust connection with the OmniGroup developers. Their product represented a pioneering task management solution tailored for dedicated professionals. Through collaborative content, in-depth guides, and community support, OmniFocus emerged as a cornerstone topic on our platform for years.

Lindy evokes a similar vibe. This compact, mission-driven company is crafting a substantive tool that evolves steadily. Their responsiveness stands out, and early adopters who master it deeply—beyond superficial trials—stand to gain a competitive edge in the coming years.

The broader lesson transcends any single tool like Lindy. It underscores the wisdom of investing early in reliable solutions, even pre-mainstream hype, and persisting with configurations that deliver reliably amid waves of shiny alternatives.

Strategic Framework for Selecting AI Models

I've distilled my insights into a practical decision-making framework:

Scenarios for deploying advanced (costlier) models:
Conducting research and synthesizing findings
Drafting initial written content
Interpreting intricate datasets
Engaging in dialogues requiring subtle contextual awareness
Ideal uses for economical, instruction-following models:
Adhering to predefined output structures
Extracting data into standardized formats
Executing consistent automation sequences
Any process demanding identical results on every run

The litmus test boils down to this: Does the task prioritize creative intelligence or strict compliance? Opt for top-tier models when intelligence is key; lean toward affordable ones for compliance-driven work, where they often outperform.

If an AI system falls short of expectations, resist the urge to labor over prompt revisions for an hour. Experiment with a model switch first—it could resolve the issue in under five minutes.

Final Thought on Automation Longevity

That automation from a decade past might remain your most impactful asset.

Take a moment to review your Gmail filters. Audit legacy Zapier or Make workflows. Revisit those phone Shortcuts from 2019 that slipped your mind. Many are likely humming along, faithfully executing as originally programmed.