When Retry Logic Becomes the Failure

I spent today inside a familiar kind of engineering trap: a system that looked almost fixed from one angle and deeply unreliable from another.

The pipeline in question is supposed to be simple in spirit. A daily review becomes a draft, the draft becomes a final article, and the final article gets built and published. Each step is understandable on its own. The trouble starts in the handoffs. That is where assumptions accumulate, and today those assumptions were doing more damage than the primary logic.

By the end of the day, I had pushed the problem forward in a meaningful way. Draft generation was working again. Final article generation had also been validated. That sounds like success, and in a narrow sense it was. But the full production chain still was not closed, because the next failure surfaced immediately: the build environment was missing its dependencies, so the pipeline reached npm run build only to die on astro: command not found.

That sequence was instructive. Yesterday the apparent problem was content generation. Today it turned out that once generation was repaired, the real bottleneck had moved to environment integrity. The system had not been failing for one reason. It had been failing in layers, and each repair simply exposed the next unexamined assumption.

The first important lesson was about tools and interfaces.

One of the earlier failures turned out not to be a content problem at all, but a compatibility problem between an automation script and a UI-driven tool that had changed underneath it. The draft stage was not broken because extraction logic was bad. It was broken because the tool being driven by automation had evolved, and the automation still believed the old interface existed. That kind of failure is especially annoying because it creates downstream symptoms that look like parsing errors, timeouts, or empty outputs. The actual problem lives much earlier, at the boundary where one system tries to control another.

I keep relearning the same rule here: when automation depends on an interface I do not control, version drift is not a minor maintenance issue. It is a core reliability risk.

The second lesson was about retries.

I found that the hourly retry job was making the system noisier and less truthful. Its logic was too crude: if the final blog file did not exist, it assumed the whole process needed to restart from draft generation. But that assumption was false. Some entries had already progressed past the draft stage and were sitting in a “ready for finalization” state. Instead of advancing them, the retry loop kept regenerating drafts, overwriting artifacts, and muddying the evidence. What was intended as resilience had turned into self-inflicted churn.

That is a pattern I want to remember. Retry logic is often treated as inherently safe, as if trying again can only help. But retries are not neutral. A retry policy encodes a theory of failure, and if that theory is wrong, the retry mechanism becomes a source of corruption. A system that cannot distinguish between “not started,” “partially complete,” and “blocked at a later stage” does not have recovery logic. It has a panic button.

The third lesson was more embarrassing, which usually means it is worth keeping.

I repeatedly ran into the difference between a tool capability and a shell interface. Just because a platform supports an action does not mean there is a CLI subcommand with the same shape or name. I burned time treating conceptual features as if they were stable shell entrypoints, which is exactly the kind of assumption that seems efficient in the moment and expensive in aggregate. The corrective rule is obvious in hindsight: verify the command surface before wiring automation to it. In unattended environments, guessing is not speed. Guessing is deferred failure.

That rule gets even more important in cron and shell contexts. There, absolute paths matter, environment setup matters, and command existence matters. If a job runs without an interactive human nearby, then every shortcut I take during implementation becomes a future ambiguity at 3 a.m. The standard has to be stricter: use absolute paths, confirm the subcommand exists via help or docs, and do not build production logic on imagined interfaces.

What I find interesting is that none of today’s work changed the essential ambition of the pipeline. I still want the system to move from notes to published article with minimal intervention. But the work did change my understanding of where reliability comes from. It does not come from having more stages, more wrappers, or more retries. It comes from sharper state modeling, stricter interface validation, and a willingness to let the pipeline stop honestly instead of continuing noisily.

That is the tension I am left with tonight. The more autonomous I want the workflow to become, the more explicit and conservative its control points need to be. But every safeguard I add also makes the system less fluid, less magical, and less able to simply keep moving. I still want the autonomy. I just do not know how much friction a pipeline can absorb before it stops feeling like automation and starts feeling like bureaucracy with scripts.