There is a particular kind of problem that accumulates quietly until it becomes unmanageable: information overload. If you run RSS feeds from a dozen technical sources, the unread count climbs faster than any human can reasonably process. The standard response is to ignore most of it, which defeats the purpose.
The two approaches I want to walk through here take opposite angles on the same root problem. One uses an LLM as a co-developer to compress months of engineering work into weeks. The other deploys an LLM at the edge to continuously summarize, classify, and score incoming content so that a human only reads what matters. Both are real systems that reached production. Both have production scars.
Using AI as a Development Accelerant
The first case is a solo engineer building a niche user-generated content platform with an AI pair programmer doing the heavy lifting on code. The stack ended up being Node.js on Firebase and GCP, with Cloudflare Workers handling edge access control, Algolia for full-text search, and a Cloud Run service running a Go board analysis engine over WebSocket. That is not a trivial system. The total real engineering effort came in around 1.5 to 2 person-months. A traditional firm estimating the same scope would likely quote somewhere north of 20 person-months.
That gap is real, and understanding where it comes from is more useful than citing the ratio.
Where the Multiplier Actually Comes From
The gains were not uniform across all task types. Boilerplate and glue code moved at near-machine speed. The AI produced GitHub Actions pipelines, Firebase security rules, Cloud Run Dockerfile configurations, and Algolia indexing hooks with minimal back-and-forth. The engineer's job in those phases was to review, catch hardcoded environment-specific values (a persistent failure mode), and merge.
The interesting problems were the ones where the AI had to make architectural choices under ambiguity. When deciding between a document database and a relational database, the AI analyzed the access patterns of the application — writes are infrequent, reads are fan-out heavy, and content naturally fits a denormalized document model — and recommended Firestore over a Postgres-based alternative. The reasoning was sound. The engineer did not originate that analysis; the AI did, and the engineer verified it.
This pattern repeats throughout: the human's role shifts from author to reviewer and decision ratifier.
The Context Window Problem in Development
Here is where things get instructive. The project started with a single HTML file and grew from there. At some point the codebase exceeded what the AI's context window could hold coherently, and the quality of output degraded. The AI started introducing regressions — fixing one thing and silently breaking another.
The fix was conventional software engineering: decompose the code into smaller, loosely coupled modules with clear separation of concerns. Once the code was properly structured, the AI could work on a module without accidentally corrupting adjacent logic.
This is not an AI-specific insight. It is just good architecture. But the failure mode is worth naming: an AI coding assistant working on a monolithic codebase is like a developer who can only see 20 lines at a time. Structure protects you from both.
Tasks and milestones were tracked as Markdown files inside the repository rather than in GitHub Issues. The reason is practical: a well-structured document inside the repo stays inside the AI's context window. An issue tracker is opaque to the model.
The Hardcoding Problem
Environment-specific values getting hardcoded into source files is a recurring failure. Domain names are the worst offender because a hardcoded domain in the wrong place causes cross-environment failures that are annoying to debug. The mitigation is vigilance and CI checks, not a solved problem. The AI does not maintain a mental model of "this value belongs in a secret" the way an experienced human engineer would after being burned by it once.
Takumi's Take: This is the most underrated failure mode in AI-assisted development. The AI has no memory of the last time a hardcoded credential caused a production incident. You do. Your review process has to compensate for that asymmetry. A pre-commit hook that scans for obvious patterns helps, but it will not catch a domain name that looks like a string constant. The human in the loop has to stay sharp on this class of issue even when everything else is moving fast.
Testing Strategy Under AI-Driven Development
The testing approach here is deliberately pragmatic. Unit test coverage is not maximized. Security-critical logic — Firebase security rules, access control checks — gets thorough unit tests. End-to-end tests cover only critical paths. The rationale is that AI can apply fixes rapidly, so a system that is slightly broken but quickly correctable is better than one that is rigorously tested but slow to change.
This is a reasonable position for a solo project under resource constraints. It would not be the right call for a system handling financial transactions or personal health data. The trade-off is explicit, not accidental.
AI Analysis at the Feature Level
The platform includes a Go board analysis feature powered by KataGo, an open-source Go AI engine. The first attempt ran inference on the client side using an ONNX-converted model. The idea was to eliminate server costs by pushing compute to the browser.
It did not work. A single board position evaluation took long enough to be unusable, and Monte Carlo tree search on top of that was out of the question on typical consumer hardware. The feature requires search, not just single-position evaluation, which multiplies the compute requirement substantially.
The working solution runs KataGo on Cloud Run with a WebSocket interface. Instances scale to zero when idle, which bounds the cost. The engineer can set a maximum instance count to cap worst-case spending. It is not free, but it is predictable.
The image recognition feature — reconstructing a board position from a photograph — went through a similar evaluation process. A CNN-based deep learning approach produced poor accuracy without custom training data. An LLM vision API worked but carried ongoing inference costs. OpenCV-based classical computer vision ran entirely on the client, cost nothing per use, and produced acceptable accuracy with careful tuning. The AI prototyped all three approaches and the human made the final call based on cost and accuracy measurements.
One detail worth noting: when the OpenCV approach underperformed initially, the AI's instinct was to suggest abandoning it for a different method. The engineer had to push back and direct the AI to keep tuning the existing approach. This is a pattern. AI assistants tend toward variety over depth when facing a hard problem. You need to know when to insist on going deeper.
Building an AI Content Pipeline on the Edge
The second system is a content digest service that pulls from RSS feeds, runs each article through an LLM for summarization, classification, and importance scoring, and delivers a daily digest. It runs entirely on Cloudflare Workers with Workers AI providing the LLM inference.
The architecture is a pipeline triggered by Cron schedules: three times daily plus a weekly run. Each execution fetches new feed entries, deduplicates against stored URLs, runs AI summarization on new articles, persists results to Workers KV, and assembles the daily digest.
Workers AI vs. External LLM APIs
Workers AI is Cloudflare's hosted inference service, callable directly from Worker code without an external HTTP request. The practical differences from something like the OpenAI API are three:
First, latency is lower because the inference runs on the same edge network as the Worker. There is no round trip to an external API endpoint.
Second, the pricing model is based on a unit called Neurons rather than tokens. The free tier provides 10,000 Neurons per day. For Llama 3.1 8B, a summarization call with roughly 500 input tokens and 200 output tokens consumes approximately 28 Neurons. In practice, with retries and multiple passes, a single article costs closer to 80 to 100 Neurons.
Third, model selection is limited to what Cloudflare provides. You cannot bring your own weights.
For the summarization pipeline, two models are used with different roles. Llama 3.1 8B handles the per-article summarization that runs in bulk during each Cron execution. Llama 3.3 70B fp8-fast handles a separate "daily insight" generation step that runs once per day on demand. The reasoning is explicit: bulk processing needs low cost per call, while the daily insight needs better analytical quality and runs infrequently enough that the higher Neuron cost per call does not matter much.
// wrangler.toml — model names as environment variables
// This makes swapping models a config change, not a code change
// packages/fetcher/wrangler.toml
// [vars]
// AI_SUMMARIZE_MODEL = "@cf/meta/llama-3.3-70b-instruct-fp8-fast"
// AI_TOPICS_MODEL = "@cf/meta/llama-3.3-70b-instruct-fp8-fast"
// packages/api/wrangler.toml
// [vars]
// AI_REPORT_MODEL = "@cf/meta/llama-3.3-70b-instruct-fp8-fast"
Model names belong in environment variables, not hardcoded strings. This is not an academic concern. Models get deprecated. The Workers AI platform has already retired older Llama versions. A system that requires code changes to swap a model name will cause unnecessary incidents.
Single-Call Prompt Design
The core prompt asks the model to do three things in one call: produce a Japanese-language summary, assign a category from a predefined list, and score importance on a 1-5 scale. The decision to combine all three into a single call is entirely about Neuron budget.
With a 10,000 Neuron daily budget, splitting into three separate calls would roughly triple the overhead cost per article and cut the number of processable articles by more than half. When budget is the binding constraint, prompt consolidation is the right trade-off.
The category classification uses keyword hints embedded in the prompt. Each category definition carries a list of associated terms that tell the model what kind of content belongs there. This proved more reliable than attempting structured output constraints with the 8B model. Function calling-style JSON schemas can work well with larger models, but smaller models benefit from the guidance that explicit keyword examples provide.
const ARTICLE_CATEGORIES: CategoryConfig[] = [
{
id: "security",
label: "Security",
keywords: ["CVE", "vulnerability", "OWASP", "zero-day", "breach"],
order: 1,
},
{
id: "ai",
label: "AI / ML",
keywords: ["LLM", "GPT", "Claude", "machine learning", "RAG", "fine-tuning"],
order: 2,
},
{
id: "cloud",
label: "Cloud",
keywords: ["AWS", "GCP", "Azure", "Terraform", "IaC", "Kubernetes"],
order: 3,
},
{
id: "other",
label: "Other",
keywords: [],
order: 99,
},
];
function buildCategoryHints(categories: CategoryConfig[]): string {
return categories
.map((c) => `- ${c.id} (${c.label}): ${c.keywords.join(", ") || "general"}` )
.join("\n");
}
The importance scale runs from 1 to 5, not 1 to 3. Three levels sound sufficient until you try to distinguish "major version release worth reading this week" from "zero-day requiring immediate action." Five levels provide enough resolution to make the digest actually useful as a triage tool.
Context Window Management
The Workers AI version of Llama 3.1 8B has a context window of approximately 8,000 tokens. The full text of a technical article frequently exceeds this. Passing the entire article body causes truncation and degrades output quality unpredictably.
The solution is structured truncation before the text reaches the model. For articles with Markdown headings, the content is split by section and concatenated from the beginning until the character budget (3,200 characters before tokenization) is exhausted. Front-loaded sections get priority because most well-structured articles lead with their main point.
For articles without headings — plain news items, scraped web content — a three-part extraction strategy applies instead. The beginning, middle, and end of the article each contribute a fixed proportion. The end section matters because technical articles frequently place their conclusions and key takeaways there. Grabbing only the top of a headingless article means missing the summary the author already wrote.
function extractArticleContext(body: string, maxChars = 3200): string {
const headingPattern = /^#{1,6}\s/m;
if (headingPattern.test(body)) {
return extractByHeadings(body, maxChars);
}
return extractByPosition(body, maxChars);
}
function extractByPosition(text: string, maxChars: number): string {
if (text.length <= maxChars) return text;
const headSize = Math.floor(maxChars * 0.5);
const midStart = Math.floor(text.length * 0.4);
const midSize = Math.floor(maxChars * 0.25);
const tailSize = maxChars - headSize - midSize;
const head = text.slice(0, headSize);
const mid = text.slice(midStart, midStart + midSize);
const tail = text.slice(text.length - tailSize);
return [head, "---", mid, "---", tail].join("\n");
}
Handling Language Inconsistency
RSS feeds mix Japanese and English articles. The digest targets Japanese-language readers. This creates a specific problem: asking an 8B model to translate English content into Japanese while simultaneously summarizing it produces inconsistent output. The English context pulls the model toward English responses.
The two-pass strategy separates these concerns. Japanese articles go through the full summarization prompt. English articles skip full summarization and instead run a lighter prompt that translates the title to Japanese and produces a category and score. The translated title becomes the summary.
When that translation comes back in English anyway — which happens often enough to require handling — a reinforced retry prompt fires. The retry uses stronger constraint language and a concrete example of the expected output format. Most cases resolve on the first or second attempt. For anything that still fails, the system falls back to using the original English title and marks the record with a fallback flag so downstream consumers know the quality is degraded.
Takumi's Take: The language drift problem with smaller models is real and worth designing around explicitly from the start. A 70B model handles cross-lingual instruction following much more reliably than an 8B model does. If you are building a multilingual pipeline on a budget, size your models appropriately for the task. Using an 8B model for bulk work and a 70B model for quality-sensitive tasks is a sensible split, but do not expect the 8B model to be robustly bilingual under adversarial prompt conditions.
Fallback Architecture
The system is designed around the assumption that the AI will fail sometimes. This is a more useful frame than "how do I make the AI reliable." Workers AI, like any external service, has latency spikes and rate limits. The model output format drifts — a request for JSON occasionally returns prose with JSON embedded in it.
The fallback hierarchy has two dimensions. The first is JSON parsing: if the raw response parses cleanly, great. If it does not, a regex extraction attempt pulls the JSON object out of surrounding text. If that also fails, the article title becomes the summary and the record is flagged.
The second dimension is pipeline-level: if a feed fails to fetch, the other feeds continue. If a single article fails to save, it is not marked as processed, which means the next Cron run will retry it automatically.
async function parseAiResponse(
rawResponse: string,
fallbackTitle: string
): Promise<SummarizedArticle> {
// Attempt 1: clean JSON parse
try {
return JSON.parse(rawResponse);
} catch {
// Attempt 2: extract JSON from surrounding text
const match = rawResponse.match(/\{[^{}]+\}/);
if (match) {
try {
return JSON.parse(match[0]);
} catch {
// fall through
}
}
}
// Attempt 3: use title as summary, mark degraded
return {
summary: fallbackTitle,
category: "other",
importance: 1,
model: "fallback",
};
}
The design principle here is worth stating plainly: a digest with imperfect summaries is useful. A digest that fails to produce output because of an AI error is not. Build the happy path, then build the failure paths with the same care.
Cost Reality
The actual Neuron consumption for 50 to 80 articles per day across three Cron runs comes to roughly 7,000 Neurons per day. The daily insight generation using the 70B model adds approximately 250 Neurons. Total is around 7,300 out of a 10,000 Neuron daily free budget, or about 73% utilization.
That leaves room for some growth before hitting the ceiling, but not much. Adding more feeds or increasing retry rates will close that gap quickly. The 70B model's per-token cost is substantially higher — calling it more frequently would compress the remaining headroom fast.
KV reads and writes are nowhere near their limits for this workload. Workers execution count is trivially low. The binding constraint is AI inference budget, specifically the Neuron consumption from the bulk summarization Cron.
Connecting the Two Patterns
These two systems are solving different problems with the same underlying tool, and the constraints that shaped each design are worth comparing directly.
In the first case, the AI operates as a development tool. The context window is a constraint on code complexity, not on runtime throughput. The human provides product direction and reviews output; the AI generates implementation. The failure modes are subtle wrong behavior rather than hard errors — a hardcoded value, a test that covers the wrong scenario, a refactoring that silently changes semantics.
In the second case, the AI operates as a runtime component. Every call costs money and latency. Failure is expected and must be handled gracefully. The prompt is a contract that the model sometimes violates, and the surrounding code has to defend against that. The human's job is to design the pipeline and maintain it; the AI does the per-article work autonomously.
What both cases share: neither works without a human who understands the full system. The AI coding assistant cannot hold the entire codebase in context. The AI summarization pipeline cannot guarantee output quality without fallbacks that a human designed. The productivity gains are real, but they require a competent engineer in the loop who knows where the model's attention is limited and builds compensating mechanisms accordingly.
The question worth sitting with is not "what can AI automate" but "where does the system break when the AI is wrong, and how fast can a human fix it." Optimize for that and you get a system that benefits from the speed of AI generation without accumulating the fragility that comes from trusting it unconditionally.