Choose Wisely: Models Should Follow Your Use Case.
Author(s): Dhanush Kandhan Originally published on Towards AI. Choose Wisely: Models Should Follow Your Use Case. — By Dhanush Kandhan A guy in my builder’s discord group blew his entire Codex subscription in eleven days. Two weeks into the month, nothing left. You know what he was building? A billing feature in his SaaS. Not a compiler. Not an operating system kernel. Not a real-time physics simulation. A billing page with subscriptions, invoices, and a Dodo Payments webhook that doesn’t send duplicate emails. He said it with the exhausted pride of someone who just pushed to prod at 2 AM (we devs are batmans, right?). I nodded. I didn’t say anything. But inside I was doing the mental math. I run my full AI stack coding agent, agent workflows, browser automation, speech to text, for around $10 — $15 a month. And I ship. Regularly (my github is proof for that). With billing features and everything. That conversation is what this post is about. The Benchmark Theater We All Fell For Let me describe a pattern you’ve probably noticed. A big AI company/lab drops a new model/version. The announcement lands. Within hours, everyone on X is posting about it. “Our model built a C compiler from scratch.” “Our model achieved gold on the International Math Olympiad.” “Our model solved problems that researchers said required human-level reasoning.” Image Credits: Faiapp Meme Creator The posts get thousands of likes. Engineers screenshot the benchmark charts. Someone puts together a thread comparing it to the previous generation. Replies flood in from founders saying they’re switching immediately. Then someone from Chennai quietly tries it on their actual codebase and reports back that it’s roughly the same as before for their use case. This tweet gets eleven likes. I’m not mocking the benchmark results. Building a C compiler is impressive. Scoring on the IMO is legitimately hard. These results tell you something real about what the model is capable of in controlled settings. But here is the question nobody asks loudly enough: when was the last time your actual work required an AI to build a C compiler? Look at what you built last week. Probably a REST endpoint. A React component that talks to it. Some data validation logic. An email template. A webhook handler. A cron job that moves rows between two database tables. Maybe a RAG pipeline if you’re in the AI space. Something with auth. Something with payments. You are not building compiler infrastructure. You are building software for users. Web apps. Mobile apps. Developer tools. Internal automation. The kind of work that, individually, each piece looks boring on a benchmark slide but collectively represents most of the software being written on earth today. The benchmark score tells you the ceiling of what a model can achieve on curated academic tasks. It does not tell you whether the model is the right tool for your Monday morning standup’s ticket queue. I learned this slowly. And expensively. What “Open Source” Actually Means Here? (It’s Not One Thing) Before I get into the specific models, I need to clear up something that trips up engineers constantly. When someone says a model is “open source,” they usually mean one of two very different things, and conflating them leads to bad decisions. The first is open weights. The actual model parameters, the billions of floating point numbers that encode what the model knows are publicly available. You can download them. You can run them on your own hardware. You can fine-tune them on your own data. You can deploy them inside your own VPC and never send a single token to anyone else’s server. You can modify the architecture and release derivatives. Models like GLM-5.2, DeepSeek V4, Kimi K2.6, and Nemotron from NVIDIA are all open-weight models. The weights live on Hugging Face. Most of them ship under MIT licenses, which means you can use them commercially without paying anyone a licensing fee. The second is what most of the subscription-based coding tools are: API access. You get to call their endpoint. The model runs on their servers. Their data retention policy applies to your prompts. Their pricing can change next quarter. If their infrastructure has issues on the day you have a demo, that is your problem too. You never see the weights. You cannot run it locally. The model is theirs; you are renting access. The practical difference matters more than most engineers realize until they’ve felt it. With open weights, your inference cost is literally your compute. You can run through OpenRouter or Together AI and pay per token with no monthly subscription, switching to a better model the day it ships. You can cache aggressively. You can self-host if the data sensitivity requires it. You are not locked into anyone’s pricing model. There is also a comfortable middle path, which is what I run: open-weight models accessed through inference providers. Pay per token, no subscription, full flexibility to switch, and the per-token cost is typically a fraction of what the closed model APIs charge. The Stack. For Real. I’ve read too many “why I use open source models” posts that are basically just “open source good, closed source bad” with a Hugging Face link at the bottom. Useless. Let me be specific. GLM-5.2 for Coding via OpenCode When GLM-5.2 dropped from Z.ai, the Beijing-based lab that used to be called Zhipu AI the X(twitter) reaction was something. Aravind Srinivas posted about it. Guillermo Rauch appreciated it. The Artificial Analysis Intelligence Index ranked it at 51 points, which put it above DeepSeek V4 Pro, Kimi K2.6, and even some Google models. On their GDPval-AA v2 metric, which is their best approximation of real agentic task performance, GLM-5.2 roughly matched GPT-5.5. But you know how it goes. X(Twitter) energy is its own genre. I do not make infra decisions based on who gets quote-tweeted by whom. So I used it. On a $10/month OpenCode Go plan, using it daily. The billing feature I […]
