It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".
All I observe they got better at tool call and answering questions about big codebases, especially if the question has a vague pattern to search, and they're superuseful for that! For generating production code even with a lot of steering and baby sitting?
Absolutely not, not quite there not even close in my experience.
But we should stop talking about 1s and 0s, especially with marketing hype trains, there exist a gradient of capabalities that agents have that really depends on the intricacies of the codebase you're working on, I think everyone has yet to discover how to better apply these tools in their day to day work.
But that totally collides with the current narrative, that flattens out our work to be always the same and that can be automated easily in each case, it's not!
That's why the debate is so polizered imo, there isn't a shared experience
The polarization comes from the very disparate coding experiences and output quality that different people find when using these tools.
For example, I've had the opposite experience of yours, generating very high quality work using Claude (such as https://github.com/kstenerud/yoloai). Just in dealing with all the bugs and idiosyncrasies in the technologies I'm using, the agent has been a godsend in discovering and cataloguing them so that the implementation phase doesn't keep tripping over them: https://github.com/kstenerud/yoloai/blob/main/docs/dev/backe...
And the agents keep getting better all the time. Even in the past month I've noticed a considerable jump in its ability to anticipate issues and correctly infer implications as we build out research, design, architecture and planning docs. By the time it comes to coding, it's mostly a mechanical process that can be passed off to sonnet with a negligible defect rate.
Most of my productivity in the last 2-3 months has been thanks to AI, though none of the code there is AI generated.
It's helped me catch a lot of bugs that would have taken me a long time to even notice on my own. I guess it helps that my project is modular enough where each file can be considered standalone, with just 1-2 dependencies and well-commented already, so the AI can look at each file on its own one at a time. You can see the AGENTS.md I use on that repo.
I don't just copy-paste the AI's output, because it's almost always inefficient anyway, but I use its findings to manually clean up my shit. Maybe they're not that good with GDScript yet which is a bit of a jank language anyway.
So my main framework is wholly made by meat, but I do have fun now and then telling Codex to make experimental games using only the library of modular components I have written so far, to test my framework and also the AI's abilities. This kind of work seems like a surprisingly good match for AI sometimes: It just has to put existing blocks together.
I've been on the $20 ChatGPT plan for about a year now, and only started using Codex since like maybe 4 months ago, almost always on the latest model with "Extended Thinking" or "Extra High", and only hit limits like 2 times.
Thanks for sharing your experience! I totally agree that if you "own your code", as in you're invested in it, coding it and documenting it, these tools can be really valuable for review, bug fixing and maintenance, it pushes you to do better, maybe one piece at a time like you said with a good modularized codebase. I think more devs should share experiences like that, we should overthrow marketing and people narratives that "don't code anymore since X"
you are experiencing reverse Dunning–Kruger effect.
For someone that just dabbled in coding prior, it went from AI building 80%, and struggling through to finish the 20% when trying to build an app/website.
now it's like 97% and struggling with last 3%. Yes it'll look rough around the edges when evaulated by a senior dev, but being able to build MVP level things to completion with ease helps you stay engaged and motivated to continue and learn.
This is really my biggest worry when it gets to consumer AI. People already have a hard time informing themselves properly. Now we have technology that just boosts the already existing confirmation bias people have. It's sickening.
It’s the opposite, non-creatives (if such roles even exist in those industries) should be worried. All those models offset technical skills, allowing to get from idea to implementation through a different route (which can be easier or harder depending on idea and model - good luck tweaking that pelican’s exact pose and movements to match your imagination precisely). Nothing touches creativity, not even in the slightest.
But there’s a lot of panicking, fear-mongering and all sorts of nonsense around this whole subject.
My mother has started watching 100% AI generated stories on YouTube. They are good enough to be entertaining even if they include random errors like messing up the main character’s name.
The thing is the creative economy is all about people’s attention and pocketbooks, it doesn’t need to be great just good enough.
HN has a mechanism that causes popular blogs to stay popular.
It's a winner-takes-all karma prize for being first to post the article.
This causes a rush of people to post.
HN has a mechanism by which duplicate submissions count as upvotes toward the first submission.
This is a positive feedback for the desire to be first, which increases duplicate submissions and in turn the karma reward.
This effect means that good blogs stay well upvoted. This isn't altogether a bad thing, but it does mean some blogs require a string of poorly received posts before that effect wears off and people no longer rush to be first.
One way to fix this would be to attribute all karma to user simonw himself ( and do similar where attribution to an HN user is known. )
I wonder how much the 'inflection point' is a thing vs marketing. I'm sure the models got somewhat better, but even now when I'm trying to 'vibe code' a game with the latest models (combination of Codex w/ gpt5.5 and gpt5.3-codex), they really do struggle.
They definitely get something barebones up and running, but it's far from a fully fledged application.
I remember this very clearly myself. Before opus 4.5, I was doing a lot of hand holding and was coding a lot myself, but I have not written code since that day more or less.
I did write some stuff myself just to learn how the enigma encryption machine worked, so wrote myself to learn. But professionally, I stopped coding in November.
A pattern I've settled into is to write code but leave a TODO for every narrow thing I want the LLM to do for me. Then just tell the agent to fix the todos. It's often faster and easier to give "instructions" this way
This is _the_ question we must all be able to answer, so here goes my attempt - we all have access to the same tools, before stackoverflow it was forums, books/manuals, so its always been about “getting there, showing up, figuring it out”
your hypothetical boss has other things to do than kick a LLM around at that price
I don't feel the need to justify my salary, since I'm simply lucky in that regard. But I'm pretty sure you couldn't do my job just because you had access to a coding agent. Most of my time at the office is spent discussing high-level architecture and strategy, ideas, customer requests, backward compatibility, safety, security, quality assurance, etc.
Writing the actual code is a significant part of that, but the codebase is so complex that even Opus 4.7 and GPT-5.5 struggle with it without being fed a *lot* of context and constraints. And even then, they need a *lot* of steering due to making bad decisions that only someone with an intimate knowledge of the theory behind our software is able to catch.
I can only assume that people who think coding agents can completely replace an actual developer mostly deal with trivial software regarding both scope and the type of customers they serve (individuals instead of big companies in industry).
How do you justify your salary given that you're just using OSS compiler/editor any of us could use for free in your role ?
AI just changed how I edit code - I still see coworkers (senior developers) failing with Claude/Codex and get stuck when there are trivial solutions if you understand the full problem space. Right now AI is just a productivity tool.
Usually I describe the problem, explore a bit with LLM iteratively. Then I switch to creating a plan when I have enough insight (and the LLM has it in context/same session as exploration), specifying all the things I'm trying to accomplish.
Then I just iterate with LLM - I let it start writing stuff in YOLO mode and check on what it's doing in the code steering it in the direction I want.
Usually the code LLM generates will work but is kind of garbage - but I can easily steer it towards better implementations.
Sometimes using an LLM is theoretically slower than hand-rolling - if I just sat down and focused I could outperform the iteration and the waiting, especially considering how stupid agents are at running expensive builds/test suites (with a bunch of explicit instructions in skills/claude/agents.md). But the practical improvement of going with LLM is that you have a bunch of thinking traces saved as a part of your iteration proces - it's really easy to get back into flow. This is a huge productivity win for me given how many interruptions I have in my work day. Like so many people like to point out - writing code ends up being less and less of your time as you level up in your career.
Please see Ben Evans’ podcast on a good take on this. Coding is just one of the task you do in your job, it is not the job or at least it probably is not. You do not get paid to code, you get paid to make a set of decisions that create value to the company. If this is automated then yes sadly your salary is not justified.
I agree, but the reality is that most people work to make a living, not to have fun. If you enjoy your job because you mostly get to write code in a tight feedback loop instead of doing the "hard" work of planning, writing and reviewing specs, balancing customer requirements, and the lot, you have a very privileged life. And those jobs are probably going to get fewer now.
It's kind of sad. But on the other hand, I am glad I don't have to write every little line of code myself *on top* of having to do all the other stuff.
To me, LLM's free up time for me so that I can spend time on the fun parts of coding. Less boilerplate, more focus on the interesting problems. This is no different from using high level languages. The problem domain is less around memory management and garbage collection and closer to the problem you're actually trying to solve.
But we’ve had tools to automate out the boilerplate for years. We don’t need ai for that. It’s seriously like we all forgot we could run one command and scaffold a project. AI isn’t even that great at it. Last I tried a month ago it used a really out of date version of nextjs and picked all sorts of random deps that weren’t in the plan.
I could have just used the next project scaffold tool and been on my way before the ai even started returning output.
I agree with this. I feel like there’s a false dichotomy right now in a lot of these discussions where one can only vibe code or only code by hand. It is possible to do both…
Someone competent using them is today a requirement and for awhile will make the marginal utility of skilled workers greater than that of unskilled. The justification is that they are much more productive than they were before.
You can build things quickly with AI, but you can’t delegate your responsibilities to AI. Once the AI starts struggling, you’ll need to takeover and figure it out.
They're using a tool that anyone can use for $20 an hour, sure. But that's not what they're "just" doing. This is what is so insane about non-technical people talking about code - writing the actual syntax is not really the hard part.
What you're saying is like "how do you justify your salary as a NASA engineer when anyone can use Simulink and generate the code?"
How do you justify your salary given that you sit in a chair all day, likely making the world worse, and make 5x as much as someone saving lives, building houses, or teaching kids how to read?
Supply and demand. Not many people are good at programming and it's highly in demand.
The question is how many people will be good at vibe coding? If the answer is "lots" then we can definitely expect programming salaries to return to "normal" levels. His question is very relevant; you can't dismiss it as easily as that.
it can be easily dismissed because "anyone can use the tool that costs $20" makes no meaningful sense.
this was always true in fact $20 is more than the free it costs for notepad++
it's a flippant statement. Go down the line of any tool; it's cost has basically nothing to do with skill difference to operate it. See basically everything. There's levels.
I have no idea what you're trying to say. If anyone really can vibe code then programming salaries are pretty much guaranteed to come down. The critical question is whether it really is true that anyone can do it, or if it still requires rare skill.
are you a programmer? it 100% requires skill. AI or not.
i'm trying to say there's levels to this. if you don't agree then you don't agree. but i can buy commodity tools for any skill and that doesn't make me professional grade at that skill.
Because the tool will happily give you a "solution" that kinda works for a few inputs. It will happily correct itself when you give it more incorrect tests.
It will almost never converge on the general solution that will pass tests you haven't given it yet.
This is why AI is sooo good at Javascript and related slop. A solution that "kinda works" is good enough 9 times out of 10 and if some tests fail well ... YOLO and the web page will probably render anyway.
Contrast that to using Scheme or Lisp where AI will have trouble simply keeping the parentheses balanced.
While I certainly like parentheses highlighting and rainbow parentheses, I've programmed Clojure without syntax highlighting and while it’s not as nice as it would be with, it’s fine.
I’ve also written C++ and Java in Notepad long ago. Not ideal, but hardly a problem.
Paradox - you can get multiple inflection points even as systems start to have dimishing marginal returns in core capability, I think this is due to 'threshold crossing' where something 'becomes good enough for a specific purpose' - it just unlocks capabilities.
'Nail Guns' used to be heavy, required heavy power cords, they were extremely expensive. When they got lighter, cheaper, battery pack ... at some point, they blend seamlessly into the roofers process, and multiply dramatically the work that can be done. Marginal improvements beyond that may not yield the same 'unlocks' because the threshold has been crossed.
I've "vibed" some non-trivial stuff lately using a combination of Codex with 5.5 and Claude Code with Opus 4.7.
Key has been to spend a fair amount of time on initial overall design document, which is split into tangible and limited phases. I go back and forth between them on this document until we're all happy.
For each phase an implementation plan is made. At the end, a summary document of what was delivered and what was discovered. This becomes input to next phase.
I do check the documents, and what they're doing. I also check the tests, some more thorough. And some spot checks on the code to see if I like the structure.
I have mainly used Claude for coding and Codex for design and code review after phases. I ask both to check test coverage after phases.
Managed to implement some tools and libraries without writing a single line of code this way, which have been very beneficial to us.
Since it's so async I can work on other stuff while they plod along.
I think it's not universal though. But stuff that can be tested easily and which you have a firm grasp of what you want to achieve, but not necessarily exactly how, that I've been impressed with.
Waterfall was famous for wasting developer time and extending delivery dates in exchange for simplifying management. If Claude time is comparatively inexpensive, but human oversight remains necessary, we will switch back to waterfall because the relative importance of the two resources will invert.
Also the least fun part of development. Maybe I’m the weird one but I like to just jump right in, planning every last detail before writing code is boring.
I find it gets you past the starting line but when you dig into the code it’s a mess of duplicated code, muddled responsibilities, poor architecture, 10k line files that eat your tokens, etc.
I’m building something using LLMs to scrape websites/socials for unstructured event data from combined text/images and the only way I’ve managed to get 100% consistent results for a reasonable cost is to break the task down into very small pieces that reduce the scope of mistakes significantly.
At present, for reasonable complex tasks, Codex/Claude will happily code you into an expensive corner.
Indeed. To add to this, the obvious solution (ask the AI to break down the tasks to whatever METR says they'd be capable of 80% of the time) is of limited utility, as the AI are only so-so at estimating task complexity.
(Even when they're getting the planning part right, I do also recommend checking the LLM-generated unit tests, because in my experience some of those are "regex the source code" not "execute functions and check outputs").
While some people got it to work better, for me vibe coding games still didn't reach the point of regular sites/web apps. Physics, creativity, assets and UI/UX still need a lot of hand handholding with the models. Games that are more interface based like point and click or something like reigns are easier though
It's very real. Just in the past 2 months or so IMO there's been a pretty big improvement in claude for local dev (although I think a lot of that is less model strength and more harness capability). 1m context is a huge difference (~30 min vs 2.5hr between compact significantly increases the scope of what I get the AI to do before it goes stupid). The other biggest difference I've noticed is a better balance of actually doing the work vs pushing back on bad ideas. I want the AI to tell me if it thinks the thing I am telling it is wrong or a bad idea, but if I confirm, I want it to do that anyway. A couple months ago, the claude was a lot more likely to either say "This is too much work I'm not going to do all of it", tell me the idea was genius (and then pretend to do it) or something equally useless.
It's real for me as a non coder previously uploading a python script asking it to add this function or that function used to break it now usually it just works at least with Claude and Chat Gpt models. Google Gemini still breaks stuff but rumors are their new flash model that will be announced soon is very good. I am usually working with data in csv files and generating spreadsheet pdf etc and the results for that has improved dramatically.
That’s me. Built a scraper do dump stuff to a csv of a list of images for further ocr and openCV processing. Now I have a convenient list of hits once I run the batch that used to be a loooot of manual sifting.
Once I work out the kinks, I’ll be able to further automate it.
Would have taken 10-100x as long for me to build it without AI and the AI version is probably better.
But yeah, I have enough knowledge to know what prompts are needed and figure out those “oh, I think it’s running slow or failing because of xyz” and further prompt to improve it based on that what I think it should do instead.
And I know where to make slight changes without burning my allotments.
It is all marketing. The easiest way to tell is that a year ago the same people said the inflection point was X or Y model.
When people claim LLMs just don't work for them, the first question is whether they're using the latest model or not, and if not, dismissing the poster.
The thing is that that same question was being asked a year ago, and even a year before that, but with the models that lead to a dismissal today.
Just make the experiment yourself, wait 6 months, say LLMs just aren't working for the software engineering that you do, and people will dismiss you if you say that you use Opus 4.5 and not the latest model Claude MegaMind 8.8 pro max gigathinking. Despite this model being touted as the inflection point in this article.
I think it's because both sides are talking about different things. If you go in expecting it is good enough to make developers obsolete today(reasonable impression to get from the way a lot of people hype it) you would be disappointed and after first couple of tries every few months you would probably not try it much with next generations. Reasonable if it's considered a dichotomy.
But a lot of people excited about new generations(including me, now) are not seeing it as a dichotomy but rather a spectrum where models are getting better and indeed once a year or even 6 months at times there comes a sudden growth which feels like an inflection point from what came before. Practically, it's a tool like any other, you evaluate it based on if it's worth the effort and cost for the benefit you get from it and if it is and has a good DX you use it. If the calculation doesn't work for you, it doesn't. For me, it has gone from a novelty, to good for some kind of quick manual search, to I guess it can debug some kind of errors at times in very specific conditions, to hey I think I am getting a bit addicted to autocomplete in IDE provided by them even if I don't use them for anything intelligent but it's becoming indispensable now but only this part, to it's good for areas I lack expertise in, to agentic sucks I will stick with discussing algorithms and architecture with it on greenfield projects, to holy shit it can do agentic decently well now, I am skeptic to give it access more than in limited cases, to now I am getting close to letting it run free on my device in not so distant future I guess. Some of these were big jumps, at each point I was skeptical of growth. Everytime I thought now the growth will slow down from days 2k context window to millions now. From basic chat completion to working on complex adaptive systems, game theoretic modelling, heurestics and constraint modelling and other things I throw at it. I am still needed in the loop, it can be so smart at times and then will do something so stupid, but the frequency of stupidity is rapidly decreasing. I am still needed, I don't think it could accomplish alone all that it has done for me. But I do at times at night remain awake reflecting on my self worth for the potential day when I don't add that value. When I have a harder time keeping up.
Also had someone told me not in even 2019 that in 2026 we could have NLP models do what they do today, I would have posited it all as sci-fi and here I am waking up in awe of the world we live in and how quickly we adapt.
Purely vibe code won't work. You need to define an excellent architecture, have great specs, a solid plan, divide the plan in small phases that fit well in a context window, use TDD and automated code reviews for implementing each phase, do QA and some code review.
At any point you need to have agents review, verify and test the other agents output and iterate until the output is perfect.
And also, have good e2e tests.
IMO, if you don't spend at least a few tens of millions tokens per day, you aren't doing it properly.
I'm curious how the 6 months have looked from a non-programmer's perspective. What kind of co-working tools and similar optimizations have people from other fields experienced?
I am an instructor who helps deliver an apprenticeship. My new boss has been in our industry for about 20 years and is one of the most respected people in our company. They've just joined us to teach and are off doing a two week course. On the first day she was told to let AI write all of her lesson plans, and then feed the lesson plans to AI to make her slides...
Hopefully she rejects all this out of hand, but if she doesn't it'll mean that none of our trainees get the benefit of her experience, who she is as a person, and what she has to pass onto them.
We have 6 monthly reviews as instructors where we are told the same thing. "How could you use AI for your teaching?"
They don't even feel the need to justify why this would be desirable, or is needed at all. It's just pure bandwagonning. Unbelievably, most of my coworkers are extremely positive about AI, although none of them have told me they use it for anything besides preparing their lessons for them — they just use it instead of having to think, or spend time preparing...the only important thing they do at work.
I’m teaching a class at a university in Japan (on AI-related issues, as it happens). I’ve been teaching for more than 40 years, but at 106 registered students this is by far the largest class I have ever taught. AI tools are very helpful for class management, such as keeping track of attendance and homework submissions.
I have to consciously avoid using AI for more cognitive tasks, though. It would be very tempting to have Claude, ChatGPT, or Gemini summarize, classify, and grade the students’ assignments, write individual feedback, prepare my lesson plans, etc. However, I know that my engagement with the material and with the students would suffer. I also want to show the students that they are learning together with me and with each other, not with bots.
I am semiretired and have a light teaching load that gives me plenty of time to prepare for class. I can see that full-time teachers might find it hard to resist the lure of offloading their thinking to AI.
I've been a teacher (most of the time a college professor) for...a long time. Nowadays, when preparing a new course, I definitely work with AI: "Here's what I want, and who my audience is - give me a course outline".
That gives me a starting point. Of course, I modify it. Maybe I bounce back and forth to the AI for further refinements and suggestions, but ultimately I have to be happy with the result.
When prepping the individual lessons, the biggest time saver is coming up with examples to illustrate particular points. I could do this alone, but sometimes that involves staring at a blank screen for a while. It is faster to ask the AI for suggestions, pick the one I like, and refine it further myself.
Purely anecdotal, but in my team of 20 data analysts, we've seen a bunch of them become quite productive in producing tools and apps. These are analysts with mostly domain knowledge, and not so much programming knowledge - meaning that they knew the basics to write scripts, and wrangle data programmatically, but not enough to actually engage in software engineering.
Some of these are now contributors.
I also have a friend (beware, N=1 study) with zero prior programming knowledge that has released his first app.
I work at a company that deploys AI to enterprises
The average office worker is amazed at Copilot (not in the IDE - but the app bundled with Windows), and they mostly copy paste material into their enterprise provided ChatGPT / Gemini, and get tips from Facebook / Instagram on their top 5 best prompts for work productivity
Showing them agents that automate work at scale is a very magical experience
And then everyone that has to deal with their copy pasted output is too nice to say how bad it is and how much work it just offloads to the next person that’ll probably get frustrated and have an agent handle it.
Claude in Office was a tipping point for nontechnical folks around me. Everyone’s slides decks are immaculate now. Finance isn’t needing nearly as much BI help. It’s pretty impressive.
I find it really troubling finance are relying on LLMs (word generators!) for financial analysis - I mean I guess it means there will never be any annoying gaps in the data.
I use it a lot now for knocking up grafana charts etc. It’s not so much that the LLM is feeding the numbers through. You can still use real tools to analyse and summarise the numbers, it’s just very quicker driving them.
As ever with data analysis, two things will continue to be true. Real insights come from spotting something that looks off and digging into it deeper. Secondly, it’s really easy to connect data in a misleading way.
I’ve had a Claude analysis handed to me this morning including a summary list of actions we’re going to take next which falls into this very trap.
The insights you’ll get from your data will only be as deep as the curiosity of the person at the helm.
Interesting. I don't have to use PowerPoint much, but I hate it when I do. I don't want the llm to write the words but I do want it to make things look nice. So does this work well now?
My pipeline for this is vscode + prompts + markdown templates + GitHub copilot -> markdown docs -> pandoc to produce.docx -> copilot in word for “nice” formatting -> copilot in ppt for nice decks. LLMs all the way down.
I find it’s easier to version control and diff the .md artefacts, those remain my authoritative source.
With a little bit of work, it works very well. You can generate powerpoint directly with Codex or Claude Cowork. There is also Canva support for these tools and it has its own AI integration. Another useful tool in this space is the Gemini integration in Google slides.
If you are a bit technical, reveal.js is actually really nice for this. I one shotted a pdf export for that uses a headless browser. I've used that a few times now.
What works well for me is to take an existing presentation and then some raw input and generate a new presentation in the same style as the old one from the raw input. After that, I can go in and tweak individual slides.
Another thing I did recently was take somebody's existing pitch deck and fix it with a one line prompt: "this deck is a bit meh, pimp it!" that worked unreasonably well. I like using shitty prompts like that. Codex often manages to do the right thing if you don't overthink your prompts.
Classic deck of somebody that used way too much text and only bullets. It did a great job on that presenting the content in a more simple and better structured way. Pulling out key facts and highlighting those, simplifying text, etc. Doing that manually would have taken hours.
If you don't want an LLM to write the words, surely you also want to decide on the data and graphs to show by yourself? Isn't that 90% of a presentation? The "looking nice" part doesn't matter as much, it could be black text on a white background and it would be fine.
The important part is the presentation matching your presenting cadence, which is something LLM generated presentations never get right. I don't have a problem with people generating presentations, but most of the time they just end up reading whatever is on the screen when presenting.
In business: using coworking tools to review and propose filing of emails; manage my files and folders; on a daily basis scour the intranet for interesting and relevant content.
Personal: my wife tutors in her native language to non-native primary and high school kids. They are all using these tools now generate fresh content for practice based on school lesson plans. The kids are improving much more quickly now than they were just a few months ago.
Copilot Cowork in the M365 ecosystem. It inherits all the permissions from my account, has access to exchange to send me emails, and OneDrive to save each day’s summary for posterity and future refinement.
I think Claude Cowork through the Microsoft thing which was copilot but is now named M365 (or something?) is likely creating every powerpoint resentation within our organisation at this point.
We have whatever AI is in teams transcribe every meeting, and it's scaringly good at it. It's also extremely good at sumerizing or finding things from pervious meetings when tasked. One disadvantage in this, is that I can see how stupid I sound on writing. I'll go "yeah, hmm, yeah, that's, yeah", but it really is pretty good.
I assume we're going to see a massive increase in AI with this Cowork inside the Microsoft client. We actually have a better tool available through a librechat where you can create and configure your own agents with the same filesystem access to your one drive, and a lot more tools and models than just Claude. Almost nobody has been capable of figuring out how to use it though, so they've been using the regular office365 copilot and it sucks so bad that a lot of people stopped beliving in AI.
It's ironic that Microsoft fumbling the ball on AI, but being very good at enterprise customers (especially non-IT) means that they'll likely be the company which is going to sell us AI tools that people will actually use. I have no idea why it's so hard for people to pick up the Librechat tool we're given access to through our equity fund. It's quite litterally a copy of ChatGPT where you can point-and-click configure an agent, but we're seeing that even employees who use a lot of ChatGPT privately don't use this tool professionally. Meanwhile everyone has been capable of using the Microsoft thing (that I personally think is less user friendly since you will need to add your configuration files to every promt).
"I have no idea why it's so hard for people to pick up the Librechat tool we're given access to through our equity fund"
That's because M365 is integrated with the whole Office/Exchange environment, especially in terms of security policies, etc. MS also guarantee that the data are private, this is very important for many companies both from the IP protection perspective and the liability to expose some users/customers data (think of GDPR regulations is Europe).
I don't know who is behind Liberchat, probably some good and friendly folks, but when it comes to privacy/security Microsoft has much more to loose and if shit happens it is easier to sue them than some random VC-financed company from the USA.
As a former data scientist, I started to use code agent 3 monthes ago. Before that, I use chat completion on web. Now, I nearly do everything which outputs documents with code agent.
I’m not him, but I’ve started using them to do the analysis (SQL, Python etc.) and then output the report as Quarto HTML which can be hosted on GitHub Pages. It works well for this analysis style work.
Once I was going to send some figures to leadership so I checked the queries myself and not only had it done it correctly, but it had also included a lot of sanity checks with other places in the database which as a human I doubt I’d have had the time or inclination to do.
Even for modelling work it can be good to check your ETL queries, or write one itself and then check it etc.
Somewhere right now some human artist is being tasked with drawing illustrations of pelicans riding bicycles to be used as training data at a big AI lab.
The quality of the Gemini pelican was such a step change in one iteration, while the other benchmarks remained quite flat, that I think you are right. Although whether they targeted Pelicans in particular or just svg, I can't say.
Every modern image-generation model can generate a pelican on a bicycle trivially. The point of the test is to generate SVG text that represents an image, which is more complicated.
Yes, there are ways to convert raster images to SVG for use in training data but it's not a good use of anyone's time.
Simon mentions further along in his article that given Jeff Dean’s post referencing the pelican-riding-a-bike task (and how good current models are at doing it), that it’s no longer a great benchmark to use. Enter the opossum riding an e-scooter!
If it turns out to be a good change or not is to be seen.
The half-full view is that the models are so good at finding vulns that if you plug them into your build-pipeline then the amount of new vulns introduced will go down towards zero.
The half-empty view is that we're now producing more junior-level code with less review, so everything will have more vuln, also it's cheaper and easier to find them so prepare for chaos.
Short term there is sure to be chaos either way as the models are clearly good enough to find all the old bugs, and not everyone has the resources or will to try to stay ahead of the curve like Mozilla is trying to do with their Mythos access https://blog.mozilla.org/en/firefox/ai-security-zero-day-vul...
There's a major caveat to the half-full view: You'll only stop adding new vulns that your model can find.
A threat actor with access to a better model or more money to burn on tokens may yet find more. Some of them have deep pockets, and not nearly every project will get the Glasswing treatment of free Mythos tokens.
Three deterministic Linux LPEs in a week, an LPE in BSD in execve (of all things...), nginx vulnerabilities, one or two new gnarly supply chain attacks. Linus noting that the linux-security mailing list is getting flooded with duplicated, AI-driven reports of varying quality. There are pretty crazy keycloak vulnerabilities getting discovered.
We're most likely entering a year or two or rapid vulnerability discovery, patching, as well as reducing and minimalizing system footprints just to survive the onslaught of strange vulnerabilities from e.g. ancient and widely unused kernel modules.
I met a few people at PyCon this week who have been part of Glasswing (they're just starting to be allowed to talk about it) and it really does drive down the cost of finding vulnerabilities.
Wouldn't it drive up the cost of finding vulnerabilities when all the low hanging fruit has already been scanned and patched? Like the new baseline for finding a vulnerability will be something an LLM couldn't find.
Broadly, I'm talking about the shift from building elaborate vulnerability research harnesses towards using the frontier models and their RL-optimized harnesses to build simpler vulnerability discovery pipelines. And then: the ensuing carnage.
Not op but just look at HN posts in the last couple weeks: supply chain worms, zero-day LPEs for all OSes seemingly every other day, researchers on X and here openly saying they’ve got more valid findings than they know what to do with
why is there no talk about the world is already run by AI by proxy? ie bureaucrats using chatgpt to make their speeches decisions shopfront designs etc. I just dont seem to read about this, intead its more this nebulous specific date in the future
December 2025 was the breakthrough for me.
January Claude was euphoric, ChatGPT was up there. February Gemini cooked for a second there. March amazing. April the big bad nerf. May GPT 5.5 is just pure bliss altough 2x limits temporarily, not sure about Claude it's sort of okay still not as good as it felt before, slowly increasing limits with more compute and rebuilding good will.
I think Opus 4.6 at its peak was the "how can anyone not get that this is good" for me.
Then the nerf, and the massive uplift in tokens for 4.7, a model which I find lazy and prone to hallucinate.
It's probably time to try GPT5.5. Like many I'm pretty heavily invested in the anthropic ecosystem at this point, which I suppose gives another strong reason to make the switch.
I was a dedicated Claude user but in March/April I started using GPT5.5 on a new project that Claude had tried and failed to execute successfully. GPT knocked it out of the park, and was able to do it within my subscription allocation of tokens. I'd recommend giving it a go at least. Something like OpenClaude can let you use the Claude tools you're used to
I only used Claude first time in April, previously only ChatGPT and Gemini. And I struggle to see what the hype is all about - yes it seems a tiny bit smarter than the pack, but on the 20$ subscription it runs out of tokens in 5-20 minutes, and then you need to wait 3-4h.
ChatGPT 5.5 seems capable, although a bit stingy with “thinking” compared to earlier models, and I never run into session limits.
Haven’t noticed much significant progress in LLMs myself in 6 months (significant as in new or vastly improved capabilities or understanding, not new releases, there are plenty of those).
I feel like if anything people started to realise the significant limitations of LLMs when you try to use them as ‘agents’ which was the big direction LLM companies tried to push recently.
Best use of LLMs so far IMO is finding vulnerabilities (with human help) and pattern matching in other domains. For generating code and prose they are still mediocre and somewhat unreliable and for use as personal assistant agents I wouldn’t trust them.
So what’s happening with openclaw, the biggest experiment in agentic, vibe coded by the agents themselves? The thing that was so hot a few months ago.
> Google released the Gemma 4 series of models, which are the most capable open weight models I’ve seen from a US company.
Implying another country has a better model? I'm being pokey here because I'm very curious! I know Gemma is efficient, but I also remember Qwen and Kiwi being referred to as optimized. The difference being that Gemma is using less tokens, but maybe Qwen/Kiwi's quality is higher? I dont know.
The consensus right now is that Qwen3.6 in its 27B and 35B-A3B versions is better for coding whereas Gemma4 is stronger when it comes to OCR, audio transcription and the likes. Margins are slim though and the harness at these model sizes is the most important factor.
My goal post for "AI will definitely replace most SWEs" was to reproduce a particular 90s programming game one shot and then add multiplayer support with minimal prompting.
I tried this a while ago, haven’t tried again recently. The models were producing code that was clearly lifted from stuff in their training data, and what I ended up with was a fairly decent game in html and js after a bit of tidy up, though it felt like several code paradigms smooshed together rather than a coherent whole, but it mostly worked. Not something I’d want to maintain but it was impressive at the time.
They were able to one-shot famous games (like asteroid or pong), I suspect because they had been trained on multiple versions of that game. So like producing Harry Potter, with the right prompt it was able to produce a license stripped version of code it had seen. I tried another arcade game like frogger and it failed really badly and took a lot longer, never got it working.
The whole exercise left me feeling they have a long way to go, I don’t see how anyone could think they would replace SWE unless they didn’t look at the code produced, even now.
what are your thoughts on Software engineer replacement. My team has already seen big reductions. Q/A team is gone. Software Engineer reduced by a third.
Scared for the future
Ditching the QA team when the single highest challenge is verifying that vibe-coded systems do what they're meant to is extraordinarily short-sighted.
Personally, the more time I spend working with coding agents the least worried I am for my career. Getting the best results out of them is really hard. They amplify existing skills and experience, so the more experience you have the better.
I believe that many of those saying that they "never write code anymore" or are experiencing "10x productivity," are heavily underestimating (or outright misrepresenting) how much they are guiding the model, and ignoring everything else that goes into shipping fit for purpose software. I frequently see zero measurements or factual arguments supplied to support such claims. I also see many people say that they are "vibe coding," when they are almost certainly reviewing, editing, or otherwise steering the output.
I wonder why there is such a mad dash to trump up the capabilities of coding agents. And why such loose terminology and lack of rigor? I thought programmers were supposed to be rational people (har har!)
Have you seen the automated tests that QA members deliver? My experience is that they are horrible, and it's not so hard to beat that low quality bar with an LLM.
I have a theory: if they were good at writing automated tests, they would have been developers instead of QA engineers.
Not saying that there aren't any high quality QA engineers, I worked with some. But LLM's raised the bar in a way that most QA engineers can't reach.
If you're famous, you'll be fine. If you're in retiring age, you don't care. Otherwise, good luck! We put ourselves on the street by not protesting what is happening.
I think there will be larger markets, more companies, more jobs than before due to AI, but also a very painful transition period
AI reduces the cost of producing software (and other intellectual tasks), which greatly improves the viability for more and more ambitious projects. As far as we know the amount of problems software (and humanity) can solve is unbounded
It feels like the market has shifted in SWE yet again to heavily prioritize a new set of skills, of which those in the top quartile are desired more than ever
This is the magic question that I'm very eager to hear the answer to.
Fundamentally, steering LLMs requires the same structured, logical thought process that is required to write code, regardless of abstraction level. Unlike what HN would have you believe this is not a skill that is equally distributed across the population.
But given the rapid pace at which this technology is evolving, "steering" may very well be ceded to the clankers. LLM agents are fantastic at logical reasoning & any inefficiencies relative to human experts can be circumvented by sheer compute.
Am I crazy, or are these differences between the best models so marginal that you’d get roughly the same performance if you use the same high-quality harness (ie preloaded instructions from md files, including custom skills)?
You will immediately notice the difference if you use it at the threshold.
It's like most people just watching a 'starting nba player' (not superstar, but just starting player) vs one that sits on the bench.
If you were to just watching them play, work out, shoot - you'd never notice the difference.
Put them head to head and it's 98-54 and you start to see the patterns.
It's pretty interesting actually, someone tell me what the 'science' for this is, I'm sure there is some kind of information theory at work here.
Software has innumerable kinds of problems at varying level of complexity and so it provides the perfect testbed for seeing how far models can go in practice.
Should add: you're very right to hint that harness, tooling, and models tuned o both the harness and he kinds of things people do on the harness, as well as some other things do make enormous difference.
Bu and large, SOTA Codex/Claude Code are substantially better - at least for now. That may change.
You have correctly identified that getting a "high-quality harness (ie preloaded instructions from md files, including custom skills)" is the (or at least a) hard part.
Because you have to adjust the harness to your problem space and provide that so you can say it is high-quality.
Many people will stop that discussion at the claude code vs. codex vs. opencode level and then merge that with discussing model performance.
And that is also why "Generate an SVG of a pelican riding a bicycle" is still a benchmark worth discussing. Because at least it is a defined problem space.
No you're not wrong. Many people will see what you see. Enthusiasts will see it as monumental squeezing out that last drop of performance. In my opinion I think it is okay for enthusiasts to feel that way. I'm just satisfied with getting a tool as an aid.
Personal opinion we need to focus more on efficiency instead of how large or complex a model can get as that model creeps into more resource requirements. If the goal is to cost a billion dollars to operate than we've really lost the idea of what models are supposed to be achieving.
By definition the differences between "best models" are small. It's tautology. If a model is significantly dumber than the others then it's not one of the best models.
As someone who uses AI daily (not in agent mode, just user-interactive), I have definitely noticed major quality improvements over the past few months. And that's surprising, because when you use something daily, you tend to overlook the big jumps.
I haven't looked into any sort of "agent" mode, just because I don't yet quite trust the AI not to do something dumb. Also, I don't use M365, where Copilot is integrated, so I suppose I would have to set it up myself.
Also LinkedIn wars of people trying to claim throne as most AI-pilled, throwing down strawmen stories of luddites yelling at data centres who'll lose their job to a single person doing 100x work.
Yes, with good RLVR at scale you can greatly improve performance especially on benchmarks
The hope was that good RLVR on relatively contrived datasets (like benchmarks) would be generalized to good software taste, which has somewhat succeeded but also the models fail in horrible ways still
And the hope beyond that is that good skills in fundamental problem solving tasks (coding, math) would generalize to tasks beyond math and code, which did happen but less so
I would say that most improvements are in easily verifiable things like code or math. Atleast that's where all the amazing results seem to be coming from.
Other domains I am not sure but I've heard from people like Cal Newport that the rate of increase outside of code and math are not as equally impressive
100% true - I only had five minutes so I had to edit it down to just a couple, but all of those models are excellent and keep leap-frogging each other.
'Producing Images' or even 'Some Code that is Valid and Compiles' is in some ways one of the most misleading ways we assess quality of the AI.
It is getting very good at producing code that compiles - at the algorithmic level.
This is definitely noteworthy - and the AI is crossing a critical 'productivity threshold'.
But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.
Everyone has tried to get AI to 'Draw The Thing They Want' and you notice immediately how it's almost impossible to 'adjust the image' along the vector you want - because ... and this is key:
-> the AI doesn't really understand what a Duck is, it's components, or fully how it made the duck <-
It just knows how to 'incant' the duck.
This becomes very clear when you try to get the AI to write proper documentation - it fails so miserably, even with direct guidance.
This is really strong evidence of how poorly the AI is generalizing, and that it is not 'understanding' rather it's 'synthesizing' from patterns.
We already kind of knew that - but we have not yet built an intuition for that until now.
Only now can we see 'how amazing the pattern synthesis' is - it's almost magic, and yet how it falls off a cliff otherwise
This has deep implications for the 'road ahead' and the kinds of things we're going to be able to do with AI.
In short: the AI is 'Wizard Level Code Helper, Researcher, and Worker' - but it very clearly lacks capabilities even one level of abstraction above the code itself.
LLMs were first trained by 'text' and now ... they are 'trained by our compilers'. Basically g++, javac, tsc are the 'Verifiable Human Rewards' in the post-training and reinforcement learning - and the AI is getting extremely good at producing 'code that compiles', but that's definitely an indirection from 'code that does what we want'.
It's astonishing that it took us all this time to internalize and start to discover what I think will be in hindsight a very obvious 'threshold' of it's capabilities.
We are constantly 'amazed' at the work that it can do, and therefore over-project it's capabilities.
I have no doubt that even with these limitations - the AI will unlock a lot more as it gets better - and - that it will 'creep up' the layers of abstraction of it's understanding.
But I strongly believe that the AI is going to get much 'wider' (pattern matching dominance) before it gets 'higher' (intrinsic understanding) - and - that this may be a fundamental limitation.
This may be 'the Le Cunn' insight - when he talks about the limitations of LLMs in detail - I believe this is that insight writ large.
Even the term AI - or certainly 'AGI' may be a misleading metaphor - were we to have always called it 'Stochastic Algorithms' or something along those lines, it's possible that our intuition would be framed a bit better.
The most interesting thing is how it is definitely amazing, world changing, novel and powerful and some ways - and obviously useless in others at the same time. That's the 'threshold' we need to better understand.
> But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.
That might be the case, but Simon's case "Generate an SVG of a pelican riding a bicycle" is very different.
The model actually has to understand what parts of a pelican and bicycle come together in something like an anatomically plausible way. That's a higher level of abstraction than something like passing the same prompt to Stable Diffusion etc
(The new Nano Banana/GPT Image 2.0 models are different though - they have significant world knowledge baked in)
No, it's not because it's seen 'anatomy' for Pelicans, Animals - even how it's represented in Animals.
If you try to get the AI to actually decompose it and start to 'draw pelicans' in very obscure ways, it will immediately fail.
Try to get the AI to draw the pelican form a very odd angle - like underneath, to the right, one wing extended, one wing not ... 0% chance.
Precisely because it does not understand those things.
FYI it's a slightly unfair case because it does not have 'world model' yet, which will actually solve that problem, but even then not through very much abstracting.
We're a long way away - but in the meantime, there's lots to unpack.
Opencode has free access to Qwen 3.6 and Deepseek v4 Flash right now.
They're on par with Claude and Codex imo - when you still design architecture and know what the output should be. Claude and GPT 5.5 need less guidance with vibe coding, but we're not yet at a point where that's sustainable anyway even with those models.
It depends on what you’re comparing it against. For $20, OpenAI is still probably the best value for SOTA models. In terms of limits, you can use GPT-5.4 instead of 5.5. The intelligence feels similar, but it’s cheaper. You can also experiment with other harnesses like pi. It’s lightweight but capable enough, and its token usage is definitely much more efficient.
I think that there's a lot to be improved in harnesses and the way the models are interacting with harnesses. For example, the harness should be able to steer the model when thinking.
I'm so glad Simon is documenting this. The field is evolving so fast, so rapidly, so hungry for data and money, that few are willing to zoom out and document everything big picture so we can see the changes over time.
I mean do you guys remember "Do anything now"? Just a distant memory, a funny party trick.
There's something fitting about the mystical nature of LLMs and scrolling through a bunch of goofy pelicans on bicycles representing report cards for the bleeding edge of technology.
How are these even graded? Qwen3.6-35B-A3B gets high marks for a pelican with a gaping hole in its bill?
edit: Just noticed its feet are disconnected from its legs as well (but right on the pedals!). Pardon my French but that's Chinese af.
It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".
All I observe they got better at tool call and answering questions about big codebases, especially if the question has a vague pattern to search, and they're superuseful for that! For generating production code even with a lot of steering and baby sitting?
Absolutely not, not quite there not even close in my experience.
But we should stop talking about 1s and 0s, especially with marketing hype trains, there exist a gradient of capabalities that agents have that really depends on the intricacies of the codebase you're working on, I think everyone has yet to discover how to better apply these tools in their day to day work.
But that totally collides with the current narrative, that flattens out our work to be always the same and that can be automated easily in each case, it's not!
That's why the debate is so polizered imo, there isn't a shared experience
For example, I've had the opposite experience of yours, generating very high quality work using Claude (such as https://github.com/kstenerud/yoloai). Just in dealing with all the bugs and idiosyncrasies in the technologies I'm using, the agent has been a godsend in discovering and cataloguing them so that the implementation phase doesn't keep tripping over them: https://github.com/kstenerud/yoloai/blob/main/docs/dev/backe...
And the agents keep getting better all the time. Even in the past month I've noticed a considerable jump in its ability to anticipate issues and correctly infer implications as we build out research, design, architecture and planning docs. By the time it comes to coding, it's mostly a mechanical process that can be passed off to sonnet with a negligible defect rate.
I've been using Codex mostly just to review existing Godot/GDScript code: https://github.com/InvadingOctopus/comedot
Most of my productivity in the last 2-3 months has been thanks to AI, though none of the code there is AI generated.
It's helped me catch a lot of bugs that would have taken me a long time to even notice on my own. I guess it helps that my project is modular enough where each file can be considered standalone, with just 1-2 dependencies and well-commented already, so the AI can look at each file on its own one at a time. You can see the AGENTS.md I use on that repo.
I don't just copy-paste the AI's output, because it's almost always inefficient anyway, but I use its findings to manually clean up my shit. Maybe they're not that good with GDScript yet which is a bit of a jank language anyway.
So my main framework is wholly made by meat, but I do have fun now and then telling Codex to make experimental games using only the library of modular components I have written so far, to test my framework and also the AI's abilities. This kind of work seems like a surprisingly good match for AI sometimes: It just has to put existing blocks together.
I've been on the $20 ChatGPT plan for about a year now, and only started using Codex since like maybe 4 months ago, almost always on the latest model with "Extended Thinking" or "Extra High", and only hit limits like 2 times.
Claude on the other hand, terrible: https://i.imgur.com/jYawPDY.png
For someone that just dabbled in coding prior, it went from AI building 80%, and struggling through to finish the 20% when trying to build an app/website.
now it's like 97% and struggling with last 3%. Yes it'll look rough around the edges when evaulated by a senior dev, but being able to build MVP level things to completion with ease helps you stay engaged and motivated to continue and learn.
https://gemini.google.com/share/55e250c99693
https://grok.com/imagine/post/8d1eab88-737f-4d46-ba92-9b6502...
Interesting that it does better at making the pelican peddle in the video generation than in image generation.
But there’s a lot of panicking, fear-mongering and all sorts of nonsense around this whole subject.
The thing is the creative economy is all about people’s attention and pocketbooks, it doesn’t need to be great just good enough.
It's a winner-takes-all karma prize for being first to post the article.
This causes a rush of people to post.
HN has a mechanism by which duplicate submissions count as upvotes toward the first submission.
This is a positive feedback for the desire to be first, which increases duplicate submissions and in turn the karma reward.
This effect means that good blogs stay well upvoted. This isn't altogether a bad thing, but it does mean some blogs require a string of poorly received posts before that effect wears off and people no longer rush to be first.
One way to fix this would be to attribute all karma to user simonw himself ( and do similar where attribution to an HN user is known. )
They definitely get something barebones up and running, but it's far from a fully fledged application.
I did write some stuff myself just to learn how the enigma encryption machine worked, so wrote myself to learn. But professionally, I stopped coding in November.
Writing the actual code is a significant part of that, but the codebase is so complex that even Opus 4.7 and GPT-5.5 struggle with it without being fed a *lot* of context and constraints. And even then, they need a *lot* of steering due to making bad decisions that only someone with an intimate knowledge of the theory behind our software is able to catch.
I can only assume that people who think coding agents can completely replace an actual developer mostly deal with trivial software regarding both scope and the type of customers they serve (individuals instead of big companies in industry).
AI just changed how I edit code - I still see coworkers (senior developers) failing with Claude/Codex and get stuck when there are trivial solutions if you understand the full problem space. Right now AI is just a productivity tool.
1. Spec -> plan -> code (all agent driven, maybe with grill-me or ultraplan)
2. Handwritten spec -> agent driven plan -> agent driven code
3. Agent driven spec -> vibed code -> Fix by handholding until ok-ish
4. Vibed throwaway prototypes -> extract useful patterns -> rewrite with handholding
5. Generate file structure with handholding -> manual TODO comments -> Fill in blanks with handholding
Then I just iterate with LLM - I let it start writing stuff in YOLO mode and check on what it's doing in the code steering it in the direction I want.
Usually the code LLM generates will work but is kind of garbage - but I can easily steer it towards better implementations.
Sometimes using an LLM is theoretically slower than hand-rolling - if I just sat down and focused I could outperform the iteration and the waiting, especially considering how stupid agents are at running expensive builds/test suites (with a bunch of explicit instructions in skills/claude/agents.md). But the practical improvement of going with LLM is that you have a bunch of thinking traces saved as a part of your iteration proces - it's really easy to get back into flow. This is a huge productivity win for me given how many interruptions I have in my work day. Like so many people like to point out - writing code ends up being less and less of your time as you level up in your career.
But it's by far the most fun part and the only reason to take such a job...
It's kind of sad. But on the other hand, I am glad I don't have to write every little line of code myself *on top* of having to do all the other stuff.
I could have just used the next project scaffold tool and been on my way before the ai even started returning output.
What you're saying is like "how do you justify your salary as a NASA engineer when anyone can use Simulink and generate the code?"
It is extremely ignorant.
The question is how many people will be good at vibe coding? If the answer is "lots" then we can definitely expect programming salaries to return to "normal" levels. His question is very relevant; you can't dismiss it as easily as that.
this was always true in fact $20 is more than the free it costs for notepad++
it's a flippant statement. Go down the line of any tool; it's cost has basically nothing to do with skill difference to operate it. See basically everything. There's levels.
i'm trying to say there's levels to this. if you don't agree then you don't agree. but i can buy commodity tools for any skill and that doesn't make me professional grade at that skill.
Coinbase is paying the price for that for every UX glitch, after the CEO was gleeful about HR personnel shipping production code
It will almost never converge on the general solution that will pass tests you haven't given it yet.
This is why AI is sooo good at Javascript and related slop. A solution that "kinda works" is good enough 9 times out of 10 and if some tests fail well ... YOLO and the web page will probably render anyway.
Contrast that to using Scheme or Lisp where AI will have trouble simply keeping the parentheses balanced.
I’ve also written C++ and Java in Notepad long ago. Not ideal, but hardly a problem.
'Nail Guns' used to be heavy, required heavy power cords, they were extremely expensive. When they got lighter, cheaper, battery pack ... at some point, they blend seamlessly into the roofers process, and multiply dramatically the work that can be done. Marginal improvements beyond that may not yield the same 'unlocks' because the threshold has been crossed.
Key has been to spend a fair amount of time on initial overall design document, which is split into tangible and limited phases. I go back and forth between them on this document until we're all happy.
For each phase an implementation plan is made. At the end, a summary document of what was delivered and what was discovered. This becomes input to next phase.
I do check the documents, and what they're doing. I also check the tests, some more thorough. And some spot checks on the code to see if I like the structure.
I have mainly used Claude for coding and Codex for design and code review after phases. I ask both to check test coverage after phases.
Managed to implement some tools and libraries without writing a single line of code this way, which have been very beneficial to us.
Since it's so async I can work on other stuff while they plod along.
I think it's not universal though. But stuff that can be tested easily and which you have a firm grasp of what you want to achieve, but not necessarily exactly how, that I've been impressed with.
> For each phase an implementation plan is made. At the end, a summary document of what was delivered and what was discovered.
> I do check the documents, and what they're doing. I also check the tests, some more thorough.
Sounds like programming, but with extra steps.
I’m building something using LLMs to scrape websites/socials for unstructured event data from combined text/images and the only way I’ve managed to get 100% consistent results for a reasonable cost is to break the task down into very small pieces that reduce the scope of mistakes significantly.
At present, for reasonable complex tasks, Codex/Claude will happily code you into an expensive corner.
(Even when they're getting the planning part right, I do also recommend checking the LLM-generated unit tests, because in my experience some of those are "regex the source code" not "execute functions and check outputs").
GPT 5.5 is a significant improvement over GPT 5.4 but I wouldn't call it an inflection.
I think the smart zone stays within the first 100k tokens, no mater if the context window is 240k or 1 million.
I divide the work to fit within that 100k and use subagent for the tasks.
Once I work out the kinks, I’ll be able to further automate it.
Would have taken 10-100x as long for me to build it without AI and the AI version is probably better.
But yeah, I have enough knowledge to know what prompts are needed and figure out those “oh, I think it’s running slow or failing because of xyz” and further prompt to improve it based on that what I think it should do instead.
And I know where to make slight changes without burning my allotments.
When people claim LLMs just don't work for them, the first question is whether they're using the latest model or not, and if not, dismissing the poster.
The thing is that that same question was being asked a year ago, and even a year before that, but with the models that lead to a dismissal today.
Just make the experiment yourself, wait 6 months, say LLMs just aren't working for the software engineering that you do, and people will dismiss you if you say that you use Opus 4.5 and not the latest model Claude MegaMind 8.8 pro max gigathinking. Despite this model being touted as the inflection point in this article.
But a lot of people excited about new generations(including me, now) are not seeing it as a dichotomy but rather a spectrum where models are getting better and indeed once a year or even 6 months at times there comes a sudden growth which feels like an inflection point from what came before. Practically, it's a tool like any other, you evaluate it based on if it's worth the effort and cost for the benefit you get from it and if it is and has a good DX you use it. If the calculation doesn't work for you, it doesn't. For me, it has gone from a novelty, to good for some kind of quick manual search, to I guess it can debug some kind of errors at times in very specific conditions, to hey I think I am getting a bit addicted to autocomplete in IDE provided by them even if I don't use them for anything intelligent but it's becoming indispensable now but only this part, to it's good for areas I lack expertise in, to agentic sucks I will stick with discussing algorithms and architecture with it on greenfield projects, to holy shit it can do agentic decently well now, I am skeptic to give it access more than in limited cases, to now I am getting close to letting it run free on my device in not so distant future I guess. Some of these were big jumps, at each point I was skeptical of growth. Everytime I thought now the growth will slow down from days 2k context window to millions now. From basic chat completion to working on complex adaptive systems, game theoretic modelling, heurestics and constraint modelling and other things I throw at it. I am still needed in the loop, it can be so smart at times and then will do something so stupid, but the frequency of stupidity is rapidly decreasing. I am still needed, I don't think it could accomplish alone all that it has done for me. But I do at times at night remain awake reflecting on my self worth for the potential day when I don't add that value. When I have a harder time keeping up.
Also had someone told me not in even 2019 that in 2026 we could have NLP models do what they do today, I would have posited it all as sci-fi and here I am waking up in awe of the world we live in and how quickly we adapt.
At any point you need to have agents review, verify and test the other agents output and iterate until the output is perfect.
And also, have good e2e tests.
IMO, if you don't spend at least a few tens of millions tokens per day, you aren't doing it properly.
Hopefully she rejects all this out of hand, but if she doesn't it'll mean that none of our trainees get the benefit of her experience, who she is as a person, and what she has to pass onto them.
We have 6 monthly reviews as instructors where we are told the same thing. "How could you use AI for your teaching?"
They don't even feel the need to justify why this would be desirable, or is needed at all. It's just pure bandwagonning. Unbelievably, most of my coworkers are extremely positive about AI, although none of them have told me they use it for anything besides preparing their lessons for them — they just use it instead of having to think, or spend time preparing...the only important thing they do at work.
It makes no sense to me.
I have to consciously avoid using AI for more cognitive tasks, though. It would be very tempting to have Claude, ChatGPT, or Gemini summarize, classify, and grade the students’ assignments, write individual feedback, prepare my lesson plans, etc. However, I know that my engagement with the material and with the students would suffer. I also want to show the students that they are learning together with me and with each other, not with bots.
I am semiretired and have a light teaching load that gives me plenty of time to prepare for class. I can see that full-time teachers might find it hard to resist the lure of offloading their thinking to AI.
That gives me a starting point. Of course, I modify it. Maybe I bounce back and forth to the AI for further refinements and suggestions, but ultimately I have to be happy with the result.
When prepping the individual lessons, the biggest time saver is coming up with examples to illustrate particular points. I could do this alone, but sometimes that involves staring at a blank screen for a while. It is faster to ask the AI for suggestions, pick the one I like, and refine it further myself.
AI is a tool. Use it appropriately.
Some of these are now contributors.
I also have a friend (beware, N=1 study) with zero prior programming knowledge that has released his first app.
The average office worker is amazed at Copilot (not in the IDE - but the app bundled with Windows), and they mostly copy paste material into their enterprise provided ChatGPT / Gemini, and get tips from Facebook / Instagram on their top 5 best prompts for work productivity
Showing them agents that automate work at scale is a very magical experience
I use it a lot now for knocking up grafana charts etc. It’s not so much that the LLM is feeding the numbers through. You can still use real tools to analyse and summarise the numbers, it’s just very quicker driving them.
As ever with data analysis, two things will continue to be true. Real insights come from spotting something that looks off and digging into it deeper. Secondly, it’s really easy to connect data in a misleading way.
I’ve had a Claude analysis handed to me this morning including a summary list of actions we’re going to take next which falls into this very trap.
The insights you’ll get from your data will only be as deep as the curiosity of the person at the helm.
I find it’s easier to version control and diff the .md artefacts, those remain my authoritative source.
If you are a bit technical, reveal.js is actually really nice for this. I one shotted a pdf export for that uses a headless browser. I've used that a few times now.
What works well for me is to take an existing presentation and then some raw input and generate a new presentation in the same style as the old one from the raw input. After that, I can go in and tweak individual slides.
Another thing I did recently was take somebody's existing pitch deck and fix it with a one line prompt: "this deck is a bit meh, pimp it!" that worked unreasonably well. I like using shitty prompts like that. Codex often manages to do the right thing if you don't overthink your prompts.
Classic deck of somebody that used way too much text and only bullets. It did a great job on that presenting the content in a more simple and better structured way. Pulling out key facts and highlighting those, simplifying text, etc. Doing that manually would have taken hours.
The important part is the presentation matching your presenting cadence, which is something LLM generated presentations never get right. I don't have a problem with people generating presentations, but most of the time they just end up reading whatever is on the screen when presenting.
Personal: my wife tutors in her native language to non-native primary and high school kids. They are all using these tools now generate fresh content for practice based on school lesson plans. The kids are improving much more quickly now than they were just a few months ago.
Thanks!
We have whatever AI is in teams transcribe every meeting, and it's scaringly good at it. It's also extremely good at sumerizing or finding things from pervious meetings when tasked. One disadvantage in this, is that I can see how stupid I sound on writing. I'll go "yeah, hmm, yeah, that's, yeah", but it really is pretty good.
I assume we're going to see a massive increase in AI with this Cowork inside the Microsoft client. We actually have a better tool available through a librechat where you can create and configure your own agents with the same filesystem access to your one drive, and a lot more tools and models than just Claude. Almost nobody has been capable of figuring out how to use it though, so they've been using the regular office365 copilot and it sucks so bad that a lot of people stopped beliving in AI.
It's ironic that Microsoft fumbling the ball on AI, but being very good at enterprise customers (especially non-IT) means that they'll likely be the company which is going to sell us AI tools that people will actually use. I have no idea why it's so hard for people to pick up the Librechat tool we're given access to through our equity fund. It's quite litterally a copy of ChatGPT where you can point-and-click configure an agent, but we're seeing that even employees who use a lot of ChatGPT privately don't use this tool professionally. Meanwhile everyone has been capable of using the Microsoft thing (that I personally think is less user friendly since you will need to add your configuration files to every promt).
That's because M365 is integrated with the whole Office/Exchange environment, especially in terms of security policies, etc. MS also guarantee that the data are private, this is very important for many companies both from the IP protection perspective and the liability to expose some users/customers data (think of GDPR regulations is Europe).
I don't know who is behind Liberchat, probably some good and friendly folks, but when it comes to privacy/security Microsoft has much more to loose and if shit happens it is easier to sue them than some random VC-financed company from the USA.
Once I was going to send some figures to leadership so I checked the queries myself and not only had it done it correctly, but it had also included a lot of sanity checks with other places in the database which as a human I doubt I’d have had the time or inclination to do.
Even for modelling work it can be good to check your ETL queries, or write one itself and then check it etc.
Yes, there are ways to convert raster images to SVG for use in training data but it's not a good use of anyone's time.
Mistral seems to be the exception. Their new model from a few weeks ago is worse then selfhosted gemma.
I'm not sure that's true anymore considering how popular Simon's blog is
> I think this mainly demonstrates that the pelican on the bicycle has firmly exceeded its limits as a useful benchmark.
As acknowledged in the article.
Well, a combination of that and believing that replication of test data is a good measure of progress.
The half-full view is that the models are so good at finding vulns that if you plug them into your build-pipeline then the amount of new vulns introduced will go down towards zero.
The half-empty view is that we're now producing more junior-level code with less review, so everything will have more vuln, also it's cheaper and easier to find them so prepare for chaos.
Short term there is sure to be chaos either way as the models are clearly good enough to find all the old bugs, and not everyone has the resources or will to try to stay ahead of the curve like Mozilla is trying to do with their Mythos access https://blog.mozilla.org/en/firefox/ai-security-zero-day-vul...
A threat actor with access to a better model or more money to burn on tokens may yet find more. Some of them have deep pockets, and not nearly every project will get the Glasswing treatment of free Mythos tokens.
We're most likely entering a year or two or rapid vulnerability discovery, patching, as well as reducing and minimalizing system footprints just to survive the onslaught of strange vulnerabilities from e.g. ancient and widely unused kernel modules.
I met a few people at PyCon this week who have been part of Glasswing (they're just starting to be allowed to talk about it) and it really does drive down the cost of finding vulnerabilities.
I've been collecting notes on that here: https://simonwillison.net/tags/ai-security-research/
You used to have a couple of days to close a breach, now it 2 hours.
Then the nerf, and the massive uplift in tokens for 4.7, a model which I find lazy and prone to hallucinate.
It's probably time to try GPT5.5. Like many I'm pretty heavily invested in the anthropic ecosystem at this point, which I suppose gives another strong reason to make the switch.
ChatGPT 5.5 seems capable, although a bit stingy with “thinking” compared to earlier models, and I never run into session limits.
Even operations and GTM are all at "professional" level (which I think is vaguely equivalent to 5x).
I feel like if anything people started to realise the significant limitations of LLMs when you try to use them as ‘agents’ which was the big direction LLM companies tried to push recently.
Best use of LLMs so far IMO is finding vulnerabilities (with human help) and pattern matching in other domains. For generating code and prose they are still mediocre and somewhat unreliable and for use as personal assistant agents I wouldn’t trust them.
So what’s happening with openclaw, the biggest experiment in agentic, vibe coded by the agents themselves? The thing that was so hot a few months ago.
https://github.com/openclaw/openclaw/pulse?period=daily
279 commits to main from 77 authors in the last 24 hours.
Why is there so much churn and how could you trust it with your data? This is changes in ONE day!
If these are useful changes, surely it’d be superhuman by now given months of this pace.
What are people using this for?
Implying another country has a better model? I'm being pokey here because I'm very curious! I know Gemma is efficient, but I also remember Qwen and Kiwi being referred to as optimized. The difference being that Gemma is using less tokens, but maybe Qwen/Kiwi's quality is higher? I dont know.
Opus 4.5 hit that point in November.
They were able to one-shot famous games (like asteroid or pong), I suspect because they had been trained on multiple versions of that game. So like producing Harry Potter, with the right prompt it was able to produce a license stripped version of code it had seen. I tried another arcade game like frogger and it failed really badly and took a lot longer, never got it working.
The whole exercise left me feeling they have a long way to go, I don’t see how anyone could think they would replace SWE unless they didn’t look at the code produced, even now.
Personally, the more time I spend working with coding agents the least worried I am for my career. Getting the best results out of them is really hard. They amplify existing skills and experience, so the more experience you have the better.
I wonder why there is such a mad dash to trump up the capabilities of coding agents. And why such loose terminology and lack of rigor? I thought programmers were supposed to be rational people (har har!)
I have a theory: if they were good at writing automated tests, they would have been developers instead of QA engineers.
Not saying that there aren't any high quality QA engineers, I worked with some. But LLM's raised the bar in a way that most QA engineers can't reach.
In my limited experience they write test cases, test each story, do regression test, verify bugs from customers. All by hand.
At my current job I don't want to miss them.
AI reduces the cost of producing software (and other intellectual tasks), which greatly improves the viability for more and more ambitious projects. As far as we know the amount of problems software (and humanity) can solve is unbounded
It feels like the market has shifted in SWE yet again to heavily prioritize a new set of skills, of which those in the top quartile are desired more than ever
Fundamentally, steering LLMs requires the same structured, logical thought process that is required to write code, regardless of abstraction level. Unlike what HN would have you believe this is not a skill that is equally distributed across the population.
But given the rapid pace at which this technology is evolving, "steering" may very well be ceded to the clankers. LLM agents are fantastic at logical reasoning & any inefficiencies relative to human experts can be circumvented by sheer compute.
It's like most people just watching a 'starting nba player' (not superstar, but just starting player) vs one that sits on the bench.
If you were to just watching them play, work out, shoot - you'd never notice the difference.
Put them head to head and it's 98-54 and you start to see the patterns.
It's pretty interesting actually, someone tell me what the 'science' for this is, I'm sure there is some kind of information theory at work here.
Software has innumerable kinds of problems at varying level of complexity and so it provides the perfect testbed for seeing how far models can go in practice.
Should add: you're very right to hint that harness, tooling, and models tuned o both the harness and he kinds of things people do on the harness, as well as some other things do make enormous difference.
Bu and large, SOTA Codex/Claude Code are substantially better - at least for now. That may change.
Because you have to adjust the harness to your problem space and provide that so you can say it is high-quality.
Many people will stop that discussion at the claude code vs. codex vs. opencode level and then merge that with discussing model performance.
And that is also why "Generate an SVG of a pelican riding a bicycle" is still a benchmark worth discussing. Because at least it is a defined problem space.
Personal opinion we need to focus more on efficiency instead of how large or complex a model can get as that model creeps into more resource requirements. If the goal is to cost a billion dollars to operate than we've really lost the idea of what models are supposed to be achieving.
I've certainly had things that Opus fixed using some kind of work around that GPT-5.5 actually solved.
And the difference between the Sonnet/Gemini/DeepSeek tier to the Opus/GPT-5.5 tier is immediately obvious.
I haven't looked into any sort of "agent" mode, just because I don't yet quite trust the AI not to do something dumb. Also, I don't use M365, where Copilot is integrated, so I suppose I would have to set it up myself.
Hmmm......
Does that suggest the uplift was only for things that are easily verifiable like code?
The hope was that good RLVR on relatively contrived datasets (like benchmarks) would be generalized to good software taste, which has somewhat succeeded but also the models fail in horrible ways still
And the hope beyond that is that good skills in fundamental problem solving tasks (coding, math) would generalize to tasks beyond math and code, which did happen but less so
Other domains I am not sure but I've heard from people like Cal Newport that the rate of increase outside of code and math are not as equally impressive
It would support your point about the performance of 20GB local models.
It is getting very good at producing code that compiles - at the algorithmic level.
This is definitely noteworthy - and the AI is crossing a critical 'productivity threshold'.
But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.
Everyone has tried to get AI to 'Draw The Thing They Want' and you notice immediately how it's almost impossible to 'adjust the image' along the vector you want - because ... and this is key:
-> the AI doesn't really understand what a Duck is, it's components, or fully how it made the duck <-
It just knows how to 'incant' the duck.
This becomes very clear when you try to get the AI to write proper documentation - it fails so miserably, even with direct guidance.
This is really strong evidence of how poorly the AI is generalizing, and that it is not 'understanding' rather it's 'synthesizing' from patterns.
We already kind of knew that - but we have not yet built an intuition for that until now.
Only now can we see 'how amazing the pattern synthesis' is - it's almost magic, and yet how it falls off a cliff otherwise
This has deep implications for the 'road ahead' and the kinds of things we're going to be able to do with AI.
In short: the AI is 'Wizard Level Code Helper, Researcher, and Worker' - but it very clearly lacks capabilities even one level of abstraction above the code itself.
LLMs were first trained by 'text' and now ... they are 'trained by our compilers'. Basically g++, javac, tsc are the 'Verifiable Human Rewards' in the post-training and reinforcement learning - and the AI is getting extremely good at producing 'code that compiles', but that's definitely an indirection from 'code that does what we want'.
It's astonishing that it took us all this time to internalize and start to discover what I think will be in hindsight a very obvious 'threshold' of it's capabilities.
We are constantly 'amazed' at the work that it can do, and therefore over-project it's capabilities.
I have no doubt that even with these limitations - the AI will unlock a lot more as it gets better - and - that it will 'creep up' the layers of abstraction of it's understanding.
But I strongly believe that the AI is going to get much 'wider' (pattern matching dominance) before it gets 'higher' (intrinsic understanding) - and - that this may be a fundamental limitation.
This may be 'the Le Cunn' insight - when he talks about the limitations of LLMs in detail - I believe this is that insight writ large.
Even the term AI - or certainly 'AGI' may be a misleading metaphor - were we to have always called it 'Stochastic Algorithms' or something along those lines, it's possible that our intuition would be framed a bit better.
The most interesting thing is how it is definitely amazing, world changing, novel and powerful and some ways - and obviously useless in others at the same time. That's the 'threshold' we need to better understand.
That might be the case, but Simon's case "Generate an SVG of a pelican riding a bicycle" is very different.
The model actually has to understand what parts of a pelican and bicycle come together in something like an anatomically plausible way. That's a higher level of abstraction than something like passing the same prompt to Stable Diffusion etc
(The new Nano Banana/GPT Image 2.0 models are different though - they have significant world knowledge baked in)
No, it's not because it's seen 'anatomy' for Pelicans, Animals - even how it's represented in Animals.
If you try to get the AI to actually decompose it and start to 'draw pelicans' in very obscure ways, it will immediately fail.
Try to get the AI to draw the pelican form a very odd angle - like underneath, to the right, one wing extended, one wing not ... 0% chance.
Precisely because it does not understand those things.
FYI it's a slightly unfair case because it does not have 'world model' yet, which will actually solve that problem, but even then not through very much abstracting.
We're a long way away - but in the meantime, there's lots to unpack.
Proof by existence?
https://gist.github.com/nlothian/50241d34a654fcf0caa280d4475...
Looks pretty good to me. ChatGPT in "Thinking" model.
Edit: I've added the Opus version on the same link.
https://chatgpt.com/share/e/6a0bf28b-e198-8012-9a88-c777d965...
When it was new, sure. Right now, models can be trained on that because everybody uses it as a benchmark.
Is the only choice to pay for the "max" plans?
Or just read so much about it that you bs your way through an interview and then use the company's resources?
Simon, I'm curious too how much you invest each month researching all the latest and great AI tech?
They're on par with Claude and Codex imo - when you still design architecture and know what the output should be. Claude and GPT 5.5 need less guidance with vibe coding, but we're not yet at a point where that's sustainable anyway even with those models.
"and then you have to get a mac mini, and then, and then"
smile and nod, it pays weekly
As time progresses one now has a yard stick to measure against progress. No more excuses - show me the money baby.
How are these even graded? Qwen3.6-35B-A3B gets high marks for a pelican with a gaping hole in its bill?
edit: Just noticed its feet are disconnected from its legs as well (but right on the pedals!). Pardon my French but that's Chinese af.