Indeed, that's what I kind of hinted at in https://news.ycombinator.com/item?id=46442195 and coincidentally https://news.ycombinator.com/item?id=46437688 briefly after, namely that OK, one can "generate" a "solution", that's much easier than before... but until we can verify somehow that it actually does what it say it does (and we know of hallucinations and have no reason to believe this changed) then testing itself, especially of well know "problems" is more and more important.
That being said, it doesn't answer the "why" in the first place, an even more important question. At least though it does help somehow to compare with existing alternatives.
Folks think, they write code, they do their own localized evaluation and testing, then they commit and then the rest of the (down|up)stream process begins.
LLM's skip over the "actually verify that the code I just wrote does what I intended it to" step. Granted, most humans don't do this step as thoroughly and carefully as would be desirable (sometimes through laziness, sometimes because of a belief in (down|up)stream testing processes). But LLM's don't do it at all.
They absolutely can do that if you give them the tools. Seeing Claude (I use it with opencode agents) run curl and playwright to verify and then fix it's implementation was a real 'wow' moment for me.
> LLM's skip over the "actually verify that the code I just wrote does what I intended it to" step.
I'm not sure where this idea comes from. Just instruct it to write and run unit tests and document as it goes. All of the ones I've used will happily do so.
You still have to verify that the unit tests are valid, but that's still far less work than skipping them or writing the code/tests yourself.
Heck, when Satya Nadella wanted to demonstrate Copilot coding, he had it emit an Altair emulator. I guess there's little room for creativity in 8-bit emulator design so LLMs can handle them well. https://thenewstack.io/from-basic-to-vibes-microsofts-50-yea...
It demonstrated the capabilities of an AI to a potentially on-the-fence audience while giving the author experience using the new tools/environment. That's solid value. I also just find it really cool to see that an AI did this.
It’s a shame that the source code isn’t commented and documented more. At the very least, I would see it being helpful to add some documentation for every CPU op code being emulated.
Forbidding LLM to write comments and docstrings (preferrably enforced by build and commit hook) is one of the best "hacks" for using that thing. LLM cannot help itself but emit poisonous comments.
Meh. No human has written the horrors llm produces. At least I am yet to see codebase like that. Let me attempt a theatrical reenactment:
// Use buffer that is large enough to hold any possible value. Avoid using JSON configuration, this optimizes codebase and prevents possible security exploits!
size_t len = 32;
// this function does not call "sort" utility using shell anymore, but instead uses optimized library function "sort" for extreme perfomance improvement!!!
void get_permutations() {
... and so on. It basically uses comments as a wall to scribble grandiose graffiti about it's valiant conquests in following explicit instruction after fifth repeat and not commiting egregious violence agains common sense.
Give it copy paste / translate tasks and it’s a no brainer (quite literally)
But same can be said of humans.
The question here is, did it implement it because it read the available online documentation about the NES architecture OR did it just see one too many of such implementations.
Indeed, the 'cleanroom' standard always was one team does the RE and writes a spec, another team that has never seen the original (and has written statements with penalty clauses to prove it) then does the re-implementation. If you were to read the implementation, write the spec and then write the re-implementation that would be definitely violating the standard for claiming an original work.
WASM and the performance seems catastrophically bad (45ms to render a frame on an M4 laptop)? It would be much more impressive if Claude could optimize it into something that someone would actually want to play? Compare this to a random hit from Google, https://jsnes.org/ which has sound, much smaller payload, and runs really fast (<1ms to render a frame).
The cost of slop is >40X drop in performance? Pick any metric that you care about for your domain perhaps that's what you're going to lose and is the effort to recover that practical with current vibe-coding strategies?
That being said, it doesn't answer the "why" in the first place, an even more important question. At least though it does help somehow to compare with existing alternatives.
Why would this be any different?
Folks think, they write code, they do their own localized evaluation and testing, then they commit and then the rest of the (down|up)stream process begins.
LLM's skip over the "actually verify that the code I just wrote does what I intended it to" step. Granted, most humans don't do this step as thoroughly and carefully as would be desirable (sometimes through laziness, sometimes because of a belief in (down|up)stream testing processes). But LLM's don't do it at all.
I'm not sure where this idea comes from. Just instruct it to write and run unit tests and document as it goes. All of the ones I've used will happily do so.
You still have to verify that the unit tests are valid, but that's still far less work than skipping them or writing the code/tests yourself.
That's what the author did when they ran it.
Until it's so, it's just hearsay to me by someone having a multi-billion horse in the race.
This endeavor had negative net value.
Give it copy paste / translate tasks and it’s a no brainer (quite literally)
But same can be said of humans.
The question here is, did it implement it because it read the available online documentation about the NES architecture OR did it just see one too many of such implementations.
Indeed, the 'cleanroom' standard always was one team does the RE and writes a spec, another team that has never seen the original (and has written statements with penalty clauses to prove it) then does the re-implementation. If you were to read the implementation, write the spec and then write the re-implementation that would be definitely violating the standard for claiming an original work.
The cost of slop is >40X drop in performance? Pick any metric that you care about for your domain perhaps that's what you're going to lose and is the effort to recover that practical with current vibe-coding strategies?
Github alone has +4k NES emulator projects: https://github.com/search?q=nes%20emulator&type=repositories
This is more like "wow, it can quote training data".