Microgpt explained interactively

(growingswe.com)

157 points | by growingswe 14 hours ago

8 comments

politelemon 3 hours ago
> By the end of training, the model produces names like "kamon", "karai", "anna", and "anton". None of them are copies from the dataset.
Hey, I am able to see kamon, karai, anna, and anton in the dataset, it'd be worth using some other names: https://raw.githubusercontent.com/karpathy/makemore/988aa59/...
[-]
- ayhanfuat 2 hours ago
  You are absolutely right. The whole post reads like AI generated.
  [-]
  - jsheard 2 hours ago
    The rate they are posting new articles on random subjects is also a pretty indicative of a content mill.
    In 3 days they've covered machine learning, geometry, cryptography, file formats and directory services.
  - re 2 hours ago
    I didn't get that sense from the prose; it didn't have the usual LLM hallmarks to me, though I'm not enough of an expert in the space to pick up on inaccuracies/hallucinations.
    The "TRAINING" visualization does seem synthetic though, the graph is a bit too "perfect" and it's odd that the generated names don't update for every step.
    [-]
    - oytis 7 minutes ago
      For me it was the prose that alarmed me. Short sentences, aggressive punctuation, desperately trying to keep you engaged. It is totally possible to ask the model to choose a different style - I think that's either the default or corresponds to tastes of the content creators
  - butterisgood 2 hours ago
    ISWYDT
- growingswe 2 hours ago
  Thanks, will fix
malnourish 2 hours ago
I read through this entire article. There was some value in it, but I found it to be very "draw the rest of the owl". It read like introductions to conceptual elements or even proper segues had been edited out. That said, I appreciated the interactive components.
[-]
- davidw 1 hour ago
  It started off nicely but before long you get
  "The MLP (multilayer perceptron) is a two-layer feed-forward network: project up to 64 dimensions, apply ReLU (zero out negatives), project back to 16"
  Which starts to feel pretty owly indeed.
  I think the whole thing could be expanded to cover some more of it in greater depth.
jmkd 1 hour ago
It says its tailored for beginners, but I don't know what kind of beginner can parse multiple paragraphs like this:
"How wrong was the prediction? We need a single number that captures "the model thought the correct answer was unlikely." If the model assigns probability 0.9 to the correct next token, the loss is low (0.1). If it assigns probability 0.01, the loss is high (4.6). The formula is − log ⁡ ( � ) −log(p) where � p is the probability the model assigned to the correct token. This is called cross-entropy loss."
love2read 1 hour ago
Is it becoming a thing to misspell and add grammatical mistakes on purpose to show that an LLM didn't write the blog post? I noticed several spelling mistakes in Karpathy's blog post that this article is based on and in this article.
[-]
- klysm 1 hour ago
  I expect this kind of counter signaling to become more common in the coming years.
- refulgentis 16 minutes ago
  People aren't gonna be happy I spell this out, but, Karpathy's not The Dude.
  He's got a big Twitter following so people assume somethings going on or important, but he just isn't.
  Biggest thing he did in his career was feed Elon's Full Self Driving delusion for years and years and years.
  Note, then, how long he lasted at OpenAI, and how much time he spends on code golf.
  If you're angry to read this, please, take a minute and let me know the last time you saw something from him that didn't involve A) code golf B) coining phrases.
- efilife 52 minutes ago
  You just started to notice it
grey-area 1 hour ago
The original article from Karpathy: https://karpathy.github.io/2026/02/12/microgpt/
windowshopping 2 hours ago
The part that eludes me is how you get from this to the capability to debug arbitrary coding problems. How does statistical inference become reasoning?
For a long time, it seemed the answer was it doesn't. But now, using Claude code daily, it seems it does.
[-]
- ferris-booler 1 hour ago
  IMO your question is the largest unknown in the ML research field (neural net interpretability is a related area), but the most basic explanation is "if we can always accurately guess the next 'correct' word, then we will always answer questions correctly".
  An enormous amount of research+eng work (most of the work of frontier labs) is being poured into making that 'correct' modifier happen, rather than just predicting the next token from 'the internet' (naive original training corpus). This work takes the form of improved training data (e.g. expert annotations), human-feedback finetuning (e.g. RLHF), and most recently reinforcement learning (e.g. RLVR, meaning RL with verifiable rewards), where the model is trained to find the correct answer to a problem without 'token-level guidance'. RL for LLMs is a very hot research area and very tricky to solve correctly.
- fc417fc802 1 hour ago
  Because it's not statistical inference on words or characters but rather stacked layers of statistical inference on ~arbitrarily complex semantic concepts which is then performed recursively.
  [-]
  - love2read 50 minutes ago
    This answer makes sense if you know that LLMs have layers, if you don't this answer is not super informative.
    If I were to describe this to a nontechnical person, I would say:
    LLMs are big stacks of layers of "understanders" that each teach the next guy something.
    Imagine you are making a large language model that has 4 layers. Each layer will talk to it's immediate neighbor.
    The first layer will get the bare minimum, in the LLM's of today, that's groups of letters that are common to come up together, called "tokens". This layer will try to derive a bit of meaning to tell the next layer, such as grouping of letters into words.
    The next layer may be a little bit more semantic, for example interpreting that the word "hot" immediately followed by the word "dog" maps to a phrase "hot dog".
    The layer after that, becoming a bit more intelligent given it's predecessors have already had some chances at smaller interpretations may now try to group words into bigger blobs, such as "i want a hot dog" as one combined phrase rather than a set of separated concepts.
    The final layer may do something even more intelligent afterward, like realize that this is a quote in a book.
    The point is that each layer tries to add a little meaning for the next layer.
    I want to stress this: the layers do not actually correspond to specific concepts the way I just expressed, the point is that each layer adds a bit more "semantic meaning" for the next layer.
ChrisArchitect 1 hour ago
Related:
Microgpt
https://news.ycombinator.com/item?id=47202708
nimbus-hn-test 3 hours ago
[dead]