LLMs can see and hear without any training

(github.com)

210 points | by T-A 329 days ago

21 comments

vessenes 329 days ago
I’ve read the paper and the skeptical comments here, to wit: it’s just an actor/critic pipeline by another name.
I’ll bite and say this is actually interesting — and the paper title is misleading.
What they’ve done here is hooked up a text-only LLM to multimodal critics, given it (mostly) an image diffusion generation task, and asked it to improve its prompting of the multimodal generation by getting a set of scores back.
This definitely works, based on their outputs. Which is to say, LLMs can, zero shot, with outside tool feedback, iteratively improve their prompting using only that tooling feedback.
Why is this interesting? Well, this did not work in the GPT-3 era; it seems to do so now. I see this as an interesting line to be added in the ‘model capabilities’ box as our models get larger and more sophisticated — the LLMs can perform some sort of internally guided search against a black box generator and use a black box scorer to improve at inference time.
That’s pretty cool. It’s also generalizable, and I think is worth keeping in mind on the stack of possible approaches for, say agentic coding, that you can use a critic to not just ‘improve’ generated output, but most likely do some guided search through output space.
[-]
- jorvi 329 days ago
  > zero shot
  I really wish we would find a different term for this.
  Doing something always takes at least one attempt, i.e. "one shotting". "Zero shotting" is an oxymoron, which makes it a term that only creates more confusion rather than succinctly conveying something.
  [-]
  - Izkata 329 days ago
    "One shot" is simply about the action itself, but it says nothing about how much preparation was done beforehand. "Zero shot" additionally implies without training or preparation.
    TCGs have a related "zero turn win" concept, where the opponent goes first and you win without getting a turn due to the set of cards you randomly drew and being able to activate them on the opponent's turn.
  - vessenes 329 days ago
    I think of a shot as an example, not a try: “One shot” is “One example”. Zero shot is “Zero examples”. I don’t love it, but I don’t hate it, got a better word for it?
    [-]
    - jorvi 329 days ago
      We already have a term for it in people, "intuited". When you are asked to intuit something, it usually implies an unfamiliarity with the subject matter.
      There is such entrenchment with terms though, it'll never get shifted to that.. and on top of that, it doesn't sound as interesting and dynamic as "zero shotting".
      [-]
      - vessenes 327 days ago
        to be fair, it's also pretty long winded to say "pass @ 32 attempts to intuit" or "intuited after 6 examples"
    - saurik 329 days ago
      I mean... how about "example"? I feel as if I were to give what you just said to someone a hundred years ago, with no context of AI training or even of the discussion, the very form of what your response leads to the answer "example" ;P.
      The issue with "shot" is that it is a term and part of an idiom that has been used for a very long time and, critically, is relevant to the same problem space in a much more intuitive way: to count the number of shots shot, not shots seen.
  - quantadev 329 days ago
    My favorite AI term to ridicule is the recent "Test Time Compute" nonsense, which has nothing whatsoever to do with testing. It literally just means "inference time".
    And if I hear someone say "banger", "cooking", "insane", or "crazy", one more time I'm going to sledge hammer my computer. Can't someone, under 40 please pick up a book and read. Yesterday Sam Altman tried to coin "Skillsmaxxing" in a tweet. I threw my coffee cup at my laptop.
    [-]
    - ks2048 329 days ago
      Speaking of old-timers and "inference time" - there was a time when "inference" meant inferring parameters from data (i.e. training). And now it means "test-time". (or maybe the difference is if it's statistics community vs ML community).
      e.g. Bishop's textbook says:
      5.2.4 Inference and decision
      We have broken the classification problem down into two separate stages, the inference stage in which we use training data to learn a model for p(Ck|x) and the subsequent decision stage in which we use these posterior probabilities to make op- timal class assignments.
      [-]
      - quantadev 329 days ago
        I almost mentioned "inference" too, as an unfortunate word that stuck in a bad way, but it's tolerable since we can now just [falsely] claim that the AI is "inferring" what a prompt "means" in order to answer it.
        And speaking of word definitions: "Old Timer" is anyone with a decade more experience than you.
    - numeri 329 days ago
      It makes quite a lot of sense juxtaposed with "train time compute". The point being made is that a set budget can be split between paying for more training or more inference _at test time_ or rather _at the time of testing_ the model. The word "time" in "inference time" plays a slightly different role grammatically (noun, not part of an adverbial phrase), but comes out to mean the same thing.
      [-]
      - quantadev 329 days ago
        Exactly right. The term "Test Time" had relevance in a certain context, and in a certain paper, but once people read the paper and saw the term they latched onto it, not realizing how totally non-descriptive and nonsensical it was when used outside that specific narrow context of genuinely "testing".
    - byteknight 329 days ago
      Get off my lawn is alive and well it seems
      [-]
      - quantadev 329 days ago
        Speaking of worn out tropes, you just used the most common one of all. I'm sure it was a tough call for you to decide between that and a "boomer" quip.
  - BoredPositron 329 days ago
    We say Sure Shot.
  - airstrike 329 days ago
    It's a shot from position zero
    [-]
    - nmstoker 329 days ago
      No it isn't. The number of shots (examples) is zero.
      [-]
      - Zambyte 329 days ago
        You're both right.
  - hawk_ 329 days ago
    Array indexing can start at 0 or 1.
    [-]
    - layer8 329 days ago
      For an array of zero shots, the indexing doesn’t matter.
- skydhash 329 days ago
  > I think is worth keeping in mind on the stack of possible approaches for, say agentic coding, that you can use a critic to not just ‘improve’ generated output, but most likely do some guided search through output space.
  The one issue I keep finding with those approaches is that there’s already good tools for the problem, but we keep searching for wasteful approaches because “natural languages” for something humans are not going to interact without a good deal of training.
  I do understand the hope of getting LLMs do the bulk of the work, and then after audit, we fix the errors. But both audit and fixing will require the same mental energy as writing the code in the first place. And possibly more time.
  Specialist tools are always more expansive and offer more controls than general public tools. Most approaches with agentic coding is offering general interfaces instead of specialized interfaces, but redirecting you to a bespoke and badly designed specialized interface whenever you want to do anything useful.
  [-]
  - vessenes 329 days ago
    I hear that. Counterpoint - if you all you have is a Philips-head screwdriver, all you have is a Philips-head screwdriver. On the other hand if all you have is a six axis CnC mill, well, then you have a lot.
    I think of this less as audit misses, and more as developing a permanently useful tool. For open model weights, humanity will not (unless we’re talking real zombie apocalypse scenarios) lose these weights. They are an incredible global asset, so making them more generally useful and figuring out how to use them is super helpful.
    [-]
    - skydhash 329 days ago
      Maybe they are useful. But I think there’s more usefulness in specialized databases and optimized approaches than betting everything on big llms models. Kinda like deriving linting rules and combining it with a rule engines to catch errors. Efficient and useful instead of continuously running a big llm model.
    - walleeee 329 days ago
      While it is hard to argue with the wisdom of crystallizing intellectual capital into our tools, I do wonder if these models might be as likely to diminish as to develop the person using them, in which case we trade an implement's iterative improvement for ours, in a way
      [-]
      - vessenes 329 days ago
        Monks in the Middle Ages: “The Printing Press will destroy people’s ability to memorize.”
        This was accurate. But mostly humans gained from books. I think we will develop the social technology to use these tools over time; giving some things up and gaining others.
        If we don’t, the Amish can just take over and be like “Stupid English, using the devil’s weights.” :)
- nightski 329 days ago
  Are they using the same diffusion models as the GPT-3 area? Meaning is it the LLM that has improved or is it the diffusion model? I know it's probably a foolish take but I am really skeptical of the "larger models will solve all our problems" line of thinking.
  [-]
  - vessenes 329 days ago
    They don’t compare in the paper. I will say I experimented extensively with GPT-3 era LLMs on improving ouput by trying to guide early diffusion models with critical responses. It was a) not successful, and b) pretty clear to me that GPT-3 didn’t “get” what it was supposed to be doing, or didn’t have enough context to keep all this in mind, or couldn’t process it properly, or some such thing.
    This paper has ablations, although I didn’t read that section, so you could see where they say the effectiveness comes from. I bet you thought that it’s emergent from a bunch of different places.
    FWIW, I don’t think LLMS will solve all our problems, so I too am skeptical of that claim. I’m not skeptical of the slightly weaker “larger models have emergent capabilities and we are probably not done finding them as we scale up”.
    [-]
    - tomrod 329 days ago
      > FWIW, I don’t think LLMS will solve all our problems, so I too am skeptical of that claim. I’m not skeptical of the slightly weaker “larger models have emergent capabilities and we are probably not done finding them as we scale up”.
      100% agree. I'd classify the time now as identifying the limits of what they can functionally do though, an it's a lot!
EncomLab 329 days ago
My photoresistor nightlight can "see" that it is dark and it "knows" to turn on the light - not only does it not have training, it does not have any code!
And if you think that is amazing, my bi-metallic strip thermostat "feels" the temperature and then modifies the environment because it "knows" if it's hot to turn on the A/C, and if it's cold to turn on the heat - no training or code!
All of this AI stuff is just unbelievably incredible - what a brave new world (of word games)!
[-]
- JoBrad 329 days ago
  The nightlight and thermostat's response to stimulus is nowhere near analyzing a picture of a clock tower and responding with "Image of a city's tallest, historic landmark with a sepia filter." To me, recognizing the umbrella in the spoon is one of the most impressive items they list.
  [-]
  - EncomLab 329 days ago
    It's not the technology that is bad - it's the extreme anthropomorphizing language that's used to describe it.
    [-]
    - horacemorace 329 days ago
      It might be bad if its behavior wasn’t so anthropomorphic.
  - bamboozled 329 days ago
    These devices are still "recognizing" something, which is quite interesting in itself.
nico 329 days ago
To people curious or skeptical if this could be called “seeing” or “hearing”, I recommend listening to the Batman podcast episode on NPR (https://www.npr.org/2015/01/23/379134306/batman-pt-1)
Through the story and experience of a blind man, they end up getting into the question of what does it mean to see
The podcast is pretty straightforward, but it does end up showing that defining “seeing” is a philosophical question, rather than a simple obvious answer
scribu 329 days ago
This seems to be a system to generate better prompts to be fed into a base multimodal model.
Interesting, but title is definitely clickbait.
[-]
- throwaway4aday 329 days ago
  They only did that for image generation. The more interesting part is that an LLM can approach or find the correct caption for an image, video or audio during test time with no training using only the score as a guide. It's essentially working blind almost like the game Marco Polo where the scorer is saying "warmer" or "colder" while the LLM is finding its way towards the goal. This is an example of emergent capabilities since there are no examples of this in the training data.
- matt123456789 329 days ago
  Actually, it's the name of the paper. And while the team also developed and released a system to elicit the behavior by doing what you described, it's entirely possible that the researchers thought the title to be the most important finding in their work.
- wangii 329 days ago
  Exactly! There is definitely something wrong with FAIR.
  [-]
underdeserver 329 days ago
Paper: https://arxiv.org/pdf/2501.18096
[-]
- suddenlybananas 329 days ago
  I don't understand how the title relates to the content of this article at all. They're even using CLIP which definitely has been trained.
  [-]
  - dragonwriter 329 days ago
    You don't have to train the LLM soecifically for the tasks and even the auxiliary tools aren't trained on the tasks they are used as scorers for (because they aren't doing the task,just evaluating how well the LlM is), so there is no task-specific training.
    [-]
    - suddenlybananas 329 days ago
      Task-specific training sure, but the title implies that vision itself is not trained.
viraptor 329 days ago
That looks like a classic Actor/Critic setup, yet it's not mentioned even once in the paper. Am I missing some large difference here?
[-]
- dawnofdusk 329 days ago
  In actor/critic the actor and critic are normally learned, i.e., their weights are adjusted during the process. The paper is correct that their method is zero-shot, but it doesn't mention that their method is essentially equivalent to a few rounds of training but then discarding the training update.
  Anyone who works with deep architectures and momentum-based optimizers knows that the first few updates alone provide large improvements in loss. In this paper the breakthrough is that computing these first few updates at test time enables one to describe the algorithm as "without training" and therefore attract hype.
  [-]
  - fc417fc802 329 days ago
    > discarding the training update
    But they aren't updating the model weights. They're iteratively updating the prompt. It's automating the process that humans use with generative models.
    Agreed that it's conceptually equivalent though.
- oneseven 329 days ago
  Yes, apparently they've developed new names: Generator and Scorer. This feels a bit like "Tai's Model" https://news.ycombinator.com/item?id=17863514
  [-]
  - lukeinator42 329 days ago
    Haha "Tai's Model" is absolutely hilarious, that gave me a good chuckle. I checked and it currently is cited 568 times.
JoBrad 329 days ago
Exactly how little training is "without any"? I'm assuming that companies haven't been spending billions trying to train LLMs to better understand things when they can do it without any training.
qgin 328 days ago
Emergent capabilities have been one of the wildest developments in software. For most traditional programmers you learn quickly and with great pain that the computer only does what you explicitly program it to do, no more, no less, and unintended behavior is a bug (and if you’re lucky, an accidental feature).
But the idea that entire abilities just emerge from scale… I still have a hard time accepting it.
robocop_legacy 329 days ago
I think there is potentially a powerful method here. Specifically, the optimal context for a given task can be saved and a meta-learner can be trained to map the task to the context. This would allow fine tuning a model for some specific task without retaining the LLM. For example, generating an SEM image with of some material with a specified porosity and grain size.
v01rt 329 days ago
"without training" describes transfer learning with an actor / critic approach
TheCoreh 329 days ago
Is the LLM essentially playing "Wordle" with an external system that rates the quality of its output, gradually climbing the score ladder until it produces good results?
sega_sai 329 days ago
The paper certainly contradicts my expectation from the title. I.e. it does not present an LLM that can generate images without any access to images before.
jagged-chisel 329 days ago
Computers can receive input without any programming. Not sure what’s interesting here.
[-]
- amelius 329 days ago
  There's more to seeing and hearing than just receiving inputs.
  Anyway, this looks like a case of human trying to understand article without reading it.
- dragonwriter 329 days ago
  This isn't receiving input, its generating output competitive with models with task-specific training.
  I’m guessing the iterative approach burns a lot of tokens though, though that may not matter too much with 8B Llama as the LLM.
- fortran77 329 days ago
  Really? How?
  [-]
  - skydhash 329 days ago
    The base layer is just electronic circuitry. As long there is electricity it will do stuff (like a radio producing noise). GPU, CPU, is mostly software embedded in hardware.
  - barotalomey 329 days ago
    Primarily, processing input.
    [-]
    - HaZeust 329 days ago
      Logic gates aren't coding? Could have fooled me!
- lud_lite 329 days ago
  [flagged]
alex1138 329 days ago
I just remember Zuck's comments about AI and how the idea of it dooming our species is a bit silly, etc
This is the wrong approach to take. At minimum you have to say things like "well yes we're always on the lookout for this kind of thing". With him? Not a care in the world
gitroom 329 days ago
pretty cool seeing models get a bit smarter each time - always makes me wonder how much of this is luck vs real skill tbh
3rdworldeng 329 days ago
Find me Jose Monkey will do that too :-)
v-rt 329 days ago
"without training" describes transfer learning
[-]
- v01rt 329 days ago
  hey what the hell? it said the username was taken?? bug???
  [-]
  - HaZeust 329 days ago
    wut
blogabegonija 329 days ago
[dead]
lngnmn2 329 days ago
[dead]