Such low dimensionality of the LoRA vector must surely result in a close-to-linear modification to the KV calculation. This seems to me to imply that what we call "reasoning" is latent within the model. Pretty clear I didn't read the paper, I'm sure the authors address this.
Yes - some degree of reasoning appears to be latent in the structure of language itself. But models trained explicitly on reasoning-focused data still perform better than models trained only on general corpora.*
*At least up to 300B parameters, based on the models we’ve tested.
After a quick content browse, my understanding is this is more like with a very compressed diff vector, applied to a multi billion parameter model, the models could be 'retrained' to reason (score) better on a specific topic , e.g. math was used in the paper
I agree, I don't think gradient descent is going to work in the long run for the kind of luxurious & automated communist utopia the technocrats are promising everyone.
*At least up to 300B parameters, based on the models we’ve tested.
[0]: cartesien.io or Salesforce's WebscaleRL
For some use cases it can be parity performance at 1/20th the cost up to exceeds at 1/10th the cost. Trade-off is ofc narrow applicability
even some advanced math usually evolves applying patterns found elsewhere into new topics