I like Mistral, it hits the exact sweet spot between cost and my data staying in the EU, withouth a significant drop in quality, but man are their model naming conventions confusing af. They mention they have a model called Devstral 2, which is neither Codestral nor Devestral. I want to use it, but the api only lists devstral-2512, devstral-latest, devstral-medium-latest, devstral-medium-2507, devstral-small, devstral-small-2507.
I think, devstral-latest should be it, no? So I write to support and get an answer 12 hours later that says oh, no, devstral 2 is definetely called devstral 2 and then a page of instructions on how to set it up in Intellij... generated with AI. The screens it is refering to don't exist and never did.
I have a general impression they are not interested too much in individual devs and making it suite their workflow. They want to be a B2B company and deliver a custom workflow per company.
Or it can just be a Google like problem where a big company one part doesn't talk to the other.
you might be correct. for example, they have an intellij plugin that allows integration without the AI Assistant, but it is only available for Enterprise customers
Don't sleep on Mistral. Highly underrated as a general service LLM. Cheaper, too. Their emphasis on bespoke modelling over generalized megaliths will pay off. There are all kinds of specialized datasets and restricted access stores that can benefit from their approach. Especially in highly regulated EU.
Not everyone is obsessed with code generation. There is a whole world out there.
I also think that this is the best approach for businesses wanting to adopt AI to automate, streamline, etc their business.
The problem they have is that this is not a moat - their approach is easily reproducible.
If they can pull ahead in having the most number of pre-trained models (one for this ERP, one for that CRM, etc) and then being able to close sales to companies using these products and sell them on post-trained (give us your specific ERP customisations and we'll give you access to a model that is tailored to your business), then THAT is a moat.
But they need to do this without fanfare. Just close sales, and keep closing, basically. After all, even if other AI providers copy the process, the moat would already have been established for Mistral.
> My 2ct: Currently the moat may be that they are not US-American which is not reproducible by any of the US alternatives.
I hope you are right (I am in the process of finalising a product and one of the top-5 selling points contains "outside the jurisdiction of the US"), but in my experience, companies only pay lip service to ethics unless it hits their bottom line.
This moat doesn't seem to be much of a moat considering a non-US model doesn't even crack the top 5 by usage - except DeepSeek, which would be a strange choice for Europeans looking for data sovereignty.
> considering a non-US model doesn't even crack the top 5 by usage
How do you measure "usage" in an enterprise/commercial context where no data on usage is available to you? I don't expect Mistral AI to make it's money on OpenRouter.
> Their emphasis on bespoke modelling over generalized megaliths will pay off.
Isn't the entire deal with LLMs that they are trained as megaliths? How can bespoke modelling overcome the treasure trove of knowledge that megaliths can generically bring in, even in bespoke scenarios?
ChatGPT is already a small agent that receives your message and decides which agent needs to respond. Within those, agents can have sub agents (like when it does research).
When generating images most services will have a small agent that rewrites your request and hands it off to the generative image model.
So from the treasure trove point of view, optimized agents have their place. From companies building pipelines, they also have their place.
> ChatGPT is already a small agent that receives your message and decides which agent needs to respond.
Right, but this was done to value-optimize the product, i.e. try to always give you the shittiest (cheapest) model you can bear, because otherwise people would always choose the smartest (most expensive) model for any query.
Taking away the model choice from the user introduces a lot of ways to cut down costs, but one thing it does not do is make the product give users better/more reliable answers.
> Isn't the entire deal with LLMs that they are trained as megaliths? How can bespoke modelling overcome the treasure trove of knowledge that megaliths can generically bring in, even in bespoke scenarios?
Think of it as a base model (the megalith) which then has the weights adjusted towards a specific use-case (SAP, for example).
Agreed. I’ve used their platform to train smaller, specialized models. Something I could have done in Codelab or some other tool, but their platform allows me to just upload a training set and as soon as it finishes I have a hosted model available at an endpoint. It obviously has some constraints compared to running the training yourself, but it also opens up the opportunity to way more people.
Indeed, but even for coding use cases, Vibe is more of a focused “refactor/ write this function” aid than “write me an app” and it can work locally. For me that’s a lot more valuable as an accelerator to my workflow where the developer stays in control and fully involved in the process.
It's definitely a topic of conversation in Reddit, etc... However I agree that the push to reduce US dependence by EU companies (and countries) is hampered by the fact that US stuff is already embedded (Microsoft but also Google, etc...) and that many of these companies are transnational anyway (very few European companies are solely inside the EU) and finally and most importantly just about every company will choose the option that does the job best for the right price (sovereignty is a distant second for most decision makers).
While few companies announce this publicly, I know from personal experience with corporate clients that many companies are preparing for Trump to use Big Tech as a bargaining chip.
And they should. Because the US is not behaving rationally at all.
>While few companies announce this publicly, I know from personal experience with corporate clients
Well I have even more personal experience that contradicts yours, and this isn't true at all. Everyone uses Claude / Gemini / OpenAI. Mistral isn't even on the table.
Proof: Most big EU companies use Claude or Gemini or OpenAI, not Mistral. That choice was made recently.
Things have changed in the loud echo chambers of the internet, maybe (but not really, since people were saying that EU data sovereignty was happening any time now since 2016).
My _feeling_ is that a lot of EU/European politicians has talked a lot more about the need to be independent from the US after Trump threaten Greenland. At least in the nordic countries. Not only concerning data & privacy, but defence, communications, space etc. All areas. The wheel has started to turn. You will not see it if you look around. But in 10 years time, maybe more, Europe will have stopped depending on the US. And that will hit US hard. We pay a lot of money in services to the US.
Mistral is doing some really great stuff lately. Sure, it's hard to compete with OpenAI and Anthropic and their models, but they are taking up some interesting takes and designing their product in unique ways.
I like a lot what they are doing and I'll be watching them a lot more closely. I'd love to work for them btw!
> Pre-training allows organizations to build domain-aware models by learning from large internal datasets.
> Post-training methods allow teams to refine model behavior for specific tasks and environments.
How do you suppose this works? They say "pretraining" but I'm certain that the amount of clean data available in proper dataset format is not nearly enough to make a "foundation model". Do you suppose what they are calling "pretraining" is actually SFT and then "post-training" is ... more SFT?
There's no way they mean "start from scratch". Maybe they do something like generate a heckin bunch of synthetic data seeded from company data using one of their SOA models -- which is basically equivalent to low resolution distillation, I would imagine. Hmm.
I can imagine that, as usual, you start with a few examples and then instruct an LLM to synthesize more examples out of that, and train using that. Sounds horrible, but actually works fairly well in practice.
I am rooting for Mistral with their different approach: not really competing on the largest and advanced models, instead doing custom engineering for customers and generally serving the needs of EU customers.
I found it to be the best model if you want to talk about topics philosophical. It has no problems going deep and technical while other models tend to be afraid of overshooting the comprehension of the reader.
Did they make significant improvements in OCR 3? The quality I was getting from Mistral OCR 2 was nowhere near as good as what I could get from just sending the same files to Claude Sonnet via an API call.
Mistral has been releasing some cool stuff. Definitively behind on frontier models but they are working a different angle. Was just talking at work about how hard model training is for a small company so we’d probably never do it. But with tools like this, and the new unsloth release, training feels more in reach.
I think it’s interesting what this approach suggests about who will profit from AI. I’m sceptical that having huge numbers of GPUs is a moat. After all, real humans – even geniuses – are trained on much much less data than the whole Internet. But proprietary and specialised data could very well be a moat. It’s hard to train a scientist/lawyer/analyst without reading a lot of science/law/finance. Companies’ proprietary data might encode a great deal of irreplaceable knowledge. Seems as if Mistral is taking this bet.
Interesting how Mistral is investing into training models for industry specific use cases. With the commoditization of intelligence by base models, they're probably looking to creating value from specialized verticals.
How many proprietary use cases truly need pre-training or even fine-tuning as opposed to RAG approach? And at what point does it make sense to pre-train/fine tune? Curious.
You can fine tune small, very fast and cheap to run specialized models ie. to react to logs, tool use and domain knowledge, possibly removing network llm comms altogether etc.
rag basically gives the llm a bunch of documents to search thru for the answer.
What it doesn't do is make the algorithm any better. pre-training and fine-tunning improve the llm abaility to reason about your task.
For coding use cases you may want a way to search for symbols themselves or do a plain text exact match for the name of a symbol to find the relevant documents to include. There is more to searching than building a basic similarity search.
Sorry but who mentioned coding as a use-case? My comment was general and not specific to the coding use-case, and I don't understand where did you get the idea from that I am arguing that building a similarity search engine would be a substitute to the symbol-search engine or that symbol-search is inferior to the similarity-search? Please don't put words into my mouth. My question was genuine without making any presumptions.
Even with the coding use-case you would still likely want to build a similarity search engine because searching through plain symbols isn't enough to build a contextual understanding of higher-level concepts in the code.
And yet your blog says you think NFTs are alive. Curious.
But seriously, RAG/retrieval is thriving. It'll be part of the mix alongside long context, reranking, and tool-based context assembly for the forseeable future.
> Of course you would have to set a temperature of 0 to prevent abuse from the operator, and also assume that an operator has access to the pre-prompt
Doesn't the fact that LLM's are still non-deterministic with a 0 temperature render all of this moot? And why was I compelled to read a random blog post on the unsolved issue of validating natural language? It's a SQL injection except without a predetermined syntax to validate against, and thus a NP problem we've yet to solve.
I don't think RAG is dead, and I don't think NFTs have any use and think that they are completely dead.
But the OP's blog is more about ZK than about NFTs, and crypto is the only place funding work on ZK. It's kind of a devil's bargain, but I've taken crypto money to work on privacy preserving tech before and would again.
I have no interest in anything crypto, but they are making a proposal about NFTs tied to AI (LLMs and verifiable machine learning) so they can make ownership decisions.
So it'd be alive in the making decisions sense, not in a "the technology is thriving" sense.
This is definitely the smart path for making $$ in AI. I noticed MongoDB is also going into this market with https://www.voyageai.com/ targeting business RAG applications and offering consulting for company-specific models.
Looks interesting. But how to explore or test or use? The product page (https://mistral.ai/products/forge) also does not contain anything useful. Just "Contact us"
Huh. I initially thought this is just another finetuning end point. But apparently they are partnering up with customers on the pretraining side as well. But RL as well? Jeez RL env are really hard to get right. Best wishes I guess.
I cannot keep up with their products, model names and releases.
What is what for? Their marketing texts do not make sense for me.
Is there a nice overview somewhere?
I am a simple stupid Le Chat user with a small mind and the Tredict MCP Server connected to it (to Le Chat, not my mind), which works ok-ish. :-)
My bet is that the solution to continuous learning is with external storage. There is a lot of talk about context engineering - but I have not seen anyone taking context as the main bottleneck and building a system around that.
This would show that even context engineering is kind of wrong term - because context does not enter the llm in some mysterious way - it goes through prompt and the whole model of passing chat history back and forth is not the most efficient way of using the prompt limitation.
"External Storage" whatever that is can not be the same as continous learning as it does not have the strong connections/capture the interdepencies of knowledge.
That said I think we will see more efforts also on the business side to have models that can help you build a knowledge base in some kind of standardized way that the model is trained to read. Or synthesize some sort on instructions how to navigate your knowledge base.
Currently e.g. Copilot tries to navigate a hot mess of a MS knowledge graph that is very different for each company. And due to its amnesia it has to repeat the discovery in every session. No wonder that does not work. We have to either standardize or store somewhere (model, instructions) how to find information efficiently.
The key to make Copilot useful is to take the limited context problem seriously enough. There are many dimensions to it: https://zby.github.io/commonplace/notes/context-efficiency-i... and it should be the starting point for designing the systems that extensively use llms.
A knowledge base - something where the LLM knows how to find the knowledge it needs for a given task. I am working on this idea in https://zby.github.io/commonplace/
I thought that for pretraining to work and reasoning to emerge you need internet scale data. How can forge achieve it with just internal company data (unless the said company is AT&T or something) ?
Note that any supervised fine-tuning following the Pretraining stage is just swapping the dataset and maybe tweaking some of the optimiser settings. Presumably they're talking about this kind of pre-RL fine-tuning instead of post-RL fine-tuning, and not about swapping out the Pretraining stage entirely.
The future of AI is specialization, not just achieving benevolent knowledge as fast as we can at the expense of everything and everyone along the way. I appreciate and applaud this approach. I am looking into a similar product myself. Good stuff.
Ironically that was also the past of AI. In 2016 it was all about specialized models (not just training data, everything including architecture and model class/type) for specific tasks and that's the way things had been for a long time.
Are you suggesting that it's an aberration that from ~2019 to ~2026 the AI field has been working on general intelligence (I assume this is what you mean by "achieving benevolent knowledge")?
Personally I think it's remarkable how much a simple transformer model can do when scaled up in size. LLMs are an incredible feat of generalization. I don't see why the trajectory should change back towards specialization now.
I think, devstral-latest should be it, no? So I write to support and get an answer 12 hours later that says oh, no, devstral 2 is definetely called devstral 2 and then a page of instructions on how to set it up in Intellij... generated with AI. The screens it is refering to don't exist and never did.
Or it can just be a Google like problem where a big company one part doesn't talk to the other.
Not everyone is obsessed with code generation. There is a whole world out there.
The problem they have is that this is not a moat - their approach is easily reproducible.
If they can pull ahead in having the most number of pre-trained models (one for this ERP, one for that CRM, etc) and then being able to close sales to companies using these products and sell them on post-trained (give us your specific ERP customisations and we'll give you access to a model that is tailored to your business), then THAT is a moat.
But they need to do this without fanfare. Just close sales, and keep closing, basically. After all, even if other AI providers copy the process, the moat would already have been established for Mistral.
My 2ct: Currently the moat may be that they are not US-American which is not reproducible by any of the US alternatives.
I hope you are right (I am in the process of finalising a product and one of the top-5 selling points contains "outside the jurisdiction of the US"), but in my experience, companies only pay lip service to ethics unless it hits their bottom line.
How do you measure "usage" in an enterprise/commercial context where no data on usage is available to you? I don't expect Mistral AI to make it's money on OpenRouter.
Isn't the entire deal with LLMs that they are trained as megaliths? How can bespoke modelling overcome the treasure trove of knowledge that megaliths can generically bring in, even in bespoke scenarios?
When generating images most services will have a small agent that rewrites your request and hands it off to the generative image model.
So from the treasure trove point of view, optimized agents have their place. From companies building pipelines, they also have their place.
Right, but this was done to value-optimize the product, i.e. try to always give you the shittiest (cheapest) model you can bear, because otherwise people would always choose the smartest (most expensive) model for any query.
Taking away the model choice from the user introduces a lot of ways to cut down costs, but one thing it does not do is make the product give users better/more reliable answers.
Think of it as a base model (the megalith) which then has the weights adjusted towards a specific use-case (SAP, for example).
And they should. Because the US is not behaving rationally at all.
https://nltimes.nl/2026/02/10/rabobank-ing-abn-amro-seek-eur...
https://www.theregister.com/2025/11/13/gartner_cio_cloud_sov...
https://www.independent.co.uk/news/world/europe/europe-zoom-...
https://www.theglobeandmail.com/business/commentary/article-...
https://sherwood.news/tech/europe-wants-to-break-up-with-us-...
Well I have even more personal experience that contradicts yours, and this isn't true at all. Everyone uses Claude / Gemini / OpenAI. Mistral isn't even on the table.
Proof: Most big EU companies use Claude or Gemini or OpenAI, not Mistral. That choice was made recently.
Things have changed in the loud echo chambers of the internet, maybe (but not really, since people were saying that EU data sovereignty was happening any time now since 2016).
I like a lot what they are doing and I'll be watching them a lot more closely. I'd love to work for them btw!
> Post-training methods allow teams to refine model behavior for specific tasks and environments.
How do you suppose this works? They say "pretraining" but I'm certain that the amount of clean data available in proper dataset format is not nearly enough to make a "foundation model". Do you suppose what they are calling "pretraining" is actually SFT and then "post-training" is ... more SFT?
There's no way they mean "start from scratch". Maybe they do something like generate a heckin bunch of synthetic data seeded from company data using one of their SOA models -- which is basically equivalent to low resolution distillation, I would imagine. Hmm.
Pre-training: refining the weights in an existing model using more training data.
Post-training: Adding some training data to the prompt (RAG, basically).
I have been finding Voxtral useful though.
https://generativehistory.substack.com/p/gemini-3-solves-han...
Which one's the best?
next, it sounds like it's going to be .eu
but what about ai.eu
It's certainly different data, but one could argue that real humans have been trained on 3.5 billion years of evolution data.
Even with the coding use-case you would still likely want to build a similarity search engine because searching through plain symbols isn't enough to build a contextual understanding of higher-level concepts in the code.
But seriously, RAG/retrieval is thriving. It'll be part of the mix alongside long context, reranking, and tool-based context assembly for the forseeable future.
> Of course you would have to set a temperature of 0 to prevent abuse from the operator, and also assume that an operator has access to the pre-prompt
Doesn't the fact that LLM's are still non-deterministic with a 0 temperature render all of this moot? And why was I compelled to read a random blog post on the unsolved issue of validating natural language? It's a SQL injection except without a predetermined syntax to validate against, and thus a NP problem we've yet to solve.
But the OP's blog is more about ZK than about NFTs, and crypto is the only place funding work on ZK. It's kind of a devil's bargain, but I've taken crypto money to work on privacy preserving tech before and would again.
So it'd be alive in the making decisions sense, not in a "the technology is thriving" sense.
Dissapointing.
I am a simple stupid Le Chat user with a small mind and the Tredict MCP Server connected to it (to Le Chat, not my mind), which works ok-ish. :-)
That said I think we will see more efforts also on the business side to have models that can help you build a knowledge base in some kind of standardized way that the model is trained to read. Or synthesize some sort on instructions how to navigate your knowledge base.
Currently e.g. Copilot tries to navigate a hot mess of a MS knowledge graph that is very different for each company. And due to its amnesia it has to repeat the discovery in every session. No wonder that does not work. We have to either standardize or store somewhere (model, instructions) how to find information efficiently.
Would love to take it for a spin, if that is even possible.
https://docs.mistral.ai/api/endpoint/deprecated/fine-tuning
It's feasible for small models but, I thought small models were not reliable for factual information?
Foundational:
- Pretraining - Mid/post-training (SFT) - RLHF or alignment post-training (RL)
And sometimes...
- Some more customer-specific fine-tuning.
Note that any supervised fine-tuning following the Pretraining stage is just swapping the dataset and maybe tweaking some of the optimiser settings. Presumably they're talking about this kind of pre-RL fine-tuning instead of post-RL fine-tuning, and not about swapping out the Pretraining stage entirely.
Are you suggesting that it's an aberration that from ~2019 to ~2026 the AI field has been working on general intelligence (I assume this is what you mean by "achieving benevolent knowledge")?
Personally I think it's remarkable how much a simple transformer model can do when scaled up in size. LLMs are an incredible feat of generalization. I don't see why the trajectory should change back towards specialization now.
... for humans.
Is it possible to retrain daily or hourly as info changes?