I'm not sure the direction should be to finetune a small local model for each country or language. These models are already not particularly great at information retrieval, so I doubt anyone would use them for questions like the author suggests (ie who was the president between X and Y). Similarly, they are a little too lightweight to be used for translations too.
If the budget is indeed so modest (5.5 million euros!), I would focus completely on preparing datasets and making sure all open cultural artifacts that we can find are well documented in them. That way every model, private or open, that gets trained in the future could better represent the culture and language of your country.
It is definitely an interesting problem, because Portugal is a small enough country that the actual total corpus of available texts in (non-Brazilian) Portuguese is potentially problematic.
I don't think so, Portugal the country might be small, with a small population, but there is ~250 million "Lusophones" (native Portuguese speakers), making it the fifth-most spoken native language in the world, I'd hardly call that small :) And before everyone screams; yes, European Portuguese is different from Brazilian Portuguese, but they're still both Portuguese and understand each other, so it's not like the text from one cannot be used to train a model for the other, or vice-versa.
All in all, I don't think that's a major issue here.
The authors are pretty clearly trying to draw only from European Portuguese sources - I feel like there's a fairly widespread attitude here that the language is being overwhelmed by the sheer number of Brazilian speakers (which there is obviously at least some truth to).
I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English)
Right, and my point is that if you use 80% Brazilian Portuguese during base model training + 20% European Portuguese as post-training, you pretty much get exactly that, except with a ton more of available training data.
And if the first 80% don't bias the language after post-training (which I think is what you're claiming) why not go for English or a mixture of languages, which is essentially what they did by starting with EuroLLM.
Mutually intelligible, yes, but far from perfectly so. I speak both, as a native anglophone, and the difference is not so much “US vs British English” so much as “Guyanese English vs British English”. Like, fundamental points of grammar differ, the spoken rhythm and syllabic stress differs (poetry does not translate well between them), never mind just vocabulary. Continental Portuguese people tend to find it easier to understand brasileiros than vice versa, largely due to mostly one-way cultural exports, but to try to roll both into a single model would create a creole at best.
This is how Europe thinks they can catch up on tech, by having the government fund vanity projects which will be made obsolete by more general techniques in 6 months.
What LLM isn't forced into a specific language? That'd be a weird language model no one could understand, you need to chose at least one language, ideally the same as the creators speak.
Besides, there is knowledge that is locked behind languages, there are things known in Portuguese that aren't known in other languages, and the same for other languages too. More accessibility to those ideas wouldn't hurt.
To my knowledge, all major LLMs are multilingual. This article could really have used an evaluation of existing models' European Portuguese capabilities.
E.g. gemma3:4b can fake simple conversations in several european languages, including portuguese, swedish and finnish.
It's just a database. If you push text in one language into it, it'll likely crap out stuff in that same language, unless the system prompt that also goes in with your query causes it not to.
If the budget is indeed so modest (5.5 million euros!), I would focus completely on preparing datasets and making sure all open cultural artifacts that we can find are well documented in them. That way every model, private or open, that gets trained in the future could better represent the culture and language of your country.
All in all, I don't think that's a major issue here.
I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English)
And if the first 80% don't bias the language after post-training (which I think is what you're claiming) why not go for English or a mixture of languages, which is essentially what they did by starting with EuroLLM.
Trying to force a LLM into a specific language makes you missed out on most of the world knowledge.
Besides, there is knowledge that is locked behind languages, there are things known in Portuguese that aren't known in other languages, and the same for other languages too. More accessibility to those ideas wouldn't hurt.
It's just a database. If you push text in one language into it, it'll likely crap out stuff in that same language, unless the system prompt that also goes in with your query causes it not to.
and, who knows what will happen to grammar ?