Amália and the Future of European Portuguese LLMs

(duarteocarmo.com)

35 points | by johnbarron 2 days ago

4 comments

pu_pe 25 minutes ago
I'm not sure the direction should be to finetune a small local model for each country or language. These models are already not particularly great at information retrieval, so I doubt anyone would use them for questions like the author suggests (ie who was the president between X and Y). Similarly, they are a little too lightweight to be used for translations too.
If the budget is indeed so modest (5.5 million euros!), I would focus completely on preparing datasets and making sure all open cultural artifacts that we can find are well documented in them. That way every model, private or open, that gets trained in the future could better represent the culture and language of your country.
algoth1 9 minutes ago
Wouldnt it be easier to fine tune a model to convert the Brazilian Portuguese corpus into European Portuguese and then use that corpus?
swiftcoder 30 minutes ago
It is definitely an interesting problem, because Portugal is a small enough country that the actual total corpus of available texts in (non-Brazilian) Portuguese is potentially problematic.
[-]
- embedding-shape 18 minutes ago
  I don't think so, Portugal the country might be small, with a small population, but there is ~250 million "Lusophones" (native Portuguese speakers), making it the fifth-most spoken native language in the world, I'd hardly call that small :) And before everyone screams; yes, European Portuguese is different from Brazilian Portuguese, but they're still both Portuguese and understand each other, so it's not like the text from one cannot be used to train a model for the other, or vice-versa.
  All in all, I don't think that's a major issue here.
  [-]
  - swiftcoder 8 minutes ago
    The authors are pretty clearly trying to draw only from European Portuguese sources - I feel like there's a fairly widespread attitude here that the language is being overwhelmed by the sheer number of Brazilian speakers (which there is obviously at least some truth to).
    I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English)
  - KK7NIL 9 minutes ago
    The whole point of this project is to have an LLM that speaks European Portuguese, not Brazilian Portuguese.
    [-]
    - embedding-shape 7 minutes ago
      Right, and my point is that if you use 80% Brazilian Portuguese during base model training + 20% European Portuguese as post-training, you pretty much get exactly that, except with a ton more of available training data.
      [-]
      - KK7NIL 2 minutes ago
        What's your evidence for that?
        And if the first 80% don't bias the language after post-training (which I think is what you're claiming) why not go for English or a mixture of languages, which is essentially what they did by starting with EuroLLM.
  - madaxe_again 7 minutes ago
    Mutually intelligible, yes, but far from perfectly so. I speak both, as a native anglophone, and the difference is not so much “US vs British English” so much as “Guyanese English vs British English”. Like, fundamental points of grammar differ, the spoken rhythm and syllabic stress differs (poetry does not translate well between them), never mind just vocabulary. Continental Portuguese people tend to find it easier to understand brasileiros than vice versa, largely due to mostly one-way cultural exports, but to try to roll both into a single model would create a creole at best.
    [-]
    - embedding-shape 5 minutes ago
      I agree, they're not the same. But they're far closer than other languages who don't come from the same families.
hartator 55 minutes ago
What a waste of time and money.
Trying to force a LLM into a specific language makes you missed out on most of the world knowledge.
[-]
- KK7NIL 0 minutes ago
  This is how Europe thinks they can catch up on tech, by having the government fund vanity projects which will be made obsolete by more general techniques in 6 months.
- embedding-shape 49 minutes ago
  What LLM isn't forced into a specific language? That'd be a weird language model no one could understand, you need to chose at least one language, ideally the same as the creators speak.
  Besides, there is knowledge that is locked behind languages, there are things known in Portuguese that aren't known in other languages, and the same for other languages too. More accessibility to those ideas wouldn't hurt.
  [-]
  - Miraste 21 minutes ago
    To my knowledge, all major LLMs are multilingual. This article could really have used an evaluation of existing models' European Portuguese capabilities.
  - cess11 18 minutes ago
    E.g. gemma3:4b can fake simple conversations in several european languages, including portuguese, swedish and finnish.
    It's just a database. If you push text in one language into it, it'll likely crap out stuff in that same language, unless the system prompt that also goes in with your query causes it not to.
- mistrial9 52 minutes ago
  > makes you missed out on most of the world knowledge
  and, who knows what will happen to grammar ?
- clear-octopus 46 minutes ago
  [dead]