Arabic Dialect Benchmarks Just Got Messy: Fanar, Peacock, and Qwen3 Are All Catching Up

## Arabic Dialect Benchmarks Just Got Messy: Fanar, Peacock, and Qwen3 Are All Catching Up The April 2026 evaluation round for Arabic-capable LLMs tells a more complicated story than last quarter's neat scoreboard. Qatar's **Fanar** and the multimodal **Peacock** have closed a real gap against **ALLaM** and **Falcon**. **Alibaba's Qwen3-235B-A22B** has posted surprise gains on Gulf and Maghrebi dialects that have previously punished Chinese models. For Arabic AI buyers, this is the first genuine multi-vendor moment, and it forces a harder conversation about sovereignty. ## What the Benchmarks Are Saying The April cross-vendor evaluation, compiled by a combination of [MBZUAI](https://mbzuai.ac.ae/), [Qatar Computing Research Institute](https://www.hbku.edu.qa/en/qcri), and community-led leaderboards including [AlGhafa](https://huggingface.co/datasets/OALL/AlGhafa) and [MERA-AR](https://huggingface.co/OALL), scores models across eight dimensions. Four of those dimensions are pure dialect handling: Gulf (Khaleeji), Levantine, Maghrebi (including Darija), and Egyptian. The headline finding is a compression of the leaderboard. ALLaM remains strong on formal Modern Standard Arabic and Gulf-aligned enterprise tasks, where it was always designed to excel. [Falcon](https://falconllm.tii.ae/) holds its ground on open-weight availability and long-context reasoning. [Fanar](https://fanar.qa/) has posted its largest update yet on Levantine and Egyptian dialects. [Peacock](https://huggingface.co/NanoTech-AI/Peacock) continues to lead on multimodal tasks, especially image-grounded Arabic reasoning. And Qwen3 has pulled forward in places few expected. > "We are no longer in the one-model-per-country era. Arabic AI is moving to a multi-vendor reality where buyers need to pick models per use case, not per flag." > — Dr Yasmin Al Zaabi, Principal Scientist, MBZUAI Arabic Dialect Benchmarks Just Got Messy: Fanar, Peacock, and Qwen3 Are All Catching Up

## Why Qwen3 Matters More Than People Admit The gain that deserves attention is Qwen3's dialect performance. [Qwen3-235B-A22B](https://qwenlm.github.io/) from Alibaba has historically been weaker on spoken Arabic dialects than on MSA. The April evaluation shows it closing to within striking distance on Gulf and Maghrebi dialect tasks, thanks to a broader instruction-tuning corpus and aggressive multilingual mixing in the latest training run. For enterprise buyers, this is awkward. It means the strongest open-weight Arabic dialect performer for some tasks is a Chinese model, not an Arab model. That makes sovereignty a real decision rather than a default. ### By The Numbers - 4: distinct Arabic dialect families now tracked in major benchmarks (Gulf, Levantine, Maghrebi, Egyptian). - 53+: Arabic-capable LLMs now catalogued regionally, up from 38 in Q1 2025. - 43,316: conversations in the latest [Jais-derived](/arabic-ai/arabic-nlp-2026-community-research-mena) synthetic multi-turn corpus across 93 topics. - 3: Arabic LLMs now with credible multimodal (image plus text) reasoning: Fanar, Peacock, and a tuned Falcon variant. - 5: vendors within a 4-point spread on the April MSA benchmark: ALLaM, Falcon, Fanar, Peacock, Qwen3. ## Dialect Handling, Vendor By Vendor The best way to read the April numbers is per dialect rather than per model. No single model is the category leader on all four dialects. - **Gulf (Khaleeji)**: ALLaM leads, Fanar within two points, Qwen3 now third. - **Levantine**: Fanar leads, ALLaM second, Falcon third. - **Egyptian**: Peacock strongest on multimodal, Fanar leads text-only, [Noor family](https://tii.ae/) close behind. - **Maghrebi (Darija)**: Qwen3 surprisingly strong, AceGPT competitive, Fanar second. | Dialect | Strongest text model | Strongest multimodal | Gap from open to closed | |---|---|---|---| | Gulf | ALLaM | Peacock | Small | | Levantine | Fanar | Peacock | Small | | Egyptian | Fanar | Peacock | Narrowing | | Maghrebi | Qwen3 | Fanar | Wider, in open's favour | ## What Enterprise Buyers Should Do Three takeaways for MENA enterprise AI teams. First, stop picking a single model as the default. Pick a dialect stack: one model for MSA and Gulf, one for Levantine, one for Maghrebi, and one for multimodal tasks. Second, price the sovereignty premium honestly. If Qwen3 is strictly better for a production task, the question is not whether to use it, but how to contain residency and export risk. Third, benchmark on your own data. Public leaderboards are a starting point, not a conclusion. > "The leaderboards are converging. What separates vendors now is not raw score but deployment support, data residency, and enterprise tooling." > — Ziad Barazi, Head of AI, Majid Al Futtaim Group For a broader view of where the Arabic NLP research community is heading, see our [April 2026 scoreboard](/arabic-ai/arabic-llm-scoreboard-april-2026-falcon-jais-allam) and the continuing [Arabic NLP community research coverage](/arabic-ai/arabic-nlp-2026-community-research-mena). ## The Sovereignty Tension Saudi and UAE policymakers have been explicit that Arabic AI capability matters for sovereignty. The April results complicate that narrative. If the strongest open-weight option on a key dialect is a model trained primarily in China, some Gulf institutions will have to make harder deployment decisions than they imagined 12 months ago. Expect [SDAIA](https://sdaia.gov.sa/), [TII](https://www.tii.ae/), and [QCRI](https://www.hbku.edu.qa/en/qcri) to respond with faster release cadences through the rest of 2026.

The AI in Arabia View: The April evaluation is the first serious sign that Arabic LLMs have matured into a real market. Buyers now have enough choice that per-task selection is possible. That is good for enterprise AI teams and awkward for ministry-level sovereignty narratives. Our view is that the right posture for 2026 is pragmatic. Build a dialect-aware model stack, keep one sovereign option in every deployment, and benchmark on your own data every quarter. By year-end, we expect the gap between open and closed Arabic models to narrow further, and for multimodal Arabic reasoning to become the new frontier. Pick your stack now or get locked into a single-vendor path that will cost more to unwind later.

## Frequently Asked Questions ### Which Arabic LLM should I use for Gulf dialect customer service? ALLaM is the safest sovereign choice for Gulf dialect customer service with enterprise deployment. Fanar is a strong second, especially for organisations with Qatari operations. Test both on a sample of real customer transcripts before committing. ### Is Qwen3 safe to deploy for MENA enterprise use? It depends on your data residency and export requirements. Qwen3 is a Chinese-origin model, which raises questions for regulated sectors. Many private-sector MENA deployments are acceptable. Government and defence-linked work generally is not. ### What does multimodal Arabic reasoning actually mean? It means models that can reason across Arabic text plus images together, for example analysing an Arabic-language invoice, reading a handwritten form, or interpreting an Arabic sign in a photograph. Peacock currently leads this category, with Fanar close behind. ### How often should enterprise teams re-benchmark? Every quarter, minimum. The April 2026 results moved meaningfully from January, and a sovereign model release from SDAIA or TII could reshuffle positions again within weeks. Static vendor choices go stale quickly in this market. Which dialect performance gap matters most for your deployment? Drop your take in the comments below.