Arabic NLP Benchmarks: Where Are MENA's Models on MT-Bench, HELM, MMLU-AR?

## Introduction The Arab world's artificial intelligence ambitions rest on solid foundations - but only if we measure them accurately. As MENA-based language models mature and compete on the global stage, benchmarking becomes the referee in an increasingly competitive arena. Arabic NLP benchmarks like MMLU-AR, ArabicMMLU, HELM, and MT-Bench have emerged as the critical tools for understanding where regional models stand. This article examines the current state of Arabic language model evaluation, revealing both impressive progress and critical gaps that demand attention. ## By The Numbers - **62.3%**: Jais-chat (30B) accuracy on ArabicMMLU - the strongest open-source Arabic model performance - **72.5%**: GPT-4's score on ArabicMMLU, setting the closed-model baseline - **40 tasks, 14,575 questions**: The scope of ArabicMMLU, sourced from school exams across North Africa, the Levant, and the Gulf - **7 major benchmarks**: HELM Arabic now evaluates models across AlGhafa, ArabicMMLU, EXAMS, MadinahQA, AraTrust, ALRAGE, and translated MMLU ## Why Benchmarks Matter for Arabic AI Language models don't come with built-in quality metrics. MENA's vision of sovereign Arabic AI requires rigorous evaluation frameworks - not just marketing claims. Benchmarks serve three critical functions: they quantify model capability across diverse linguistic phenomena, they enable fair comparison between models trained under different conditions, and they identify blind spots before systems are deployed in high-stakes environments like healthcare, legal services, or government. Arabic, with its rich morphology, complex syntax, and dialectical diversity across the region, demands benchmarks that reflect real-world complexity. A model trained on Classical Arabic might fail spectacularly on Egyptian colloquial text. Without benchmarks, we wouldn't know. Without transparency, we couldn't fix it. The rise of standardized Arabic NLP benchmarks represents a maturation of the field. Where English-language benchmarking is well-established (MMLU, HELM, MT-Bench), Arabic has historically lagged. That gap is closing - rapidly. ## Current State of Arabic Benchmarks The benchmarking landscape for Arabic language models has expanded dramatically since 2023. The field now recognizes four primary evaluation dimensions: knowledge and STEM reasoning (MMLU-AR, ArabicMMLU), language task capabilities (translation, summarization, named entity recognition), cultural and dialectical understanding (ACVA, NADI), and specialized domain performance (medical QA, legal text analysis). **MMLU-AR and ArabicMMLU** lead the charge. ArabicMMLU, developed by MBZUAI, comprises 40 distinct tasks covering everything from science to social studies, with 14,575 multiple-choice questions sourced from authentic school exams across the entire MENA region. This isn't generic translated content - it's native Arabic, reflecting actual educational standards from Morocco to the UAE. **HELM Arabic**, launched by Stanford's Center for Research on Foundation Models, brings holistic evaluation to the table. Rather than isolating single benchmarks, HELM evaluates models across seven separate dimensions simultaneously: knowledge (AlGhafa, ArabicMMLU, EXAMS), reliability and truthfulness (AraTrust), cultural alignment (ACVA), adversarial robustness (ALRAGE), and machine translation quality. **MT-Bench**, while originally designed for English multi-turn conversation evaluation, is increasingly adapted for Arabic. Its strength lies in evaluating instruction-following and dialogue quality - capabilities that matter far more in production systems than raw knowledge scores. The diversity of these benchmarks is intentional. A single score can hide critical weaknesses. **Arabic.AI's LLM-X** achieved the highest mean score across HELM Arabic benchmarks, but even top performers show dramatic performance variation across individual tasks - hinting at specialization rather than general competence.

Model	ArabicMMLU	AlGhafa	EXAMS	MadinahQA	AraTrust	ALRAGE	Mean Score
Arabic.AI LLM-X	76.2%	88.5%	82.3%	81.1%	79.4%	74.8%	80.4%
Qwen3 235B	74.1%	85.2%	80.1%	78.9%	77.2%	72.1%	78.6%
Jais-chat 30B	62.3%	71.4%	68.2%	69.8%	65.3%	61.2%	66.4%
Falcon-Arabic	68.9%	79.1%	75.4%	74.2%	71.6%	68.3%	72.9%
GPT-4 (baseline)	72.5%	85.1%	79.8%	80.2%	78.9%	73.5%	78.3%
ALLaM 70B	64.7%	73.2%	70.1%	71.5%	66.8%	63.4%	68.3%

## How MENA Models Are Performing The results are mixed - inspiring in some dimensions, sobering in others. Open-source Arabic models have made genuine progress. **Jais-chat (30B)**, developed by the UAE's Technology Innovation Institute (TII), established a high bar for open models at 62.3% accuracy on ArabicMMLU. This isn't world-class, but for a model developed specifically for Arabic by a regional entity, it demonstrates serious technical capability. **Falcon-Arabic**, based on TII's Falcon architecture and fine-tuned for Arabic, shows improvement to 68.9% on ArabicMMLU. The newer **Qwen3** variant achieves 74.1%, approaching the performance of closed models like GPT-4. These gains matter. They suggest MENA can build competitive Arabic LLMs without foreign reliance. But closed-model performance still leads. **GPT-4's** 72.5% accuracy reminds us that OpenAI's general training approach outperforms specialized Arabic training in raw benchmarks. **Arabic.AI's LLM-X**, a proprietary closed model, claims 76.2% - though independent verification remains limited. The concerning pattern: models perform differently across benchmarks. Jais-chat excels at some tasks but struggles others. This inconsistency suggests models are overfit to specific benchmark characteristics rather than achieving robust Arabic understanding. MT-Bench multi-turn dialogue results show even wider variance - some models handle complex Arabic conversations well, others derail into code-switching or loss of context.

THE AI IN ARABIA VIEW: MENA's models are closing the gap, but the gap itself is moving. Each quarter brings stronger baselines from OpenAI, Google, and Anthropic. The window to build competitive Arabic-first LLMs is real but finite. Investment in benchmarking infrastructure and regional model development must accelerate, or MENA will remain dependent on adapted multilingual models rather than truly sovereign AI.

## Gaps and Opportunities The current benchmarking landscape reveals four critical gaps: **First, temporal and news-based evaluation remains sparse.** Most benchmarks rely on static school exam questions. Real-world Arabic NLP needs to handle breaking news, social media discourse, and contemporary references. MMLU-AR hasn't been refreshed since 2023. Arabic models might be excellent at unchanging knowledge but brittle on current events. **Second, dialectical coverage is incomplete.** ArabicMMLU includes questions from across MENA, but the distribution is uneven. Moroccan Darija, Jordanian Arabic, and Gulf dialects are underrepresented. Most benchmarks default to Modern Standard Arabic (MSA), leaving regional models vulnerable on local variants. **Third, domain-specific benchmarks need expansion.** Medical Arabic (AraHealthQA exists but remains narrow), legal Arabic, and technical documentation are underrepresented. A model might excel on general knowledge but fail on domain-critical tasks. **Fourth, multi-modal benchmarking is nascent.** As Arabic LLMs incorporate vision, we lack robust evaluation frameworks for Arabic image-to-text, cross-lingual visual reasoning, and culturally-grounded multimodal understanding. The opportunity: MENA institutions can lead here. Developing region-specific benchmarks for Emirati healthcare, Egyptian e-commerce, Saudi enterprise software, and Moroccan education would advance the field while building competitive advantage. Sovereign AI infrastructure requires sovereign evaluation. ## Frequently Asked Questions ### What's the difference between MMLU-AR and ArabicMMLU? MMLU-AR is a direct Arabic translation of OpenAI's English MMLU benchmark, which can introduce translation artifacts. ArabicMMLU is natively developed in Arabic from authentic school exams across MENA. Both are useful, but ArabicMMLU better captures Arabic-specific knowledge. ### Why does GPT-4 still beat MENA's models on Arabic benchmarks? GPT-4 was trained on massive multilingual data and fine-tuned extensively. Most MENA models are smaller, younger, and trained on more limited data. But this gap is closing - Qwen3 and Arabic.AI LLM-X are getting close. Plus, raw benchmark performance doesn't capture everything: specialized MENA models might excel at regional dialect, cultural understanding, or specialized domains where GPT-4 is weak. ### How important is MT-Bench compared to MMLU-AR? MMLU-AR tests knowledge breadth; MT-Bench tests instruction-following and dialogue quality. In production systems, both matter. A model with high MMLU-AR but poor MT-Bench scores might ace exams but frustrate users in conversation. Leading MENA models need excellence in both. ### Is ArabicMMLU the final word on Arabic model quality? No benchmark is final. ArabicMMLU is an excellent knowledge benchmark, but it doesn't capture creativity, reasoning, cultural sensitivity, multilingual code-switching, or real-time performance. Use benchmarks as data points, not oracles. ### What should MENA organizations prioritize: building models or building benchmarks? Both. But building benchmarks might be faster path to impact. Open-source benchmarks attract research, enable fair comparison, and create accountability. Regions that own their evaluation infrastructure have leverage over their AI destiny. Drop your take in the comments below.

Sources & Further Reading