Fanar-2 and Ai71's Noor Are Quietly Converging on Arabic Summarisation, and the Gap to GPT Is Closing
Qatar's Fanar and Abu Dhabi's Ai71 both shipped updated Arabic-first models this month, and the benchmarks they have published tell a surprisingly consistent story: the two models are within two to three percentage points of each other on Arabic summarisation, and within eight points of GPT-5 on the same tasks. That is the closest the regional models have been to frontier performance on an Arabic-specific workload since the first Jais release in 2023.
Why summarisation matters
Summarisation is the Arabic AI benchmark that Gulf enterprises actually care about. Legal firms want to summarise hundreds of pages of Arabic contracts. Banks need short-form summaries of regulator circulars. Ministries want readable briefs from longer policy documents. The gap between a model that works and one that fails is measured in reader acceptance, and historical Arabic models have failed at the morphological level, dropping negations, compressing diacritic distinctions, or introducing classical phrasings into modern news text.
Fanar-2 and Noor have visibly improved on those failure modes.
By The Numbers
- Fanar-2 scored 83.4 on Arabic AbSum v2 summarisation, up from 76.1 for Fanar-1.
- Ai71 Noor scored 81.6 on the same benchmark, a 9 point improvement over Noor-Lite.
- GPT-5 scores 89.9 on Arabic AbSum v2, a gap of roughly six to eight points.
- Hugging Face lists 37 actively maintained Arabic-first models as of April 2026, up from 18 a year ago.
- Fanar-2 and Noor both support Modern Standard Arabic and five major dialects: Egyptian, Levantine, Gulf, Maghrebi, and Iraqi.
Arabic summarisation was the benchmark that kept revealing the weakness of Arabic LLMs. That the regional models are now within a single generation of frontier is significant.
What changed between releases
Fanar-2 was trained with a substantially expanded Arabic corpus, including parliamentary records from Gulf countries and Egypt, historical press archives, and a curated set of Islamic legal texts. The training team, a partnership between QCRI and HBKU, also invested in instruction tuning with human Arabic editors, a detail often skipped in earlier regional efforts.
Ai71's Noor took a different path. The TII-affiliated team focused on synthetic data generation with quality filters, producing what Ai71 calls "grounded-synthetic" training pairs for summarisation and long-context reasoning. The result is a smaller model, 12 billion parameters compared to Fanar-2's 34 billion, that runs significantly cheaper in production.
Head-to-head capabilities
| Capability | Fanar-2 | Ai71 Noor | GPT-5 |
|---|---|---|---|
| Arabic summarisation (AbSum v2) | 83.4 | 81.6 | 89.9 |
| MSA reading comprehension | 78.2 | 76.9 | 85.1 |
| Dialect accuracy (5-dialect avg) | 79.4 | 82.1 | 74.6 |
| Classical Arabic | 71.8 | 68.2 | 62.3 |
| Parameters | 34B | 12B | Undisclosed |
| Inference cost per 1M tokens | $1.20 | $0.45 | $3.00 |
The dialect column is the interesting one. On five-dialect average, both regional models outperform GPT-5. That is the category where local engineering, local data, and local evaluation pays off.
Frontier-grade English models still struggle with Egyptian and Gulf dialect nuances. Our models are built specifically for these, and that is where the gap closes.
Production deployments
Qatar National Library is piloting Fanar-2 for automated abstracts on its Arabic academic collection. Abu Dhabi Judicial Department is evaluating Noor for internal court document summarisation. Saudi's ALLaM team is running an internal comparison benchmark on legal documents. Most significantly, Al Jazeera Digital is testing both models for newsroom summarisation, the most visible production workload in the region.
For the broader Arabic AI picture, see our earlier coverage of April 2026's Falcon-H1 lead, the Arabic dialect benchmarks, and the April Arabic LLM scoreboard.