G42 Launches JAIS 70B To Champion Arabic NLP

Inception, a subsidiary of G42, has released JAIS 70B, a new large language model (LLM), which is aimed at enhancing capabilities in customer service, content creation, and data analysis.

With its 70 billion parameters, it brings unparalleled Arabic-English bilingual capabilities to the open-source community. The model’s development involved extensive training on a dataset of 370 billion tokens, 330 billion of which were Arabic—the largest Arabic dataset ever used for training an open-source foundational model.

Dr. Andrew Jackson, CEO of Inception, said, “AI is now a proven value-adding force, and large language models have been at the forefront of the AI adoption spike. JAIS was created to preserve Arabic heritage, culture, and language, and to democratise access to AI. Releasing JAIS 70B and this new family of models reinforces our commitment to delivering the highest quality AI foundation model for Arabic speaking nations.”

A comprehensive suite of models

In addition to JAIS 70B, Inception has unveiled a comprehensive suite of JAIS models designed to cater to a wide range of applications. This suite includes 20 models across eight different sizes, ranging from 590 million to 70 billion parameters. These models have been specifically fine-tuned for chat applications and trained on up to 1.6 trillion tokens, incorporating Arabic, English, and code data.

Neha Sengupta, Principal Applied Scientist at Inception, said, “For models up to 30 billion parameters, we successfully trained JAIS from scratch, consistently outperforming adapted models in the community. However, for models with 70 billion parameters and above, the computational complexity and environmental impact of training from scratch were significant. We made a choice to build JAIS 70B on the Llama2 model, allowing us to leverage the extensive knowledge base of an existing English model and develop a more efficient and sustainable solution.”

According to the company, JAIS 70B not only matches but in specific cases exceeds the high-quality English-language processing capabilities of Llama2, particularly excelling in Arabic outputs. The development team enhanced the tokeniser based on the Llama2 tokeniser to double the model’s base vocabulary, making Arabic text processing more efficient. According to Sengupta, this adjustment “splits Arabic words less aggressively and makes training and inferencing cheaper” than the standard Llama2 model.

Inception’s JAIS 70B follows the successful launches of JAIS-13B and JAIS-30B models, which have already set high benchmarks in the field. With JAIS 70B, Inception continues to push the boundaries of what’s possible in AI, particularly for under-served languages.

Researchers, developers, and businesses interested in leveraging the capabilities of JAIS 70B can download the models and access the technical paper and benchmarking results by visiting the dedicated page on Hugging Face: JAIS on Hugging Face.