OpenLLM France
developing open source, transparent AI with a French twist
Building digital commons for the French language to put AI building blocks in the hands of researchers, engineers and educators working on local use cases
About us
OpenLLM France is a research project funded by BPI France for a period of two years.
Born from the OpenLLM France community – a group of academic and industry stakeholders interested in truly open-source generative models – the consortium is composed of nine official partners.
These 9 partners are supported by 12 associate partners, including:
Our philosophy
French Data Collection and Processing
Developing French corpora helps minimize biases from English, which is heavily overrepresented in open training datasets.
Open Redistribution of Training Data
We republish our datasets in the format used for training to in order ensure the auditability of our data and models.
Sharing of Model Weights Under Open Licenses
We share final and intermediate pretraining checkpoints to facilitate research and continual pretraining.
Open Publication of Training and Processing Code
Sharing code for training and data processing promotes interpretability and helps others get started on model training.
Our Research Topics
Multilinguality
From education to healthcare, AI in French-speaking countries requires strong French language skills, often overlooked by English-centric models. We provide resources tailored to French while advancing research on training French-language, bilingual, and multilingual models.
Clean data
Our work is guided by a commitment to data transparency and respect for intellectual property, in full compliance with European directives. While this approach may impact model performance, we believe that the long-term benefits of openly sharing training data far outweigh the trade-offs by fostering future research and development.
Multimodality
A wide variety of use cases leverage AI systems capable of understanding human speech and analyzing visual information such as graphs and tables. Together with our academic partners, we are exploring more advanced approaches to designing multimodal conversational agents.
Education
An important goal of our project is to improve the use of AI in education. This involves working with educators to develop models that support both teachers and learners in real-world scenarios, but above all, collaborating with experts to raise awareness of the risks associated with AI and promote best practices.
Explore our resources
The Luciole family is our brand-new lineup of pre-trained language models. Just like Lucie 7B, the Luciole models were trained on approximately 30% French data.
Check out Luciole 1B, 8B, and 23B, as well as the training data, on Hugging Face. Our code for data processing and model training can be found on our GitHub repository.
Model Sizes
1B for edge use cases, 8B Mamba hybrid for better management of long contexts, and 23B for increased performance and reasoning.
Billion tokens
Carefully selected to strike a balance between quality and diversity, while retaining our commitment to openness and transparency.
Languages
A multilingual approach, with a particular focus on French and the major European languages, ensuring cultural and linguistic representation.
LUCIE 7B
Lucie-7B, our first foundation model trained from scratch, was the first large French-focused foundation model, trained on more than 30% French data.
To learn more about the Lucie family of models and their training data, check out our spaces on Hugging Face and GitHub.
Our Commitments to Energy-Efficient Generative AI
As a part of our commitment to sustainable development, we conduct an environmental life-cycle analysis of models based on the AFNOR methodology from the General Reference for Frugal AI. This assessment covers all stages of the process, from training to inference.


