Rian Touchent

PhD student in AI / NLP · Inria Paris - Team ALMAnaCH

Working on information extraction for clinical reports using LLMs. Previously intern at Inria Sophia-Antipolis - Team STARS.

78 citations 5 h-index 3 i10-index

Articles

2026

OntoBook: Ontology-Grounded Synthetic Textbooks for Medical Encoder Pretraining

KG-LLM @ LREC 2026

We present OntoBook, a method that converts medical ontology structure into pretraining signal for encoder language models. Our approach has three stages: random walks through ontology graphs capture hierarchical and causal relations between medical codes, a large language model reformulates these walks into fluent textbook-style prose, and the resulting text is used to train ModernCamemBERT, a 149M-parameter French encoder, with two objectives on the same data: masked language modeling and relation prediction between code pairs. On three French medical coding benchmarks (FRACCO, Cantemist-FR, Distemist-FR), OntoBook achieves significant improvements over MLM-only pretraining, with +2.5 micro-F1 on FRACCO and +8.0 micro-F1 on Distemist. We find that alignment between objectives is necessary: misaligned training, where each task uses different data, causes a 30-point degradation. We release 1.3 million LLM-reformulated medical textbooks across three French ontologies (CIM-10, CCAM, ATC) and pretrained model checkpoints.

Paper

2026

A Causal Language Modeling Detour Improves Encoder Continued Pretraining

arXiv preprint

When adapting an encoder to a new domain, the standard approach is to continue training with Masked Language Modeling (MLM). We show that temporarily switching to Causal Language Modeling (CLM) followed by a short MLM decay improves downstream performance. On biomedical texts with ModernBERT, this CLM detour outperforms MLM baselines trained on identical data and compute across 8 French and 11 English biomedical tasks, by +1.2-2.8pp and +0.3-0.8pp respectively, depending on model size. We investigate the reasons for these gains. We find that CLM's dense supervision impacts low transformer layers (0-7) far more than MLM does. Freezing low layers during CLM eliminates the downstream benefit; freezing mid layers preserves it. The representational changes persist through the MLM decay phase, even when it matches the CLM phase in length, and they scale with model capacity. We release ModernCamemBERT-bio and ModernBERT-bio as state-of-the-art biomedical encoders in Base and Large sizes.

2026

Biomed-Enriched: A Biomedical Dataset Enriched with LLMs for Pretraining and Extracting Rare and Hidden Content

ACL 2026 · 4 citations

We introduce Biomed-Enriched, a biomedical text dataset constructed from PubMed via a two-stage annotation process. In the first stage, a large language model annotates 400K paragraphs from PubMed scientific articles, assigning scores for their type (review, study, clinical case, other), domain (clinical, biomedical, other), and educational quality. The educational quality score (rated 1 to 5) estimates how useful a paragraph is for college-level learning. These annotations are then used to fine-tune a small language model, which propagates the labels across the full PMC-OA corpus. The resulting metadata allows us to extract refined subsets, including 2M clinical case paragraphs with over 450K high-quality ones from articles with commercial-use licenses, and to construct several variants via quality filtering and domain upsampling. Clinical text is typically difficult to access due to privacy constraints, as hospital records cannot be publicly shared. Hence, our dataset provides an alternative large-scale, openly available collection of clinical cases from PubMed, making it a valuable resource for biomedical and clinical NLP. Preliminary continual-pretraining experiments with OLMo2 suggest these curated subsets enable targeted improvements, with clinical upsampling boosting performance by ~5% on MMLU ProfMed and educational quality filtering improving MedQA and MedMCQA by ~1%. Combinations of these techniques led to faster convergence, reaching same performance with a third of training tokens, indicating potential for more efficient and effective biomedical pretraining strategies.

2026

Gaperon: A Peppered English-French Generative Language Model Suite

ACL 2026 · 5 citations

We release Gaperon, a fully open suite of French-English-coding language models designed to advance transparency and reproducibility in large-scale model training. The Gaperon family includes 1.5B, 8B, and 24B parameter models trained on 2-4 trillion tokens, released with all elements of the training pipeline: French and English datasets filtered with a neural quality classifier, an efficient data curation and training framework, and hundreds of intermediate checkpoints. Through this work, we study how data filtering and contamination interact to shape both benchmark and generative performance. We find that filtering for linguistic quality enhances text fluency and coherence but yields subpar benchmark results, and that late deliberate contamination -- continuing training on data mixes that include test sets -- recovers competitive scores while only reasonably harming generation quality. We discuss how usual neural filtering can unintentionally amplify benchmark leakage. To support further research, we also introduce harmless data poisoning during pretraining, providing a realistic testbed for safety studies. By openly releasing all models, datasets, code, and checkpoints, Gaperon establishes a reproducible foundation for exploring the trade-offs between data curation, evaluation, safety, and openness in multilingual language model development.

2024

CamemBERT-bio: Leveraging Continual Pre-training for Cost-Effective Models on French Biomedical Data

LREC-COLING 2024 · 28 citations

Clinical data in hospitals are increasingly accessible for research through clinical data warehouses. However these documents are unstructured and it is therefore necessary to extract information from medical reports to conduct clinical studies. Transfer learning with BERT-like models such as CamemBERT has allowed major advances for French, especially for named entity recognition. However, these models are trained for plain language and are less efficient on biomedical data. Addressing this gap, we introduce CamemBERT-bio, a dedicated French biomedical model derived from a new public French biomedical dataset. Through continual pre-training of the original CamemBERT, CamemBERT-bio achieves an improvement of 2.54 points of F1-score on average across various biomedical named entity recognition tasks, reinforcing the potential of continual pre-training as an equally proficient yet less computationally intensive alternative to training from scratch. Additionally, we highlight the importance of using a standard evaluation protocol that provides a clear view of the current state-of-the-art for French biomedical models.

2024

CamemBERT 2.0: A Smarter French Language Model Aged to Perfection

arXiv preprint · 20 citations

French language models, such as CamemBERT, have been widely adopted across industries for natural language processing (NLP) tasks, with models like CamemBERT seeing over 4 million downloads per month. However, these models face challenges due to temporal concept drift, where outdated training data leads to a decline in performance, especially when encountering new topics and terminology. This issue emphasizes the need for updated models that reflect current linguistic trends. In this paper, we introduce two new versions of the CamemBERT base model-CamemBERTav2 and CamemBERTv2-designed to address these challenges. CamemBERTav2 is based on the DeBERTaV3 architecture and makes use of the Replaced Token Detection (RTD) objective for better contextual understanding, while CamemBERTv2 is built on RoBERTa, which uses the Masked Language Modeling (MLM) objective. Both models are trained on a significantly larger and more recent dataset with longer context length and an updated tokenizer that enhances tokenization performance for French. We evaluate the performance of these models on both general-domain NLP tasks and domain-specific applications, such as medical field tasks, demonstrating their versatility and effectiveness across a range of use cases. Our results show that these updated models vastly outperform their predecessors, making them valuable tools for modern NLP systems. All our new models, as well as intermediate checkpoints, are made openly available on Huggingface.

2023

CamemBERT-bio: Un modèle de langue français savoureux et meilleur pour la santé

TALN 2023 · 20 citations

French presentation of CamemBERT-bio, a biomedical language model adapted from CamemBERT for French clinical and biomedical text.

Releases

Hugging Face 25,531 downloads / month

models

Biomedical encoders

10,225 downloads/mo

French biomedical encoder models

HF camembert-bio-base French biomedical encoder 9,564
HF ModernCamemBERT-bio-base French CLM-detour encoder 134
HF ModernBERT-bio-base English CLM-detour encoder 88
HF camembert-bio-gliner-v0.1 open biomedical NER 52

large variants (2)

HF ModernCamemBERT-bio-large large 317
HF ModernBERT-bio-large large 70

model collection

CamemBERT 2.0 / ModernCamemBERT

12,881 downloads/mo

French encoder models

collection

HF camembertav2-base DeBERTaV3 encoder 7,912
HF moderncamembert-base ModernBERT encoder 3,276
HF camembertv2-base RoBERTa encoder 1,693

dataset + classifier

Biomed-Enriched

1,854 downloads/mo

Biomedical dataset enriched with LLM annotations, plus the classifier used to scale annotations to PMC-OA.

biomedical collection

HF Biomed-Enriched dataset 1,850
HF Biomed-Enriched-classifier annotation classifier 4

model collection

Gaperon

571 downloads/mo

French-English generative model suite.

collection

HF Gaperon-1125-8B main 8B model 306
HF Gaperon-1125-1B 1B model 220
HF Gaperon-1125-24B 24B model 2
HF gaperon-quality-classifier quality classifier 41

SFT variants (3)

Blog posts

2021

Google reviews insight extraction using NLP

This is my end of studies project. We have to scrap reviews about renault dealers in France from Google and use NLP to extract some insights about customer satisfaction.

2021

Project Tuatara - Autonomous car competition

We are participating with a friend in an autonomous car competition. We are trying to make our car race as fast as possible using deep learning.

2021

Talking Face Video Generation using Deep Learning

This was a research internship at INRIA. I was working on talking face video generation using GANs.

2020

Image segmentation and image generation of shoes

This was my first technical internship. I had to segment the different part of the shoes and generate new shoes using GANs.