Francis Kulumba

I am a Ph.D. candidate in natural language processing at Inria Paris in the ALMAnaCH team, supervised by Laurent Romary. I am currently (or at least while I’m waiting for my PhD defense) a Research Scientist at the French Ministry of Defense, where I help the research and deployment effort of a domain-specific French embedding model for administrations’ needs.

I study authorship attribution: given a text, can we identify who wrote it from distributional patterns in their writing alone? I approach this as a retrieval problem, training embedding models to map texts by the same author to nearby points in a shared space. To build and evaluate these models, I constructed HALvest, a multilingual scholarly corpus, and its contrastive derivative HALvest-Contrastive. The resulting models outperform baselines by a factor of four on stylometric retrieval. Beyond learning better representations, I want to understand what models learn: where authorship signal emerges inside encoder language models, or how an 8B-parameter decoder reroutes its output when a planted backdoor trigger is present.

I co-designed and taught an Advanced NLP graduate course at EPITA and served as a teaching assistant at Paris 1 Panthéon-Sorbonne.

Download my CV

Releases

🤗 … downloads / month.

resource		downloads/mo
almanach/halvest	`17-billion-token multilingual scholarly corpus`	…
almanach/halvest-contrastive	`authorship attribution benchmark`	…
almanach/camembertv2-base	`french RoBERTa-like encoder`	…
almanach/camembertav2-base	`french DeBERTav3-like encoder`	…

latest posts

selected publications

HALvest-Contrastive: Retrieval-Like Authorship Attribution with Patch-Level Late Interaction

Francis Kulumba, Wissam Antoun, Guillaume Vimont, and 2 more authors

2026

Abs Bib

Deciding whether two pieces of text share an author is made difficult by topical confound: two writers covering the same topic often look more alike than one writer covering two topics. We tackle this with HALvest, a 17-billion-token multilingual corpus of open-access scholarly papers, and its English contrastive derivative HALvest-Contrastive, in which same-author passages are drawn from distinct papers within a field to minimize topical overlap. We also revisit how documents are compared. Authorship systems traditionally compress each document into a single vector, we keep a sequence of vectors and compare them with late interaction, then introduce Patch-Level Late Interaction (PLI), which compresses neighboring tokens into patches before matching. Matching at the sequence level greatly improves performance over the single-vector baseline, but the optimal interaction granularity is subtle.
@misc{kulumba_halvest_2026, title = {HALvest-Contrastive: Retrieval-Like Authorship Attribution with Patch-Level Late Interaction}, author = {Kulumba, Francis and Antoun, Wissam and Vimont, Guillaume and Romary, Laurent and Cafiero, Florian}, year = {2026}, archiveprefix = {arXiv}, primaryclass = {cs.DL}, url = {https://arxiv.org/abs/2407.20595}, }
Language-Switching Triggers Take a Latent Detour Through Language Models

Francis Kulumba, Wissam Antoun, Théo Lasnier, and 2 more authors

2026

Abs Bib

Backdoor attacks on language models pose a growing security concern, yet the internal mechanisms by which a trigger sequence hijacks model computations remain poorly understood. We identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger (nine tokens) redirects English output to French. We decompose the circuit into three phases: (1) distributed attention heads at early layers compose the trigger tokens into the last sequence position; (2) the resulting signal propagates through mid-layers in a subspace orthogonal to the model’s natural language-identity direction; (3) the MLP at the final layer converts this latent signal into French logits. The entire circuit flows through a serial bottleneck at a single position: corrupting that position at any layer entirely mitigate the trigger but also hinder the model’s capabilities. The orthogonal latent encoding suggests that defenses that search for language-like signals in intermediate representations would miss this trigger entirely.
@misc{kulumba_language_2026, title = {Language-Switching Triggers Take a Latent Detour Through Language Models}, author = {Kulumba, Francis and Antoun, Wissam and Lasnier, Théo and Sagot, Benoît and Seddah, Djamé}, year = {2026}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, url = {https://arxiv.org/abs/2605.18646}, }
Where Does Authorship Signal Emerge in Encoder-Based Language Models?

Francis Kulumba, Guillaume Vimont, Laurent Romary, and 1 more author

2026

Abs Bib

Authorship attribution models fine-tuned with the same pretrained encoder, data, and loss can differ four-fold in performance depending only on their scoring mechanism. We use mechanistic interpretability tools to explain this gap. Stylistic features such as word length, punctuation density, and function-word frequency are equally available at every layer in every model, including in an off-the-shelf control encoder, hence the gap not coming from representation quality. Instead, causal intervention shows that the scorer determines where the encoder consolidates authorship signal. Mean pooling forces consolidation by early to mid layers, while late interaction defers it to later layers. We further derive this difference from the gradient structure of each scorer, and training dynamics reveal distinct learning trajectories that follow from that difference.
@misc{kulumba_does_2026, title = {Where Does Authorship Signal Emerge in Encoder-Based Language Models?}, author = {Kulumba, Francis and Vimont, Guillaume and Romary, Laurent and Cafiero, Florian}, year = {2026}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, url = {https://arxiv.org/abs/2605.19908}, }