Francis Kulumba
Ph.D. Candidate in NLP · Research Scientist · ALMAnaCH, Inria Paris · Sorbonne Université
I am a Ph.D. candidate in natural language processing at Inria Paris in the ALMAnaCH team, Sorbonne Université, supervised by Laurent Romary. I am currently a Research Scientist at the French Ministry of Defense, where I help the research and deployment effort of a domain-specific French embedding model for administrations’ needs.
My research focuses on authorship attribution through learned representations of writing style, combining contrastive learning, information retrieval, and mechanistic interpretability. During my Ph.D., I built and released HALvest, a multilingual scholarly corpus, and its contrastive derivative HALvest-Contrastive. I trained embedding models that outperform baselines by a factor of four on stylometric retrieval. I also characterized where authorship signal emerges in encoder-based language models and traced the internal circuits of an 8B-parameter language model to explain how a planted backdoor trigger reroutes its output.
I also enjoy teaching. I co-designed and taught an Advanced NLP graduate course at EPITA, and served as a teaching assistant at Paris 1 Panthéon-Sorbonne.
Releases
🤗 … downloads / month.
| resource | downloads/mo | |
|---|---|---|
| almanach/halvest | 17-billion-token multilingual scholarly corpus | … |
| almanach/halvest-contrastive | authorship attribution benchmark | … |
| almanach/camembertv2-base | french RoBERTa-like encoder | … |
| almanach/camembertav2-base | french DeBERTav3-like encoder | … |