Kyle O'Brien

Kyle O'Brien

Trying to understand the minds on our computers

Current Work

I'm a founding researcher at Geodesic Research, a UK-based technical AI safety organization building robust alignment initializations—base models shaped through pre- and midtraining to hold up under capabilities reinforcement learning. There, I lead pretraining and midtraining safety directions to address dangerous capabilities (e.g., biorisk and offensive cyber) and to instill deeper alignment and character into LLMs.

I was formerly a Research Manager with ERA Cambridge, where I mentored junior researchers by matching them with mentors, brainstorming research questions, and executing empirical research projects. Before that I was at EleutherAI, and an applied scientist and software engineer at Microsoft before pivoting into AI research.

My research style favors fast feedback loops, clear falsifiable hypotheses, skepticism, and intellectual rigor. For details on my past work and experience, please see my resume.

Research Direction

My north star is to help make AGI go well for humanity. There are many ways it may not. I don't study them all. My research focuses on developing solutions to the technical and governance challenges. I focus on LLMs because I believe they will lead to AGI and beyond.

My current focus, which I pursue at Geodesic Research, is building the most robust alignment initializations for capable LLMs—base models shaped through pre- and midtraining so their alignment properties persist through capabilities reinforcement learning. This spans alignment pretraining (baking alignment priors into base models), misalignment quarantining through declarative midtraining documents, and stress-testing these interventions for adversarial robustness against capabilities RL.

Previously, I focused on Capability Prevention & Removal (CPR): building scalable techniques for removing unwanted capabilities from models before deployment. For instance, our paper Deep Ignorance demonstrates how to prevent models from learning biorisk knowledge in the first place (covered by the Washington Post and Fortune). The throughline is that careful, data-centric interventions early in the training pipeline may prove more robust than post-training techniques alone.

Selected Publications

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Cameron Tice, Puria Radmard, Samuel Ratnam, Andy Kim, David Africa, Kyle O'Brien

International Conference on Machine Learning (ICML), 2026 (Spotlight)

Alignment Pretraining figure
We investigate how pretraining data containing discussions about AI systems measurably influences the alignment characteristics of large language models when prompted as AI assistants. Increasing positive AI-related content during pretraining enhances alignment rates, and these improvements remain stable even after extensive post-training on millions of examples. This demonstrates a self-fulfilling mechanism where discourse about AI systems directly shapes subsequent model behavior.

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Kyle O'Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, Ishan Mishra, Geoffrey Irving, Yarin Gal, Stella Biderman

International Conference on Learning Representations (ICLR), 2026

Deep Ignorance figure
We investigate whether filtering dual-use topics from training data can serve as a tamper-resistant safeguard for open-weight LLMs. Our multi-stage data filtering pipeline demonstrates substantial resistance to adversarial fine-tuning on up to 10,000 steps and 300M tokens of biothreat-related text, outperforming existing post-training baselines by over an order of magnitude, with no degradation to unrelated capabilities.

Get in Touch

I'm always excited to discuss potential collaborations, my research, or AI safety career advice—please feel free to drop me a line. I do my best to reply, though my bandwidth is currently limited, so I appreciate your patience if a response is slow in coming.