David Peinador Veiga

The stream focuses on evaluating and/or mitigating catastrophic risk emerging from dangerous scientific capabilities in frontier AI systems, with an emphasis on the challenges that emerge from lab integrations and novel science. Potential research directions include evaluation design, risk mitigations and evaluation science.

Stream overview

Safety at LILA

I work in the AI Safety team at LILA, an AI company with the goal of achieving scientific superintelligence. LILA's approach rests on two in-house capability verticals: the development of AI models for scientific reasoning and the development of a lab automation platform with broad experimental coverage, not just narrow workflows. The capabilities that the combination of those verticals might unlock present novel challenges from a safety perspective:

  • Evaluating scientific capabilities beyond known science, where ground truth is unavailable by definition. We are focusing mainly on biology and chemistry.
  • Understanding the interplay between agents and physical lab equipment — evaluating the risks this introduces and building appropriate mitigations.
  • Expanding traditional threat models to cover all possible catastrophic risk vectors.

Directions

With this context, let me now turn to the research directions. The field of dangerous scientific capabilities is moving fast, so I will describe broad directions and leave more granular scoping until closer to the start of the fellowship. I expect the fellow to focus on one of the three directions below, though projects can draw on more than one.

Dangerous capability evaluations

  • Developing dangerous capability evaluations for scientific agents. Note that access to LILA's internal infrastructure is unlikely, but we have access to SMEs and can build on top of open-source harnesses. Projects here would design and run novel evaluations targeting bio or chem capabilities relevant to misuse or unsafe autonomous execution, with care taken to ensure tasks are valid, well-elicited, and actually risk-relevant.
  • The challenge of novel science. Standard evaluations assume verifiable ground truth, which breaks down when a model is producing genuinely new scientific output. Projects here would explore how to detect when a model is stepping into novel territory (as opposed to recombining or recalling training data), how to evaluate the quality and plausibility of such output, and whether novelty detection could serve as a trigger for additional safety checks during deployment.

Risk mitigations

  • Safety guardrails and monitoring. There are currently very few open-source classifiers or monitors targeted at scientific dual-use capabilities. This gap is increasingly pressing as we move from AI systems capable of scientific reasoning but little agency to systems capable of autonomous execution, where dangerous behaviuor can manifest in intermediate reasoning or tool calls rather than just final outputs. Projects here could include: building and red-teaming a classifier for a specific bio/chem dual-use capability; comparing output-level vs. trajectory-level monitoring on open-source agent harnesses.
  • Mitigations for novel science. The previous point becomes much harder if you go beyond known dangerous agents and you can’t rely on identifying those agents or variations of them for filtering. How can you safeguard research that involves the discovery of novel (a-priori) benign compounds?

Evaluation science

This direction is more methodological, and aims to develop frameworks that assure and improve the quality of our evaluations.

  • Evaluation coverage. Evaluations are always imperfect, since they can't possibly test all the potential risks a system might bring. This is mitigated by anchoring evaluations in threat models — theoretical frameworks that aim to map the risk space before selecting representative tasks. In practice, however, the process is also shaped by practicalities like data access, tooling, and the capability level of the model under test. This stream would research a more principled approach to linking the components of an evaluation strategy — threat models, task selection, elicitation, and grading — and to making explicit the assumptions that connect eval results back to claims about real-world risk.
  • Novel evaluation formats. Traditionally, most evaluations have been formatted as multiple-choice questions. Although this is very convenient from a grading perspective, it severely limits the ability to evaluate the model in complex, messy tasks. Long-form and tool-use evaluations have a more open-ended approach, which measures capability more accurately by mimicking real deployments more faithfully. Unfortunately, these are soon becoming too artificial and limiting to capture the nature of complex tasks. What could come next?
  • Evaluation robustness. Techniques for detecting when an evaluation is no longer measuring what we think it measures — including saturation (the eval has become too easy and no longer discriminates between models), sandbagging (the model is underperforming strategically), training data contamination, and failure modes unrelated to capability (e.g., refusals, formatting issues, or tool-use bugs being misread as capability gaps). Projects here would develop diagnostics for one or more of these failure modes and apply them to existing public benchmarks.

Mentors

David Peinador Veiga
Lila Sciences
,
Scientist, AI Safety
London
No items found.

David is a Scientist in the AI Safety team at Lila Sciences, where he leads the technical design and implementation of dangerous capability evaluations in scientific domains like biology, chemistry, and materials science. His work involves measuring model performance on scientific tasks that could pose safety risks through misuse or unintended failure modes as well as the efficacy of guardrails and other mitigations. A core challenge is designing evaluations that meaningfully capture frontier capabilities; assessing not just what models can do today, but what emerging scientific reasoning abilities might enable in adversarial or failure scenarios.

In his previous role, David developed biosafety evaluations featured in the systems cards of Meta’s Muse Spark and multiple versions of Anthropic’s Claude Sonnet and Opus. Previously, he completed his PhD in Theoretical Physics at Queen Mary University of London.

Read more

Mentorship style

We can schedule a weekly 1h meeting, for general progress updates, share result and overall guidance. I would be reachable on Slack as well for async comms. Happy to jump on ad-hoc calls for specific discussions or pair coding/debugging. I am based in London and I work UK hours (10am-7pm), but I also visit the US (Boston) a few times a year.

Fellows we are looking for

Essential

  • Strong skills in working in Python projects. Python packages, git/GitHub
  • Research or academic experience. Research literacy and independence.
  • Experience with LLMs in research or applied projects
  • Proactive, self-driven attitude.
  • Statistics (probability theory, significant tests)

Preferred

  • Experience in model evaluations and evaluation frameworks (like Inspect AI)
  • Tool use: sandboxing environments, MCP.
  • Experience building interactive LLM environments with complex tooling.
  • Subject matter expertise in biology, chemistry or materials.
  • Experience with mitigation techniques: content classifiers, interpretability, jailbreaks

Not a good fit:

  • Working exclusively on Jupyter notebooks
  • Fellows who exclusively want to work on white box models or RL.

Project selection

I will work with the fellow to find the right project that suits their interest within the directions spelled out above. I will pitch a few project ideas and support the fellow in making the decision. I also welcome  project suggestions; in those cases I would work with the fellow to scope it appropriately.