I’m interested in two broad projects focused on improving current detection efforts at SecureBio. The first is to characterize when AI-bio or general AI tools are actually useful for large-scale metagenomic detection, including tradeoffs between compute cost, sequencing cost, model type, model size, and pipeline stage. The second is to explore genomic language models as novelty detectors—for example, using perplexity-style metrics to flag surprising sequences—and to evaluate whether this approach can complement traditional bioinformatics systems in a cost-effective, sensitive, and interpretable way.
I'm excited about two broad projects, both focused on being directly useful for current detection efforts:
A sort of "phase diagram" characterizing the tradeoffs between compute cost, sequencing cost, tool types, etc. would be very valuable. Bonus points for extrapolating that diagram into the future or exploring how it changes as a function of model size, model capability, and sequencing cost. The actual math here is mostly arithmetic, but useful arithmetic! And arithmetic that requires a decent amount of deep understanding of both how metagenomic pathogen detection works and how different AI tools work.
It'd be harder but even more valuable to form a theory of which kinds of tools (general purpose LLMs, protein folding models, genomic language models, multi-track biological foundation models, ...) are in principle most promising for different parts of a metagenomic pathogen detection pipeline.
The project I'm imagining starts with setting up such a detection system, then working to tune and characterize:
This is very much an exploratory project: I don't have a cleanly labeled set of exactly the reads that should or should not be flagged; the question is to explore whether and how this kind of detection scheme provides a valuable complement to more traditional approaches.
---
Note that by MATS time likely one of these projects will be underway in a parallel fellowship.
Evan is a research scientist at SecureBio Detection, where he works on computational pipelines and detection methods for uncovering threats in deep metagenomic sequencing data. Prior to transitioning his career into biosecurity, he was the VP of data science and engineering at Zoba, a startup providing optimization services to the shared mobility industry. He holds a PhD in operations research and, in his free time, devotes lots of thought cycles to sourdough pizza.
By default, we'll mostly collaborate via a standing weekly meeting (~1 hour), wherein we'll discuss recent progress and next directions. I'm available via Slack for quick back-and-forth on ideas, sanity checks, and unblocking (data access, etc.), but will rely on the fellow to manage their own implementations, code review, debugging, etc.
Essential:
Preferred:
Not a good fit:
I'll determine which of the two broad project ideas we're running with based on SecureBio Detection needs, which fellows match to me, etc. Within that broad project, I'll guide with what I think is helpful / interesting / relevant to SecureBio Detection, and I expect the fellow to have both autonomy and responsibility to pick concrete work directions.