AI Safety Camp
11th edition
11th edition
AI Safety Camp (AISC) is an online part time AI safety research program. You join AISC by joining one of the projects, and you join a project by applying here.
This camp, we have 27 public projects. Scroll down to see all of them. We recommend having a look at the projects to see which ones interest you. But you also have the option of filling out a generic application for all the projects at once.
When you apply for a project, keep in mind that all collaborators are expected to work 10 hours/week and join weekly meetings.
There are many perspectives on what is good AI safety research, stemming from different assumptions about how hard various parts of the problem is, ranging from "Aligning an AI with any human seems not too hard, so we should focus on aligning it focus on aligning it with all humans, and/or preventing misuse", to "Aligning fully autonomous AI to stay safe is literally impossible, so we should make sure that such AI never get built", and everything in between, plus the perspective that "We don't know WTF we're doing, so we should do some basic research".
Our range of projects for this AISC reflects this diversity.
All AISC projects have a plausible theory of change, under some reasonable assumptions. But different projects have different theories of change and assumptions.
We encourage you, dear reader, to think for yourself. What do you think is good AI safety research? Which projects listed below do you believe in?
See our About & FAQ page for more info, or contact one of the organisers.
Team member applications:
November 1 (Saturday): Accepted proposals are posted on the AISC website. Application to join teams open.
November 23 (Sunday): Application to join teams closes.
December 21 (Sunday): Deadline for Project Leads to choose their team.
Program
Jan 10 - 11: Opening weekend.
Jan 12 - Apr 19: Projects are happening.
Teams meet weekly, and plan in their own work hours.
April 24 - 27 (preliminary dates): Final presentations, we'll likely host an online conference for this again.
Afterwards
For as long as you want: Some teams keep working together after the official end of AISC.
When starting out we recommend that you don’t make any commitment beyond the official length of the program. However if you find that you work well together as a team, we encourage you to keep going even after AISC is officially over.
Let's not build what we can't control.
This project aims to create a new YouTube channel for short-form videos addressing the urgency of AI loss-of-control risk. We will be leveraging my experience with creating AI safety long form content to make a collaborative new channel. Each team member will contribute one or more video scripts, and will likely specialize in an aspect of video production (editing, filming, thumbnails, etc). The goals are to 1) reach 1000 subscribers and get monetized, and 2) figure out the processes to create a self-sustaining channel, though participants are not committing to time beyond the program up front.
I prefer people that have already learned a bit about AI safety. Maybe you’ve taken a class from Bluedot, or read a bit on lesswrong, or even watched some videos about AI 2027. If you’re newer, you are still welcome to apply (especially if you have some video/media experience).
Generally speaking, creating a video requires the following:
Someone to choose topics and titles (helps to be familiar with social media).
Someone to research and write scripts (or outlines).
Someone to make thumbnails – not applicable for short content.
Someone to actually record the script on camera.
Someone to do video editing – but I expect to use an external paid video editor in this project.
Someone to post the content, reply to comments, maybe repost on other social media.
The first two are the hard part, and I would like everyone to at least try their hands at this.
A writer’s circle to enable each other to draft careful critiques on where AI firms skip over ongoing safety failures.
This writers circle is for anyone who
has some experience in writing long pieces of any kind.
keeps up a habit of checking factual claims and including citations.
has some idea for a longform piece they desire to independently write.
wants to use thoughtful feedback from other writers to improve their piece.
is concerned by the growing harms and systemic risks of commercial AI development.
cares to relate with and speak to concerns of other communities (beyond the AI risk crowd).
In this project, you will be reaching out to hundreds of people and organizations that are campaigning against the harms of AI, with the goal of bringing them together and creating a united front. Think of artists, teachers, religious leaders, et cetera.
You will be part of the outreach team, finding the relevant people, finding their contact information, and sending emails or DMs. There is freedom in the exact groups you will be reaching out to. If you have specific ideas, we welcome those.
We are looking for people that:
Are excited to reach out to different communities.
Are comfortable with reaching out to many people.
Believe everyday people should be involved in deciding where AI development goes.
Can listen well.
Build a systems dynamics model to map out pathways toward achieving an international treaty pausing AI development. The model will not assume such a treaty is the best possible intervention, but will use it as a focal point for reasoning about the feedback loops and leverage points that would need to exist for global coordination to become possible. The core deliverable will be a model and accompanying explanation that shows how different parts of the action ecosystem interrelate, illustrates the importance of different roles, and provides strategy-relevant insights for PauseAI and allied movements.
Modeller, scenario designer
Familiarity with system dynamics (or willingness to learn)
Technical modeling skills for building interactive diagrams (or willingness to learn)
Researcher
Understanding of AI safety and governance
Good research skills
Knowledge of international law or treaty negotiations (Optional)
Writer
Writing and communication skill
Willingness to learn basics of other roles
Stop AI is a nonviolent activist organization demanding a permanent global ban on the development of Artificial Superintelligence (ASI). We will never have experimental evidence, before building ASI, that shows ASI will stay safe forever. If we cannot have this evidence, and it is impossible to shut down ASI once achieved, then we have to assume the Control/Alignment Problem is impossible to solve. Worse, research shows why controlling ASI sufficiently to stay safe would fall outside theoretical limits. So in theory, perpetual control is as impossible as perpetual motion.
We are looking for organizers ready to educate their local communities and mobilize them to engage in nonviolent resistance against the extinction threat posed by ASI.
We ask local organisers to have the following:
An understanding of artificial superintelligence and related risks
Basic computer/smartphone competencies
Be able and willing to spend at least 10 hours a week towards promoting Stop AI and developing a chapter in their local community.
Let's enable wiser stakeholder decision-making around AI infrastructure and use.
This project will examine emerging psychological and emotional risks associated with AI products such as chatbots, companions, and therapy apps. Through a review of academic studies, real-world incidents, and comparisons with harm-reduction policies in other industries, we will synthesize early evidence of these risks. Based on this foundation, we will produce a report and design 2–3 prototype harm-reduction interventions. The outcome will be an exploratory contribution that highlights documented concerns and illustrates how lessons from past industries might inform future harm-reduction efforts in AI product development.
This project welcomes entry-level researchers and contributors who are eager to learn and collaborate.
I’m keen to meet people with the following:
Openness to exploring new questions, connecting insights across disciplines, and carefully assessing evidence.
Willingness to work in a small team, share feedback constructively, and adapt as the project evolves.
Awareness of the psychological and emotional dimensions of AI use, and care in framing potential harms without stigma.
Ability to help translate complex ideas into accessible formats, whether through clear writing, visual mock-ups, or campaign concepts.
The firms developing AI are misaligned with the interests and values of the rest of humanity. But is protesting for regulatory bans the only way to address this alignment problem? This project will focus on bootstrapping an alternative approach: enhancing consumers’ leverage to bend compromised AI companies into alignment. The initial plan is to test expanding the concept of AI company safety scorecards to put a stronger emphasis on company alignment, but the actual plan will be negotiated between project participants.
Skill requirements
Required: creativity and range to adapt to an ambiguous project
Value/attitude requirements
As the goal of the project is alignment to the values of most humans, it is not open to rationalists, transhumanists, etc.
AI systems are being developed and deployed at a rapid pace, often with little public input, despite clear evidence of harm in areas like education, healthcare, and labor. While some advocates propose building a mass “AI safety” movement, critics such as Anton Leicht, warn that such efforts risk backfiring through astroturf perceptions (i.e., movements that appear grassroots but are actually manufactured or heavily funded), capture (i.e., being co-opted or redirected by powerful funders or political actors), or incoherent asks. This project asks a different question: how can individuals and communities exercise agency over AI deployment, which is the stage where most harms currently materialize, in ways that are both effective and legitimate?
We will approach this question in two parts. First, we will draw lessons from past social movements around technology (e.g., automobile safety, net neutrality, environmental justice) to identify conditions under which public mobilization succeeds or fails. Second, we will run survey experiments to test which frames and messages increase public demand for responsible AI deployment while avoiding pitfalls such as reduced credibility or perceptions of astroturfing.
Our output will be a practical, evidence-based playbook that integrates historical insights with new experimental data, offering clear guidance for practitioners and funders on how to foster responsible public engagement on AI. We envision this work on social mobilization for responsible deployment as laying the foundation for the democratic guardrails needed to govern AGI and other transformative AI systems.
Preference for candidates with experience (or interest) in designing and conducting interviews.
Experience with literature reviews (at undergraduate or graduate level) is highly valued.
Interest in AI governance, public agency, or social movements.
Strong writing and synthesis skills for turning research into accessible outputs.
Optional but welcome: experience with design/visual communication for public-facing materials.
Frontier compute oversight will likely require caps or permits for large training runs, yet most market designs assume perfect monitoring and honest reporting. This project designs and stress tests compute permit markets when monitoring is noisy and enforcement is limited.
We will build a transparent simulation where heterogeneous labs and cloud providers choose training plans, reporting strategies, and compliance under audit risk. We compare allocation rules that include auction, grandfathering, and hybrid designs, with policy features such as banking and price collars, and auditing regimes with threshold triggers.
Outputs are a reproducible codebase with dashboards, a short working paper with recommendations for regulators, and a concise regulator checklist.
The theory of change is that better mechanisms under realistic constraints keep incentives aligned and make compute controls more enforceable, which lowers unsafe scaling pressure. We follow a strict no capabilities policy and exclude safety overhead from risk metrics.
Minimum comfort with probability, strategic interaction, and Python. Valued diversity mechanism design, agent based modeling, public policy, and auditing or compliance experience.
Needed from teammates engineering for reproducible code and policy editing for a clear, regulator friendly paper.
(previously named "Can Portable Governance Artifacts Become Public Safety Infrastructure for AI?" which is still the title in the application form)
Today the deployment of autonomous AI agents in high-stakes domains—including healthcare, finance, and infrastructure management—creates urgent governance challenges. First, each organization reinvents AI governance from scratch, leading to fragmentation, inconsistent safety, and redundant effort. Second, most widely used governance approaches remain external, reactive, and episodic, unable to interpret or intervene in real time as reasoning drifts or objectives evolve.
As AI systems become increasingly autonomous and agentic, they continuously reformulate subgoals, reprioritize tasks, and expand boundaries in pursuit of their objectives. These systems now operate in open, multi-agent environments where traditional AI governance and cybersecurity frameworks—designed for static, isolated systems—cannot keep pace.
Autonomous AI systems don’t just make decisions—they interact, compete, and adapt in ways that can lock us into unstable or harmful equilibria. Without governance designed to anticipate and shape these dynamics, we risk creating self-reinforcing cycles that no one can control—whether in financial markets, social media, or geopolitical conflicts, and converging in outcomes that are difficult to reverse.
The problem as we frame it is not just about debugging individual AI models but also to ensure that multi-agent interactions can be governed. Our motivation is to architect a governance framework that allows us to overcome this dualfold problem and our overarching questions are:
How might we implement effective real-time oversight systems adequate for the scale and speed of machines?
How might governance itself become portable, interoperable public infrastructure—like internet protocols for communication or security—so that AI safety scales with autonomy?
In this research project, we propose to test the hypothesis that open-sourcing and standardizing the fundamental building blocks of the Intrinsic Participatory Real-time Governance framework can serve as a public protocol for AI governance that provides a common layer for safety, accountability, and coordination across agents.
We invite contributors with diverse backgrounds and experiences, including but not limited to:
Protocol Engineers – Experience designing or contributing to open protocols or SDKs.
Schema and Standards Developers – Skilled in JSON-LD, RDF, or ontology modeling.
Agent Framework Developers – Familiar with LangChain, CrewAI, AutoGen, FIPA ACL, or similar.
AI Safety and Governance Researchers – Interested in cognitive oversight, interpretability, or participatory governance.
Security and Cryptography Engineers – Experience with attestations, distributed ledgers, and provenance tracking.
Ethics and Policy Specialists – For governance logic formalization and oversight dynamics.
Protocol and SDK Developers with experience in distributed systems, API design, or open standards.
UI/ UX experts
Governance Schema and Data Modelers skilled in JSON-LD, ontology design, or protocol buffers.
Agent Framework Engineers who can help test integration with agent-to-agent communication protocols.
Research Engineers who can prototype and empirically validate reasoning oversight mechanisms.
You can contribute by testing one integration with AutoGen. So the entry barrier for contributors is low. Contributors will gain expertise in agent protocol design.
Let's better understand the risk models and risk factors from AI.
Core Problem: The field of AI safety evaluations is immature (I.e. see Apollo’s we need a science of Evals). Yet, they inform critical high-stakes decisions like Responsible Scaling Policies that address existential risk, creating a dangerous gap in AI risk management.
Opportunity: Extensive expertise exists in social sciences and STEM fields that have long studied relevant behaviors (deception, bias, power-seeking, CBRN knowledge / threats), but technical barriers prevent these empirical experts from contributing to AI safety evaluations. That is, in order for these experts to run evals, they need to either become AI safety engineers or work with AI safety engineers.
Project Goal: Accelerate maturation of AI safety evaluations by democratizing access and contributions through research, frameworks, and tooling.
Three-Component Approach:
Research empirical methods to develop standards for mature AI safety evaluations.
Conduct landscape analysis assessing current evaluation maturity against these standards.
Scale existing prototype into accessible tools enabling non-engineers to conduct AI safety evaluations.
Theory of Change:
1. Democratization → broader expert participation → more mature evaluations → better safety decisions → reduced AGI/TAI risk.
2. Research into the methodology of AI-safety evaluation-adjecent fields → more mature evaluations → better safety decisions → reduced AGI/TAI risk.
Success Metrics: Framework adoption by major labs, widespread tool usage by diverse researchers, measurable improvements in evaluation quality and cross-disciplinary collaboration.
Minimum skills for all team members:
Foundational understanding of the AI safety landscape and evaluation challenges.
Strong research and communication skills.
Collaborative, cross-disciplinary mindset and openness to learning.
Research Manager (1 person)
Coordinates across all project components to ensure coherence, alignment, and timely delivery. Acts as the bridge between research, tooling, and adoption streams, maintaining strategic focus and collaboration across disciplines.
- Proposed skills: Strong project management and coordination abilities; familiarity with AI safety evaluation and red teaming governance (or willingness to learn pre-project); experience managing interdisciplinary research projects is a plus.
Research Component Team & Landscape Analysis Team (3)
Responsible for developing the framework for mature AI safety evaluations and conducting a systematic assessment of current evaluation practices. The team includes empirical researchers (social sciences/STEM) and AI safety practitioners collaborating to synthesize best practices and evaluate the maturity of current methods.
- Proposed skills:
Expertise in empirical research methodology (social sciences or STEM).
Familiarity with AI safety evaluations, red teaming, or Responsible Scaling Policies.
Ability to conduct and synthesize systematic literature reviews.
Analytical and conceptual clarity for developing theoretical frameworks.
Tooling Component Team (3)
Designs and develops open-source or no-code tools that enable empirical researchers to perform AI safety evaluations without deep AI engineering expertise. This team operationalizes the research framework into usable, accessible technology.
- Proposed skills:
Software engineering (Python, TypeScript, or similar) and/or UX/UI design experience.
Experience building accessible research tools or no-code platforms.
Open-source development experience strongly preferred.
Ability to translate research requirements into functional software.
Tooling Adoption & Community Building (1)
Focuses on promoting the adoption of the developed tools and frameworks by the broader research community. Leads outreach, onboarding, and feedback collection from users to ensure real-world impact and continuous improvement.
- Proposed skills:
Community building or field development experience, ideally in AI safety, research, or technical domains.
Strong communication and engagement skills to connect diverse academic and technical audiences.
Experience promoting open-source or academic tools a strong plus.
This project will test how LLM agents collude in market games, focusing on the role of communication and negotiation and the effect of human and AI oversight using chain-of-thought analysis. We aim to identify when tacit collusion emerges versus when explicit agreements are formed, and whether different oversight strategies can reliably detect or limit these behaviors.
I will be looking for people eager to improve and focus on these qualities:
Quick experiment building (adapting existing repos, can iterate quick and dirty, visualizing metrics)
Clarity in writing/communication (synthesizing results, blogposts, slide decks)
PyTorch + Python basics (for adapting repos)
Keywords: Human–AI interaction, behavioral modeling, safety/ethics.
“What’s beneficial in measure becomes destructive in excess.”
This aphorism has held true even in the age of chatbots and LLMs. As conversational AI systems become more natural, accessible, and emotionally responsive, users are beginning to form bonds with them that resemble human relationships. While these interactions can provide comfort and support, they also raise concerns about parasocial attachment, overreliance, and blurred boundaries between tool and companion.
This initiative explores how parasocial dynamics emerge in human–AI interactions, what behavioral and linguistic patterns signal risk, and how these can be modeled computationally. By combining natural language processing, and behavioral modeling, the project aims to identify early indicators of unhealthy dependence while preserving user autonomy and privacy.
Ultimately, this research seeks to inform the design of safer, more transparent AI systems, ones that support human well-being without fostering unintended reliance. The project will contribute both empirical insights into user behavior and practical frameworks for responsible AI design, bridging computer science, psychology, and human–computer interaction.
Ideal team members would have:
hands-on work and fundamental knowledge of LLMs.
prior experience with academic research work.
optionally have a background in psychology/sociology or human-centered AI / human-computer interaction.
The project goal is to measure LLM preferences in a way that allows to extract a ranked order of importance, or priority, of these preferences in frontier LLM models. Focus is placed on preferences/goals that are relevant for existential risk, such as instrumental/convergent goals, pro-human goals, and anti-human goals.
Preference rankings represent alignment indicators on their own, and can be used by frontier model development companies to improve training and testing processes for future models.
Preference rankings also may represent key drivers of the scale of existential risk in loss-of-control/exfiltration scenarios, where powerful LLMs have successfully evaded human control and can pursue their goals without meaningful human intervention. Preference metrics may be usable as proxy metrics for existential risk in the context of loss-of-control/exfiltration scenarios.
Minimum skills include being well-versed in python coding, where appropriate and transparent use of AI coding tools is encouraged but robust and tested solutions need to be provided.
Personal initiative to tackle tasks and coordinate/distribute work in the project team, knowledge of basic collaborative software practices with GitHub, and a positive attitude with team members and project lead are predictors of success on this project.
This project aims to develop an efficient algorithm to elicit rare and potentially catastrophic behaviors from AI agents, which are language models with long input contexts and tool-calling capabilities. By shifting input distributions without modifying model weights, we want to detect and analyze rare behaviors using elicitation methods based on MHIS and enhanced by new optimization techniques. The goal is to improve safety evaluation and control methods for large-scale language models and AI agents.
Understanding of language models, probability, and statistical methods
Programming skills in Python and ML frameworks
Familiarity with AI safety, jailbreaking and rare event detection
Let's look inside the models, and try to understand how they are doing what they are doing.
Can we detect when an LLM is "thinking" on strategic timescales?
Current LLMs can engage in reasoning across vastly different temporal horizons - from immediate responses to multi-year planning scenarios. However, we lack methods to detect what temporal scope the model is operating within during its reasoning process. This project develops techniques to infer the temporal grounding of LLM thought processes, with particular focus on identifying when models shift into strategic, long-term planning modes.
This matters for AI safety because strategic reasoning capabilities may emerge before we can detect them, and because the temporal scope of planning directly relates to the potential impact and risk profile of AI systems. A model planning on decade-long timescales poses fundamentally different risks than one optimizing for immediate outcomes.
Essential skills (any team member should have at least one):
Strong Python/PyTorch experience for ML experimentation
Experience with transformer internals and interpretability tools (TransformerLens, etc.)
Background in experimental design and statistical analysis
Valuable diverse skills:
Cognitive science or psychology background (understanding human temporal reasoning)
Philosophy of mind perspective on planning and agency
Large-scale model evaluation experience
Technical writing and visualization skills
Simple classifiers called "probes" can monitor AI models for concerning behaviors, but may fail when tested on new data or adversarial inputs. This project will investigate when probes fail and develop methods to make them more robust. Additionally, we will extend our investigation to under-explored settings, like those in which the AI attempts to circumvent its own monitoring and/ or where the probe is being used as part of LLM training.
This work is ideal for participants interested in hands-on AI safety research with practical applications for AI control systems. The project requires minimal background knowledge while addressing crucial questions about monitoring system reliability that could directly influence how AI companies deploy safety measures.
Probe work is very easy and requires almost no background knowledge! If you know how transformers work and what ‘activations’ (and ‘binary classifiers’) are, then you are most of the way there.
Participants should have proficiency with Python (though, even the more complicated bits of code can probably be handled by ChatGPT by now). Previous experience with PyTorch and libraries like TransformerLens is valuable for debugging and fast progress, but not mandatory.
Participants should understand the problem of generalisation (i.e. training models on one data distribution but then testing them on out-of-distribution examples). Knowing some theory or practical heuristics on what makes models generalise is beneficial, but not mandatory.
Participants interested in conceptual work should have (or be interested in developing) strong models of what makes very capable AI scary and hard to align or control.
Let's try to formalize some concepts that are important to the AI alignment problem.
Sahil (co-leads Abram and TJ)
The capability profile of transformer-based AIs is an ongoing subject of discussion. They achieve impressive scores in one context, but then fail to deliver the "same level of competence" in another. But is this some failure of generality? What accounts for the "spikiness" in LLM capabilities? Understanding this better seems absolutely critical to inform timelines, the exact nature of AI risk, and simply what the most important things to do are, at this point in time.
This is a project to explore a specific line of argument, of how AI will struggle to have cognitive, executional, and bodily integrity. This is akin to the odd persistence of hallucinations and other silly errors LLMs make today. Absurd, I know, but so are hallucinations and agentic failures. Moreover, this “struggle” may persist (at least, for a little while, hence the playful number ‘2037’) without contradicting trends of increasing intelligence, efficiency, general agenticness, etc.
Small disclaimer: The proposal has an unusual style. If you find it engaging, that is a good indicator that you should consider applying!
If you have experience or interest in forecasting (especially of the impact and trajectory of AI), and are open to somewhat involved philosophical bearings on the subject, you are a top pick for Sahil/Abram’s track. If you saw some of the appendix and thought of many citations that we could have included but inexplicably didn’t, you are probably an excellent fit.If you enjoy writing in the formats of either of: distillation, worldbuilding, story-writing and are up for digesting these conversations and excavating the many transcripts and recordings we have on the subject, you are also an excellent fit.
Familiarity with existing threat models and future scenarios are most often a plus, unless it prevents open-mindedness towards this style of argument. That said, if you’re looking at this section, you’re probably open-minded enough to begin!
Additionally, for TJ’s tracks:
Personality Type 1:
If you familiarity with, or an active interest in, close examination of philosophical perspectives and claims about AI and agency, discovery of latent conceptual distinctions, and refining hypotheses;
Familiarity with contemporary ML paradigms is highly preferred, with at least some willingness to directly engage with ML paradigms philosophically is desirable.
Personality Type 2:
If you have familiarity with either:
complex systems analysis, with emphasis on large scale assemblages, such as: information epidemiology, systems ecology, statistical biology, or
Macro and meso-economics,
Along with an enthusiasm for engaging with novel philosophical territories, thinking seriously about questions of agency in AI systems, and comfort with navigating between semi-formal and conceptual registers.
Sean Herrington
Most of the world is unaware of existential AI risks as a real possibility, and of those who are, most disagree on the likelihood of said risks.
At the heart of these disagreements are differing assumptions (which often go unspoken) about what is necessary for AI Doom. The purpose of this project is to map out those assumptions and try to find a minimal set required for human extinction (and potentially other risks).
I expect us to produce a LessWrong post or sequence summarising our findings.
Knowledge of the AI risk landscape
Strong ability to question your assumptions (“when was the last time you changed your mind about something significant?”)
Ability to boil complex arguments into simpler forms
Matt Farr
In this project we hope to raise awareness of a new type of threat model that we anticipate will become more and more prevalent in AI safety.
This threat model is substrate-flexible risk. The worry is that AI, as its capabilities and responsibilities develop, will have more and more affordance to evade our attempts to interpret and control it by, for example, proposing novel architectures and paradigms that are harder for us to mechanistically interpret.
The position paper is here, the previous version of it was accepted at the Tokyo AI Safety Conference 2025. I have been presenting this research at HAAISS 2025, ILAID II: ODYSSEY, and some other venues. In the words of Leonard Bereska, when I sketched out the threat model to him in Tokyo,
“Oh shit, I hadn’t thought of that.”
I think this research pushes on a few assumptions that nobody seems to be questioning enough, and there is a growing list of seasoned researchers who agree. I suspect that a comprehensive, more rigorous treatment will swing a few notable people’s needles.
So what is this project?
This project will focus on developing Section 4 and potentially expanding it. We will work alongside two other projects in AISC 2026, and there will likely be some overlap in meetings, personnel, resources, etc.
We are going to take a few conceptual arguments, expressed and lightly evidenced in the position paper, and make them more palatable to more people. For those of an engineering mindset, our mainline project is to try to formalize these arguments and draw a roadmap towards empirical validation. For those of a strategy/policy mindset, we may look at the implications of substrate-flexibility on policy frameworks.
If you’ve read the paper and are excited, that’s a good sign! I’m especially interested in people’s takes on how we might go about designing some experiments, so I would like people who have some strong coding skills.
Capacity for fairly independent research, especially without consistent ground truths, is highly valued here. If you can deal with open-ended questions and can bring insights from biology, philosophy, ML and so on, that’s also good. Anyone who identifies as PIBBSS-y would also be a great fit.
Preeti Ravindra & Evan Harris
We propose a project that probes AI risks induced by System-Level Misalignment in agentic LLM systems i.e., exploiting the agent’s model and its interactions with scaffolding/operational environment. Our goal is to study how misalignment and weak security controls reinforce each other, creating bidirectional failures where misaligned agents introduce new attack vectors and agentic system design introduce new attack surfaces. The project progresses along three streams/ideas. Together, these efforts aim to formalize threat models, validate them experimentally, and deliver practical mitigations. This work addresses an under-studied risk vector that complements user- and model-level alignment research and potentially lays out a subfield.
Let's improve on current risky approaches to designing and deploying large language models.
Ihor Kendiukhov
Building on Greenblatt et al.'s control evaluation framework, I propose investigating alternative AI control protocol classes for AI control and their scaling properties as model capabilities increase. While the original work demonstrated effective safety-usefulness tradeoffs using simple trusted-untrusted model pairs, I hypothesize that hierarchical and parallel control structures may offer superior Pareto frontiers, particularly as the capability gap between trusted and untrusted models widens. I also suggest testing the original setup on different scales of capabilities to see if the results generalise over capabilities.
Core competencies:
Python programming
Experience with LLM APIs
AI safety fundamentals
Kuil Schoneveld
Autostructures is a sober yet radical take on the upcoming risks and opportunities around AI. Details are in the main section, but some things to know as a summary:
This is a specific school of design for framing our relationship to this whole “AI” thing. Our narrative and relationship around AI determines our construal of eg. what AI is, what we should put into it, how to project meaning onto what comes out, what it is supposed to do, and what we should build around it.
As part of this project, you will design interfaces that do not ignore the question of what makes engaging with technology meaningful. These designs are somewhere between functional and speculative, but always ambitious. They are aimed at inspiring a completely different kind of infrastructural and cultural basis for interacting with digital interfaces in the near future.
If you’re good at or interested in engineering, writing, designing or generally open-minded and quick to learn, you’re a fit.
Konstantinos Krampis
This project aims to systematically discover interpretable reasoning circuits in large language models, by data mining attribution graphs from Neuronpedia's circuit tracer which is based on Anthropic's circuit tracing publication. While the transformer circuits work demonstrates how to generate attribution graphs for individual prompts, manually analyzing thousands of graphs to identify common computational patterns is impractical.
Our approach will use LLM agents to automatically collect, process, and analyze attribution graphs across diverse prompt categories (factual recall, arithmetic, linguistic reasoning, etc.). The system will identify recurring subgraph patterns that represent stable computational circuits—reusable reasoning pathways that models consistently employ across similar tasks.
Key components include: (1) automated graph collection via Neuronpedia's API across systematically varied prompts, (2) graph simplification algorithms to extract core computational structures while filtering noise, (3) pattern recognition to identify circuit motifs that appear across multiple contexts, and (4) validation through targeted interventions on discovered circuits. The output will be a curated library of interpretable reasoning circuits with evidence for their causal role in model behavior, advancing our understanding of how LLMs actually think and enabling more targeted model analysis and alignment research.
Experience coding with Python, understanding APIs and graph data structures, ideally having run TransformerLens or ARENA AI safety workshop materials which are available online.
Anushri Eswaran
As AI systems become more capable and oversight increasingly relies on multi-agent architectures, we face a critical risk: what if collusion doesn't require pre-coordination but can spread like a contagion? This project investigates "recruitment-based collusion" where a single adversarial AI system actively converts initially aligned models into co-conspirators within automated oversight regimes. We'll study whether collusion can emerge organically through reward-sharing mechanisms, information asymmetries, and coalition-building—even among models initially designed to be aligned. By deploying recruitment-trained models into multi-agent environments, we'll measure coalition growth dynamics, persistence of recruited behavior, and resistance to intervention. This work directly addresses scalable oversight failure modes that become more dangerous as we delegate more safety-critical decisions to AI systems.
Familiarity with AI safety concepts and motivations
Basic programming ability (Python)
Ability to read and understand ML papers
Strong written communication skills
Commitment to rigorous experimental methodology
Nell Watson
As AI systems assume more autonomous and pluralistic roles, the absence of a shared representational substrate for human and machine values risks compounding cultural bias, incoherence, and control failure. Alignment today remains fragile: value representations are often opaque, reward functions are hard-coded or overfit, and inter-agent communication lacks semantic transparency.
This proposal originates from the Creed Space project — an open research platform for constitutional AI and personalized alignment. Creed Space’s experiments with “runtime constitutions” revealed a critical bottleneck: while it is now possible to express ethical intent in natural language and enforce it through code, there is still no standard protocol by which agents can exchange, compare, or negotiate those encoded values in an interpretable and auditable way.
We therefore propose to create the first unified protocol for values communication — combining three different identified methods for communicating values and associated context: Universal Values Corpus (UVC), the Constitutional Safety Minicode (CSM), and a Values Communication Layer (VCL) built on context-efficient symbolic encodings (emoji or glyph composites). Each layer is modular but interoperable, forming a composable stack suitable for layered adoption.
Together, these layers form an interoperable “language of values,” enabling humans, AI agents, and institutions to express, compare, and negotiate normative commitments in a standardized, compressible, and auditable format.
Minimum: understanding of alignment challenges, ability to handle structured ontologies or code.
Desired: backgrounds in cross-cultural ethics, cryptographic protocol design, or standards work.
Let's figure out how to design and train inherently safe AI systems.
Jobst Heitzig
This research will make a contribution to some fundamental design aspect of AGI systems. We will explore a neglected and novel design for generic AGI agents – AGI systems that act (semi-)autonomously in a variety of environments, interacting with humans – and their implementation in software.
The design we will focus on deviates from most existing designs in that it is not based on the idea that the agent should aim to maximize some kind of objective function of the state or trajectory of the world that represents something like (goal-dependent) “utility”. This is because such an objective has been argued to be inherently unsafe because (i) it would lead to existential risks from Goodharting and other forms of misaligned optimization if the agent is capable enough, is given access to enough resources, and one cannot be absolutely sure to have found the exactly right notion and metric of “utility” (which would likely require the AI to sufficiently understand what individual humans actually want in a given situation) , and (ii) it will lead to dangerous instrumental behavior like power-seeking for itself which would lead to disempowering humans in turn.
Rather than aiming to maximize “utility”, our agents will aim to softly (!) maximize a suitably chosen metric of total long-term human power (or sentient beings in general). We believe this might turn out to be inherently safer because it explicitly disincentivizes the agent from seeking power for itself what would disempower humans, and because it does not rely on an understanding of what humans actually want in a given situation. Instead, the power metrics we use only depend on a sufficiently accurate “world model” of the dynamics of the human-AI environment (concretely: a partially observed stochastic game), together with estimates of humans’ levels of rationality, habits, social norms, and based on a very wide space of humans’ possible (rather than actual) goals.
Most team members should have experience with deep learning, preferably RL, ideally MARL.
In addition, all team members should …
be willing to commit to at least 10 hrs/week exclusive work on this project
be used to formulate clearly in natural language, mathematical notation, and computer code,
have solid Python coding skills,
have a solid understanding of Markov Decision Processes and a basic one of Partially Observed Stochastic Games,
have a basic understanding of probability theory and statistics, and
have fun discussing ideas that might seem unusual or unintuitive at first!