ideas
These are open questions across reasoning/interpretability that I find interesting and important.
- How does in-context learning update internal representations? Does implicit gradient descent occur on the weights during ICL?
- “Data Drives Behavior” -> How does training on certain types of data increase/decrease types of model behavior?
- How can we steer fine-tuning so that models acquire desired internal circuits while suppressing the emergence of unwanted or unsafe mechanisms?
- What mechanisms underlie OOD reasoning failures, and can we build models whose internal representations remain stable and trustworthy under distribution shift?
- How can we design models capable of continual, in-context learning that refine their internal representations through interaction with their environment?
- What are methods to move beyond the next-token prediction paradigm and deepen a model’s creative output?
- How do a model’s internal representations and implicit world model give rise to its goals, preferences, and emergent personality-like behaviors?
- As we scale test-time compute for advanced models (scratchpads, chain-of-thought, tree search), what new internal mechanisms emerge, and how can we interpret or control them?
Recent Papers I’ve enjoyed reading!
Reasoning:
- Recursive Language Models (Zhang et al., 2025)
- Continuous Autoregressive Language Models(Shao et al., 2025)
- The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning(Zhu et al., 2025)
- Parallel trade-offs in human cognition and neural networks: The dynamic interplay between in-context and in-weight learning(Russin et al., 2025)
Interpretability:
- On the Biology of a Large Language Model(Lindsey et al., 2025)
- Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models(Karvonen et al. 2024)
- The Platonic Representation Hypothesis(Huh et. al 2024)
- Are Sparse Autoencoders Useful? A Case Study in Sparse Probing(Kantamneni et al., 2025)
Alignment/Safety:
- Subliminal Learning(Cloud et al., 2025)
- Poisoning Attacks on LLMs Require a Near-Constant Number of Poison Samples(Souly et al., 2025)
- Eliciting Secret Knowledge From Language Models(Cywinski et. al., 2025)
- Persona Vectors: Monitoring and Controlling Character Traits in Language Models(Chen et. al., 2025)