ideas | Ashwin Saraswatula

These are open questions across reasoning/interpretability that I find interesting and important.

How does in-context learning update internal representations? Does implicit gradient descent occur on the weights during ICL?
“Data Drives Behavior” -> How does training on certain types of data increase/decrease types of model behavior?
How can we steer fine-tuning so that models acquire desired internal circuits while suppressing the emergence of unwanted or unsafe mechanisms?
What mechanisms underlie OOD reasoning failures, and can we build models whose internal representations remain stable and trustworthy under distribution shift?
How can we design models capable of continual, in-context learning that refine their internal representations through interaction with their environment?
What are methods to move beyond the next-token prediction paradigm and deepen a model’s creative output?
How do a model’s internal representations and implicit world model give rise to its goals, preferences, and emergent personality-like behaviors?
As we scale test-time compute for advanced models (scratchpads, chain-of-thought, tree search), what new internal mechanisms emerge, and how can we interpret or control them?

Recent Papers I’ve enjoyed reading!

Reasoning:

Interpretability:

Alignment/Safety: