publications | Ashwin Saraswatula

2026

MSE-Break: Steering Internal Representations to Bypass Refusals in Large Language Models

Ashwin Saraswatula, Pranav Balabhadra, and Pranav Dhinkar

International Conference on Machine Learning(ICML) Actionable Interpretability 2025; Under review at ICLR Main Track, 2026

Abs HTML

The flexibility of internal concept embeddings in large language models (LLMs) enables advanced capabilities like in-context learning—but also opens the door to adversarial exploitation. We introduce MSE-Break, a jailbreak method that optimizes a soft-prompt prefix via gradient descent to minimize the mean squared error (MSE) between harmful concept embeddings in refused and accepted contexts. The resulting soft prompt p is concept-specific but prompt-general, enabling it to jailbreak a wide range of queries involving that concept without further tuning. Applied to four popular open-source LLMs—including Gemma-2B-IT and LLaMA-3.1-8B-IT—MSE-Break achieves attack success rates exceeding 90%. Its interpretability-driven design enables MSE-Break to outperform existing methods like GCG and AutoDAN—while converging in a fraction of the time. We find that harmful concept embeddings are linearly separable between refused and accepted contexts—structure that MSE-Break actively exploits. We further show that concept representations can be drastically steered in-context with as little as a single token. Our findings underscore the brittleness of LLM representations—and their susceptibility to targeted manipulation—highlighting the urgency for more robust and interpretable safety mechanisms.
Data Whitening Improves Sparse Autoencoder Learning

Ashwin Saraswatula and David Klindt

Association for the Advancement of Artificial Intelligence(AAAI) XAI4Science, 2026

Abs HTML

Sparse autoencoders (SAEs) have emerged as a promising approach for learning interpretable features from neural network activations. However, the optimization landscape for SAE training can be challenging due to correlations in the input data. We demonstrate that applying PCA Whitening to input activations—a standard preprocessing technique in classical sparse coding—improves SAE performance across multiple metrics. Through theoretical analysis and simulation, we show that whitening transforms the optimization landscape, making it more convex and easier to navigate. We evaluate both ReLU and Top-K SAEs across diverse model architectures, widths, and sparsity regimes. Empirical evaluation on SAEBench, a comprehensive benchmark for sparse autoencoders, reveals that whitening consistently improves interpretability metrics, including sparse probing accuracy and feature disentanglement, despite minor drops in reconstruction quality. Our results challenge the assumption that interpretability aligns with an optimal sparsity–fidelity trade-off and suggest that whitening should be considered a standard preprocessing step for SAE training.
Bridging The Von Neumann Gap: Why LLMs Haven’t Made Novel Discoveries

Ashwin Saraswatula

Association for the Advancement of Artificial Intelligence(AAAI) XAI4Science, 2026

Abs HTML

Large language models (LLMs) have been trained on vast data spanning nearly every scientific discipline, yet they rarely produce meaningful novel discovery. Human polymaths such as John von Neumann routinely generated breakthroughs across disparate fields—from game theory to quantum mechanics to the very architecture of the modern computer—by connecting insights across domains. We argue this gap reflects a structural limitation of the LLM paradigm rather than a problem of scale. Drawing on Piaget’s theory of cognitive development and Gentner’s structure-mapping, we contend novel discovery depends on two core processes: constructing nuanced internal schemas of the external world and flexibly redeploying them via analogical mapping. Without embodied data or exploration, LLMs form shallow world models; and because their architectures optimize for statistical efficiency, they struggle to extend analogies out of distribution in ways that capture relational structure across domains. Without rethinking training environments and architectures, LLMs will remain constrained to weak abstraction rather than the deep reasoning required for scientific innovation