Blog | Kunsh Singh | Kunsh Singh

cat ~/thoughts/*.md

BL0G_FEED

Notes from the rabbit holes — interpretability, ML, and whatever I'm nerding out on

class.data.load(535)

Long-form, occasionally over-caffeinated write-ups. Click in.

interp.decode(activations)

Jun 7, 2026 · 11 min

Reading a Model's Mind

How Natural Language Autoencoders translate a model's internal state into plain English

Anthropic trained two LLMs to talk to each other in activation-space and accidentally built a tool that translates a model's internal state into plain English. Here's how Natural Language Autoencoders work, why I can't stop thinking about them, and where I'd take them next.

InterpretabilityNLAAutoencodersRLAlignment

read.entry()→