2026-06-13 –, Salon B (Rocky Linux Ballroom)
Sparse autoencoders reveal how large language models encode meaning beyond individual neurons, extracting interpretable features—from DNA and code errors to multilingual concepts—and advancing practical mechanistic interpretability.
An overview of sparse autoencoders as a tool for decoding hidden representations in language models, showing how they uncover meaningful, human-interpretable features and circuits.