Cracking the Black Box: Using Sparse Autoencoders to Decode What Language Models Really Learn
Yulia Kuznetsova
Sparse autoencoders reveal how large language models encode meaning beyond individual neurons, extracting interpretable features—from DNA and code errors to multilingual concepts—and advancing practical mechanistic interpretability.
Artificial Intelligence
Salon B (Rocky Linux Ballroom)