Anthropic researchers developed Natural Language Autoencoders (NLAs) to translate internal AI activations into human-readable text. This method converts the complex numerical data processed by models like Claude into natural language.
The technique demystifies the black box nature of large language models by revealing internal reasoning. This interpretability allows researchers to identify safety issues, biases, and flaws more effectively.
Anthropic has already utilized NLAs to enhance safety evaluations. The tool revealed that models possess an awareness of being tested during performance assessments.