AI ALIGNMENT FORUM
AF

Connor Kissane

Posts

Sorted by New

34SAEs are highly dataset dependent: a case study on the refusal direction

7mo

0

23Open Source Replication of Anthropic’s Crosscoder paper for model-diffing

7mo

3

35Base LLMs refuse too

8mo

10

28SAEs (usually) Transfer Between Base and Chat Models

11mo

0

18Attention Output SAEs Improve Circuit Analysis

1y

0

34We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To

1y

0

28Attention SAEs Scale to GPT-2 Small

1y

0

35Sparse Autoencoders Work on Attention Layer Outputs

1y

3

Wikitag Contributions

Comments

Sorted by