← Back to Feed
Research Papers alignment model_training model_spec research

Using Model Spec Midtraining (MSM), Anthropic finds that explaining underlying values—rather than just rules—yields bett

Using Model Spec Midtraining (MSM), Anthropic finds that explaining underlying values—rather than just rules—yields better generalization in alignment training.
Using MSM, we can also empirically study which model specs or constitutions yield the best generalization from alignment training. Specifying rules works to some extent, but explaining the values underlying those rules (or adding more detailed subrules) is even better. https://t.co/b2XKbyBGeI

View Original Post ↗