How can consistency training help language models resist sycophantic prompts and jailbreak style attacks while keeping their capabilities intact? Large…
As large language models surpass human-level capabilities, providing accurate supervision becomes increasingly difficult. Weak-to-strong learning, which uses a less capable…