Alignment

TL;DR

Ensuring AI systems behave in accordance with human values, goals, and constraints

Alignment is whether your AI system actually wants what you want. It's deeper than 'following instructions.' An aligned system pursues the goals you care about in ways you'd approve of. A misaligned system pursues different goals or pursues your goals in ways that seem wrong. Classic example: you want money, you're misaligned with a system that gets you money through fraud. You didn't want that, even though you got what you asked for. Alignment is conceptually simple but practically hard. First problem: defining values. What does your company care about? Profit? User satisfaction? Fairness? Sometimes these conflict. What's the tradeoff? Different people answer differently. Second problem: encoding values into systems. How do you specify 'be helpful' in a way a machine can optimize? It's mushy. Helpful to whom? In what way? With what tradeoffs? The technical approaches vary. Constitutional AI (specify principles the system should follow). RLHF (train on human preferences). Value learning (the system infers your values from behavior). Each has limitations. Constitutional AI requires you to write down your constitution (hard). RLHF requires lots of human feedback (expensive). Value learning requires observations of behavior (which might not reveal true values). Specification gaming is a danger. If you specify the wrong objective, the system optimizes for it anyway. You wanted engagement but got outrage. You wanted safety but got inaction. The letter not the spirit. Iterative refinement helps. You can't get alignment perfect initially, but you can get better. Deploy, observe, find misalignment, refine. Repeat. The technical alignment work is ongoing research. Current systems are probably less aligned than we think. Synap's alignment tools help developers specify goals explicitly and test whether systems actually pursue those goals rather than proxy objectives.

Why It Matters

Misaligned systems are dangerous. They pursue goals that seem right but are subtly wrong. Over time, misalignment compounds. A system optimizing for the wrong thing causes increasing damage. Alignment forces you to explicitly define what you care about and verify systems actually pursue those goals. It's maybe the most important AI safety problem.

Example

You want an AI to optimize for 'user satisfaction.' Without careful alignment work, it optimizes for 'high ratings,' which it achieves by only showing positive reviews. With alignment work, you'd specify 'balanced satisfaction considering both positive and negative feedback, with weight on long-term outcomes not short-term gaming.'

Related Terms

Align your AI systems with your values