designs and provides at least one case of a provably beneficial system in the sense
introduced above. The overall approach resembles mechanism-design problems in
economics, wherein one incentivizes other agents to behave in ways beneficial to the
designer. The key difference here is that we are building one of the agents in order to
benefit the other.
There are reasons to think this approach may work in practice. First, there is
abundant written and filmed information about humans doing things (and other humans
reacting). Technology to build models of human preferences from this storehouse will
presumably be available long before superintelligent AI systems are created. Second,
there are strong, near-term economic incentives for robots to understand human
preferences: If one poorly designed domestic robot cooks the cat for dinner, not realizing
that its sentimental value outweighs its nutritional value, the domestic-robot industry will
be out of business.
There are obvious difficulties, however, with an approach that expects a robot
to learn underlying preferences from human behavior. Humans are irrational,
inconsistent, weak-willed, and computationally limited, so their actions don’t always
reflect their true preferences. (Consider, for example, two humans playing chess.
Usually, one of them loses, but not on purpose!) So robots can learn from nonrational
human behavior only with the aid of much better cognitive models of humans.
Furthermore, practical and social constraints will prevent all preferences from being
maximally satisfied simultaneously, which means that robots must mediate among
conflicting preferences—something that philosophers and social scientists have struggled
with for millennia. And what should robots learn from humans who enjoy the suffering
of others? It may be best to zero out such preferences in the robots’ calculations.
Finding a solution to the AI control problem is an important task; it may be,
in Bostrom’s words, “the essential task of our age.” Up to now, AI research has focused
on systems that are better at making decisions, but this is not the same as making better
decisions. No matter how excellently an algorithm maximizes, and no matter how
accurate its model of the world, a machine’s decisions may be ineffably stupid in the eyes
of an ordinary human if its utility function is not well aligned with human values.
This problem requires a change in the definition of AI itself—from a field
concerned with pure intelligence, independent of the objective, to a field concerned with
systems that are provably beneficial for humans. Taking the problem seriously seems
likely to yield new ways of thinking about AI, its purpose, and our relationship to it.
35
HOUSE_OVERSIGHT_016838
Discussion 0
No comments yet
Be the first to share your thoughts on this epstein document