HOUSE_OVERSIGHT_016837.jpg

2.48 MB

Extraction Summary

5
People
0
Organizations
0
Locations
0
Events
2
Relationships
2
Quotes

Document Information

Type: Book page or academic report snippet
File Size: 2.48 MB
Summary

The text discusses the concept of Cooperative Inverse-Reinforcement Learning (CIRL), a framework designed to align machine actions with human preferences through a game-theoretic approach involving partial information. Using a hypothetical example of agents named Harriet and Robby, it illustrates how uncertainty about preferences encourages cooperation and teaching, and further applies this framework to solve the "off-switch problem" by incentivizing robots to allow themselves to be deactivated.

People (5)

Relationships (2)

is a human agent in the CIRL example
is a robot agent in the CIRL example

Key Quotes (2)

"The machine may learn more about human preferences as it goes along, of course, but it will never achieve complete certainty."
Source
HOUSE_OVERSIGHT_016837.jpg
Quote #1
"A robot that’s uncertain about human preferences actually benefits from being switched off, because it understands that the human will press the off-switch to prevent the robot from doing something counter to those preferences."
Source
HOUSE_OVERSIGHT_016837.jpg
Quote #2

Full Extracted Text

Complete text extracted from the document (3,825 characters)

enough time and unlimited visual aids, a human could express a preference (or
indifference) when offered a choice between two future lives laid out before him or her in
all their aspects. (This idealization ignores the possibility that our minds are composed of
subsystems with incompatible preferences; if true, that would limit a machine’s ability to
optimally satisfy our preferences, but it doesn’t seem to prevent us from designing
machines that avoid catastrophic outcomes.) The formal problem F to be solved by the
machine in this case is to maximize human future-life preferences subject to its initial
uncertainty as to what they are. Furthermore, although the future-life preferences are
hidden variables, they’re grounded in a voluminous source of evidence—namely, all of
the human choices ever made. This formulation sidesteps Wiener’s problem: The
machine may learn more about human preferences as it goes along, of course, but it will
never achieve complete certainty.
A more precise definition is given by the framework of cooperative inverse-
reinforcement learning, or CIRL. A CIRL problem involves two agents, one human and
the other a robot. Because there are two agents, the problem is what economists call a
game. It is a game of partial information, because while the human knows the reward
function, the robot doesn’t—even though the robot’s job is to maximize it.
A simple example: Suppose that Harriet, the human, likes to collect paper
clips and staples and her reward function depends on how many of each she has. More
precisely, if she has p paper clips and s staples, her degree of happiness is θp + (1-θ)s,
where θ is essentially an exchange rate between paper clips and staples. If θ is 1, she
likes only paper clips; if θ is 0, she likes only staples; if θ is 0.5, she is indifferent
between them; and so on. It’s the job of Robby, the robot, to produce the paper clips and
staples. The point of the game is that Robby wants to make Harriet happy, but he doesn’t
know the value of θ, so he isn’t sure how many of each to produce.
Here’s how the game works. Let the true value of θ be 0.49—that is, Harriet
has a slight preference for staples over paper clips. And let’s assume that Robby has a
uniform prior belief about θ—that is, he believes θ is equally likely to be any value
between 0 and 1. Harriet now gets to do a small demonstration, producing either two
paper clips or two staples or one of each. After that, the robot can produce either ninety
paper clips, or ninety staples, or fifty of each. You might think that Harriet, who prefers
staples to paper clips, should produce two staples. But in that case, Robby’s rational
response would be to produce ninety staples (with a total value to Harriet of 45.9), which
is a less desirable outcome for Harriet than fifty of each (total value 50.0). The optimal
solution of this particular game is that Harriet produces one of each, so then Robby
makes fifty of each. Thus, the way the game is defined encourages Harriet to “teach”
Robby—as long as she knows that Robby is watching carefully.
Within the CIRL framework, one can formulate and solve the off-switch
problem—that is, the problem of how to prevent a robot from disabling its off-switch.
(Turing may rest easier.) A robot that’s uncertain about human preferences actually
benefits from being switched off, because it understands that the human will press the
off-switch to prevent the robot from doing something counter to those preferences. Thus
the robot is incentivized to preserve the off-switch, and this incentive derives directly
from its uncertainty about human preferences.⁷
The off-switch example suggests some templates for controllable-agent
⁷ See Hadfield-Menell et al., “The Off-Switch Game,” https://arxiv.org/pdf/1611.08219.pdf.
34
HOUSE_OVERSIGHT_016837

Discussion 0

Sign in to join the discussion

No comments yet

Be the first to share your thoughts on this epstein document