Aversion to external feedback suffices to ensure agent alignment

Paulo Garcia

doi:10.1038/s41598-024-72072-0

Scientific Reports (Sep 2024)

Aversion to external feedback suffices to ensure agent alignment

Paulo Garcia

Affiliations

Paulo Garcia: International School of Engineering, Chulalongkorn University

DOI: https://doi.org/10.1038/s41598-024-72072-0
Journal volume & issue: Vol. 14, no. 1
pp. 1 – 12

Abstract

Read online

Abstract Ensuring artificial intelligence behaves in such a way that is aligned with human values is commonly referred to as the alignment challenge. Prior work has shown that rational agents, behaving in such a way that maximizes a utility function, will inevitably behave in such a way that is not aligned with human values, especially as their level of intelligence goes up. Prior work has also shown that there is no “one true utility function”; solutions must include a more holistic approach to alignment. This paper describes apprehensive agents: agents that are architected in such a way that their effective utility function is an aggregation of a partial utility function (built by designers, to be maximized) and an expectation of negative feedback on given states (reasoned about, to be minimized). Agents are also capable of performing a temporal reasoning process that approximates designers’ intentions in function of environment evolution (a necessary feature for severe mis-alignment to occur). We show that an apprehensive agent, behaving rationally, leverages this internal approximation of designers’ intentions to predict negative feedback, and, as a consequence, behaves in such a way that maximizes alignment, without actually receiving any external feedback. We evaluate this strategy on simulated environments that expose mis-alignment opportunities: we show that apprehensive agents are indeed better aligned than their base counterparts and, in contrast with extant techniques, chances of alignment actually improve as agent intelligence grows.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal