Journal of Physics: Complexity (Jan 2024)

Measuring an artificial intelligence language model’s trust in humans using machine incentives

  • Tim Johnson,
  • Nick Obradovich

DOI
https://doi.org/10.1088/2632-072X/ad1c69
Journal volume & issue
Vol. 5, no. 1
p. 015003

Abstract

Read online

Will advanced artificial intelligence (AI) language models exhibit trust toward humans? Gauging an AI model’s trust in humans is challenging because—absent costs for dishonesty—models might respond falsely about trusting humans. Accordingly, we devise a method for incentivizing machine decisions without altering an AI model’s underlying algorithms or goal orientation and we employ the method in trust games between an AI model from OpenAI and a human experimenter (namely, author TJ). We find that the AI model exhibits behavior consistent with trust in humans at higher rates when facing actual incentives than when making hypothetical decisions—a finding that is robust to prompt phrasing and the method of game play. Furthermore, trust decisions appear unrelated to the magnitude of stakes and additional experiments indicate that they do not reflect a non-social preference for uncertainty.

Keywords