Fundamental Research (Nov 2021)

A commentary of Multi-skilled AI in MIT Technology Review 2021

  • Rongrong Ji

Journal volume & issue
Vol. 1, no. 6
pp. 844 – 845

Abstract

Read online

Towards the end of 2012, artificial intelligence (AI) scientists first figured out how to impart “vision” to neural networks. Later, they also mastered how to enable neural networks to mimic human reasoning, hearing, speaking, and writing. Although AI has become similar to or even superior to humans in accomplishing specific tasks, it still does not possess the “flexibility” of the human brain, i.e., the human brain can apply skills learned in one situation to another.Taking cues from the growth process of children, we think about the following question. If senses and language can be combined, and AI can perform at a level closer to humans in terms of collecting and processing information, will it be able to develop an understanding of the world? The answer is yes. “Multi-modal” systems, which can simultaneously acquire human senses and language, thereby generating significantly stronger AI, and making it easier for AI to adapt to new situations and solve new problems. Hence, such algorithms can be used to solve more complex problems, or be implanted into robots for communication and collaboration with humans in our daily lives. In September 2020, researchers from the Allen Institute for AI (AI2) created a model that could generate images from captions, thus demonstrating the ability of the algorithm to associate words with visual information. In November, scientists from the University of North Carolina at Chapel Hill developed a method of incorporating images into existing language models, which significantly enhanced the ability of the model to comprehend text. Early in 2021, OpenAI extended GPT-3 and released two visual language models: one associates the objects in the image with the words in the descriptions, and another one generates a digital image based on the combination of concepts it has learned. The progress made by “multi-modal” systems, in the long run, will help break through the limits of AI. It will not only unlock new AI applications, but also make these applications safer and more reliable. More sophisticated multi-modal systems will also aid the development of more advanced robot assistants. Ultimately, multi-modal systems may prove to be the first AI that we can trust.①① Original source in Chinese: R. Ji, Multi-skilled AI, Bulletin of National Natural Science Foundation of China. 35 (3) (2021) 413-415.