A supervised machine learning workflow for the reduction of highly dimensional biological data

Linnea K. Andersen; Benjamin J. Reading

Artificial Intelligence in the Life Sciences (Jun 2024)

A supervised machine learning workflow for the reduction of highly dimensional biological data

Linnea K. Andersen,
Benjamin J. Reading

Affiliations

Linnea K. Andersen: Department of Applied Ecology Raleigh, North Carolina State University, NC, USA
Benjamin J. Reading: Department of Applied Ecology Raleigh, North Carolina State University, NC, USA; Pamlico Aquaculture Field Laboratory, North Carolina State University, Aurora, NC, USA; Corresponding author at: Department of Applied Ecology, North Carolina State University, 100 Eugene Brooks Avenue Box 7617 Raleigh, NC 27695, USA.

Journal volume & issue: Vol. 5
p. 100090

Abstract

Read online

Recent technological advancements have revolutionized research capabilities across the biological sciences by enabling the collection of large data that provides a broader picture of systems from the cellular to ecosystem level at a more refined resolution. The rapid rate of generating these data has exacerbated bottlenecks in study design and data analysis approaches, especially as conventional methods that incorporate traditional statistical tests and assumptions are not suitable or sufficient for highly dimensional data (i.e., more than 1,000 variables). The application of machine learning techniques in large data analysis is one promising solution that is increasingly popular. However, limitations in expertise such that the results from machine learning models can be interpreted to gain meaningful biological insight pose a great challenge. To address this challenge, a user-friendly machine learning workflow that can be applied to a wide variety of data types to reduce these large data to those variables (attributes) most determinant of experimental and/or observed conditions is provided, as well as a general overview of data analysis and machine learning approaches and considerations thereof. The workflow presented here has been beta-tested with great success and is recommended to be incorporated into analysis pipelines of large data as a standardized approach to reduce data dimensionality. Moreover, the workflow is flexible, and the underlying concepts and steps can be modified to best suit user needs, objectives, and study parameters.

Published in Artificial Intelligence in the Life Sciences

ISSN: 2667-3185 (Online)
Publisher: Elsevier
Country of publisher: Netherlands
LCC subjects: Science: Science (General)
Website: https://www.journals.elsevier.com/artificial-intelligence-in-the-life-sciences

About the journal

Abstract

Keywords