Decoding the gene-disease associations in type 2 diabetes: A curated dataset for text mining-based classification

Sushrutha Raj; Sushmitha Raj; Vindhya Namdeo; Alok Srivastava

Data in Brief (Jun 2024)

Decoding the gene-disease associations in type 2 diabetes: A curated dataset for text mining-based classification

Sushrutha Raj,
Sushmitha Raj,
Vindhya Namdeo,
Alok Srivastava

Affiliations

Sushrutha Raj: Amity Institute of Integrative Sciences and Health, Amity University Haryana, Amity Education Valley, Gurgaon 122413, India
Sushmitha Raj: Sri Innovation and Research Foundation, Ghaziabad 201009, India
Vindhya Namdeo: Sri Innovation and Research Foundation, Ghaziabad 201009, India
Alok Srivastava: Sri Innovation and Research Foundation, Ghaziabad 201009, India; L V Prasad Eye Institute, Hyderabad 500034, Telangana, India; Corresponding author.

Journal volume & issue: Vol. 54
p. 110418

Abstract

Read online

Type 2 Diabetes (T2D) exerts a substantial impact on mortality rates. According to 2023 statistics, more than half a billion individuals are experiencing the effects of T2D, making it one of the top 10 leading contributors to worldwide deaths. Multiple factors contribute to the onset of T2D, such as obesity, poor diet and lifestyle, the mutation in specific genes and many more. Among the various factors that contribute to the development of T2D, genetics is a pivotal aspect. Due to the significant influence of genes in the initiation and advancement of various phases of T2D, our focus lies on exploring the association between T2D and genes. In the present article, we have curated Standard disease gene association data which contains evidence or reference sentences which contain this disease gene association information, which is further classified into 4 classes: Yes, No, Ambiguous and X each pertaining to Positive, Negative, Ambiguous and Not related disease-gene associations respectively. For the purpose of this work, we downloaded T2D related abstracts from PubMed using EDirect and further pre-processed this abstract data to extract Reference Sentences Data. This data was later double-fold manually validated to compile this disease gene association data. The data produced in this article serves as reference data for the training text mining-based biological literature classifiers. Classifiers will further be used to predict classes of published literature, not just for T2D, but can also be expanded beyond to encompass a wide range of disease and their complications. The compilation of positively linked genes derived from these predictions can then be utilized for in-depth system-level analysis of T2D.

Published in Data in Brief

ISSN: 2352-3409 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Science (General)
Website: http://www.journals.elsevier.com/data-in-brief/

About the journal

Abstract

Keywords