Code4ML: a large-scale dataset of annotated Machine Learning code

Anastasia Drozdova; Ekaterina Trofimova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin

doi:10.7717/peerj-cs.1230

PeerJ Computer Science (Feb 2023)

Code4ML: a large-scale dataset of annotated Machine Learning code

Anastasia Drozdova,
Ekaterina Trofimova,
Polina Guseva,
Anna Scherbakova,
Andrey Ustyuzhanin

Affiliations

Anastasia Drozdova: Department of Computer Science, NRU Higher School of Economics, Moscow, Russia
Ekaterina Trofimova: Department of Computer Science, NRU Higher School of Economics, Moscow, Russia
Polina Guseva: Department of Computer Science, NRU Higher School of Economics, Moscow, Russia
Anna Scherbakova: Department of Computer Science, NRU Higher School of Economics, Moscow, Russia
Andrey Ustyuzhanin: Department of Computer Science, NRU Higher School of Economics, Moscow, Russia

DOI: https://doi.org/10.7717/peerj-cs.1230
Journal volume & issue: Vol. 9
p. e1230

Abstract

Read online Read online

The use of program code as a data source is increasingly expanding among data scientists. The purpose of the usage varies from the semantic classification of code to the automatic generation of programs. However, the machine learning model application is somewhat limited without annotating the code snippets. To address the lack of annotated datasets, we present the Code4ML corpus. It contains code snippets, task summaries, competitions, and dataset descriptions publicly available from Kaggle—the leading platform for hosting data science competitions. The corpus consists of ~2.5 million snippets of ML code collected from ~100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose. Code4ML dataset can help address a number of software engineering or data science challenges through a data-driven approach. For example, it can be helpful for semantic code classification, code auto-completion, and code generation for an ML task specified in natural language.

Published in PeerJ Computer Science

ISSN: 2376-5992 (Online)
Publisher: PeerJ Inc.
Country of publisher: United States
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://peerj.com/computer-science/

About the journal

Abstract

Keywords