TY - GEN
T1 - A Dataset for Analysis of Quality Code and Toxic Comments
AU - Sayago-Heredia, Jaime
AU - Chango Sailema, Gustavo
AU - Pérez-Castillo, Ricardo
AU - Piattini, Mario
N1 - Publisher Copyright:
© 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2023
Y1 - 2023
N2 - Software development has an important human aspect, so it is known that the feelings of developers have a significant impact on software development and could affect the quality, productivity and performance of developers. In this study, we have begun the process of finding, understanding and relating these affects to software quality. We propose a quality code and sentiments dataset, a clean set of commits, code quality and toxic sentiments of 19 projects obtained from GitHub. The dataset extracts messages from the commits present in GitHub along with quality metrics from SonarQube. Using this information, we run machine learning techniques with the ML.Net tool to identify toxic developer sentiments in commits that could affect code quality. We analyzed 218K commits from the 19 selected projects. The analysis of the projects took 120 days. We also describe the process of building the tool and retrieving the data. The dataset will be used to further investigate in depth the factors that affect developers’ emotions and whether these factors are related to code quality in the life cycle of a software project. In addition, code quality will be estimated as a function of developer sentiments.
AB - Software development has an important human aspect, so it is known that the feelings of developers have a significant impact on software development and could affect the quality, productivity and performance of developers. In this study, we have begun the process of finding, understanding and relating these affects to software quality. We propose a quality code and sentiments dataset, a clean set of commits, code quality and toxic sentiments of 19 projects obtained from GitHub. The dataset extracts messages from the commits present in GitHub along with quality metrics from SonarQube. Using this information, we run machine learning techniques with the ML.Net tool to identify toxic developer sentiments in commits that could affect code quality. We analyzed 218K commits from the 19 selected projects. The analysis of the projects took 120 days. We also describe the process of building the tool and retrieving the data. The dataset will be used to further investigate in depth the factors that affect developers’ emotions and whether these factors are related to code quality in the life cycle of a software project. In addition, code quality will be estimated as a function of developer sentiments.
KW - Commits
KW - GitHub
KW - Sentiments analysis
KW - Software Engineering
KW - Software quality
KW - SonarQube
KW - Toxic comment classification
UR - http://www.scopus.com/inward/record.url?scp=85147994474&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-24985-3_41
DO - 10.1007/978-3-031-24985-3_41
M3 - Conference contribution
AN - SCOPUS:85147994474
SN - 9783031249846
T3 - Communications in Computer and Information Science
SP - 559
EP - 574
BT - Applied Technologies - 4th International Conference, ICAT 2022, Revised Selected Papers
A2 - Botto-Tobar, Miguel
A2 - Zambrano Vizuete, Marcelo
A2 - Montes León, Sergio
A2 - Torres-Carrión, Pablo
A2 - Durakovic, Benjamin
PB - Springer Science and Business Media Deutschland GmbH
T2 - 4th International Conference on Applied Technologies, ICAT 2022
Y2 - 23 November 2022 through 25 November 2022
ER -