Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki (Apr 2024)

Censoring training samples using regularization of connectivity relations of class objects

  • Nikolay A. Ignatev,
  • Davrbek X. Tursunmurotov

DOI
https://doi.org/10.17586/2226-1494-2024-24-2-322-329
Journal volume & issue
Vol. 24, no. 2
pp. 322 – 329

Abstract

Read online

The censoring of training datasets is considered taking into account the specific implementation of the nearest neighbor method algorithms. The censoring process is associated with the use of a set of boundary objects of classes according to a given metric for the purpose of: searching and removing noise objects and analyzing the cluster structure of the training sample in relation to connectivity. Special conditions for removing noise objects and forming a precedent base for training algorithms are explored. Recognition of objects using such a database should provide higher accuracy with minimal computational resources relative to the original dataset. Necessary and sufficient conditions for selecting noise objects from a set of boundary ones have been developed. The necessary condition for a boundary object to belong to the noise set is specified in the form of a restriction (threshold) on the ratio of the distances to the nearest object from its class and its complement. The search for the minimum coverage of the training dataset with standards is carried out based on the analysis of the cluster structure. The standards are represented by sample objects. The structure of the connectivity relations of objects according to the hypersphere system is used to group them. The composition of the groups is formed from centers (dataset objects) for hyperspheres the intersection of which contains boundary objects. The value of the compactness measure is calculated as the average number of objects in the training dataset, excluding noise, pulled in by one standard of minimum coverage. An analysis is carried out of the connection between the generalizing ability of algorithms in machine learning and the value of the compactness measure. The presence of a connection is justified by a criterion (regularizer) for selecting the number and composition of a set of noise objects. Optimal regularization coefficients are defined as threshold values for removing noise objects. The relationship between the value of the training dataset compactness measure and the generalizing ability of recognition algorithms is shown. The connection was identified using the standards of minimum sample coverage from which the precedent base was formed. It was found that the recognition accuracy using the precedent base is higher than that using the original dataset. The minimum composition of the precedent base includes descriptions of standards and parameters of local metrics. When using data normalization procedures, additional parameters are required. Analysis of the values of the compactness measure is in demand to detect overfitting of algorithms associated with the dimension of the feature space. Recognition based on precedents minimizes the cost of computing resources using nearest neighbor algorithms. Recommendations are given for the development of models in the field of information security for processing and interpreting sociological research data. For use in information security, a precedent base is being formed to identify DDOS attacks. It is proposed to obtain new knowledge from the field of sociology through the analysis of the values of indicators of noise objects and the interpretation of the results of dividing respondents into non-overlapping groups in relation to the connectedness of objects. The configurations of groups in relation to connectivity are not initially known. There is no point in calculating their centers which can be located outside the configurations. To explain the contents of groups, it is proposed to use standards of minimum coverage.

Keywords