Telecom (Jul 2024)
Training from Zero: Forecasting of Radio Frequency Machine Learning Data Quantity
Abstract
The data used during training in any given application space are directly tied to the performance of the system once deployed. While there are many other factors that are attributed to producing high-performance models based on the Neural Scaling Law within Machine Learning, there is no doubt that the data used to train a system provide the foundation from which to build. One of the underlying heuristics used within the Machine Learning space is that having more data leads to better models, but there is no easy answer to the question, “How much data is needed to achieve the desired level of performance?” This work examines a modulation classification problem in the Radio Frequency domain space, attempting to answer the question of how many training data are required to achieve a desired level of performance, but the procedure readily applies to classification problems across modalities. The ultimate goal is to determine an approach that requires the lowest amount of data collection to better inform a more thorough collection effort to achieve the desired performance metric. By focusing on forecasting the performance of the model rather than the loss value, this approach allows for a greater intuitive understanding of data volume requirements. While this approach will require an initial dataset, the goal is to allow for the initial data collection to be orders of magnitude smaller than what is required for delivering a system that achieves the desired performance. An additional benefit of the techniques presented here is that the quality of different datasets can be numerically evaluated and tied together with the quantity of data, and ultimately, the performance of the architecture in the problem domain.
Keywords