Otoscopy is a diagnostic procedure to visualize the external ear canal and eardrum, facilitating the detection of various ear pathologies and conditions. Timely otoscopy image classification offers significant advantages, including early detection, reduced patient anxiety, and personalized treatment plans. This paper introduces a novel OTONet framework specifically tailored for otoscopy image classification. It leverages octave 3D convolution and a combination of feature and region-focus modules to create an accurate and robust classification system capable of distinguishing between various otoscopic conditions. This architecture is designed to efficiently capture and process the spatial and feature information present in otoscopy images. Using a public otoscopy dataset, OTONet has reached a classification accuracy of 99.3% and an F1 score of 99.4% across 11 classes of ear conditions. A comparative analysis demonstrates that OTONet surpasses other established machine learning models, including ResNet50, ResNet50v2, VGG16, Dense-Net169, and ConvNeXtTiny, across various evaluation metrics. The research’s contribution to improved diagnostic accuracy reduced human error, expedited diagnostics, and its potential for telemedicine applications.