Scintillation-based X-ray imaging can provide convenient visual observation of absorption contrast by standard digital cameras, which is critical in a variety of science and engineering disciplines. More efficient scintillators and electronic postprocessing derived from neural networks are usually used to improve the quality of obtained images from the perspective of optical imaging and machine vision, respectively. Here, we propose to overcome the intrinsic separation of optical transmission process and electronic calculation process, integrating the imaging and postprocessing into one fused optical–electronic convolutional autoencoder network by affixing a designable optical convolutional metasurface to the scintillator. In this way, the convolutional autoencoder was directly connected to down-conversion process, and the optical information loss and training cost can be decreased simultaneously. We demonstrate that feature-specific enhancement of incoherent images is realized, which can apply to multi-class samples without additional data precollection. Hard X-ray experimental validations reveal the enhancement of textural features and regional features achieved by adjusting the optical metasurface, indicating a signal-to-noise ratio improvement of up to 11.2 dB. We anticipate that our framework will advance the fundamental understanding of X-ray imaging and prove to be useful for number recognition and bioimaging applications.