A vision transformer (ViT)-based deep neural network was applied to classify the flavonoid glycoside isomers by analyzing electrospray ionization tandem mass spectrometry (ESI-MS/MS) spectra. Our model successfully classified the flavonoid isomers with various substitution patterns (3-O, 6-C, 7-O, 8-C, 4′-O) and multiple glycosides, achieving over 80% accuracy during training. In addition, the experimental spectra from flavonoid glycoside standards were acquired with different adducts, and our model showed robust performance regardless of the experimental conditions. As a result, the vision transformer-based computer vision model is promising for analyzing mass spectrometry data.