Abstract Social interactions are important for us, humans, as social creatures. Emotions play an important part in social interactions. They usually express meanings along with the spoken utterances to the interlocutors. Automatic facial expressions recognition is one technique to automatically capture, recognise, and understand emotions from the interlocutor. Many techniques proposed to increase the accuracy of emotions recognition from facial cues. Architecture such as convolutional neural networks demonstrates promising results for emotions recognition. However, most of the current models of convolutional neural networks require an enormous computational power to train and process emotional recognition. This research aims to build compact networks with depthwise separable layers while also maintaining performance. Three datasets and three other similar architectures were used to be compared with the proposed architecture. The results show that the proposed architecture performed the best among the other architectures. It achieved up to 13% better accuracy and 6–71% smaller and more compact than the other architectures. The best testing accuracy achieved by the architecture was 99.4%.