Owing to the powerful self-attention mechanism, the Transformer network has achieved considerable successes across many sequence modeling tasks and has become one of the most popular methods in text-to-speech (TTS). The vanilla self-attention excels in capturing long-range dependencies but suffers in modeling stable short-range dependencies that are quite important for speech synthesis where the local audio signals are highly correlated. To address this problem, we propose the hybrid lightweight convolution (HLC), which is responsible for fully exploiting local structures of a sequence, and combine it with the self-attention to improve the Transformer-based TTS. The experimental results show that our modified model obtains better performance in both objective and subjective evaluations. At the same time, we also demonstrate that a more compact TTS model may be built through the combination of self-attention and proposed hybrid lightweight convolution. Besides, this method is also potentially adaptable for other sequence modeling tasks.