Up to now, Masked Language Model (MLM) framework has been widely adopted for self-supervised language pre-training using large-scale general corpus. Moreover, due to the complexity of natural language, we are witnessing a race for increasingly massive models and this cannot be sustainable. In contrast, there are works arguing the benefit of diversifying tasks in the pre-training phase and other works focusing on efficient ways to positively share knowledge between tasks. This efficient diversification in tasks would potentially give birth to new pre-training paradigms enabling richer language representation and not requiring as much parameters as the classical MLM pre-training paradigm. In this talk, we propose to review these two sets of works namely multi-task pre-training and effective knowledge sharing between tasks, and show some of their implications in the case of Arabic language and its dialects.