Recently, analysis of deep neural networks has been an active topic of research. While previous work mostly used so-called 'probing tasks' and has made some interesting observations, an explanation of the process behind the observed behavior has been lacking.
I attempt to explain more generally why such behavior is observed by characterizing how the learning objective determines the information flow in the model. In particular, I consider how the representations of individual tokens in the Transformer evolve between layers under different learning objectives: machine translation (MT), language modeling (LM), and masked language modeling (MLM, aka BERT). I look at this task from the information bottleneck perspective on learning in neural networks.
I will show, that:
- LMs gradually forget past when forming predictions about future;
- for MLMs, the evolution proceeds in two stages of context encoding and token reconstruction;
- MT representations get refined with context, but less processing is happening.