Market participants often would like to infer a sentiment from a text. It can be News, Analyst Reports, Earning Call transcripts, Regulatory Filings, etc. The “old way” to do it is to use a lexicon, like the Loughran-McDonald’s one, whereas the “new way” is to use embeddings that are trained in an unsupervised manner, relying on the quality of a “language model”. I will challenge these embeddings: can they really understand the polarity of a text? That for I need the natural probabilistic generative model for embeddings, to be able to generate synthetic texts and assess the identifiability of embeddings. It shows that they are not always good at making the difference between synonyms or antonyms. Moreover, training embeddings on different corpora (wikipedia, headlines of financial News, body of financial News), I will provide evidence not only that it may be impossible for them to capture sentiments polarities, but that they may attach polarities to terms that you want to be neutral, like names of companies.
For more details, the paper is there: https://arxiv.org/abs/2103.09813