On the Hierarchical Information in a Single Contextualised Word Representation (Student Abstract)
Contextual word embeddings produced by neural language models, such as BERT or ELMo, have seen widespread application and performance gains across many Natural Language Processing tasks, suggesting rich linguistic features encoded in their representations. This work aims to investigate to what extent any linguistic hierarchical information is encoded into a single contextual embedding. Using labelled constituency trees, we train simple linear classifiers on top of single contextualised word representations for ancestor sentiment analysis tasks at multiple constituency levels of a sentence. To assess the presence of hierarchical information throughout the networks, the linear classifiers are trained using representations produced by each intermediate layer of BERT and ELMo variants. We show that with no fine-tuning, a single contextualised representation encodes enough syntactic and semantic sentence-level information to significantly outperform a non-contextual baseline for classifying 5-class sentiment of its ancestor constituents at multiple levels of the constituency tree. Additionally, we show that both LSTM and transformer architectures trained on similarly sized datasets achieve similar levels of performance on these tasks. Future work looks to expand the analysis to a wider range of NLP tasks and contextualisers.