Advances in Natural Language Processing (NLP) have accelerated the adoption of these approaches in all domains where text data harbors very rich information content. Furthermore, the field has also achieved success in applications to domain specific languages such as those used in scientific publications, medical records or engagements with healthcare professionals. However, one challenge remains: Limited sampling of the available knowledge space either due to difficulties in data collection or due to the high cost of labeling. Not surprisingly, some of the most interesting answers lie at the edge or out of distribution. Identifying novel protein - compound pairs, finding rare information in publications or patient data are all limited by this imbalance problem in data sampling. In this talk, I will first describe our collaborative NLP platform that not only powers use case delivery but also our research. Building on one of our use cases, I will then summarize our recent research that aims to address the “needle in a haystack”problem by approaches beyond supervised learning.