The talk will describe synthetic data generation for Natural Language Processing NLP with an application on a Named Entity Recognition (NER) task. There are 2 main parts in the talk: - The definition of Probabilistic Context Free Grammars (PCFGs) using graphical examples and short NLTK code snippets to convey their functionality and their flexibility. PCFGs are grammars that consist of terminal and non-terminal symbols. A set of probabilistic rules describes how the grammar symbols are combined to form PCFG productions. Defining such grammar rules is at the core of this synthetic data generation approach. - The application of PCFGs on generating NER training data massively and cheaply without specialised hardware. The core of the application is defining a PCFG that combines the entities of our choice is valid ways and then applies slot filling. In this context, the talk will use NER as an example task whose goal is to recognise concepts like “Action” “Object” “Time” “Location”. Using a PCFG we can describe the order of these concepts and their co-occurrence and then sample productions like: [“Action” “Object” “Time” “Location”] or [“Object” “Location”]. In the last step, we apply slot-filling on these productions using vocabulary lists for each concept: (e.g., “Show me [Action] restaurants [Object] from last week [Time] in Lausanne, Switzerland [Location]). This approach generalises to custom concepts trivially and is very fast to bootstrap data for a new application. The method and the lessons learned this talk proposes have been successfully applied to train a NER system without tailored or manually annotated data that is used in production as Salesforce Search. It suggests an effective way to generate data that respect underlying distributions respecting controlling for concept sparsity via the probabilities assigned to each grammar rule.
Download the slides for this talk.Download ( PDF, 6946.59 MB)