We introduce DeepTone, a novel data-efficient representation learning framework for real-time latent speaker state characterisation.
We will share our experience on designing, training, and deploying DeepTone in the cloud and on the edge, and outline our vision for the future of privacy-preserving, computationally efficient acoustic conversation modelling on embedded devices.
Details (if required)
Many emotions and behaviours playing a critical role in human communication are uniquely contained in voice. Currently, systems attempting to learn emotions and behaviours from voice either rely on noisy and lossy text modelling solutions from transcribed audio, or fully supervised representations learned from voice, requiring a large and extensive corpus of the target emotions and behaviours.
With DeepTone, we introduce a new framework to learn rich speaker state representations from low amounts of data with even lower numbers of behavioural labels, and query this latent space in real time to unlock a new level of accuracy in acoustic intonation modelling.
Efficient training and optimisation have allowed us to port this framework from cloud GPU instances to low-powered personal devices, pioneering rich acoustic conversational modelling on embedded systems. In this talk, we will share the learnings we collected along the way, as well as illustrations of the business applications of DeepTone.