class: middle, center, title-slide
Lecture: Communication
Prof. Gilles Louppe
[email protected]
Can you talk to an artificial agent? Can it understand what you say?
- Machine translation
- Speech recognition
- Text-to-speech synthesis
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
.grid[ .kol-1-3[Machine translation:] .kol-1-4.center[Hello, my name is HAL.] .kol-2-12.center[$\rightarrow$] .kol-1-4.center[Bonjour, mon nom est HAL.] ] .grid[ .kol-1-3[Speech recognition:] .kol-1-4.center[.width-100[]] .kol-2-12.center[$\rightarrow$] .kol-1-4.center[Hello, my name is HAL.] ] .grid[ .kol-1-3[Text-to-speech synthesis:] .kol-1-4.center[Hello, my name is HAL.] .kol-2-12.center[$\rightarrow$] .kol-1-4.center[.width-100[]] ]
class: middle
class: middle
Automatic translation of text from one natural language (the source) to another (the target), while preserving the intended meaning.
.exercise[How would you engineer a machine translation system?]
???
Expect the students to come up with a dictionary-based solution.
class: middle
.center[Natural languages are not 1:1 mappings of each other!]
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
.center[To obtain a correct translation, one must decide
whether "it" refers to the soccer ball or to the window.
Therefore, one must understand physics as well as language.]
.footnote[Image credits: CS188, UC Berkeley.]
.center.width-100[![](figures/archives-lec-communication/data-driven-mt.png)]
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
Translation systems must model the source and target languages, but systems vary in the type of models they use.
- Some systems analyze the source language text all the way into an interlingua knowledge representation and then generate sentences in the target language from that representation.
- Other systems are based on a transfer model. They keep a database of translation rules and whenever the rule matches, they translate directly. Transfer can occur at the lexical, syntactic or semantic level.
class: middle
To translate an English sentence
- The language model
$P(f|e)$ is learned from a bilingual corpus, i.e. a collection of parallel texts, each an English/French pair. - Most of the English sentences to be translated will be novel, but will be composed of phrases that that have been seen before.
- The corresponding French phrases will be reassembled to form a French sentence that makes sense.
???
phrase = locution
class: middle
Given an English source sentence
- Break
$e$ into phrases$e_1, ..., e_n$ . - For each phrase
$e_i$ , choose a corresponding French phrase$f_i$ . We use the notation$P(f_i|e_i)$ for the phrasal probability that$f_i$ is a translation of$e_i$ . - Choose a permutation of the phrases
$f_1, ..., f_n$ . For each$f_i$ , we choose a distortion$$d_i = \text{start}(f_i) - \text{end}(f_{i-1}) - 1,$$ which is the number of words that phrase$f_i$ has moved with respect to$f_{i-1}$ ; positive for moving to the right, negative for moving the left.
class: middle
class: middle
We define the probability
Assuming that each phrase translation and each distortion is independent of the others, we have
- The best
$f$ and$e$ cannot be found through enumeration because of the combinatorial explosion. - Instead, local beam search with a heuristic that estimates probability has proven effective at finding a nearly-most-probable translation.
???
With maybe
100 French phrases for each English phrase in the corpus, there are
class: middle
All that remains is to learn the phrasal and distortion probabilities:
- Find parallel texts.
- Segment into sentences.
- Align sentences.
- Align phrases.
- Extract distortions.
- Improve estimates with expectation-maximization.
Modern machine translation systems are all based on neural networks of various types, often architectured as compositions of
- recurrent networks for sequence-to-sequence learning,
- convolutional networks for modeling spatial dependencies.
- transformer networks.
class: middle
.grid[ .kol-1-2[
- Encoder: bidirectional RNN, producing a set of annotation vectors
$h_i$ . - Decoder: attention-based.
class: middle
class: middle, center, black-slide
.grid[
.kol-5-12.center[
My name is HAL.]
]
Speech recognition can be viewed as an instance of the problem of finding the most likely sequence of state variables
-
In this case, (hidden) state variables are the words and the observations are sounds.
-
The input audio waveform from a microphone is converted into a sequence of fixed size acoustic vectors
$\mathbf{y}_{1:T}$ in a process called feature extraction. -
The decoder attempts to find the sequence of words
$\mathbf{w}_{1:L} = w_1, ..., w_L$ which is the most likely given the sequence$\mathbf{y}_{1:T}$ :$$\hat{\mathbf{w}}_{1:L} = \arg \max_{\mathbf{w}_{1:L}} P(\mathbf{w}_{1:L}|\mathbf{y}_{1:T})$$
class: middle
Since
- the likelihood
$p(\mathbf{y}_{1:T}|\mathbf{w}_{1:L})$ is the acoustic model; - the prior
$P(\mathbf{w}_{1:L})$ is the language model.
class: middle
class: middle
- The feature extraction seeks to provide a compact representation
$\mathbf{y}_{1:T}$ of the speech waveform. - This form should minimize the loss of information that discriminates between words.
- One of the most widely used encoding schemes is based on mel-frequency cepstral coefficients (MFCCs).
class: middle
MFCCs calculation.]
.footnote[Image credits: Giampiero Salvi, 2016. DT2118.]
???
- Pre-emphasis: amplify the high frequencies.
- Windowing: split the signal into short-time frames. - FFT: calculate the frequency spectrum and compute the power spectrum (periodogram).
- Filter banks: apply triangular filter (around 40) on a Mel-scale to the power spectrum to extract frequency bands.
- The Mel-scale aims to mimic the non-linear human ear perception of sound, by being more discriminate at lower frequencies and less discriminative at higher frequencies.
- Decorrelate the bank coefficients through a Discrete Cosine Transform.
class: middle
.center[Feature extraction from the signal in the time domain to MFCCs.]
.footnote[Image credits: Haytham Fayek, 2016.]
class: middle
A spoken word
- This sequence is called its pronunciation
$\mathbf{q}^{w}_{1:K_w} = q_1, ..., q_{K_w}$ . - Pronunciations are related to words through pronunciations models defined for each word.
- e.g. "Artificial intelligence" is pronounced
/ɑːtɪˈfɪʃ(ə)l ɪnˈtɛlɪdʒ(ə)ns/
.
class: middle
class: middle
class: middle
Each base phone
- the transition probability parameter
$a_{ij}$ corresponds to the probability of making the particular transition from state$s_i$ to$s_j$ ; - the output sensor models are Gaussians
$b_j(\mathbf{y}) = \mathcal{N}(\mathbf{y}; \mu^{(j)}, \Sigma^{(j)})$ and relate state variables$s_j$ to MFCCs$\mathbf{y}$ .
class: middle
The full acoustic model can now be defined as a composition of pronunciation models with individual phone models:
$$
\begin{aligned}
p(\mathbf{y}_{1:T}|\mathbf{w}_{1:L}) &= \sum_{\mathbf{Q}} P(\mathbf{y}_{1:T} | \mathbf{Q}) P(\mathbf{Q} | \mathbf{w}_{1:L})
\end{aligned}
$$
where the summation is over all valid pronunciation sequences for
class: middle
Given the composite HMM formed by concatenating all the constituent pronunciations
From this formulation, all model parameters can be efficiently estimated from a corpus of training utterances with expectation-maximization.
class: middle
The prior probability of a word sequence
The N-gram probabilities are estimated from training texts by counting N-gram occurrences to form maximum likelihood estimates.
class: middle
The composite model corresponds to a HMM, from which the most-likely state sequence
By construction, states
Modern speech recognition systems are now based on end-to-end deep neural network architectures trained on large corpus of data.
.grid[ .kol-2-3[
- Recurrent neural network with
- one or more convolutional input layers,
- followed by multiple recurrent layers,
- and one fully connected layer before a softmax layer.
- Total of 35M parameters.
- Same architecture for both English and Mandarin. ] .kol-1-3[.width-100[]] ]
.footnote[Image credits: Amodei et al, 2015. arXiv:1512.02595.]
class: middle, black-slide
.center[
<iframe width="640" height="400" src="https://www.youtube.com/embed/IFPwMKbdQnI?&loop=1&start=0" frameborder="0" volume="0" allowfullscreen></iframe>Deep Speech 2 ]
class: middle
class: middle
.grid[
.kol-5-12.center[
My name is HAL.]
.kol-2-12.center[
The Tacotron 2 system is a sequence-to-sequence neural network architecture for text-to-speech. It consists of two components:
- a recurrent sequence-to-sequence feature prediction network with attention which predicts a sequence of mel spectrogram frames from an input character sequence;
- a Wavenet vocoder which generates time-domain waveform samples conditioned on the predicted mel spectrogram frames.
class: middle
.footnote[Image credits: Shen et al, 2017. arXiv:1712.05884.]
class: middle
- The Tacotron 2 architecture produces mel spectrograms as outputs, which remain to be synthesized as waveforms.
- This last step can be performed through another autoregressive neural model, such as Wavenet, to transform mel-scale spectrograms into high-fidelity waveforms.
.center[ .width-30[![](figures/archives-lec-communication/mel-to-wave.png)] .width-50[![](figures/archives-lec-communication/wavenet.png)] ]
class: middle
Audio samples at
class: middle, black-slide
.center[
<iframe width="640" height="400" src="https://www.youtube.com/embed/7gh6_U7Nfjs?&loop=1&start=0" frameborder="0" volume="0" allowfullscreen></iframe>Google Assistant: Soon in your smartphone. ]
- Natural language understanding is one of the most important subfields of AI.
- Machine translation, speech recognition and text-to-speech synthesis are instances of sequence-to-sequence problems.
- All problems can be tackled with traditional statistical inference methods but require sophisticated engineering.
- State-of-the-art methods are now based on neural networks.
class: end-slide, center count: false
The end.