Introduction to Artificial Intelligence

Lecture: Communication

Prof. Gilles Louppe
g.louppe@uliege.be

Today

.center.width-30[]

Can you talk to an artificial agent? Can it understand what you say?

Machine translation
Speech recognition
Text-to-speech synthesis

Sequence-to-sequence mapping

.grid[ .kol-1-3[Machine translation:] .kol-1-4.center[Hello, my name is HAL.] .kol-2-12.center[$\rightarrow$] .kol-1-4.center[Bonjour, mon nom est HAL.] ] .grid[ .kol-1-3[Speech recognition:] .kol-1-4.center[.width-100[]] .kol-2-12.center[$\rightarrow$] .kol-1-4.center[Hello, my name is HAL.] ] .grid[ .kol-1-3[Text-to-speech synthesis:] .kol-1-4.center[Hello, my name is HAL.] .kol-2-12.center[$\rightarrow$] .kol-1-4.center[.width-100[]] ]

Machine translation

.center.width-100[]

Machine translation

Automatic translation of text from one natural language (the source) to another (the target), while preserving the intended meaning.

???

Expect the students to come up with a dictionary-based solution.

Issue of dictionary lookups

.center.width-80[]

.center.width-100[]

.center[To obtain a correct translation, one must decide
whether "it" refers to the soccer ball or to the window.

Therefore, one must understand physics as well as language.]

History

.center.width-100[]

Data-driven machine translation

.center.width-100[![](figures/archives-lec-communication/data-driven-mt.png)]

Machine translation systems

Translation systems must model the source and target languages, but systems vary in the type of models they use.

Some systems analyze the source language text all the way into an interlingua knowledge representation and then generate sentences in the target language from that representation.
Other systems are based on a transfer model. They keep a database of translation rules and whenever the rule matches, they translate directly. Transfer can occur at the lexical, syntactic or semantic level.

.center.width-100[]

Statistical machine translation

To translate an English sentence $e$ into a French sentence $f$, we seek the strings of words $f^*$ such that $$f^* = \arg\max_f P(f|e).$$

The language model $P(f|e)$ is learned from a bilingual corpus, i.e. a collection of parallel texts, each an English/French pair.
Most of the English sentences to be translated will be novel, but will be composed of phrases that that have been seen before.
The corresponding French phrases will be reassembled to form a French sentence that makes sense.

???

phrase = locution

Given an English source sentence $e$, finding a French translation $f$ is a matter of three steps:

Break $e$ into phrases $e_1, ..., e_n$.
For each phrase $e_i$, choose a corresponding French phrase $f_i$. We use the notation $P(f_i|e_i)$ for the phrasal probability that $f_i$ is a translation of $e_i$.
Choose a permutation of the phrases $f_1, ..., f_n$. For each $f_i$, we choose a distortion $$d_i = \text{start}(f_i) - \text{end}(f_{i-1}) - 1,$$ which is the number of words that phrase $f_i$ has moved with respect to $f_{i-1}$; positive for moving to the right, negative for moving the left.

.center.width-100[]

We define the probability $P(f,d|e)$ that the sequence of phrases $f$ with distortions $d$ is a translation of the sequence of phrases $e$.

Assuming that each phrase translation and each distortion is independent of the others, we have $$P(f,d|e) = \prod_i P(f_i | e_i) P(d_i).$$

The best $f$ and $e$ cannot be found through enumeration because of the combinatorial explosion.
Instead, local beam search with a heuristic that estimates probability has proven effective at finding a nearly-most-probable translation.

???

With maybe 100 French phrases for each English phrase in the corpus, there are $100^5$ different 5-phrase translations, and $5!$ reorderings for each of those.

All that remains is to learn the phrasal and distortion probabilities:

Find parallel texts.
Segment into sentences.
Align sentences.
Align phrases.
Extract distortions.
Improve estimates with expectation-maximization.

Neural machine translation

Modern machine translation systems are all based on neural networks of various types, often architectured as compositions of

recurrent networks for sequence-to-sequence learning,
convolutional networks for modeling spatial dependencies.
transformer networks.

.center.width-70[]

Attention-based recurrent neural network

Encoder: bidirectional RNN, producing a set of annotation vectors $h_i$.
Decoder: attention-based.
- Compute attention weights $\alpha_{ij}$.
- Compute the weighted sum of the annotation vectors, as a way to align the input words to the output words.
- Decode using the context vector, the embedding of the previous output word and the hidden state. ] .kol-1-2[ .center.width-100[] ] ]

Speech recognition

.center.width-100[]

Recognition as inference

.grid[ .kol-5-12.center[ $\mathbf{y}_{1:T}$ .width-100[]] .kol-2-12.center[
$\rightarrow$] .kol-5-12.center[ $\mathbf{w}_{1:L}$
My name is HAL.] ]

Speech recognition can be viewed as an instance of the problem of finding the most likely sequence of state variables $\mathbf{w}_{1:L}$, given a sequence of observations $\mathbf{y}_{1:T}$.

In this case, (hidden) state variables are the words and the observations are sounds.
The input audio waveform from a microphone is converted into a sequence of fixed size acoustic vectors $\mathbf{y}_{1:T}$ in a process called feature extraction.
The decoder attempts to find the sequence of words $\mathbf{w}_{1:L} = w_1, ..., w_L$ which is the most likely given the sequence $\mathbf{y}_{1:T}$: $$\hat{\mathbf{w}}_{1:L} = \arg \max_{\mathbf{w}_{1:L}} P(\mathbf{w}_{1:L}|\mathbf{y}_{1:T})$$

Since $P(\mathbf{w}_{1:L}|\mathbf{y}_{1:T})$ is difficult to model directly, Bayes' rule is the used to solve the equivalent problem $$\hat{\mathbf{w}}_{1:L} = \arg \max_{\mathbf{w}_{1:L}} p(\mathbf{y}_{1:T}|\mathbf{w}_{1:L}) P(\mathbf{w}_{1:L}),$$ where

the likelihood $p(\mathbf{y}_{1:T}|\mathbf{w}_{1:L})$ is the acoustic model;
the prior $P(\mathbf{w}_{1:L})$ is the language model.

.center.width-90[]

Feature extraction

The feature extraction seeks to provide a compact representation $\mathbf{y}_{1:T}$ of the speech waveform.
This form should minimize the loss of information that discriminates between words.
One of the most widely used encoding schemes is based on mel-frequency cepstral coefficients (MFCCs).

.center.width-100[

MFCCs calculation.]

???

Pre-emphasis: amplify the high frequencies.
Windowing: split the signal into short-time frames. - FFT: calculate the frequency spectrum and compute the power spectrum (periodogram).
Filter banks: apply triangular filter (around 40) on a Mel-scale to the power spectrum to extract frequency bands.
- The Mel-scale aims to mimic the non-linear human ear perception of sound, by being more discriminate at lower frequencies and less discriminative at higher frequencies.
Decorrelate the bank coefficients through a Discrete Cosine Transform.

.center.width-90[]

$$\downarrow$$

.center.width-90[]

Acoustic model

A spoken word $w$ is decomposed into a sequence of $K_w$ basic sounds called base phones (such as vowels or consonants).

This sequence is called its pronunciation $\mathbf{q}^{w}_{1:K_w} = q_1, ..., q_{K_w}$.
Pronunciations are related to words through pronunciations models defined for each word.
e.g. "Artificial intelligence" is pronounced /ɑːtɪˈfɪʃ(ə)l ɪnˈtɛlɪdʒ(ə)ns/.

.center.width-100[]

.center.width-100[]

.center.width-60[]

Each base phone $q$ is represented by phone model defined as a three-state continuous density HMM, where

the transition probability parameter $a_{ij}$ corresponds to the probability of making the particular transition from state $s_i$ to $s_j$;
the output sensor models are Gaussians $b_j(\mathbf{y}) = \mathcal{N}(\mathbf{y}; \mu^{(j)}, \Sigma^{(j)})$ and relate state variables $s_j$ to MFCCs $\mathbf{y}$.

The full acoustic model can now be defined as a composition of pronunciation models with individual phone models: $$ \begin{aligned} p(\mathbf{y}_{1:T}|\mathbf{w}_{1:L}) &= \sum_{\mathbf{Q}} P(\mathbf{y}_{1:T} | \mathbf{Q}) P(\mathbf{Q} | \mathbf{w}_{1:L}) \end{aligned} $$ where the summation is over all valid pronunciation sequences for $\mathbf{w}_{1:L}$, $\mathbf{Q}$ is a particular sequence $\mathbf{q}^{w_1}, ..., \mathbf{q}^{w_L}$ of pronunciations, $$ \begin{aligned} P(\mathbf{Q}|\mathbf{w}_{1:L}) &= \prod_{l=1}^L P(\mathbf{q}^{w_l}|w_l) \end{aligned}$$ as given by the pronunciation model, and where $\mathbf{q}^{w_l}$ is a valid pronunciation for word $w_l$.

Given the composite HMM formed by concatenating all the constituent pronunciations $\mathbf{q}^{w_1}, ..., \mathbf{q}^{w_L}$ and their corresponding base phones, the acoustic likelihood is given by $$ p(\mathbf{y}_{1:T}|\mathbf{Q}) = \sum_\mathbf{s} p(\mathbf{s},\mathbf{y}_{1:T}|\mathbf{Q}) $$ where $\mathbf{s} = s_0, ..., s_{T+1}$ is a state sequence through the composite model and $$p(\mathbf{s},\mathbf{y}_{1:T}|\mathbf{Q}) = a_{s_0, s_1} \prod_{t=1}^T b_{s_t}(\mathbf{y}_t) a_{s_t s_{t+1}}.$$

From this formulation, all model parameters can be efficiently estimated from a corpus of training utterances with expectation-maximization.

N-gram language model

The prior probability of a word sequence $\mathbf{w} = w_1, ..., w_L$ is given by $$P(\mathbf{w}) = \prod_{l=1}^L P(w_l | w_{l-1}, ..., w_{l-N+1}).$$

The N-gram probabilities are estimated from training texts by counting N-gram occurrences to form maximum likelihood estimates.

Decoding

The composite model corresponds to a HMM, from which the most-likely state sequence $\mathbf{s}$ can be inferred using (a variant of) Viterbi.

By construction, states $\mathbf{s}$ relate to phones, phones to pronunciations, and pronunciations to words.

Neural speech recognition

Modern speech recognition systems are now based on end-to-end deep neural network architectures trained on large corpus of data.

Deep Speech 2

Recurrent neural network with
- one or more convolutional input layers,
- followed by multiple recurrent layers,
- and one fully connected layer before a softmax layer.
Total of 35M parameters.
Same architecture for both English and Mandarin. ] .kol-1-3[.width-100[]] ]

Deep Speech 2 ]

Text-to-speech synthesis

.grid[ .kol-5-12.center[ $\mathbf{w}_{1:L}$
My name is HAL.] .kol-2-12.center[
$\rightarrow$] .kol-5-12.center[ $\mathbf{y}_{1:T}$ .width-100[]] ]

Tacotron 2

The Tacotron 2 system is a sequence-to-sequence neural network architecture for text-to-speech. It consists of two components:

a recurrent sequence-to-sequence feature prediction network with attention which predicts a sequence of mel spectrogram frames from an input character sequence;
a Wavenet vocoder which generates time-domain waveform samples conditioned on the predicted mel spectrogram frames.

.width-80.center[]

Wavenet

The Tacotron 2 architecture produces mel spectrograms as outputs, which remain to be synthesized as waveforms.
This last step can be performed through another autoregressive neural model, such as Wavenet, to transform mel-scale spectrograms into high-fidelity waveforms.

.center[ .width-30[![](figures/archives-lec-communication/mel-to-wave.png)] .width-50[![](figures/archives-lec-communication/wavenet.png)] ]

Audio samples at

deepmind.com/blog/wavenet-generative-model-raw-audio
google.github.io/tacotron

Google Assistant: Soon in your smartphone. ]

Summary

Natural language understanding is one of the most important subfields of AI.
Machine translation, speech recognition and text-to-speech synthesis are instances of sequence-to-sequence problems.
All problems can be tackled with traditional statistical inference methods but require sophisticated engineering.
State-of-the-art methods are now based on neural networks.

The end.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

archives-lecture-communication.md

archives-lecture-communication.md

Introduction to Artificial Intelligence

Today

Sequence-to-sequence mapping

Machine translation

Machine translation

Issue of dictionary lookups

History

Data-driven machine translation

Machine translation systems

Statistical machine translation

Neural machine translation

Attention-based recurrent neural network

Speech recognition

Recognition as inference

Feature extraction

Acoustic model

N-gram language model

Decoding

Neural speech recognition

Deep Speech 2

Text-to-speech synthesis

Tacotron 2

Wavenet

Summary

Files

archives-lecture-communication.md

Latest commit

History

archives-lecture-communication.md

File metadata and controls

Introduction to Artificial Intelligence

Today

Sequence-to-sequence mapping

Machine translation

Machine translation

Issue of dictionary lookups

History

Data-driven machine translation

Machine translation systems

Statistical machine translation

Neural machine translation

Attention-based recurrent neural network

Speech recognition

Recognition as inference

Feature extraction

Acoustic model

N-gram language model

Decoding

Neural speech recognition

Deep Speech 2

Text-to-speech synthesis

Tacotron 2

Wavenet

Summary