Skip to content

Latest commit



executable file
520 lines (324 loc) · 15.4 KB

File metadata and controls

executable file
520 lines (324 loc) · 15.4 KB

class: middle, center, title-slide

Introduction to Artificial Intelligence

Lecture: Communication

Prof. Gilles Louppe
[email protected]



Can you talk to an artificial agent? Can it understand what you say?

  • Machine translation
  • Speech recognition
  • Text-to-speech synthesis

.footnote[Image credits: CS188, UC Berkeley.]

class: middle

Sequence-to-sequence mapping

.grid[ .kol-1-3[Machine translation:][Hello, my name is HAL.][$\rightarrow$][Bonjour, mon nom est HAL.] ] .grid[ .kol-1-3[Speech recognition:][.width-100[]][$\rightarrow$][Hello, my name is HAL.] ] .grid[ .kol-1-3[Text-to-speech synthesis:][Hello, my name is HAL.][$\rightarrow$][.width-100[]] ]

class: middle

Machine translation

class: middle


Machine translation

Automatic translation of text from one natural language (the source) to another (the target), while preserving the intended meaning.

.exercise[How would you engineer a machine translation system?]


Expect the students to come up with a dictionary-based solution.

class: middle

Issue of dictionary lookups


.center[Natural languages are not 1:1 mappings of each other!]

.footnote[Image credits: CS188, UC Berkeley.]

class: middle


.center[To obtain a correct translation, one must decide
whether "it" refers to the soccer ball or to the window.

Therefore, one must understand physics as well as language.]



.footnote[Image credits: CS188, UC Berkeley.]

Data-driven machine translation


.footnote[Image credits: CS188, UC Berkeley.]

class: middle

Machine translation systems

Translation systems must model the source and target languages, but systems vary in the type of models they use.

  • Some systems analyze the source language text all the way into an interlingua knowledge representation and then generate sentences in the target language from that representation.
  • Other systems are based on a transfer model. They keep a database of translation rules and whenever the rule matches, they translate directly. Transfer can occur at the lexical, syntactic or semantic level.

class: middle


Statistical machine translation

To translate an English sentence $e$ into a French sentence $f$, we seek the strings of words $f^*$ such that $$f^* = \arg\max_f P(f|e).$$

  • The language model $P(f|e)$ is learned from a bilingual corpus, i.e. a collection of parallel texts, each an English/French pair.
  • Most of the English sentences to be translated will be novel, but will be composed of phrases that that have been seen before.
  • The corresponding French phrases will be reassembled to form a French sentence that makes sense.


phrase = locution

class: middle

Given an English source sentence $e$, finding a French translation $f$ is a matter of three steps:

  • Break $e$ into phrases $e_1, ..., e_n$.
  • For each phrase $e_i$, choose a corresponding French phrase $f_i$. We use the notation $P(f_i|e_i)$ for the phrasal probability that $f_i$ is a translation of $e_i$.
  • Choose a permutation of the phrases $f_1, ..., f_n$. For each $f_i$, we choose a distortion $$d_i = \text{start}(f_i) - \text{end}(f_{i-1}) - 1,$$ which is the number of words that phrase $f_i$ has moved with respect to $f_{i-1}$; positive for moving to the right, negative for moving the left.

class: middle


class: middle

We define the probability $P(f,d|e)$ that the sequence of phrases $f$ with distortions $d$ is a translation of the sequence of phrases $e$.

Assuming that each phrase translation and each distortion is independent of the others, we have $$P(f,d|e) = \prod_i P(f_i | e_i) P(d_i).$$

  • The best $f$ and $e$ cannot be found through enumeration because of the combinatorial explosion.
  • Instead, local beam search with a heuristic that estimates probability has proven effective at finding a nearly-most-probable translation.


With maybe 100 French phrases for each English phrase in the corpus, there are $100^5$ different 5-phrase translations, and $5!$ reorderings for each of those.

class: middle

All that remains is to learn the phrasal and distortion probabilities:

  1. Find parallel texts.
  2. Segment into sentences.
  3. Align sentences.
  4. Align phrases.
  5. Extract distortions.
  6. Improve estimates with expectation-maximization.

Neural machine translation

Modern machine translation systems are all based on neural networks of various types, often architectured as compositions of

  • recurrent networks for sequence-to-sequence learning,
  • convolutional networks for modeling spatial dependencies.
  • transformer networks.


class: middle

Attention-based recurrent neural network

.grid[ .kol-1-2[

  • Encoder: bidirectional RNN, producing a set of annotation vectors $h_i$.
  • Decoder: attention-based.
    • Compute attention weights $\alpha_{ij}$.
    • Compute the weighted sum of the annotation vectors, as a way to align the input words to the output words.
    • Decode using the context vector, the embedding of the previous output word and the hidden state. ] .kol-1-2[ .center.width-100[] ] ]

class: middle

Speech recognition

class: middle, center, black-slide


Recognition as inference

.grid[[ $\mathbf{y}_{1:T}$ .width-100[]][
$\rightarrow$][ $\mathbf{w}_{1:L}$
My name is HAL.] ]

Speech recognition can be viewed as an instance of the problem of finding the most likely sequence of state variables $\mathbf{w}_{1:L}$, given a sequence of observations $\mathbf{y}_{1:T}$.

  • In this case, (hidden) state variables are the words and the observations are sounds.

  • The input audio waveform from a microphone is converted into a sequence of fixed size acoustic vectors $\mathbf{y}_{1:T}$ in a process called feature extraction.

  • The decoder attempts to find the sequence of words $\mathbf{w}_{1:L} = w_1, ..., w_L$ which is the most likely given the sequence $\mathbf{y}_{1:T}$: $$\hat{\mathbf{w}}_{1:L} = \arg \max_{\mathbf{w}_{1:L}} P(\mathbf{w}_{1:L}|\mathbf{y}_{1:T})$$

class: middle

Since $P(\mathbf{w}_{1:L}|\mathbf{y}_{1:T})$ is difficult to model directly, Bayes' rule is the used to solve the equivalent problem $$\hat{\mathbf{w}}_{1:L} = \arg \max_{\mathbf{w}_{1:L}} p(\mathbf{y}_{1:T}|\mathbf{w}_{1:L}) P(\mathbf{w}_{1:L}),$$ where

  • the likelihood $p(\mathbf{y}_{1:T}|\mathbf{w}_{1:L})$ is the acoustic model;
  • the prior $P(\mathbf{w}_{1:L})$ is the language model.

class: middle


class: middle

Feature extraction

  • The feature extraction seeks to provide a compact representation $\mathbf{y}_{1:T}$ of the speech waveform.
  • This form should minimize the loss of information that discriminates between words.
  • One of the most widely used encoding schemes is based on mel-frequency cepstral coefficients (MFCCs).

class: middle


MFCCs calculation.]

.footnote[Image credits: Giampiero Salvi, 2016. DT2118.]


  • Pre-emphasis: amplify the high frequencies.
  • Windowing: split the signal into short-time frames. - FFT: calculate the frequency spectrum and compute the power spectrum (periodogram).
  • Filter banks: apply triangular filter (around 40) on a Mel-scale to the power spectrum to extract frequency bands.
    • The Mel-scale aims to mimic the non-linear human ear perception of sound, by being more discriminate at lower frequencies and less discriminative at higher frequencies.
  • Decorrelate the bank coefficients through a Discrete Cosine Transform.

class: middle




.center[Feature extraction from the signal in the time domain to MFCCs.]

.footnote[Image credits: Haytham Fayek, 2016.]

class: middle

Acoustic model

A spoken word $w$ is decomposed into a sequence of $K_w$ basic sounds called base phones (such as vowels or consonants).

  • This sequence is called its pronunciation $\mathbf{q}^{w}_{1:K_w} = q_1, ..., q_{K_w}$.
  • Pronunciations are related to words through pronunciations models defined for each word.
  • e.g. "Artificial intelligence" is pronounced /ɑːtɪˈfɪʃ(ə)l ɪnˈtɛlɪdʒ(ə)ns/.

class: middle


class: middle


class: middle


Each base phone $q$ is represented by phone model defined as a three-state continuous density HMM, where

  • the transition probability parameter $a_{ij}$ corresponds to the probability of making the particular transition from state $s_i$ to $s_j$;
  • the output sensor models are Gaussians $b_j(\mathbf{y}) = \mathcal{N}(\mathbf{y}; \mu^{(j)}, \Sigma^{(j)})$ and relate state variables $s_j$ to MFCCs $\mathbf{y}$.

class: middle

The full acoustic model can now be defined as a composition of pronunciation models with individual phone models: $$ \begin{aligned} p(\mathbf{y}_{1:T}|\mathbf{w}_{1:L}) &= \sum_{\mathbf{Q}} P(\mathbf{y}_{1:T} | \mathbf{Q}) P(\mathbf{Q} | \mathbf{w}_{1:L}) \end{aligned} $$ where the summation is over all valid pronunciation sequences for $\mathbf{w}_{1:L}$, $\mathbf{Q}$ is a particular sequence $\mathbf{q}^{w_1}, ..., \mathbf{q}^{w_L}$ of pronunciations, $$ \begin{aligned} P(\mathbf{Q}|\mathbf{w}_{1:L}) &= \prod_{l=1}^L P(\mathbf{q}^{w_l}|w_l) \end{aligned}$$ as given by the pronunciation model, and where $\mathbf{q}^{w_l}$ is a valid pronunciation for word $w_l$.

class: middle

Given the composite HMM formed by concatenating all the constituent pronunciations $\mathbf{q}^{w_1}, ..., \mathbf{q}^{w_L}$ and their corresponding base phones, the acoustic likelihood is given by $$ p(\mathbf{y}_{1:T}|\mathbf{Q}) = \sum_\mathbf{s} p(\mathbf{s},\mathbf{y}_{1:T}|\mathbf{Q}) $$ where $\mathbf{s} = s_0, ..., s_{T+1}$ is a state sequence through the composite model and $$p(\mathbf{s},\mathbf{y}_{1:T}|\mathbf{Q}) = a_{s_0, s_1} \prod_{t=1}^T b_{s_t}(\mathbf{y}_t) a_{s_t s_{t+1}}.$$

From this formulation, all model parameters can be efficiently estimated from a corpus of training utterances with expectation-maximization.

class: middle

N-gram language model

The prior probability of a word sequence $\mathbf{w} = w_1, ..., w_L$ is given by $$P(\mathbf{w}) = \prod_{l=1}^L P(w_l | w_{l-1}, ..., w_{l-N+1}).$$

The N-gram probabilities are estimated from training texts by counting N-gram occurrences to form maximum likelihood estimates.

class: middle


The composite model corresponds to a HMM, from which the most-likely state sequence $\mathbf{s}$ can be inferred using (a variant of) Viterbi.

By construction, states $\mathbf{s}$ relate to phones, phones to pronunciations, and pronunciations to words.

Neural speech recognition

Modern speech recognition systems are now based on end-to-end deep neural network architectures trained on large corpus of data.

.grid[ .kol-2-3[

Deep Speech 2

  • Recurrent neural network with
    • one or more convolutional input layers,
    • followed by multiple recurrent layers,
    • and one fully connected layer before a softmax layer.
  • Total of 35M parameters.
  • Same architecture for both English and Mandarin. ] .kol-1-3[.width-100[]] ]

.footnote[Image credits: Amodei et al, 2015. arXiv:1512.02595.]

class: middle, black-slide


<iframe width="640" height="400" src="" frameborder="0" volume="0" allowfullscreen></iframe>

Deep Speech 2 ]

class: middle

Text-to-speech synthesis

class: middle

.grid[[ $\mathbf{w}_{1:L}$
My name is HAL.][
$\rightarrow$][ $\mathbf{y}_{1:T}$ .width-100[]] ]

Tacotron 2

The Tacotron 2 system is a sequence-to-sequence neural network architecture for text-to-speech. It consists of two components:

  • a recurrent sequence-to-sequence feature prediction network with attention which predicts a sequence of mel spectrogram frames from an input character sequence;
  • a Wavenet vocoder which generates time-domain waveform samples conditioned on the predicted mel spectrogram frames.

class: middle[]

.footnote[Image credits: Shen et al, 2017. arXiv:1712.05884.]

class: middle


  • The Tacotron 2 architecture produces mel spectrograms as outputs, which remain to be synthesized as waveforms.
  • This last step can be performed through another autoregressive neural model, such as Wavenet, to transform mel-scale spectrograms into high-fidelity waveforms.

.center[ .width-30[![](figures/archives-lec-communication/mel-to-wave.png)] .width-50[![](figures/archives-lec-communication/wavenet.png)] ]

class: middle

Audio samples at

class: middle, black-slide


<iframe width="640" height="400" src="" frameborder="0" volume="0" allowfullscreen></iframe>

Google Assistant: Soon in your smartphone. ]


  • Natural language understanding is one of the most important subfields of AI.
  • Machine translation, speech recognition and text-to-speech synthesis are instances of sequence-to-sequence problems.
  • All problems can be tackled with traditional statistical inference methods but require sophisticated engineering.
  • State-of-the-art methods are now based on neural networks.

class: end-slide, center count: false

The end.