-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize sequence labeler and allow re-use embeddings for labeling #798
Conversation
jlibovicky
commented
Feb 22, 2019
- sequence labeler can now use multiple inputs
- fixed bug with max length max_length must be undefined when using SequenceLabeler #790
- labeler can reuse word embeddings for classification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oprav testy + kydíky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vypada to v pohode, jen si nejsem jisty s tim tf.stop_gradients. Sam jsem si s tim ale nikdy nehral, takze nevim.
Jestli jste tohle pouzivali i s BERTem, tak by se hodil priklad.
|
||
reshaped_states = tf.reshape(states, [-1, embedding_dim]) | ||
reshaped_logits = tf.matmul( | ||
reshaped_states, embeddings, transpose_b=True, name="logits") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Biasy necheme?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neděláme je ani jinde při tie_embeddings
, viz autoregressive.py:225
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jo, nevsiml jsem si, ze EmbeddingsLabeler vzdycky vaze embeddingy.
def logits(self) -> tf.Tensor: | ||
embeddings = self.embedded_sequence.embedding_matrix | ||
if not self.train_embeddings: | ||
embeddings = tf.stop_gradient(embeddings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fakt muzes tohle udelat? Vzhledem k tomu, ze se jedna o fakticky posledni vrstvu, nestane se to, ze ti pri backpropu neprotece zadna informace do zbytku site (a tedy se nic nenaucis)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neproteče tam přes tuhle lokální proměnnou embeddings
, ale proteče tam skrz states
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jasne, uz to vidim
name="state_to_word_b", | ||
shape=[len(self.vocabulary)], | ||
initializer=tf.zeros_initializer()) | ||
def concatenated_inputs(self) -> tf.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kdyz budu mit vice enkoderu, co pracuji nad ruzne dlouhymi sekvencemi, tak to na tom concatu spadne, ne?
Nemela by se takova situace resit spise pres FactoredSequence/FactoredEncoder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to je pravda
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stale padaj testy + nedostal jsem odpoved k otazce s BERTem (staci rict ne a nemusite ty priklady pridavat)
Poznámku o bertovi jsem přehlídnul, s bertem jsme to nepoužívali... Máme normální language model, na kterej tenhleten labeler lepíme a je to pak skoro to samý, jako rekurentní dekodér. |
Tak snad opraveno.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jeste jedna vec, na ktere jsem si predtim nevsim (train_xents)
loss = tf.nn.sparse_softmax_cross_entropy_with_logits( | ||
labels=self.train_targets, logits=self.logits) | ||
|
||
# loss is now of shape [batch, time]. Need to mask it now by | ||
# element-wise multiplication with weights placeholder | ||
weighted_loss = loss * self.train_mask | ||
return tf.reduce_sum(weighted_loss) | ||
return loss * self.train_mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jo, jeste se mi tady nezda format train_xents.
Tady to vraci shape(batch, time), ale v autoregressivnich dekoderech to vraci shape(batch).
Bylo by teda fajn se dohodnout, co se bude vracet (bud tady vracet prumer-per-sequence; nebo v autoregressive by melo stacit vypnout average_across_timesteps v self.train_xents)
In the long run by samozrejme mel byt jeden spolecny predek "dekoder" s abstraktnima metodama, jako loss, xents apod., ale to bych klidne nechal do jineho PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
V mariánovi to jde nastavovat ("ce-mean" vs "ce-mean-words"), s tím, že v nových modelech se používá loss zprůměrovanej ze všech slov v batchi, ale default má průměr po větách.
U labeleru dává víc smysl mít ten loss pro každej label zvlášť, kdežto u dekodéru asi spíš po větách, ale souhlasim, že se to má sjednotit. Je tim pádem asi lepší vracet (batch, time) kterej si pak zprůměruješ jak chceš.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taky jsem za druhou moznost, to ale znamena tedka jeste vypnout switch v AutoregressiveDecoder. Nevim, jak to ovlivni zabehle trenovani (predpokladam, ze minimalne; ale nemeril jsem to). Pokud udelas porovnani, tak to klidne muzes zamergovat timhle zpusobem.
Na druhou stranu uz mam vyzkousene ze udelat prumer pres vety a pak pres batch v seq. labeleru funguje, takze bych v tomto PR radej udelal tohle (hodil issue na poradne doreseni).
Kazdopadne by bylo fajn to uz ted mit v masteru sjednocene, nez to nekam zapadne. Klicove je, ze to vyrazne snizi uroven odrbavani v jinych komponentach, kde pak musis mit divne workaroundy tipu kontroly shapu, supported decoder apod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
autoregressive dekodér má nějakej switch? Vidim jen, že to dělá reduce_mean
na {train,runtime}_xents
(místo aby dělal mean jen přes validní pozice, ale jinak je to stejný)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mam na mysli toto:
https://github.com/ufal/neuralmonkey/blob/master/neuralmonkey/decoders/autoregressive.py#L291
seq2seq.sequence_loss ma prepinac average_across_timesteps, ktery je tedka True. Kdyby se vypnul, tak by to odpovidalo formatu train_xents v labeleru
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
už to vidim...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hotovo. musel se ještě změnit perplexity runner, kterej počítá s průměrama přes čas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moc se mi nelibi reseni v PerplexityRunner, ale vzhledem k tomu, ze PR #801 ho uplne vyradi, tak ok.