self.word_embeddings = self.add_weight(
"weight",
shape=[self.vocab_size, self.hidden_size],
initializer=get_initializer(self.initializer_range))
words_embeddings = tf.gather(self.word_embeddings, input_ids)
An embedding layer is faster, because it is essentially the equivalent of a dense layer that makes simplifying assumptions.
Imagine a word-to-embedding layer with these weights:
w = [[0.1, 0.2, 0.3, 0.4],
[0.5, 0.6, 0.7, 0.8],
[0.9, 0.0, 0.1, 0.2]]
A Dense
layer will treat these like actual weights with which to perform matrix multiplication. An embedding layer will simply treat these weights as a list of vectors, each vector representing one word; the 0th word in the vocabulary is w[0]
, 1st is w[1]
, etc.
For an example, use the weights above and this sentence:
[0, 2, 1, 2]
A naive Dense
-based net needs to convert that sentence to a 1-hot encoding
[[1, 0, 0],
[0, 0, 1],
[0, 1, 0],
[0, 0, 1]]
then do a matrix multiplication
[[1 * 0.1 + 0 * 0.5 + 0 * 0.9, 1 * 0.2 + 0 * 0.6 + 0 * 0.0, 1 * 0.3 + 0 * 0.7 + 0 * 0.1, 1 * 0.4 + 0 * 0.8 + 0 * 0.2],
[0 * 0.1 + 0 * 0.5 + 1 * 0.9, 0 * 0.2 + 0 * 0.6 + 1 * 0.0, 0 * 0.3 + 0 * 0.7 + 1 * 0.1, 0 * 0.4 + 0 * 0.8 + 1 * 0.2],
[0 * 0.1 + 1 * 0.5 + 0 * 0.9, 0 * 0.2 + 1 * 0.6 + 0 * 0.0, 0 * 0.3 + 1 * 0.7 + 0 * 0.1, 0 * 0.4 + 1 * 0.8 + 0 * 0.2],
[0 * 0.1 + 0 * 0.5 + 1 * 0.9, 0 * 0.2 + 0 * 0.6 + 1 * 0.0, 0 * 0.3 + 0 * 0.7 + 1 * 0.1, 0 * 0.4 + 0 * 0.8 + 1 * 0.2]]
=
[[0.1, 0.2, 0.3, 0.4],
[0.9, 0.0, 0.1, 0.2],
[0.5, 0.6, 0.7, 0.8],
[0.9, 0.0, 0.1, 0.2]]
However, an Embedding
layer simply looks at [0, 2, 1, 2]
and takes the weights of the layer at indices zero, two, one, and two to immediately get
[w[0],
w[2],
w[1],
w[2]]