Saved descriptions to /tmp/tmp1z_5vpmb/attention-is-all-you-need/img_descriptions.json
Adding descriptions to 15 pages...
Done! Enriched pages saved to files/test/md_test
d of sequence-aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks *[34]*.
To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as *[17, 18]* and *[9]*.
## 3 Model Architecture ... page 2
Most competitive neural sequence transduction models have an encoder-decoder structure *[5, 2, 35]*. Here, the encoder maps an input sequence of symbol representations $(x_{1},...,x_{n})$ to a sequence of continuous representations $\mathbf{z}=(z_{1},...,z_{n})$. Given $\mathbf{z}$, the decoder then generates an output sequence $(y_{1},...,y_{m})$ of symbols one element at a time. At each step the model is auto-regressive *[10]*, consuming the previously generated symbols as additional input when generating the next.

AI-generated image description:
___
This is an architectural diagram of a Transformer model, showing the encoder-decoder structure. The left side shows the encoder with N× stacked layers, each containing Multi-Head Attention and Feed Forward sublayers with Add & Norm operations. The right side shows the decoder with N× stacked layers, featuring Masked Multi-Head Attention, Multi-Head Attention (for encoder-decoder attention), and Feed Forward sublayers, also with Add & Norm operations. Both sides include Positional Encoding added to Input Embedding (encoder) and Output Embedding (decoder). The decoder processes outputs shifted right. The architecture flows upward through Linear and Softmax layers to produce Output Probabilities. Residual connections are indicated by arrows wrapping around each sublayer block. This diagram represents the standard Transformer architecture used in natural language processing and sequence-to-sequence tasks.
___
Figure 1: The Transformer - model architecture.
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.
### 3.1 Encoder and Decoder Stacks ... page 3
Encoder: The encoder is composed of a stack of $N = 6$ identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm $(x + \text{Sublayer}(x))$ , where Sublayer $(x)$ is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs