AttentionMML¶

class keras_mml.layers.transformer.AttentionMML[source]¶

Multi-headed attention layer that is mostly without matrix multiplications.

Unlike the Keras implementation, this is not an implementation of multi-headed attention in the Attention Is All You Need paper. Rather, this layer follows the description of the token-mixer in Scalable MatMul-free Language Modeling (see section 3.3.1), where we use GRUMML as the attention mechanism.

num_heads¶: Number of attention heads.

out_dim¶: Output dimension.

fully_mml¶: Whether to use full matmul-less layers in the attention mechanism.

__init__(num_heads, out_dim, fully_mml=True, **kwargs)[source]¶

Initializes a new instance of the layer.

Parameters:

num_heads (int) – Number of attention heads.
out_dim (int) – Output dimension.
fully_mml (bool, default: True) – Whether to use full matmul-less layers in the attention mechanism.
**kwargs – Keyword arguments for keras.Layer.

Raises:

ValueError – If the number of heads is not a positive integer.
ValueError – If the output dimension is not a positive integer.

build(input_shape)[source]¶

Build the layer.

Parameters:: input_shape (Tuple[int, int, int]) – Shape of the input.

call(inputs)[source]¶

Calling method of the layer.

Parameters:: inputs (Float[ndarray, 'batch_size sequence_length features']) – Inputs into the layer.
Returns:: Float[ndarray, 'batch_size sequence_length out_dim'] – Transformed inputs.

compute_output_shape(input_shape)[source]¶

Computes the output shape of the layer.

Parameters:: input_shape (Tuple[int, int, int]) – Shape of the input into the layer.
Returns:: Tuple[int, int, int] – Shape of the output.