AttentionMML

class keras_mml.layers.transformer.AttentionMML[source]

Multi-headed attention layer that is mostly without matrix multiplications.

Unlike the Keras implementation, this is not an implementation of multi-headed attention in the Attention Is All You Need paper. Rather, this layer follows the description of the token-mixer in Scalable MatMul-free Language Modeling (see section 3.3.1), where we use GRUMML as the attention mechanism.

num_heads

Number of attention heads.

out_dim

Output dimension.

fully_mml

Whether to use full matmul-less layers in the attention mechanism.

__init__(num_heads, out_dim, fully_mml=True, **kwargs)[source]

Initializes a new instance of the layer.

Parameters:
  • num_heads (int) – Number of attention heads.

  • out_dim (int) – Output dimension.

  • fully_mml (bool, default: True) – Whether to use full matmul-less layers in the attention mechanism.

  • **kwargs – Keyword arguments for keras.Layer.

Raises:
  • ValueError – If the number of heads is not a positive integer.

  • ValueError – If the output dimension is not a positive integer.

build(input_shape)[source]

Build the layer.

Parameters:

input_shape (Tuple[int, int, int]) – Shape of the input.

call(inputs)[source]

Calling method of the layer.

Parameters:

inputs (Float[ndarray, 'batch_size sequence_length features']) – Inputs into the layer.

Returns:

Float[ndarray, 'batch_size sequence_length out_dim'] – Transformed inputs.

compute_output_shape(input_shape)[source]

Computes the output shape of the layer.

Parameters:

input_shape (Tuple[int, int, int]) – Shape of the input into the layer.

Returns:

Tuple[int, int, int] – Shape of the output.