GRUMML

class keras_mml.layers.recurrent.GRUMML[source]

Gated Recurrent Unit (GRU) layer, mostly without matrix multiplications.

The implementation of this layer mostly follows the \(\mathrm{MLGRU}\) implementation in Scalable MatMul-free Language Modeling (see section 3.3.1). We differ from the implementation of \(\mathrm{MLGRU}\) by allowing \(\mathbf{g}_t\) and \(\mathbf{o}_t\) to be regular matrix multiplications, rather than just matmul-free ternary weights. The option to make everything ternary weights is controlled by the fully_mml attribute.

Specifically, we perform the following recurrence steps.

\[\begin{split}\begin{align*} \mathbf{f}_t &= \sigma(\mathbf{x}_t\mathbf{W}_f + \mathbf{b}_f)\\ \mathbf{c}_t &= \tau(\mathbf{x}_t\mathbf{W}_c + \mathbf{b}_c)\\ \mathbf{h}_t &= \mathbf{f}_t\odot\mathbf{h}_{t-1} + (1-\mathbf{f}_t)\odot\mathbf{c}_t \\ \mathbf{g}_t &= \sigma(\mathbf{x}_t\mathbf{W}_g + \mathbf{b}_g)\\ \mathbf{o}_t' &= \mathbf{g}_t\odot\mathbf{h}_t\\ \mathbf{o}_t &= \mathbf{o}_t'\mathbf{W}_o + \mathbf{b}_o\\ \end{align*}\end{split}\]

where

  • \(\mathbf{W}_f\) and \(\mathbf{W}_c\) are ternary weights (and so do not use matrix multiplications during their operation);

  • \(\mathbf{W}_g\) and \(\mathbf{W}_o\) are (possible) ternary weights, or just regular weight matrices;

  • \(\sigma\) is the recurrent_activation (e.g., Sigmoid activation); and

  • \(\tau\) is the activation (e.g., Silu activation).

units

Dimensionality of the output space.

fully_mml

Whether to use matmul-free operations for all the layers.

num_heads

Number of heads to use when performing the recurrent step.

activation

Activation function to use.

recurrent_activation

Activation function to use for the recurrent step.

use_bias

Whether to use a bias vector for the layer.

weights_initializer

Initializer for the gates’ matrices. Used for the linear transformation of the inputs.

bias_initializer

Initializer for the bias vector.

weights_regularizer

Regularizer function applied to the gates’ matrices.

bias_regularizer

Regularizer function applied to the bias vector.

weights_constraint

Constraint function applied to the gates’ matrices.

bias_constraint

Constraint function applied to the bias vector.

__init__(units, fully_mml=False, num_heads=1, activation='silu', recurrent_activation='sigmoid', use_bias=True, weights_initializer='glorot_uniform', bias_initializer='zeros', weights_regularizer=None, bias_regularizer=None, weights_constraint=None, bias_constraint=None, **kwargs)[source]

Initializes a new instance of the layer.

Parameters:
  • units (int) – Dimensionality of the output space.

  • fully_mml (bool, default: False) – Whether to use matmul-free operations for all the layers.

  • num_heads (int, default: 1) – Number of heads to use for the recurrent step. See HGRN2: Gated Linear RNNs with State Expansion, section 3.2, for details on the multi-headed variant.

  • activation (str, default: 'silu') – Activation function to use.

  • recurrent_activation (str, default: 'sigmoid') – Activation function to use for the recurrent step.

  • use_bias (bool, default: True) – Whether to use a bias vector for the layer.

  • weights_initializer (str, default: 'glorot_uniform') – Initializer for the gates’ matrices. Used for the linear transformation of the inputs.

  • bias_initializer (str, default: 'zeros') – Initializer for the bias vector.

  • weights_regularizer (Optional[str], default: None) – Regularizer function applied to the gates’ matrices.

  • bias_regularizer (Optional[str], default: None) – Regularizer function applied to the bias vector.

  • weights_constraint (Optional[str], default: None) – Constraint function applied to the gates’ matrices.

  • bias_constraint (Optional[str], default: None) – Constraint function applied to the bias vector.

  • **kwargs – Keyword arguments for keras.Layer.

Raises:
  • ValueError – If the units provided is not a positive integer.

  • ValueError – If the number of heads to use is not a positive integer.

  • ValueError – If the number of heads does not divide the units provided.

call(sequences, initial_state=None, mask=None, training=False)[source]

Calling method of the layer.

Parameters:
  • sequences (Float[ndarray, 'batch_size timesteps features']) – Inputs into the layer.

  • initial_state (Optional[List], default: None) – List of initial state tensors to be passed to the first call of the cell. If not provided, will cause creation of zero-filled initial state tensors.

  • mask (Optional[Any], default: None) – Binary tensor indicating whether a given timestep should be masked. An individual True entry indicates that the corresponding timestep should be utilized, while a False entry indicates that the corresponding timestep should be ignored.

  • training (bool, default: False) – Indicates whether the layer should behave in training mode or in inference mode. This argument is passed to the cell when calling it.

Returns:

Float[ndarray, 'batch_size timesteps'] – Transformed inputs.

classmethod from_config(config)[source]

Creates the layer from the given configuration.

Parameters:

config (Dict[str, Any]) – Configuration dictionary.

Returns:

GRUMML – Created instance.

get_config()[source]

Gets the configuration for the layer.

Returns:

Dict[str, Any] – Layer configuration.