GRUMML¶
- class keras_mml.layers.recurrent.GRUMML[source]¶
Gated Recurrent Unit (GRU) layer, mostly without matrix multiplications.
The implementation of this layer mostly follows the \(\mathrm{MLGRU}\) implementation in Scalable MatMul-free Language Modeling (see section 3.3.1). We differ from the implementation of \(\mathrm{MLGRU}\) by allowing \(\mathbf{g}_t\) and \(\mathbf{o}_t\) to be regular matrix multiplications, rather than just matmul-free ternary weights. The option to make everything ternary weights is controlled by the
fully_mmlattribute.Specifically, we perform the following recurrence steps.
\[\begin{split}\begin{align*} \mathbf{f}_t &= \sigma(\mathbf{x}_t\mathbf{W}_f + \mathbf{b}_f)\\ \mathbf{c}_t &= \tau(\mathbf{x}_t\mathbf{W}_c + \mathbf{b}_c)\\ \mathbf{h}_t &= \mathbf{f}_t\odot\mathbf{h}_{t-1} + (1-\mathbf{f}_t)\odot\mathbf{c}_t \\ \mathbf{g}_t &= \sigma(\mathbf{x}_t\mathbf{W}_g + \mathbf{b}_g)\\ \mathbf{o}_t' &= \mathbf{g}_t\odot\mathbf{h}_t\\ \mathbf{o}_t &= \mathbf{o}_t'\mathbf{W}_o + \mathbf{b}_o\\ \end{align*}\end{split}\]where
\(\mathbf{W}_f\) and \(\mathbf{W}_c\) are ternary weights (and so do not use matrix multiplications during their operation);
\(\mathbf{W}_g\) and \(\mathbf{W}_o\) are (possible) ternary weights, or just regular weight matrices;
\(\sigma\) is the
recurrent_activation(e.g., Sigmoid activation); and\(\tau\) is the
activation(e.g., Silu activation).
- units¶
Dimensionality of the output space.
- fully_mml¶
Whether to use matmul-free operations for all the layers.
- num_heads¶
Number of heads to use when performing the recurrent step.
- activation¶
Activation function to use.
- recurrent_activation¶
Activation function to use for the recurrent step.
- use_bias¶
Whether to use a bias vector for the layer.
- weights_initializer¶
Initializer for the gates’ matrices. Used for the linear transformation of the inputs.
- bias_initializer¶
Initializer for the bias vector.
- weights_regularizer¶
Regularizer function applied to the gates’ matrices.
- bias_regularizer¶
Regularizer function applied to the bias vector.
- weights_constraint¶
Constraint function applied to the gates’ matrices.
- bias_constraint¶
Constraint function applied to the bias vector.
- __init__(units, fully_mml=False, num_heads=1, activation='silu', recurrent_activation='sigmoid', use_bias=True, weights_initializer='glorot_uniform', bias_initializer='zeros', weights_regularizer=None, bias_regularizer=None, weights_constraint=None, bias_constraint=None, **kwargs)[source]¶
Initializes a new instance of the layer.
- Parameters:
units (
int) – Dimensionality of the output space.fully_mml (
bool, default:False) – Whether to use matmul-free operations for all the layers.num_heads (
int, default:1) – Number of heads to use for the recurrent step. See HGRN2: Gated Linear RNNs with State Expansion, section 3.2, for details on the multi-headed variant.activation (
str, default:'silu') – Activation function to use.recurrent_activation (
str, default:'sigmoid') – Activation function to use for the recurrent step.use_bias (
bool, default:True) – Whether to use a bias vector for the layer.weights_initializer (
str, default:'glorot_uniform') – Initializer for the gates’ matrices. Used for the linear transformation of the inputs.bias_initializer (
str, default:'zeros') – Initializer for the bias vector.weights_regularizer (
Optional[str], default:None) – Regularizer function applied to the gates’ matrices.bias_regularizer (
Optional[str], default:None) – Regularizer function applied to the bias vector.weights_constraint (
Optional[str], default:None) – Constraint function applied to the gates’ matrices.bias_constraint (
Optional[str], default:None) – Constraint function applied to the bias vector.**kwargs – Keyword arguments for
keras.Layer.
- Raises:
ValueError – If the units provided is not a positive integer.
ValueError – If the number of heads to use is not a positive integer.
ValueError – If the number of heads does not divide the units provided.
- call(sequences, initial_state=None, mask=None, training=False)[source]¶
Calling method of the layer.
- Parameters:
sequences (
Float[ndarray, 'batch_size timesteps features']) – Inputs into the layer.initial_state (
Optional[List], default:None) – List of initial state tensors to be passed to the first call of the cell. If not provided, will cause creation of zero-filled initial state tensors.mask (
Optional[Any], default:None) – Binary tensor indicating whether a given timestep should be masked. An individual True entry indicates that the corresponding timestep should be utilized, while a False entry indicates that the corresponding timestep should be ignored.training (
bool, default:False) – Indicates whether the layer should behave in training mode or in inference mode. This argument is passed to the cell when calling it.
- Returns:
Float[ndarray, 'batch_size timesteps']– Transformed inputs.