fusionlab.encoders.vit.vit module#

class fusionlab.encoders.vit.vit.MLPBlock(hidden_size, mlp_dim, dropout_rate=0.0, act=<class 'torch.nn.modules.activation.GELU'>)[source]#

Bases: Module

A multi-layer perceptron block, based on: “Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale <https://arxiv.org/abs/2010.11929>”

__init__(hidden_size, mlp_dim, dropout_rate=0.0, act=<class 'torch.nn.modules.activation.GELU'>)[source]#

Parameters:

hidden_size (int) – dimension of hidden layer.
mlp_dim (int) – dimension of feedforward layer. If 0, hidden_size will be used.
dropout_rate (float) – faction of the input units to drop.
act (Module) – activation type and arguments. Defaults to nn.GELU

forward(x)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool#

class fusionlab.encoders.vit.vit.TransformerBlock(hidden_size, mlp_dim, num_heads, dropout_rate=0.0, qkv_bias=False, save_attn=False)[source]#

Bases: Module

A transformer block, based on: “Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale <https://arxiv.org/abs/2010.11929>”

__init__(hidden_size, mlp_dim, num_heads, dropout_rate=0.0, qkv_bias=False, save_attn=False)[source]#

Parameters:

hidden_size (int) – dimension of hidden layer.
mlp_dim (int) – dimension of feedforward layer.
num_heads (int) – number of attention heads.
dropout_rate (float, optional) – faction of the input units to drop. Defaults to 0.0.
qkv_bias (bool, optional) – apply bias term for the qkv linear layer. Defaults to False.
save_attn (bool, optional) – to make accessible the attention matrix. Defaults to False.

forward(x)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

training: bool#

class fusionlab.encoders.vit.vit.ViT(in_channels, img_size, patch_size, hidden_size=768, mlp_dim=3072, num_layers=12, num_heads=12, pos_embed='conv', dropout_rate=0.0, spatial_dims=2, qkv_bias=False, save_attn=False)[source]#

Bases: Module

Vision Transformer (ViT), based on: “Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale <https://arxiv.org/abs/2010.11929>”

ViT supports Torchscript but only works for Pytorch after 1.8.

source code: Project-MONAI/MONAI

__init__(in_channels, img_size, patch_size, hidden_size=768, mlp_dim=3072, num_layers=12, num_heads=12, pos_embed='conv', dropout_rate=0.0, spatial_dims=2, qkv_bias=False, save_attn=False)[source]#

Parameters:

in_channels (int) – dimension of input channels.
img_size (Union[Sequence[int], int]) – dimension of input image.
patch_size (Union[Sequence[int], int]) – dimension of patch size.
hidden_size (int, optional) – dimension of hidden layer. Defaults to 768.
mlp_dim (int, optional) – dimension of feedforward layer. Defaults to 3072.
num_layers (int, optional) – number of transformer blocks. Defaults to 12.
num_heads (int, optional) – number of attention heads. Defaults to 12.
pos_embed (str, optional) – position embedding layer type. Defaults to “conv”.
num_classes (int, optional) – number of classes if classification is used. Defaults to 2.
dropout_rate (float, optional) – faction of the input units to drop. Defaults to 0.0.
spatial_dims (int, optional) – number of spatial dimensions. Defaults to 3.
qkv_bias (bool, optional) – apply bias to the qkv linear layer in self attention block. Defaults to False.
save_attn (bool, optional) – to make accessible the attention in self attention block. Defaults to False.

forward(x, return_features=False)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

training: bool#

fusionlab.encoders.vit.vit.VisionTransformer#: alias of ViT