Massively Multimodal
Masked Modeling

4M pull figure.

4M enables training versatile multimodal and multitask models, capable of performing a diverse set of vision tasks out of the box, as well as being able to perform multimodal conditional generation. This, coupled with the models' ability to perform in-painting, enables powerful image editing capabilities. These generalist models transfer well to a broad range of downstream tasks or to novel modalities, and can be easily fine-tuned into more specialized variants of itself.


Current machine learning models for vision are often highly specialized and limited to a single modality and task. In contrast, recent large language models exhibit a wide range of capabilities, hinting at a possibility for similarly versatile models in computer vision.

We take a step in this direction and propose a multimodal training scheme called 4M, short for Massively Multimodal Masked Modeling. It consists of training a single unified Transformer encoder-decoder using a masked modeling objective across a wide range of input/output modalities — including text, images, geometric, and semantic modalities, as well as neural network feature maps. 4M achieves scalability by unifying the representation space of all modalities through mapping them into discrete tokens and performing multimodal masked modeling on a small randomized subset of tokens.

4M leads to models that exhibit several key capabilities:

  1. they can perform a diverse set of vision tasks out of the box,
  2. they excel when fine-tuned for unseen downstream tasks or new input modalities, and
  3. they can function as a generative model that can be conditioned on arbitrary modalities, enabling a wide variety of expressive multimodal editing capabilities with remarkable flexibility.

Through experimental analyses, we demonstrate the potential of 4M for training versatile and scalable foundation models for vision tasks, setting the stage for further exploration in multimodal learning for vision and other domains. Please see our Github repository for code and pre-trained models.

Modalities overview

4M enables training a single model on tens of diverse modalities. The resulting model can generate any of the modalities from any subset of them.

Introductory video (6min) .

Papers & code

4M: Massively Multimodal Masked Modeling

1EPFL   2Apple
* Equal contribution
NeurIPS 2023 (spotlight)

This paper introduces the 4M framework for training multimodal and multitask models and applies it to a diverse set of tokenized modalities including text, images, geometric, and semantic modalities, as well as neural network feature maps. We investigate the capabilities of 4M models through a series of transfer experiments, and study key design choices in an extensive ablation.

Scaling Vision Models to Tens of Tasks and Modalities

1EPFL   2Apple
* Equal contribution
arXiv 2024

We scale 4M to 21 diverse types of modalities, including human poses and shape, SAM instances, and metadata, and propose modality-specific tokenization approaches. We successfully scale up the training to 3 billion parameter models, demonstrate co-training on vision and language, and showcase strong out-of-the-box vision capabilities. On this page, we already showcase some results from a 4M-21 XL model trained on all 21 modalities. Stay tuned for more, coming soon!

4M method overview

4M is a multimodal and multitask training scheme, enabling the training of versatile any-to-any models, i.e. capable of predicting / generating any modality from any subset of other modalities. 4M models are capable of performing a wide variety of vision tasks out-of-the-box, and excel when fine-tuned to novel downstream tasks.

4M training

By tokenizing modalities into sequences of discrete tokens, we can train a single unified Transformer encoder-decoder on a diverse set of modalities, including text, images, geometric, and semantic modalities, as well as neural network feature maps. 4M is trained by mapping one random subset of tokens to another. Please see the animated visualization below for an overview of the 4M training scheme.

(Left): 4M is a framework for training multimodal and multitask models that operate on tokenized versions of multiple image-like modalities (such as RGB, depth, etc.) and sequence modalities (such as captions and bounding boxes). (Right): The 4M training objective consists of training a Transformer encoder-decoder to predict a randomly selected subset of tokens, which is sampled from all modalities, based on another random subset of tokens.

Multimodal chained generation

The trained 4M models can be used to generate any modality from any combination of other modalities, and are able to perform prediction from partial inputs. When predicting multiple modalities from one, rather than predicting each individually, 4M can be used to predict them one-by-one, always looping fully generated modalities back into the input and conditioning the generation of subsequent modalities on them. The consequence is that all training modalities can be predicted in a self-consistent manner. Please see the animation below for an illustration on how multimodal chained generation is performed with 4M.

This simplified example illustrates the generation of a full RGB image from a partial RGB and bounding box input using the MaskGIT decoding scheme, followed by autoregressive generation of a caption. Note that through chaining (i.e. using fully generated modalities as conditioning when generating subsequent modalities), we can predict multiple modalities in a self-consistent manner. This is in contrast to independently generating each modality from the original conditioning, where each generated output is consistent with the input but not necessarily with other outputs. Generated tokens can be turned back into images, text, and other modalities, using the detokenizers.


Using the 4M training objective, we can train a single model on tens of highly diverse modalities including several semantic and geometric modalities, feature maps from recent state of the art models like DINOv2 and ImageBind, pseudo labels of specialist models like SAM and 4DHumans, and modalities that allow for novel ways to interact with the model and steer the generation, for example image metadata and color palettes.


Tokenization consists of converting modalities and tasks into sequences or sets of discrete tokens, thereby unifying the representation space of all modalities. This is critical for training large multimodal models. We use different tokenization approaches to tokenize modalities with different characteristics, see the figure below for an overview.

Tokenization overview

As illustrated above, we employ suitable tokenization schemes for different modalities based on their format and performance. For image-like modalities and feature maps, we leverage spatial VQ-VAEs, with optional diffusion decoders for detail rich modalities like RGB. We tokenize non-spatial modalities, e.g. parameterized poses, using VQ-VAEs with MLP encoders and decoders. All sequence modalities are encoded as text using WordPiece tokenizer. The shown examples are real tokenizer reconstructions.

Steerable multimodal generation

One of the consequences of 4M's multimodal masked modeling objective is that it enables steerable multimodal generation of any modality given any subset of (partial) other modalities. This not only enables an exciting level of control over the generation process, but also allows for predicting multiple tasks in a consistent manner through chained generation (as demonstrated above).

We invite you to explore the capabilities of 4M models by interacting with the interactive visualizations below. They are best viewed on a desktop computer.

Any-to-any generation

4M allows for generating any modality from any other subset of modalities in a self-consistent manner. We can achieve this level of consistency by looping back predicted modalities into the input when generating subsequent ones. Remarkably, 4M is able to perform this feat without needing any loss-balancing or architectural modifications commonly used in multitask learning. Below, we showcase the generation of all modalities given one input modality, but as we show in other visualizations further below, 4M can also effectively integrate information from multiple inputs.

Hint: Click here for a zoomable version of the figure below, or click on the individual entries to enlarge them.

4M can generate all modalities from a given input modality using chained generation. Notice the high consistency among the predictions of all modalities for an input. Each row starts from a different modality coming from the same data sample.

RGB-to-all generation

In this example, we illustrate the generation of several modalities from an RGB image. As demonstrated 4M is able to generate all modalities in a self-consistent manner.

Hint: Use the buttons to explore different images.

Fine-grained generation & editing

Through the above shown any-to-any capabilities and the fact that 4M can perform generation from partial inputs, we can do fine-grained multimodal generation and editing tasks, as shown in the examples below. Key to this is that certain modalities such as semantic segmentation or depth maps can serve as intermediate steps in the generation process, grounding the generation of subsequent modalities, while allowing to perform high-level semantic edits on them.

Multimodal editing

4M's in-painting and any-to-any prediction abilities unlock a suite of multimodal generation and editing capabilities, which allow for fine-grained creative control.

In the following example, we showcase the generation of an RGB image conditioned on captions and bounding boxes, and vary the position and shape of only one bounding box. Besides giving users a higher degree of control, notice how the model is able to make sense of unusual bounding box inputs (e.g. the bicycle above a bed is turned into a painting).

Caption input

Bounding boxes input

RGB prediction

Hint: Drag the slider to change the position of the bounding box input. Use the buttons to explore different images.

In this example, we take a semantic segmentation map extracted from a reference image, and show how 4M resolves edits of a single class in the segmentation map. Notice how semantically stable the fixed parts are, but how they can change in appearance (e.g. the mountains in the first image become snow too when the class of the ground changes to snow).

Semantic input

RGB prediction

Hint: Drag the slider to change the semantic input. Use the buttons to explore different images.

Multimodal guidance

We are able to perform compositional generation by weighting different conditions by different amounts. This allows users to control precisely how strongly or weakly a generated output should follow each condition, as shown in the example below. It can further be used to avoid the generation of certain undesired concepts via negative weighting.

Caption input
(Fixed weight: 2.0)

RGB prediction

Depth input
(Guidance weight: 49)

Hint: Drag the slider to change the guidance weight of the depth conditioning. Use the buttons to explore different images.

Steerable data generation

We extracted different kinds of image, semantic, and geometric metadata from the RGB images and various pseudo labels, and trained a 4M model on all of them. This enables a large degree of controlability over the generation process, as shown in the interactive visualization below. Since 4M can generate multiple modalities in a self-consistent manner, this has exciting potential for future research into steerable dataset generation.

Hint: Select parts of the histograms to filter the data. Click on unselected parts of the histograms to reset the selection.

SAM clutter score (# of SAM instances)
COCO clutter score (# of COCO instances)
Crowdedness score (# of human instances)
Semantic diversity (# of unique classes)
Instance diversity (# of unique instance classes)
Objectness score (% of object pixels)
Walkability score (% of walkable pixels)
Occlusion score (% of occlusion edge pixels)

Geometric complexity (normals ang. variance)
Original image width
Original image height
Image brightness
Image contrast
Image saturation
Image colorfulness
Image entropy
RGB Surface Normals Semantic Segmentation Humans

Improved text understanding

We observe that 4M models trained on a larger variety of modalities and co-trained on a large text corpus (here denoted 4M-21) exhibit a higher degree of text understanding compared to 4M-7 trained on a smaller set of modalities. We note that this can be observed both when conditioning on T5-XXL embeddings (which is a common technique), but also when inputing raw captions.

Improved text understanding

4M pre-trained on a larger variety of modalities and co-trained on a text corpus exhibits improved text understanding capabilities.

Multimodal retrieval

4M enables performing multimodal retrievals by predicting the global embeddings of DINOv2 and ImageBind models from any subset of the input modalities. This unlocks new retrieval capabilities that were not possible with the vanilla DINOv2 and ImageBind models. Below we exemplify this for two cases, namely predicting RGB from any modality (Any-to-RGB) and predicting any modality from RGB (RGB-to-any).

Any-to-RGB Retrieval

As shown below, we can retrieve RGB images from distinctly different query modalities. Note that each query modality constrains the retrieved RGBs differently (e.g. semantically or geometrically).

Depth Query

RGB Retrieval (Top-3)

Hint: Drag the slider to change the query modality. Use the buttons to explore different query instances.

RGB-to-any retrieval

Likewise, given an RGB image as a query input, we can retrieve any other modality.

RGB input

RGB Retrieval (Top-3)

Hint: Drag the slider to change the retrieved modality. Use the buttons to explore different query images.

Transfers & ablations

We perform an extensive set of ablations and transfer experiments to understand the impact of different design choices of 4M training and to showcase its transfer capabilities. We show that 4M is a scalable and versatile training method that transfers well to a wide range of downstream tasks.


We transfer 4M models of different sizes to ImageNet-1K classification, COCO object detection and instance segmentation, ADE20K semantic segmentation, and NYUv2 depth estimation. 4M outperforms the baselines on all tasks except for ImageNet-1K, surpassed by DeiT III which is a specialized model. In contrast to 4M, all of the baselines employed data augmentations to achieve their results.

Method Training
(Top 1 acc. ↑)
(APbox & APmask ↑)
(mIoU ↑)
1 acc. ↑)
MAE B IN-1K 84.2 48.3 41.6 46.1 89.1
DeiT III B IN-21K 85.4 46.1 38.5 49.0 87.4
MultiMAE B IN-1K 84.0 44.1 37.8 46.2 89.0
4M-B (RGB → RGB only) CC12M 82.8 42.3 36.6 38.3 80.4
4M-B (RGB → CLIP only) CC12M 83.4 46.6 39.9 43.0 85.7
4M-B CC12M 84.5 49.7 42.7 50.1 92.0
MAE L IN-1K 86.8 52.8 45.3 51.8 93.6
DeiT III L IN-21K 87.0 48.7 41.1 52.0 89.6
4M-L CC12M 86.6 53.7 46.4 53.4 94.4


We perform an extensive ablation study to understand key design choices of 4M training, including what modalities to pre-train on, how to sample from them for the multimodal masking scheme, and how many input and output tokens should be used. We measure the effect of different design choices by transferring the trained models to a diverse set of downstream tasks and reporting the average loss. We note that training 4M on all modalities as both inputs and targets results in a generalist model that transfers best to novel tasks and modalities on average. Please see our NeurIPS paper appendix for further ablations.

Training inputs & targets (Avg. Loss ↓)

Training inputs & targets (Avg. Loss ↓)

Input ɑ & Target ɑ (Avg. Loss ↓)

Input Tokens (Avg. Loss ↓)

Target Tokens (Avg. Loss ↓)

We further show promising scaling trends in terms of dataset size, training duration, and model size.

Dataset Size (Avg. Loss ↓)

Train Tokens (Avg. Loss ↓)

Model Size (Avg. Loss ↓)


We thank Elmira Amirloo Abolfathi, Andrei Atanov, Hanlin Goh, Yuri Gorokhov, Kimi Huang, Javier Movellan, Hosna Oyarhoseini, Fernando Serrano Garcia, Feng Tang, and Qian Zhou for their help with the project.