chatgpt – 7vik

In A Path Towards Automatic Machine Intelligence (in 2022), Yann LeCun introduced JEPA, a new architecture for self-supervised representation learning. In the next couple of years, works by Meta Research showed how this architecture can be made to work well for learning features from various modalities such as audio (A-JEPA), images (I-JEPA), and most recently (in 2024), videos (V-JEPA). We will go through some ideas around JEPAs.

ViTs

Vision Transformers (ViTs) have emerged as a very versatile architecture to train image models, especially when embeddings, language grounding, or multi-modality is required. Architecturally, ViTs are very similar to Transformers in LLMs with different input processing to allow for image inputs. Instead of tokenization and embedding, images (or videos) are split in patches and given sequentially to a transformer encoder:

ViTs form the backbone of most of the JEPAs.

Energy-based Models

The original paper introducing JEPA suggests EBMs as a versatile framework for “world models” that is flexible enough to represent almost all SSL architectures, probabilistic or otherwise. The idea is to learn an “energy” function E(x,y) that is high when x and y are incompatible and low when they are compatible. The function can also have latents z and can have multiple modes.

Notes:

Intuitively, it seems a similar idea as in RL where we sometimes try to learn Q(s,a) instead of a = Q(s).
This is an optional unifying framework but one can very easily move from this to a prediction task by using an argmin. The V-JEPA paper does not use this terminology.
An intuitive example: A car at a fork can take one of two paths, and two different future snapshots can have low energy based on the latent, i.e., which direction was taken by the car.

Energy-based models with latent `z`. (source)

Self-Supervised Learning

Self-supervised learning (SSL) of visual representations has been explored a lot. The goal here is to learn from large-scale, unlabeled, real-world data and get universal models that can be efficiently used/fine-tuned for downstream tasks.

The two standard methods for SSL are:

Generative: automatically mask a section (such as some pixels) of the data and train a model to reconstruct it by looking at the other parts. (also called prediction-based methods). (for language, transformer-based LLMs trained on next-token prediction fall under this category).
Discriminative: Train an encoder to differentiate different datapoints in the representation space. (also called invariance-based methods). (for language, BERT-style embedding models fall under this category).

Both have their own pros and cons, and their own challenges that need to be solved. Some considerations are:

Generative (or predictive) methods spend a lot of model capacity on “fine reconstruction”, such as learning pixel-level details, which makes it harder to learn abstract semantics.
Discriminative methods rely on some hand-crafted manipulations or transforms which leads to inductive biases that make it hard to solve downstream tasks that require some different features.

One common discriminative way to learn representations is to train a (joint) encoder to have similar embeddings for different views (or crops or transformations) of the same image. (see fig. 1.d)

Figure: Some standard representation learning architectures. (source)

Representation Collapse

Representation collapse is a common issue with most joint-embedding-based SSL techniques, which means hitting the trivial solution E = c while optimizing for the loss function L = E(x,y) - E(x,y'), i.e., a flat energy landscape. Various methods have been proposed to prevent such a collapse:

Explicit regularization (such as the variance of embeddings here).
Architectural constraints (such as stopping gradients of one encoder as in here or momentum-based updates or assymetric prediction heads (SimCLR)).
This work attempts to theoretically study such collapses and come up with a learning-dynamics-based method to prevent collapses in simpler settings.

Joint-Embedding Predictive Architecture (JEPA)

The main advantage of JEPA is that it performs predictions in representation space, eschewing the need to predict every detail of y, and enabling the elimination of irrelevant details by the encoders. More precisely, the main advantage of this architecture for representing multi-modal dependencies is twofold:

(1) the encoder function s_y = Enc(y) may possess invariance properties that will make it produce the same s_y for a set of different y. This makes the energy constant over this set and allows the model to capture complex multi-modal dependencies;

(2) The latent variable z, when varied over a set Z, can produce a set of plausible predictions Pred(s_x, Z) = {s_yz = Pred(s_x, z) ∀z ∈ Z}. If x is a video clip of a car approaching a fork in the road, s_x and s_y may represent the position, orientation, velocity and other characteristics of the car before and after the fork, respectively, ignoring irrelevant details such as the trees bordering the road or the texture of the sidewalk. z may represent whether the car takes the left branch or the right branch of the road.

It takes a number of techniques/tricks to effectively train JEPA architectures for various modalities. As a case study, we will look at the details of the most recent and most versatile JEPA model to date, video JEPA.

Case Study: V-JEPA

How effective is feature prediction as a standalone objective for unsupervised learning from
video with modern tools?
The “simple question” V-JEPA aims to answer!

Training Objective

They train their visual encoder \(E_\theta(.)\) to satisfy the constraint that representations computed from one part of the video, y, should be predictable from representations computed from another part of the video, x. The predictor network \(P_\phi(.)\), which maps the representation of x to the representation of y, is trained simultaneously with the encoder, and is provided specification of the spatio-temporal positions of y through the conditioning variable z ← ∆_y.

Architecture for V-JEPA training. (source)

Naively implementing the objective using the regression:

would admit a trivial solution where the encoder outputs a constant representation regardless of its input. (collapse).

Collapse Prevention

They use the following modified objective to prevent representation collapse:

where sg(·) denotes a stop-gradient operation, which does not backpropagate through its argument, and \(\overline{E}_{\theta}(.)\) is an exponential moving average of the network E_θ(·).

Theoretical Motivation (copied from the paper)

Training Details

Data: They combine several publicly available datasets to get a corpus of 2M videos of 16 frames each. (called VideoMix2M)
Model backbone: ViT-H/L (~630/310M params) for the encoder and ViT-B (~80M params) for the predictors.
Optim: AdamW for both. 90K iterations, learning rate decay. List of all hyperparams is available here.

Evaluation Details

Tasks: They use a subset of the VideoGLUE benchmark to test for various capabilities; specifically, action recognition on Kinetics400 to test appearance-based understanding, motion classification on Something-Something-v2 to test temporal understanding, and action localization on AVA to test for motion localization.
Attentive Probing: Since the prediction objective is unnormalized, rather than using a linear operation (averaging) to pool the features output of the frozen backbone, they employ a learnable, non-linear pooling strategy. Specifically, when evaluating the frozen pretrained backbone on downstream tasks, they learn a cross-attention layer with a learnable query token. The output of the cross-attention layer is then added back to the query token (residual connection), and then fed into two-layer MLP with a single GeLU activation, followed by a LayerNorm, and finally a linear classifier.

Results

The key takeaways are captured by the following:

V-JEPA with attentive pooling performs comparably with the SoTA models and better than pixel-prediction models and image models, with the added benefits of versatility and label efficiency. (source)

Visualizing the predictor with a conditional diffusion model trained on the representations of the masked patches. (source)

Future Directions

Firstly, the V-JEPA model has some limitations that, if addressed, can help it become universal backbone for a number of practical applications in image and video tasks: more diverse pre-training datasets (to improve OOD performance), more context (to handle long-range videos), and semantic information (such as combining with language embeddings).

Secondly, JEPA still has the inductive bias of certain types of manual masks that seems to be harder to remove for video tasks. Specifically, without contiguous and consistent spatio-temporal masking, the task becomes “too easy” and the model doesn’t learn anything.

Thirdly, how can we make the feature representations that JEPA learns correspond to human-interpretable concepts and hierarchies? Can hierarchical JEPA work?

Lastly, can we try to get this to work for language? It looks like the setting is ripe: energy landscapes have multiple modes and prediction methods need to learn fine-grained, token-level details which takes compute/capacity from abstract semantics.

……………………………………………..

Thank you! Please cite this blogpost as:

@misc{golechha_hike_2024,
  author       = {Satvik Golechha},
  title        = {A Hike around JEPAs},
  howpublished = {\url{https://7vik.io/2024/02/29/a-hike-around-jepas/}},
  month        = feb,
  day          = 29,
  year         = {2024},
}

Tag: chatgpt

A Hike around JEPAs