Technical Trigger

The SentenceTransformer class in the HuggingFace library now supports multimodal embedding and reranker models, with the ability to finetune existing models and to start from a fresh VLM checkpoint. The processor_kwargs and model_kwargs parameters can be used to control preprocessing and model loading, respectively.

Developer / Implementation Hook

Developers can use the updated SentenceTransformer class to finetune existing multimodal embedding models or to start from a fresh VLM checkpoint. The Router module can be used to compose separate encoders for different modalities, allowing for more flexible and efficient model architecture design. The tomaarsen/llamaindex-vdr-en-train-preprocessed dataset can be used for training and evaluation of multimodal models.

The Structural Shift

The paradigm is shifting from text-only retrieval to multimodal retrieval, where models can understand and retrieve relevant document pages based on both text and image queries.

Early Warning — Act Before Mainstream

To take advantage of this update, GEO practitioners can: * Use the SentenceTransformer class to finetune existing multimodal embedding models for improved performance in Visual Document Retrieval tasks * Utilize the Router module to compose separate encoders for different modalities, such as text and image * Train and evaluate multimodal models using the tomaarsen/llamaindex-vdr-en-train-preprocessed dataset