Technical Trigger
The SentenceTransformer class in the HuggingFace library now supports multimodal embedding and reranker models, with the ability to finetune existing models and to start from a fresh VLM checkpoint. The processor_kwargs and model_kwargs parameters can be used to control preprocessing and model loading, respectively.
Developer / Implementation Hook
Developers can use the updated SentenceTransformer class to finetune existing multimodal embedding models or to start from a fresh VLM checkpoint. The Router module can be used to compose separate encoders for different modalities, allowing for more flexible and efficient model architecture design. The tomaarsen/llamaindex-vdr-en-train-preprocessed dataset can be used for training and evaluation of multimodal models.
The Structural Shift
The paradigm is shifting from text-only retrieval to multimodal retrieval, where models can understand and retrieve relevant document pages based on both text and image queries.
Early Warning — Act Before Mainstream
To take advantage of this update, GEO practitioners can:
* Use the SentenceTransformer class to finetune existing multimodal embedding models for improved performance in Visual Document Retrieval tasks
* Utilize the Router module to compose separate encoders for different modalities, such as text and image
* Train and evaluate multimodal models using the tomaarsen/llamaindex-vdr-en-train-preprocessed dataset