Evaluating Visual Adapters: MIVPG Performance on Single and Multi-Image Inputs

Wait 5 sec.

Table of LinksAbstract and 1 IntroductionRelated Work2.1. Multimodal Learning2.2. Multiple Instance LearningMethodology3.1. Preliminaries and Notations3.2. Relations between Attention-based VPG and MIL3.3. MIVPG for Multiple Visual Inputs3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance ScenariosExperiments and 4.1. General Setup4.2. Scenario 1: Samples with Single Image4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case StudyConclusion and References\Supplementary MaterialA. Detailed Architecture of QFormerB. Proof of PropositionC. More Experiments4. ExperimentsTo assess the effectiveness of our proposed approach, we conduct evaluations across various scenarios:\where each sample comprises a single image, and patches are naturally considered as instances;\where each sample includes multiple instances, but we use a general embedding for each image;\where each sample contains multiple images, with each image containing multiple patches.4.1. General SetupWe initialize our model using BLIP2 [22] with FLAN-T5- XL. MIVPG is initialized with weights from QFormer. The model consists of a frozen language model and a frozen visual model. During training, we only update the MIVPG. The visual encoder, ViT-G, is employed to encode images into patches of embeddings, and the images are resized to dimensions of 224 × 224. In our experiments, we observed that unfreezing the visual encoder does not lead to additional improvements in datasets with small sizes. Further details can be found in the supplementary C.1.\:::infoAuthors:(1) Wenliang Zhong, The University of Texas at Arlington (wxz9204@mavs.uta.edu);(2) Wenyi Wu, Amazon (wenyiwu@amazon.com);(3) Qi Li, Amazon (qlimz@amazon.com);(4) Rob Barton, Amazon (rab@amazon.com);(5) Boxin Du, Amazon (boxin@amazon.com);(6) Shioulin Sam, Amazon (shioulin@amazon.com);(7) Karim Bouyarmane, Amazon (bouykari@amazon.com);(8) Ismail Tutar, Amazon (ismailt@amazon.com);(9) Junzhou Huang, The University of Texas at Arlington (jzhuang@uta.edu).::::::infoThis paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.:::\