BLIP-2: Vision-Language Pre-training
Comprehensive guide to using Salesforce's BLIP-2 for vision-language tasks with frozen image encoders and large language models.
When to use BLIP-2
Use BLIP-2 when:
- Need high-quality image captioning with natural descriptions
- Building visual question answering (VQA) systems
- Require zero-shot image-text understanding without task-specific training
- Want to leverage LLM reasoning for visual tasks
- Building multimodal conversational AI
- Need image-text retrieval or matching
Key features:
- Q-Former architecture: Lightweight query transformer bridges vision and language
- Frozen backbone efficiency: No need to fine-tune large vision/language models
- Multiple LLM backends: OPT (2.7B, 6.7B) and FlanT5 (XL, XXL)
- Zero-shot capabilities: Strong performance without task-specific training
- Efficient training: Only trains Q-Former (~188M parameters)
- State-of-the-art results: Beats larger models on VQA benchmarks
Use alternatives instead:
- LLaVA: For instruction-following multimodal chat
- InstructBLIP: For improved instruction-following (BLIP-2 successor)
- GPT-4V/Claude 3: For production multimodal chat (proprietary)
- CLIP: For simple image-text similarity without generation
- Flamingo: For few-shot visual learning
Quick start
Installation
pip install transformers accelerate torch Pillow
pip install salesforce-lavis
Basic image captioning
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
torch_dtype=torch.float16,
device_map="auto"
)
image = Image.open("photo.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(caption)
Visual question answering
question = "What color is the car in this image?"
inputs = processor(images=image, text=question, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(answer)
Using LAVIS library
import torch
from lavis.models import load_model_and_preprocess
from PIL import Image
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model, vis_processors, txt_processors = load_model_and_preprocess(
name="blip2_opt",
model_type="pretrain_opt2.7b",
is_eval=True,
device=device
)
image = Image.open("photo.jpg").convert("RGB")
image = vis_processors["eval"](image).unsqueeze(0).to(device)
caption = model.generate({"image": image})
print(caption)
question = txt_processors["eval"]("What is in this image?")
answer = model.generate({"image": image, "prompt": question})
print(answer)
Core concepts
Architecture overview
BLIP-2 Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Q-Former β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Learned Queries (32 queries Γ 768 dim) β β
β ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββ β
β β Cross-Attention with Image Features β β
β ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββ β
β β Self-Attention Layers (Transformer) β β
β ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββ
β Frozen Vision Encoder β Frozen LLM β
β (ViT-G/14 from EVA-CLIP) β (OPT or FlanT5) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Model variants
| Model |
LLM Backend |
Size |
Use Case |
blip2-opt-2.7b |
OPT-2.7B |
~4GB |
General captioning, VQA |
blip2-opt-6.7b |
OPT-6.7B |
~8GB |
Better reasoning |
blip2-flan-t5-xl |
FlanT5-XL |
~5GB |
Instruction following |
blip2-flan-t5-xxl |
FlanT5-XXL |
~13GB |
Best quality |
Q-Former components
| Component |
Description |
Parameters |
| Learned queries |
Fixed set of learnable embeddings |
32 Γ 768 |
| Image transformer |
Cross-attention to vision features |
~108M |
| Text transformer |
Self-attention for text |
~108M |
| Linear projection |
Maps to LLM dimension |
Varies |
Advanced usage
Batch processing
from PIL import Image
import torch
images = [Image.open(f"image_{i}.jpg").convert("RGB") for i in range(4)]
questions = [
"What is shown in this image?",
"Describe the scene.",
"What colors are prominent?",
"Is there a person in this image?"
]
inputs = processor(
images=images,
text=questions,
return_tensors="pt",
padding=True
).to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
answers = processor.batch_decode(generated_ids, skip_special_tokens=True)
for q, a in zip(questions, answers):
print(f"Q: {q}\nA: {a}\n")
Controlling generation
generated_ids = model.generate(
**inputs,
max_new_tokens=100,
min_length=20,
num_beams=5,
no_repeat_ngram_size=2,
top_p=0.9,
temperature=0.7,
do_sample=True,
)
generated_ids = model.generate(
**inputs,
max_new_tokens