CocoIndex
Overview
CocoIndex is an ultra-performant real-time data transformation framework for AI with incremental processing. This skill enables building indexing flows that extract data from sources, apply transformations (chunking, embedding, LLM extraction), and export to targets (vector databases, graph databases, relational databases).
Core capabilities:
- Write indexing flows - Define ETL pipelines using Python
- Create custom functions - Build reusable transformation logic
- Operate flows - Run and manage flows using CLI or Python API
Key features:
- Incremental processing (only processes changed data)
- Live updates (continuously sync source changes to targets)
- Built-in functions (text chunking, embeddings, LLM extraction)
- Multiple data sources (local files, S3, Azure Blob, Google Drive, Postgres)
- Multiple targets (Postgres+pgvector, Qdrant, LanceDB, Neo4j, Kuzu)
For detailed documentation: https://cocoindex.io/docs/
Search documentation: https://cocoindex.io/docs/search?q=url%20encoded%20keyword
When to Use This Skill
Use when users request:
- "Build a vector search index for my documents"
- "Create an embedding pipeline for code/PDFs/images"
- "Extract structured information using LLMs"
- "Build a knowledge graph from documents"
- "Set up live document indexing"
- "Create custom transformation functions"
- "Run/update my CocoIndex flow"
Flow Writing Workflow
Step 1: Understand Requirements
Ask clarifying questions to understand:
Data source:
- Where is the data? (local files, S3, database, etc.)
- What file types? (text, PDF, JSON, images, code, etc.)
- How often does it change? (one-time, periodic, continuous)
Transformations:
- What processing is needed? (chunking, embedding, extraction, etc.)
- Which embedding model? (SentenceTransformer, OpenAI, custom)
- Any custom logic? (filtering, parsing, enrichment)
Target:
- Where should results go? (Postgres, Qdrant, Neo4j, etc.)
- What schema? (fields, primary keys, indexes)
- Vector search needed? (specify similarity metric)
Step 2: Set Up Dependencies
Guide user to add CocoIndex with appropriate extras to their project based on their needs:
Required dependency:
cocoindex - Core functionality, CLI, and most built-in functions
Optional extras (add as needed):
cocoindex[embeddings] - For SentenceTransformer embeddings (when using SentenceTransformerEmbed)
cocoindex[colpali] - For ColPali image/document embeddings (when using ColPaliEmbedImage or ColPaliEmbedQuery)
cocoindex[lancedb] - For LanceDB target (when exporting to LanceDB)
cocoindex[embeddings,lancedb] - Multiple extras can be combined
What's included:
- Base package: Core functionality, CLI, most built-in functions, Postgres/Qdrant/Neo4j/Kuzu targets
embeddings extra: SentenceTransformers library for local embedding models
colpali extra: ColPali engine for multimodal document/image embeddings
lancedb extra: LanceDB client library for LanceDB vector database support
Users can install using their preferred package manager (pip, uv, poetry, etc.) or add to pyproject.toml.
For installation details: https://cocoindex.io/docs/getting_started/installation
Step 3: Set Up Environment
Check existing environment first:
-
Check if COCOINDEX_DATABASE_URL exists in environment variables
- If not found, use default:
postgres://cocoindex:cocoindex@localhost/cocoindex
-
For flows requiring LLM APIs (embeddings, extraction):
- Ask user which LLM provider they want to use:
- OpenAI - Both generation and embeddings
- Anthropic - Generation only
- Gemini - Both generation and embeddings
- Voyage - Embeddings only
- Ollama - Local models (generation and embeddings)
- Check if the corresponding API key exists in environment variables
- If not found, ask user to provide the API key value
- Never create simplified examples without LLM - always get the proper API key and use the real LLM functions
Guide user to create .env file:
COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=...
VOYAGE_API_KEY=pa-...
For more LLM options: https://cocoindex.io/docs/ai/llm
Create basic project structure:
from dotenv import load_dotenv
import cocoindex
@cocoindex.flow_def(name="FlowName")
def my_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
pass
if __name__ == "__main__":
load_dotenv()
cocoindex.init()
my_flow.update()
Step 4: Write the Flow
Follow this structure:
@cocoindex.flow_def(name="DescriptiveName")
def flow_name(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
data_scope["source_name"] = flow_builder.add_source(
cocoindex.sources.SourceType(...)
)
collector = data_scope.add_collector()
with data_scope["source_name"].row() as item:
item["new_field"] = item["existing_field"].transform(
cocoindex.functions.FunctionName(...)
)
...
with item["nested_table"].row() as nested_item:
nested_item["embedding"] = nested_item["text"].transform(...)
collector.collect(
field1=nested_item["field1"],
field2=item["field2"],
generated_id=cocoindex.GeneratedField.UUID
)
collector.export(
"target_name",
cocoindex.targets.TargetType(...),
primary_key_fields=["field1"],
vector_indexes=[...]
)
Key principles:
- Each source creates a field in the top-level data scope
- Use
.row() to iterate through table data
- CRITICAL: Always assign transformed data to row fields - Use
item["new_field"] = item["existing_field"].transform(...), NOT local variables like new_field = item["existing_field"].transform(...)
- Transformations create new fields without mutating existing data
- Collectors gather data from any scope level
- Export must happen at top level (not within row iterations)
Common mistakes to avoid:
β Wrong: Using local variables for transformations
with data_scope["files"].row() as file:
summary = file["content"].transform(...)
summaries_collector.collect(filename=file["filename"], summary=summary)
β
Correct: Assigning to row fields
with data_scope["files"].row() as file:
file["summary"] = file["content"].transform(...)
summaries_collector.collect(filename=file["filename"], summary=file["summary"])
β Wrong: Creating unnecessary dataclasses to mirror flow fields
from dataclasses import dataclass
@dataclass
class FileSummary:
filename: str
summary: str
embedding: list[float]
Step 5: Design the Flow Solution
IMPORTANT: The patterns listed below are common starting points, but you cannot exhaustively enumerate all possible scenarios. When user requirements don't match existing patterns:
- Combine elements from multiple patterns - Mix and match sources, transformations, and targets creatively
- Review additional examples - See https://github.com/cocoindex-io/cocoindex?tab=readme-ov-file#-examples-and-demo for diverse real-world use cases (face recognition, multimodal search, product recommendations, patient form extraction, etc.)
- Think from first principles - Use the core APIs (sources, transforms, collectors, exports) and apply common sense to solve novel problems
- Be creative - CocoIndex is flexible; unique combinations of components can solve unique problems
Common starting patterns (use references for detailed examples):
For text embedding: Load references/flow_patterns.md a