GrepAI Chunking Configuration
This skill covers how GrepAI splits code files into chunks for embedding, and how to optimize chunking for your codebase.
When to Use This Skill
- Optimizing search accuracy
- Adjusting for code style (verbose vs. concise)
- Troubleshooting search results
- Understanding how indexing works
What is Chunking?
Chunking is the process of splitting source files into smaller segments for embedding:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Large Source File โ
โ (1000+ tokens) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
โ Chunk 1 โ โ Chunk 2 โ โ Chunk 3 โ
โ ~512 โ โ ~512 โ โ ~512 โ
โ tokens โ โ tokens โ โ tokens โ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
โ
Each chunk gets
its own embedding
Why Chunking Matters
Embedding models have optimal input sizes:
- Too large chunks: Less precise search results
- Too small chunks: Lost context, fragmented results
- Just right: Good balance of precision and context
Configuration
Basic Settings
chunking:
size: 512
overlap: 50
Understanding Parameters
Chunk Size
The target number of tokens per chunk.
| Size |
Effect |
| 256 |
More precise, less context |
| 512 |
Balanced (default) |
| 1024 |
More context, less precise |
Overlap
Tokens shared between adjacent chunks. Preserves context at boundaries.
| Overlap |
Effect |
| 0 |
No overlap, may lose context at boundaries |
| 50 |
Standard overlap (default) |
| 100 |
More context, larger index |
Visualization
With size=512 and overlap=50:
File: auth.go (1000 tokens)
Chunk 1: tokens 1-512
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ func Login(user, pass)... โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
50 token overlap
โ
Chunk 2: tokens 463-974
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ...validate credentials... โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
50 token overlap
โ
Chunk 3: tokens 925-1000
โโโโโโโโโโโโโโโโ
โ ...return โ
โโโโโโโโโโโโโโโโ
Recommended Settings by Language
Verbose Languages (Java, C#)
chunking:
size: 768
overlap: 75
Concise Languages (Go, Python)
chunking:
size: 512
overlap: 50
Very Concise (Rust, Zig)
chunking:
size: 384
overlap: 40
Recommended Settings by Codebase
Small Functions (Microservices)
chunking:
size: 384
overlap: 40
Large Classes (Monolith)
chunking:
size: 768
overlap: 100
Mixed Codebase
chunking:
size: 512
overlap: 50
How Tokens are Counted
GrepAI uses approximate token counting:
- ~4 characters = 1 token (for English text)
- Code varies based on identifiers and syntax
Example:
func calculateTotal(items []Item) float64 {
total := 0.0
for _, item := range items {
total += item.Price * float64(item.Quantity)
}
return total
}
โ 45 tokens
Impact on Index Size
Larger overlap = more chunks = larger index:
| Size |
Overlap |
Chunks per 10K tokens |
Index Impact |
| 512 |
0 |
~20 |
Smallest |
| 512 |
50 |
~22 |
Standard |
| 512 |
100 |
~24 |
+10% |
| 256 |
50 |
~44 |
+100% |
Impact on Search Quality
Too Small Chunks (size: 128)
Query: "authentication middleware"
Result: "...c.AbortWithStatus(401)..."
(Fragment, missing context)
Just Right (size: 512)
Query: "authentication middleware"
Result: "func AuthMiddleware() gin.HandlerFunc {
return func(c *gin.Context) {
token := c.GetHeader("Authorization")
if token == "" {
c.AbortWithStatus(401)
return
}
// validate token...
}
}"
(Complete function with context)
Too Large Chunks (size: 2048)
Query: "authentication middleware"
Result: "// Multiple unrelated functions...
func AuthMiddleware()... (your match)
func LoggingMiddleware()...
func CORSMiddleware()..."
(Too much noise)
Experimentation
Testing Different Settings
- Try smaller chunks for more precise results:
chunking:
size: 384
overlap: 40
- Re-index:
rm .grepai/index.gob
grepai watch
- Test with searches:
grepai search "your query"
- Adjust and repeat until satisfied.
Comparing Results
Before changing settings, save a search result:
grepai search "authentication" > before.txt
After changing settings and re-indexing:
grepai search "authentication" > after.txt
diff before.txt after.txt
Chunk Boundaries
GrepAI tries to split at logical boundaries:
- Empty lines (function/class boundaries)
- Closing braces
- Statement ends
This means actual chunk sizes may vary slightly from the target.
Best Practices
- Start with defaults: 512/50 works well for most codebases
- Adjust based on code style: Verbose = larger, concise = smaller
- Test with real queries: See what your searches return
- Re-index after changes: Must regenerate embeddings
- Consider overlap: Don't set to 0 unless index size is critical
Common Issues
โ Problem: Search results are too fragmented
โ
Solution: Increase chunk size:
chunking:
size: 768
โ Problem: Search results have too much irrelevant context
โ
Solution: Decrease chunk size:
chunking:
size: 384
โ Problem: Results miss related code at function boundaries
โ
Solution: Increase overlap:
chunking:
overlap: 100
โ Problem: Index is too large
โ
Solutions:
- Decrease overlap
- Increase chunk size
- Add more ignore patterns
Output Format
Chunking status:
โ
Chunking Configuration
Size: 512 tokens
Overlap: 50 tokens
Index Statistics:
- Total files: 245
- Total chunks: 1,234
- Avg chunks/file: 5.0
- Avg chunk size: 478 tokens
Recommendations:
- Current settings are balanced
- Consider size: 384 for more precise results
- Consider size: 768 for more context