Phoenix - AI Observability Platform
Open-source AI observability and evaluation platform for LLM applications with tracing, evaluation, datasets, experiments, and real-time monitoring.
When to use Phoenix
Use Phoenix when:
- Debugging LLM application issues with detailed traces
- Running systematic evaluations on datasets
- Monitoring production LLM systems in real-time
- Building experiment pipelines for prompt/model comparison
- Self-hosted observability without vendor lock-in
Key features:
- Tracing: OpenTelemetry-based trace collection for any LLM framework
- Evaluation: LLM-as-judge evaluators for quality assessment
- Datasets: Versioned test sets for regression testing
- Experiments: Compare prompts, models, and configurations
- Playground: Interactive prompt testing with multiple models
- Open-source: Self-hosted with PostgreSQL or SQLite
Use alternatives instead:
- LangSmith: Managed platform with LangChain-first integration
- Weights & Biases: Deep learning experiment tracking focus
- Arize Cloud: Managed Phoenix with enterprise features
- MLflow: General ML lifecycle, model registry focus
Quick start
Installation
pip install arize-phoenix
pip install arize-phoenix[embeddings]
pip install arize-phoenix-otel
pip install arize-phoenix-evals
pip install arize-phoenix-client
Launch Phoenix server
import phoenix as px
session = px.launch_app()
session.view()
print(session.url)
Command-line server (production)
phoenix serve
export PHOENIX_SQL_DATABASE_URL="postgresql://user:pass@host/db"
phoenix serve --port 6006
Basic tracing
from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor
tracer_provider = register(
project_name="my-llm-app",
endpoint="http://localhost:6006/v1/traces"
)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}]
)
Core concepts
Traces and spans
A trace represents a complete execution flow, while spans are individual operations within that trace.
from phoenix.otel import register
from opentelemetry import trace
tracer_provider = register(project_name="my-app")
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("process_query") as span:
span.set_attribute("input.value", query)
with tracer.start_as_current_span("retrieve_context"):
context = retriever.search(query)
with tracer.start_as_current_span("generate_response"):
response = llm.generate(query, context)
span.set_attribute("output.value", response)
Projects
Projects organize related traces:
import os
os.environ["PHOENIX_PROJECT_NAME"] = "production-chatbot"
from phoenix.otel import register
tracer_provider = register(project_name="experiment-v2")
Framework instrumentation
OpenAI
from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor
tracer_provider = register()
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
LangChain
from phoenix.otel import register
from openinference.instrumentation.langchain import LangChainInstrumentor
tracer_provider = register()
LangChainInstrumentor().instrument(tracer_provider=tracer_provider)
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o")
response = llm.invoke("Hello!")
LlamaIndex
from phoenix.otel import register
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
tracer_provider = register()
LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)
Anthropic
from phoenix.otel import register
from openinference.instrumentation.anthropic import AnthropicInstrumentor
tracer_provider = register()
AnthropicInstrumentor().instrument(tracer_provider=tracer_provider)
Evaluation framework
Built-in evaluators
from phoenix.evals import (
OpenAIModel,
HallucinationEvaluator,
RelevanceEvaluator,
ToxicityEvaluator,
llm_classify
)
eval_model = OpenAIModel(model="gpt-4o")
hallucination_eval = HallucinationEvaluator(eval_model)
results = hallucination_eval.evaluate(
input="What is the capital of France?",
output="The capital of France is Paris.",
reference="Paris is the capital of France."
)
Custom evaluators
from phoenix.evals import llm_classify
def evaluate_helpfulness(input_text, output_text):
template = """
Evaluate if the response is helpful for the given question.
Question: {input}
Response: {output}
Is this response helpful? Answer 'helpful' or 'not_helpful'.
"""
result = llm_classify(
model=eval_model,
template=template,
input=input_text,
output=output_text,
rails=["helpful", "not_helpful"]
)
return result
Run evaluations on dataset
from phoenix import Client
from phoenix.evals import run_evals
client = Client()
spans_df = client.get_spans_dataframe(
project_name="my-app",
filter_condition="span_kind == 'LLM'"
)
eval_results