AI Orchestration Frameworks: The Definitive Guide
Building a single LLM call is easy. Building a reliable, multi-step AI system that chains calls, manages state, handles errors, integrates tools, and runs in production? That's orchestration - and it's where frameworks earn their keep.
This guide is a comprehensive deep-dive into every major AI orchestration framework in 2026. We cover LangChain, LangGraph, CrewAI, AutoGen, Semantic Kernel, and Haystack - with real, working code examples, honest comparisons, and production-ready patterns. Whether you're building a simple RAG pipeline or a multi-agent system, this is your reference.
1. What is AI Orchestration?
AI orchestration is the practice of coordinating multiple AI components - LLM calls, tool invocations, data retrievals, and decision logic - into a coherent workflow that accomplishes a goal. Think of it as the conductor of an AI orchestra: each instrument (model, tool, database) plays its part, but someone needs to manage the timing, sequencing, and error recovery.
Why You Need It
A single openai.chat.completions.create() call gets you surprisingly far. But real applications quickly need more:
- Chaining LLM calls: One model's output feeds into another's input. Summarize β analyze β recommend.
- State management: Tracking conversation history, intermediate results, and workflow progress across multiple steps.
- Error handling: Retries, fallback models, graceful degradation when an API is down or a model hallucinates.
- Tool integration: Letting the LLM call APIs, query databases, execute code, and search the web.
- Structured output: Parsing LLM responses into typed objects your application can actually use.
- Observability: Tracing every step so you can debug, evaluate, and optimize.
Simple vs. Complex Orchestration
Not every project needs a framework. Here's the spectrum:
Orchestration Complexity Spectrum:
Simple Complex
β β
βΌ βΌ
Single LLM call β Chain of calls β RAG pipeline β Agent loop β Multi-agent system
β β β β β
No framework Maybe LCEL LangChain LangGraph CrewAI/AutoGen
needed is enough or Haystack or custom or custom
When to Use a Framework vs. Roll Your Own
Use a framework when: You need RAG, agents, multi-step workflows, or multi-agent coordination. The framework handles state management, tool calling protocols, streaming, and the dozens of edge cases you'd otherwise discover in production.
Roll your own when: You have a simple chain of 2-3 LLM calls, you need maximum control over every detail, or you're building something the frameworks don't support well. A few functions with httpx and pydantic can be cleaner than importing a framework for a simple use case.
Here's what "rolling your own" looks like for a simple chain:
# Simple orchestration without a framework
import openai
from pydantic import BaseModel
client = openai.OpenAI()
class Analysis(BaseModel):
summary: str
sentiment: str
key_topics: list[str]
def analyze_text(text: str) -> Analysis:
# Step 1: Summarize
summary_resp = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Summarize this text in 2 sentences."},
{"role": "user", "content": text}
]
)
summary = summary_resp.choices[0].message.content
# Step 2: Analyze the summary (structured output)
analysis_resp = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": "Analyze this summary."},
{"role": "user", "content": summary}
],
response_format=Analysis
)
return analysis_resp.choices[0].message.parsed
This works fine for simple cases. But once you need streaming, retries, tool calling, memory, or tracing - you'll be rebuilding what frameworks already provide.
2. LangChain
LangChain is the most widely adopted AI orchestration framework. It provides a modular toolkit for building LLM-powered applications: prompts, models, output parsers, document loaders, retrievers, and chains - all composable via the LangChain Expression Language (LCEL).
Core Concepts
- Chains: Sequences of operations. Prompt β Model β Parser.
- Prompts: Templated messages with variables. Support system/user/assistant roles.
- Output Parsers: Convert raw LLM text into structured data (strings, JSON, Pydantic models).
- Document Loaders: Ingest data from PDFs, web pages, databases, APIs, and 100+ sources.
- Text Splitters: Break documents into chunks for embedding and retrieval.
- Retrievers: Search vector stores, keyword indexes, or hybrid systems to find relevant context.
LCEL: The Pipe Operator
LCEL lets you compose components with the | (pipe) operator. Each component's output becomes the next component's input - like Unix pipes for AI:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant."),
("user", "{input}")
])
chain = prompt | ChatOpenAI(model="gpt-4o") | StrOutputParser()
result = chain.invoke({"input": "Explain microservices"})
print(result)
The beauty of LCEL is that every chain automatically supports .invoke(), .stream(), .batch(), and .ainvoke() - sync, streaming, batch, and async - with zero extra code.
# Streaming - works automatically with LCEL
for chunk in chain.stream({"input": "Explain microservices"}):
print(chunk, end="", flush=True)
# Batch - process multiple inputs in parallel
results = chain.batch([
{"input": "Explain microservices"},
{"input": "Explain serverless"},
{"input": "Explain containers"},
])
Document Loading + Splitting
LangChain has 100+ document loaders. Here's a common pattern - load a PDF, split it into chunks, and embed it:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
# Load
loader = PyPDFLoader("architecture-guide.pdf")
docs = loader.load()
# Split
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(docs)
# Embed and store
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
persist_directory="./chroma_db"
)
print(f"Indexed {len(chunks)} chunks")
Retriever Chain (RAG)
Once you have a vector store, build a RAG chain that retrieves relevant context before answering:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
rag_prompt = ChatPromptTemplate.from_messages([
("system", """Answer the question based only on the following context.
If the context doesn't contain the answer, say "I don't have that information."
Context:
{context}"""),
("user", "{question}")
])
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| rag_prompt
| ChatOpenAI(model="gpt-4o")
| StrOutputParser()
)
answer = rag_chain.invoke("What are the key architectural principles?")
print(answer)
Structured Output with Pydantic
Force the LLM to return typed, validated data using Pydantic models:
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field
class CodeReview(BaseModel):
"""Structured code review output."""
issues: list[str] = Field(description="List of issues found")
severity: str = Field(description="Overall severity: low, medium, high")
suggestions: list[str] = Field(description="Improvement suggestions")
score: int = Field(description="Code quality score 1-10", ge=1, le=10)
llm = ChatOpenAI(model="gpt-4o")
structured_llm = llm.with_structured_output(CodeReview)
review = structured_llm.invoke(
"Review this code: def add(a,b): return a+b"
)
print(f"Score: {review.score}/10")
print(f"Severity: {review.severity}")
for issue in review.issues:
print(f" - {issue}")
This is one of LangChain's killer features - .with_structured_output() works across providers (OpenAI, Anthropic, Google) and handles the JSON schema generation, function calling, and validation automatically.
3. LangGraph
LangGraph is LangChain's framework for building stateful, multi-step agent workflows as graphs. While LCEL handles linear chains, LangGraph handles cycles, conditional branching, and persistent state - everything you need for real agents.
Core Concepts
- State: A TypedDict that flows through the graph. Every node reads and writes to it.
- Nodes: Functions that take state, do work, and return state updates.
- Edges: Connections between nodes. Can be unconditional or conditional.
- Conditional Edges: Route to different nodes based on state (the "if/else" of graphs).
- Checkpointing: Persist state between runs for long-running workflows and human-in-the-loop.
Graph Flow Visualization
Multi-Step Agent Graph:
βββββββββββ
β START β
ββββββ¬βββββ
β
βΌ
ββββββββββββββββ
ββββββΆβ Agent LLM βββββββββββββββββββ
β ββββββββ¬ββββββββ β
β β β
β ββββββββΌββββββββ β
β β Router β β
β β (conditional)β β
β ββββ¬ββββ¬ββββ¬βββ β
β β β β β
β tool β β β end β
β call β β β β
β βΌ β βΌ β
β βββββββββββ βββββββ β
β β Tools ββ β END β β
β βββββ¬ββββββ βββββββ β
β β β β
ββββββββ β human β
β review β
βΌ β
ββββββββββββββββ β
β Human βββββ approved βββββ
β Review β
ββββββββββββββββ
Building a Complete Agent
Here's a full LangGraph agent with tool calling, conditional routing, and state management:
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage, ToolMessage
class AgentState(TypedDict):
messages: Annotated[list, operator.add]
next_step: str
# Define tools
from langchain_core.tools import tool
@tool
def search_web(query: str) -> str:
"""Search the web for current information."""
# In production, use a real search API
return f"Search results for: {query}"
@tool
def calculate(expression: str) -> str:
"""Evaluate a mathematical expression."""
try:
result = eval(expression) # Use a safe evaluator in production
return str(result)
except Exception as e:
return f"Error: {e}"
tools = [search_web, calculate]
llm = ChatOpenAI(model="gpt-4o").bind_tools(tools)
# Node: call the LLM
def agent_node(state: AgentState) -> dict:
response = llm.invoke(state["messages"])
return {"messages": [response]}
# Node: execute tool calls
def tool_node(state: AgentState) -> dict:
last_message = state["messages"][-1]
results = []
for call in last_message.tool_calls:
tool_fn = {t.name: t for t in tools}[call["name"]]
result = tool_fn.invoke(call["args"])
results.append(
ToolMessage(content=result, tool_call_id=call["id"])
)
return {"messages": results}
# Conditional edge: should we call tools or finish?
def should_continue(state: AgentState) -> str:
last_message = state["messages"][-1]
if hasattr(last_message, "tool_calls") and last_message.tool_calls:
return "tools"
return "end"
# Build the graph
graph = StateGraph(AgentState)
graph.add_node("agent", agent_node)
graph.add_node("tools", tool_node)
graph.set_entry_point("agent")
graph.add_conditional_edges("agent", should_continue, {
"tools": "tools",
"end": END
})
graph.add_edge("tools", "agent") # After tools, go back to agent
app = graph.compile()
# Run it
result = app.invoke({
"messages": [HumanMessage(content="What is 42 * 17, and search for the latest Python release")]
})
for msg in result["messages"]:
print(f"{msg.type}: {msg.content[:100]}")
Human-in-the-Loop with interrupt_before
LangGraph's killer feature: pause execution before a node, let a human review, then resume:
from langgraph.checkpoint.memory import MemorySaver
checkpointer = MemorySaver()
# Compile with interrupt - pauses BEFORE the tools node
app_with_hitl = graph.compile(
checkpointer=checkpointer,
interrupt_before=["tools"]
)
# Start the conversation
config = {"configurable": {"thread_id": "review-123"}}
result = app_with_hitl.invoke(
{"messages": [HumanMessage(content="Search for competitor pricing")]},
config=config
)
# Execution pauses here - the agent wants to call tools
# A human reviews the pending tool calls:
pending_calls = result["messages"][-1].tool_calls
print("Agent wants to call:")
for call in pending_calls:
print(f" {call['name']}({call['args']})")
# Human approves - resume execution
# (pass None to continue from where we left off)
final_result = app_with_hitl.invoke(None, config=config)
print(final_result["messages"][-1].content)
Checkpointing with MemorySaver
Checkpointing persists the full graph state, enabling long-running workflows, crash recovery, and time-travel debugging:
from langgraph.checkpoint.memory import MemorySaver
from langgraph.checkpoint.postgres import PostgresSaver
# In-memory (development)
memory_saver = MemorySaver()
# PostgreSQL (production)
# postgres_saver = PostgresSaver.from_conn_string(
# "postgresql://user:pass@localhost/langgraph"
# )
app = graph.compile(checkpointer=memory_saver)
# Each thread_id maintains separate state
config_a = {"configurable": {"thread_id": "user-alice"}}
config_b = {"configurable": {"thread_id": "user-bob"}}
# Alice's conversation
app.invoke({"messages": [HumanMessage(content="Hello, I'm Alice")]}, config_a)
# Bob's conversation (completely separate state)
app.invoke({"messages": [HumanMessage(content="Hello, I'm Bob")]}, config_b)
# Continue Alice's conversation - state is preserved
app.invoke({"messages": [HumanMessage(content="What's my name?")]}, config_a)
# Agent correctly responds "Alice" because state is checkpointed
LangGraph is the right choice when you need cycles (agent loops), conditional branching, persistent state, or human-in-the-loop. For linear chains, stick with LCEL.
4. CrewAI
CrewAI takes a fundamentally different approach: instead of graphs and chains, you define agents with roles, assign them tasks, and organize them into crews. It's the most intuitive framework for multi-agent systems - think of it as assembling a team of specialists.
Core Concepts
- Agent: A persona with a role, goal, backstory, and tools. Each agent is a specialist.
- Task: A specific job assigned to an agent, with a description and expected output.
- Crew: A team of agents working together on tasks, with a defined process.
- Process: How tasks are executed -
sequential(one after another) orhierarchical(manager delegates). - Tools: Functions agents can call - search, scrape, file I/O, APIs, custom tools.
Full Crew with 3 Agents
from crewai import Agent, Task, Crew, Process
from crewai_tools import SerperDevTool, WebsiteSearchTool
# Agent 1: Researcher
researcher = Agent(
role="Senior Research Analyst",
goal="Find comprehensive information about {topic}",
backstory="Expert researcher with 10 years experience in technology "
"analysis. Known for thorough, accurate research.",
tools=[SerperDevTool(), WebsiteSearchTool()],
llm="gpt-4o",
verbose=True
)
# Agent 2: Writer
writer = Agent(
role="Technical Content Writer",
goal="Write an engaging, accurate article about {topic}",
backstory="Award-winning technical writer who makes complex topics "
"accessible. Focuses on practical, actionable content.",
llm="gpt-4o",
verbose=True
)
# Agent 3: Editor
editor = Agent(
role="Senior Editor",
goal="Review and polish the article for publication",
backstory="Meticulous editor with a keen eye for accuracy, clarity, "
"and engagement. Ensures content meets publication standards.",
llm="gpt-4o",
verbose=True
)
# Define tasks
research_task = Task(
description="Research {topic} thoroughly. Find key trends, statistics, "
"expert opinions, and practical examples. Cover at least "
"5 different aspects of the topic.",
expected_output="A detailed research brief with sources, key findings, "
"statistics, and expert quotes.",
agent=researcher
)
writing_task = Task(
description="Using the research brief, write a 1500-word article about "
"{topic}. Include an introduction, 5 main sections, code "
"examples where relevant, and a conclusion.",
expected_output="A complete, well-structured article in markdown format.",
agent=writer
)
editing_task = Task(
description="Review the article for accuracy, clarity, grammar, and "
"engagement. Fix any issues and add suggestions for "
"improvement. Ensure all claims are supported by the research.",
expected_output="The final, polished article ready for publication.",
agent=editor
)
# Assemble the crew - sequential process
crew = Crew(
agents=[researcher, writer, editor],
tasks=[research_task, writing_task, editing_task],
process=Process.sequential,
verbose=True
)
# Run it
result = crew.kickoff(inputs={"topic": "AI orchestration frameworks in 2026"})
print(result)
Hierarchical Process
In hierarchical mode, a manager agent automatically delegates tasks to the best-suited agent:
# Hierarchical - a manager coordinates the agents
hierarchical_crew = Crew(
agents=[researcher, writer, editor],
tasks=[research_task, writing_task, editing_task],
process=Process.hierarchical,
manager_llm="gpt-4o",
verbose=True
)
result = hierarchical_crew.kickoff(
inputs={"topic": "Kubernetes security best practices"}
)
Custom Tools
Build your own tools for agents to use:
from crewai.tools import BaseTool
from pydantic import BaseModel, Field
import requests
class GitHubSearchInput(BaseModel):
query: str = Field(description="Search query for GitHub repositories")
class GitHubSearchTool(BaseTool):
name: str = "GitHub Repository Search"
description: str = "Search GitHub for repositories matching a query"
args_schema: type[BaseModel] = GitHubSearchInput
def _run(self, query: str) -> str:
resp = requests.get(
"https://api.github.com/search/repositories",
params={"q": query, "sort": "stars", "per_page": 5},
headers={"Accept": "application/vnd.github.v3+json"}
)
repos = resp.json().get("items", [])
results = []
for repo in repos:
results.append(
f"- {repo['full_name']} β {repo['stargazers_count']}: "
f"{repo['description']}"
)
return "\n".join(results) or "No repositories found."
# Use it
researcher_with_github = Agent(
role="Open Source Analyst",
goal="Find and analyze relevant open source projects",
backstory="Expert in evaluating open source software quality and adoption.",
tools=[GitHubSearchTool(), SerperDevTool()],
llm="gpt-4o"
)
CrewAI shines when you can naturally decompose a problem into roles. If you find yourself thinking "I need a researcher, a writer, and an editor," CrewAI is your framework.
5. AutoGen
AutoGen is Microsoft's framework for building multi-agent conversations. The core idea: agents talk to each other to solve problems. One agent proposes a solution, another critiques it, a third executes code to verify - all through natural conversation.
Core Concepts
- AssistantAgent: An LLM-powered agent that generates responses and code.
- UserProxyAgent: Represents the human user. Can auto-execute code or ask for human input.
- GroupChat: A conversation with 3+ agents, managed by a GroupChatManager that decides who speaks next.
- Code Execution: Agents can write and execute code in sandboxed environments (local or Docker).
Two-Agent Conversation
from autogen import AssistantAgent, UserProxyAgent
# Configuration for the LLM
llm_config = {
"model": "gpt-4o",
"api_key": "your-api-key", # or use OAI_CONFIG_LIST
}
# The assistant - generates solutions
assistant = AssistantAgent(
name="coding_assistant",
system_message="You are a helpful AI assistant. Solve tasks using Python code. "
"When you write code, put it in ```python blocks. "
"Reply TERMINATE when the task is done.",
llm_config=llm_config
)
# The user proxy - executes code and provides feedback
user_proxy = UserProxyAgent(
name="user",
human_input_mode="NEVER", # Auto-execute without asking
max_consecutive_auto_reply=5,
code_execution_config={
"work_dir": "coding_output",
"use_docker": False # Set True for sandboxed execution
}
)
# Start the conversation
user_proxy.initiate_chat(
assistant,
message="Create a Python script that fetches the top 10 trending "
"GitHub repositories and saves them to a CSV file."
)
Group Chat with 3+ Agents
Multiple agents collaborate through conversation - each with a different expertise:
from autogen import AssistantAgent, UserProxyAgent, GroupChat, GroupChatManager
llm_config = {"model": "gpt-4o"}
# Planner - breaks down tasks
planner = AssistantAgent(
name="planner",
system_message="You are a project planner. Break down complex tasks into "
"clear, actionable steps. Do not write code.",
llm_config=llm_config
)
# Coder - writes the code
coder = AssistantAgent(
name="coder",
system_message="You are an expert Python developer. Write clean, "
"well-documented code based on the plan. Put code in "
"```python blocks.",
llm_config=llm_config
)
# Reviewer - reviews code quality
reviewer = AssistantAgent(
name="reviewer",
system_message="You are a senior code reviewer. Review code for bugs, "
"security issues, and best practices. Be specific and "
"constructive. Say APPROVE if the code is ready.",
llm_config=llm_config
)
# Executor - runs the code
executor = UserProxyAgent(
name="executor",
human_input_mode="NEVER",
code_execution_config={"work_dir": "group_output", "use_docker": False}
)
# Create the group chat
group_chat = GroupChat(
agents=[planner, coder, reviewer, executor],
messages=[],
max_round=15,
speaker_selection_method="auto" # LLM decides who speaks next
)
manager = GroupChatManager(
groupchat=group_chat,
llm_config=llm_config
)
# Kick off the conversation
executor.initiate_chat(
manager,
message="Build a REST API with FastAPI that has CRUD endpoints for a "
"todo list with SQLite storage. Include input validation."
)
Code Execution in Docker
For production safety, run agent-generated code in Docker containers:
from autogen.coding import DockerCommandLineCodeExecutor
# Create a Docker-based executor
docker_executor = DockerCommandLineCodeExecutor(
image="python:3.12-slim",
timeout=60,
work_dir="./docker_output"
)
user_proxy = UserProxyAgent(
name="user",
human_input_mode="NEVER",
code_execution_config={"executor": docker_executor}
)
# Now any code the assistant writes runs in an isolated container
# - no access to your host filesystem or network (unless configured)
AutoGen excels at open-ended problem solving where agents need to iterate through conversation. The conversation-first approach makes it natural for tasks like "build this feature" where planning, coding, reviewing, and testing happen in a back-and-forth flow.
6. Semantic Kernel
Semantic Kernel is Microsoft's SDK for integrating AI into applications. It's the enterprise choice - first-class C#/.NET support, strong typing, plugin architecture, and deep Azure integration. It also supports Python and Java.
Core Concepts
- Kernel: The central orchestrator that manages services, plugins, and memory.
- Plugins: Collections of functions (native code or LLM prompts) that the kernel can invoke.
- Planners: Automatically compose plugins into multi-step plans to achieve a goal.
- Memory: Built-in semantic memory for storing and retrieving information by meaning.
- Connectors: Integrations with OpenAI, Azure OpenAI, Hugging Face, and other providers.
Python Example
import semantic_kernel as sk
from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion
from semantic_kernel.functions import kernel_function
# Initialize the kernel
kernel = sk.Kernel()
kernel.add_service(
OpenAIChatCompletion(
ai_model_id="gpt-4o",
service_id="chat"
)
)
# Define a plugin with native functions
class ContentPlugin:
@kernel_function(
name="summarize",
description="Summarize text to a specified length"
)
def summarize(self, text: str, max_words: int = 100) -> str:
# This would normally call the LLM via the kernel
return f"Summary of ({len(text)} chars) in {max_words} words"
@kernel_function(
name="translate",
description="Translate text to a target language"
)
def translate(self, text: str, language: str = "Spanish") -> str:
return f"Translated to {language}: {text}"
# Add the plugin
kernel.add_plugin(ContentPlugin(), plugin_name="content")
# Invoke a function directly
result = await kernel.invoke(
kernel.get_function("content", "summarize"),
text="Long article text here...",
max_words=50
)
print(result)
C# Example
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using System.ComponentModel;
// Build the kernel
var builder = Kernel.CreateBuilder();
builder.AddAzureOpenAIChatCompletion(
deploymentName: "gpt-4o",
endpoint: "https://your-resource.openai.azure.com/",
apiKey: "your-api-key"
);
var kernel = builder.Build();
// Define a plugin
public class WeatherPlugin
{
[KernelFunction, Description("Get the current weather for a city")]
public string GetWeather(
[Description("The city name")] string city)
{
// In production, call a real weather API
return $"The weather in {city} is 72Β°F and sunny.";
}
[KernelFunction, Description("Get a 5-day forecast")]
public string GetForecast(
[Description("The city name")] string city)
{
return $"5-day forecast for {city}: Sunny, Cloudy, Rain, Sunny, Sunny";
}
}
// Register and use
kernel.Plugins.AddFromType<WeatherPlugin>();
// Auto function calling - the LLM decides which plugins to use
var settings = new OpenAIPromptExecutionSettings
{
ToolCallBehavior = ToolCallBehavior.AutoInvokeKernelFunctions
};
var chatService = kernel.GetRequiredService<IChatCompletionService>();
var history = new ChatHistory();
history.AddUserMessage("What's the weather in Seattle and should I bring an umbrella this week?");
var response = await chatService.GetChatMessageContentAsync(
history, settings, kernel
);
Console.WriteLine(response.Content);
Semantic Kernel is the right choice for enterprise .NET shops, teams already on Azure, or projects that need strong typing and plugin architecture. The C# experience is significantly more polished than the Python SDK.
7. Haystack
Haystack by deepset is a framework purpose-built for document processing and RAG pipelines. While other frameworks bolt on RAG as a feature, Haystack makes it the core abstraction. If your primary use case is search, question answering, or document intelligence, Haystack deserves serious consideration.
Core Concepts
- Components: Modular building blocks - converters, splitters, embedders, retrievers, generators, rankers.
- Pipelines: DAGs (directed acyclic graphs) of components. Data flows through connected components.
- Document Stores: Storage backends for documents and embeddings - Elasticsearch, Weaviate, Pinecone, Chroma, pgvector.
- Converters: Turn files (PDF, DOCX, HTML, Markdown) into Haystack Document objects.
RAG Pipeline Example
from haystack import Pipeline
from haystack.components.converters import PyPDFToDocument
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.embedders import (
OpenAIDocumentEmbedder,
OpenAITextEmbedder
)
from haystack.components.writers import DocumentWriter
from haystack.components.retrievers.in_memory import (
InMemoryEmbeddingRetriever
)
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack_integrations.document_stores.chroma import ChromaDocumentStore
# --- Indexing Pipeline ---
document_store = ChromaDocumentStore()
indexing = Pipeline()
indexing.add_component("converter", PyPDFToDocument())
indexing.add_component("splitter", DocumentSplitter(
split_by="sentence",
split_length=3,
split_overlap=1
))
indexing.add_component("embedder", OpenAIDocumentEmbedder(
model="text-embedding-3-small"
))
indexing.add_component("writer", DocumentWriter(
document_store=document_store
))
# Connect the components
indexing.connect("converter", "splitter")
indexing.connect("splitter", "embedder")
indexing.connect("embedder", "writer")
# Run indexing
indexing.run({"converter": {"sources": ["report.pdf"]}})
# --- Query Pipeline ---
template = """Answer the question based on the context below.
Context:
{% for doc in documents %}
{{ doc.content }}
{% endfor %}
Question: {{ question }}
Answer:"""
query_pipeline = Pipeline()
query_pipeline.add_component("text_embedder", OpenAITextEmbedder(
model="text-embedding-3-small"
))
query_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(
document_store=document_store,
top_k=5
))
query_pipeline.add_component("prompt", PromptBuilder(template=template))
query_pipeline.add_component("llm", OpenAIGenerator(model="gpt-4o"))
# Connect query components
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
query_pipeline.connect("retriever.documents", "prompt.documents")
query_pipeline.connect("prompt", "llm")
# Run a query
result = query_pipeline.run({
"text_embedder": {"text": "What were the key findings?"},
"prompt": {"question": "What were the key findings?"}
})
print(result["llm"]["replies"][0])
Haystack's explicit pipeline wiring is more verbose than LangChain's LCEL, but it's also more transparent - you can see exactly how data flows between components. The framework excels at document-heavy workloads where you need fine-grained control over ingestion, splitting, embedding, and retrieval.
8. Framework Comparison
Here's an honest, side-by-side comparison of every framework we've covered:
| Framework | Best For | Learning Curve | Agent Support | Multi-Agent | Streaming | Production Ready | Languages | Community |
|---|---|---|---|---|---|---|---|---|
| LangChain | RAG, chains, general orchestration | Medium | β Good (via LangGraph) | β οΈ Basic | β Native | β Yes | Python, JS/TS | π₯ Largest (95k+ β) |
| LangGraph | Complex agents, stateful workflows | High | β Excellent | β Yes | β Native | β Yes | Python, JS/TS | Growing (10k+ β) |
| CrewAI | Multi-agent teams, role-based tasks | Low | β Excellent | β Core feature | β οΈ Limited | β οΈ Maturing | Python | Fast-growing (25k+ β) |
| AutoGen | Conversational agents, code generation | Medium | β Excellent | β Core feature | β Yes | β οΈ Maturing | Python, .NET | Large (35k+ β) |
| Semantic Kernel | Enterprise, .NET, Azure integration | Medium-High | β Good | β οΈ Basic | β Yes | β Yes | C#, Python, Java | Solid (22k+ β) |
| Haystack | Document processing, RAG, search | Medium | β οΈ Basic | β No | β Yes | β Yes | Python | Established (18k+ β) |
9. Choosing the Right Framework
The "best" framework depends entirely on your use case. Here's a decision guide:
Decision Flowchart
Which Framework Should You Use?
What are you building?
β
βββ Simple chain (prompt β LLM β parse)?
β βββ β
LangChain LCEL - minimal overhead, great DX
β
βββ RAG / document search pipeline?
β βββ Need fine-grained control over ingestion?
β β βββ β
Haystack - purpose-built for document pipelines
β βββ Need RAG + agents + other features?
β βββ β
LangChain - broadest ecosystem
β
βββ Complex agent with loops and branching?
β βββ Need human-in-the-loop?
β β βββ β
LangGraph - checkpointing + interrupt_before
β βββ Need persistent state across sessions?
β βββ β
LangGraph - built-in checkpointing
β
βββ Multi-agent team collaboration?
β βββ Role-based (researcher, writer, editor)?
β β βββ β
CrewAI - most intuitive role-based API
β βββ Conversation-based (agents discuss and iterate)?
β βββ β
AutoGen - conversation-first design
β
βββ Enterprise / .NET shop?
β βββ β
Semantic Kernel - best C# support, Azure integration
β
βββ Not sure / prototyping?
βββ β
LangChain - largest ecosystem, most examples, easiest to start
Practical Recommendations
- Starting out? Begin with LangChain LCEL. It has the most tutorials, examples, and community support. You can always add LangGraph when you need agents.
- Building agents? LangGraph gives you the most control. CrewAI is faster to prototype with but harder to customize.
- Multi-agent systems? CrewAI for role-based teams, AutoGen for conversation-based collaboration. Try both - the right choice depends on your mental model.
- Enterprise .NET? Semantic Kernel is the only serious option. The C# SDK is excellent.
- Document-heavy workloads? Haystack's pipeline model gives you the most control over ingestion and retrieval.
- Mixing frameworks? Totally valid. Use LangChain for RAG, LangGraph for agent orchestration, and LangSmith for observability. They're designed to work together.
10. Production Patterns
Getting a demo working is easy. Getting it to run reliably in production is where the real engineering happens. Here are the patterns that matter.
Error Handling & Retries
LLM APIs fail. Rate limits hit. Models hallucinate. Build resilience from day one:
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnableConfig
from tenacity import retry, stop_after_attempt, wait_exponential
import logging
logger = logging.getLogger(__name__)
# Pattern 1: LangChain's built-in retry
llm_with_retry = ChatOpenAI(
model="gpt-4o",
max_retries=3,
request_timeout=30
)
# Pattern 2: Custom retry with tenacity for complex logic
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=30),
reraise=True
)
def robust_chain_invoke(chain, input_data):
try:
return chain.invoke(input_data)
except Exception as e:
logger.warning(f"Chain failed, retrying: {e}")
raise
Fallback Models
If your primary model is down or rate-limited, fall back to an alternative:
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
# Primary: GPT-4o, Fallback: Claude 3.5 Sonnet
primary = ChatOpenAI(model="gpt-4o", max_retries=2)
fallback = ChatAnthropic(model="claude-3-5-sonnet-20241022", max_retries=2)
# .with_fallbacks() tries the next model if the first fails
llm = primary.with_fallbacks([fallback])
# Works transparently - your chain doesn't know which model responded
chain = prompt | llm | StrOutputParser()
result = chain.invoke({"input": "Explain orchestration"})
# If GPT-4o is down, automatically uses Claude
Streaming Responses
Don't make users wait for the full response. Stream tokens as they're generated:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant."),
("user", "{input}")
])
chain = prompt | ChatOpenAI(model="gpt-4o") | StrOutputParser()
# Sync streaming
for chunk in chain.stream({"input": "Write a haiku about Python"}):
print(chunk, end="", flush=True)
# Async streaming (for web frameworks like FastAPI)
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
app = FastAPI()
@app.get("/chat")
async def chat(question: str):
async def generate():
async for chunk in chain.astream({"input": question}):
yield chunk
return StreamingResponse(generate(), media_type="text/plain")
Caching
Identical prompts should return cached results - saves money and latency:
from langchain_core.globals import set_llm_cache
from langchain_community.cache import SQLiteCache, RedisCache
# SQLite cache (development)
set_llm_cache(SQLiteCache(database_path=".langchain_cache.db"))
# Redis cache (production)
# from redis import Redis
# set_llm_cache(RedisCache(redis_=Redis(host="localhost", port=6379)))
# Now identical calls are cached automatically
llm = ChatOpenAI(model="gpt-4o")
# First call: hits the API (~1-2s)
result1 = llm.invoke("What is 2+2?")
# Second call: returns from cache (~1ms)
result2 = llm.invoke("What is 2+2?")
Observability with LangSmith
You can't debug what you can't see. LangSmith traces every step of your chains and agents:
# Set environment variables to enable LangSmith tracing
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-api-key"
os.environ["LANGCHAIN_PROJECT"] = "my-orchestration-app"
# That's it - all LangChain/LangGraph calls are now traced
# Every chain.invoke() logs:
# - Input/output at each step
# - Latency per component
# - Token usage and cost
# - Error traces with full context
# - Tool call arguments and results
# Custom tracing for non-LangChain code
from langsmith import traceable
@traceable(name="custom_processing")
def process_results(raw_data: str) -> dict:
# Your custom logic here
processed = {"summary": raw_data[:100], "length": len(raw_data)}
return processed
# This function now appears in LangSmith traces alongside LangChain calls
Rate Limiting
Protect yourself from runaway costs and API rate limits:
from langchain_core.rate_limiters import InMemoryRateLimiter
# Limit to 10 requests per second
rate_limiter = InMemoryRateLimiter(
requests_per_second=10,
check_every_n_seconds=0.1,
max_bucket_size=20
)
llm = ChatOpenAI(
model="gpt-4o",
rate_limiter=rate_limiter
)
# Now even batch operations respect the rate limit
results = chain.batch([
{"input": f"Question {i}"} for i in range(100)
]) # Automatically throttled to 10 req/s
Putting It All Together
A production-ready chain combines all these patterns:
import os
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.globals import set_llm_cache
from langchain_core.rate_limiters import InMemoryRateLimiter
from langchain_community.cache import SQLiteCache
# Observability
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "production-app"
# Caching
set_llm_cache(SQLiteCache(database_path=".cache.db"))
# Rate limiting
limiter = InMemoryRateLimiter(requests_per_second=10)
# Model with fallback + retry + rate limiting
primary = ChatOpenAI(model="gpt-4o", max_retries=3, rate_limiter=limiter)
fallback = ChatAnthropic(model="claude-3-5-sonnet-20241022", max_retries=3)
llm = primary.with_fallbacks([fallback])
# The chain
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant. Be concise."),
("user", "{input}")
])
chain = prompt | llm | StrOutputParser()
# This chain now has: retries, fallback, caching, rate limiting, and tracing
result = chain.invoke({"input": "Explain AI orchestration"})
What's Next
AI orchestration is evolving fast. The frameworks are converging on common patterns - state graphs, tool calling, structured output - while differentiating on developer experience and ecosystem. Here's where to go from here:
- Start with one framework. LangChain + LangGraph covers 90% of use cases. Master it before exploring alternatives.
- Build incrementally. Start with a simple LCEL chain. Add RAG when you need context. Add agents when you need autonomy. Add multi-agent when a single agent can't handle the complexity.
- Invest in observability early. Set up LangSmith (or an alternative like Langfuse) from day one. You'll thank yourself when debugging production issues.
- Test with evals, not vibes. Build evaluation datasets for your specific use case. Automated evals catch regressions that manual testing misses.
- Production patterns matter more than framework choice. Retries, fallbacks, caching, rate limiting, and observability are what separate demos from production systems.
Want More?
Check out our Agentic AI Workflows guide for a deep-dive into agent architectures, or explore our AI tools comparison to pick the right foundation models for your orchestration layer.