Platform
A complete AI infrastructure
Fourteen pillars that cover every layer of the stack, from model serving to enterprise compliance.
Inference
High-performance model serving with OpenAI compatibility
- OpenAI-compatible API
- 10+ open-source models (Llama, Mistral, DeepSeek, Qwen)
- Streaming & non-streaming responses
- Structured outputs with JSON Schema constrained decoding
- Vision / multimodal inputs (JPEG, PNG, GIF, WebP)
- Extended thinking with configurable token budgets
- Responses API — agentic multi-turn tool orchestration
- Conversation threads with persistent state
- Background mode — async agentic jobs with webhook notifications
- Prompt prefix caching with 90% token discount
RAG & Knowledge
End-to-end retrieval-augmented generation pipelines
- Knowledge Base management
- Hybrid search (semantic + BM25 keyword matching)
- Cross-encoder reranking (BGE, Cohere)
- Citations with source references and confidence scores
- Multiple chunking strategies (recursive, semantic, sentence-window)
- Data source connectors (S3, databases, URLs, Confluence, Notion)
- Embedding models (BGE Large, E5, Cohere Embed v3)
- Metadata filtering & facets
Audio APIs
Speech-to-text and text-to-speech with per-minute pricing
- Speech-to-text with Whisper Large v3 (98+ languages)
- Text-to-speech with Kokoro (natural-sounding voices)
- OpenAI-compatible /audio/transcriptions and /audio/speech endpoints
- Multiple audio formats (mp3, opus, aac, flac, wav)
- Per-minute transcription and per-character speech pricing
Image Generation
Generate images from text with state-of-the-art models
- FLUX.1 Schnell for fast, high-quality image generation
- Multiple sizes (256x256, 512x512, 1024x1024)
- Base64 or URL output formats
- OpenAI-compatible /images/generations endpoint
- Scales to zero when idle for cost efficiency
Realtime API
Bidirectional WebSocket for real-time voice conversations
- Bidirectional WebSocket protocol
- Server-side voice activity detection (VAD)
- Streaming speech-to-text and text-to-speech
- OpenAI-compatible realtime protocol
- Low-latency voice conversations
Code Execution
Secure Python sandbox for data analysis and computation
- Python 3.12 with data science packages pre-installed
- gVisor-secured sandbox isolation
- Configurable timeouts per plan (up to 300s)
- Chart generation and file output
- Scales to zero when idle
Intelligent Routing
Automatically route prompts to the optimal model
- Complexity-based model selection with model: "auto"
- Save up to 30% on costs with zero code changes
- Custom routing rules via console
- A/B testing across models
- Fallback chains for high availability
MCP Tool Integration
Connect external tools via the Model Context Protocol
- Managed MCP server registry
- Models can call APIs, query databases, access live data
- Standardized tool interface for all models
- Custom MCP server deployment
- Tool-use with structured function calling
Embeddings & Reranking
Dedicated endpoints for vector search pipelines
- BGE Large EN v1.5 and E5 Large v2 embedding models
- Cohere Embed v3 for premium embedding quality
- BGE Reranker and Cohere Rerank v3 cross-encoders
- OpenAI-compatible /embeddings and /rerank endpoints
- Batch embedding for high-throughput ingestion
Structured Outputs & Batches
Constrained decoding and large-scale batch processing
- JSON Schema enforcement for guaranteed valid output
- Batch API for submitting thousands of requests at reduced cost
- Automatic retries and progress tracking for batch jobs
- Type-safe responses for data extraction pipelines
Security & Moderation
Enterprise-grade security and content safety
- Email/password, Google & GitHub OAuth authentication
- SAML 2.0 SSO (Okta, Azure AD, Google Workspace)
- SCIM user provisioning
- IP allowlisting
- API key scopes and audit logging
- Content moderation with per-org guardrail policies
- Topic deny-lists and category thresholds
- Webhook events for all async operations (14 event types)
Billing & Usage
Transparent pricing with full visibility into spend
- Pay-as-you-go credit system
- Transparent per-model pricing
- Stripe-powered payments
- Usage analytics & dashboards
- Spending limits & alerts
- Thinking tokens billed at 50% of standard output token rate
Developer Experience
First-class tooling for every stack
- Python & Node.js SDKs
- OpenAI SDK compatible (just change base URL)
- LangChain, LlamaIndex, Haystack, DSPy, CrewAI integrations
- Prompt playground
- API explorer
Enterprise
Built for teams with demanding requirements
- SAML SSO + SCIM provisioning
- Multi-tenant isolation
- Custom rate limits
- Dedicated GPU clusters
- VPC peering & private endpoints
- Custom model fine-tuning with LoRA
- Dedicated account manager & SLA up to 99.99%
Compare Plans
Feature comparison
See exactly what is included in every plan.
| Feature | Lite | Developer | Pro | Enterprise |
|---|---|---|---|---|
| Open-source models | Community | All models | All models | All + custom |
| Rate limit (RPM) | 500 | 1,000 | 3,000 | 10,000+ |
| Knowledge Bases | 5 | 25 | 100 | Unlimited |
| Vector storage | 1 GB | 25 GB | 100 GB | Unlimited |
| Document storage | 5 GB | 100 GB | 500 GB | Unlimited |
| Streaming & tool calling | ||||
| Embeddings (BGE, E5, Cohere) | ||||
| Reranking (BGE, Cohere) | ||||
| Prompt caching | ||||
| Structured Outputs (JSON Schema) | ||||
| Extended Thinking / Reasoning | ||||
| Vision / Multimodal Inputs | ||||
| Audio: Speech-to-Text (Whisper) | ||||
| Audio: Text-to-Speech (Kokoro) | ||||
| Image Generation (FLUX.1) | ||||
| Realtime API (WebSocket) | ||||
| Code Execution (Python sandbox) | 30s max | 120s max | 180s max | 300s max |
| Intelligent Routing (model: auto) | ||||
| MCP Tool Integration | ||||
| Batch Processing API | ||||
| Content Moderation & Guardrails | ||||
| Responses API (Agentic) | ||||
| Webhook Events | ||||
| Google & GitHub OAuth | ||||
| SSO authentication | ||||
| SAML SSO / SCIM | ||||
| IP allowlisting | ||||
| Audit logging | ||||
| Usage analytics | Basic | Full | Full | Full + export |
| Usage discount | 0% | 5% | 10% | Custom |
| Spending limits & alerts | ||||
| Support | Community | Email + Discord | Priority email | Dedicated + SLA |
| Uptime SLA | 99.9% | 99.95% | 99.99% | |
| Fine-tuning | ||||
| VPC peering & private endpoints | ||||
| Dedicated GPU clusters |
