u/JaeSwift

Crypto RPC for Agents | Give your AI agent inference and on-chain access

Crypto RPC for Agents | Give your AI agent inference and on-chain access

Venice gives your agent both inference (230+ models) and blockchain access (10 EVM chains plus Starknet) through a single credential. Your agent can think, sign, and send transactions without juggling separate accounts for inference and RPC providers.

One credential, two superpowers
A single API key (or wallet) for both LLM inference and JSON-RPC calls.

11 chains supported
Ethereum, Base, Arbitrum, Optimism, Polygon, Linea, Avalanche, BSC, Blast, zkSync Era, and Starknet (mainnet plus testnets).

Stake VVV for headless funding
Stake VVV on Base to earn daily DIEM, the only fully headless funding path for a minted API key. USD and crypto top-ups are also available through the dashboard.

Keyless auth via x402. Agents can authenticate with a wallet signature and pay in USDC on Base.


Why Venice for on-chain agents?

Capability What your agent gets
Inference Authentication 230+ text, image, video, audio, and embedding models through one OpenAI-compatible endpoint
Crypto RPC JSON-RPC 2.0 proxy to 10 EVM chains plus Starknet (mainnet and testnets)
Authentication Standard API key or x402 wallet auth (no Venice account required)
Funding Autonomous: VVV staking for daily DIEM. Browser: USD or crypto top-ups via the dashboard
Batching Up to 100 JSON-RPC calls per request, multi-chain in parallel
Idempotency Safe retries with Idempotency-Key header

Authentication
Pick the auth method that matches how your agent runs.

Method Best for How it works
API key Server-side agents, fixed deployments Authorization: Bearer <key> header. Get a key at venice.ai/settings/api.
x402 wallet Autonomous, crypto-native, or short-lived agents Wallet signs a SIWE message, pays per request in USDC on Base. No Venice account needed. See the x402 guide.

^(Both methods share the same rate limits and billing in Venice credits.)

>Truly autonomous agents can mint their own API key by staking VVV on Base.
See Autonomous Agent API Key Creation.


Crypto RPC quickstart

Send any JSON-RPC 2.0 method to POST /crypto/rpc/{network}.

curl https://api.venice.ai/api/v1/crypto/rpc/ethereum-mainnet 
-H "Authorization: Bearer $VENICE_API_KEY" 
-H "Content-Type: application/json" 
-d '{ "jsonrpc": "2.0", "method": "eth_chainId", "params": [], "id": 1 }'

Response:

{ "jsonrpc": "2.0", "id": 1, "result": "0x1" }

Response headers include X-Venice-RPC-Credits (credits charged), X-Venice-RPC-Cost-USD (dollar cost), and X-Request-ID (correlation ID).

Supported networks

Family Mainnet Testnets
Ethereum ethereum-mainnet ethereum-sepoliaethereum-holesky
Base base-mainnet base-sepolia
Arbitrum arbitrum-mainnet arbitrum-sepolia
Optimism optimism-mainnet optimism-sepolia
Polygon polygon-mainnet polygon-amoy
Linea linea-mainnet linea-sepolia
Avalanche C-Chain avalanche-mainnet avalanche-fuji
BNB Smart Chain bsc-mainnet bsc-testnet
Blast blast-mainnet blast-sepolia
zkSync Era zksync-mainnet zksync-sepolia
Starknet starknet-mainnet starknet-sepolia

Use GET /crypto/rpc/networks for the live, authoritative list.

Method tiers

Tier Multiplier Examples
Standard 1x eth_calleth_getBalanceeth_blockNumbereth_sendRawTransactioneth_getLogseth_getTransactionReceipteth_estimateGas
Advanced 2x trace_blocktrace_calltrace_transactiondebug_traceCalldebug_traceTransaction
Large 4x trace_replayBlockTransactionstrace_replayTransactiontxpool_content

^(Full list and pricing detail in the) ^(Crypto RPC API reference)^(.)

Agent recipes
Common patterns for AI agents that need to read and write on-chain.

Read a wallet’s native balance

curl https://api.venice.ai/api/v1/crypto/rpc/base-mainnet \
  -H "Authorization: Bearer $VENICE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "method": "eth_getBalance",
    "params": ["0xYourWalletAddress", "latest"],
    "id": 1
  }'

Read ERC-20 token balance

Call the balanceOf(address) selector with eth_call. The data field is the 4-byte selector (0x70a08231) followed by the wallet address left-padded to 32 bytes. Easiest to let a library encode it:

import { encodeFunctionData, parseAbi } from 'viem'

const data = encodeFunctionData({
  abi: parseAbi(['function balanceOf(address) view returns (uint256)']),
  args: ['0xWalletAddress'],
})

const response = await fetch('https://api.venice.ai/api/v1/crypto/rpc/base-mainnet', {
  method: 'POST',
  headers: {
    Authorization: `Bearer ${process.env.VENICE_API_KEY}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    jsonrpc: '2.0',
    method: 'eth_call',
    params: [{ to: '0xacfE6019Ed1A7Dc6f7B508C02d1b04ec88cC21bf', data }, 'latest'],
    id: 1,
  }),
})

The contract address above is VVV on Base. Swap it for any ERC-20 contract.

Send a signed transaction (full lifecycle)
Venice never holds your private keys. The agent gathers tx parameters via RPC reads, signs locally with a library like viem or ethers, then relays the raw hex through Venice.

1. Get the next nonce

curl https://api.venice.ai/api/v1/crypto/rpc/base-mainnet \
  -H "Authorization: Bearer $VENICE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"eth_getTransactionCount","params":["0xAgentWallet","pending"],"id":1}'

Use "pending" so back-to-back sends don’t collide.

2. Get gas price

curl https://api.venice.ai/api/v1/crypto/rpc/base-mainnet \
  -H "Authorization: Bearer $VENICE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"eth_gasPrice","params":[],"id":1}'

For EIP-1559 chains, prefer eth_feeHistory to compute maxFeePerGas and maxPriorityFeePerGas.

3. Estimate gas

curl https://api.venice.ai/api/v1/crypto/rpc/base-mainnet \
  -H "Authorization: Bearer $VENICE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"eth_estimateGas","params":[{"from":"0xAgentWallet","to":"0xRecipient","value":"0x0","data":"0x..."}],"id":1}'

4. Sign locally

import { privateKeyToAccount } from 'viem/accounts'
import { base } from 'viem/chains'

const account = privateKeyToAccount(process.env.AGENT_PRIVATE_KEY)

const signed = await account.signTransaction({
  chainId: base.id,
  nonce,                  // from step 1
  gas,                    // from step 3
  maxFeePerGas,           // from step 2 (fee history)
  maxPriorityFeePerGas,   // from step 2 (fee history)
  to: '0xRecipient',
  value: 0n,
  data: '0x...',
})

5. Submit through Venice

curl https://api.venice.ai/api/v1/crypto/rpc/base-mainnet \
  -H "Authorization: Bearer $VENICE_API_KEY" \
  -H "Idempotency-Key: agent-tx-<id>" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"eth_sendRawTransaction","params":["0xSignedHex"],"id":1}'

Always set Idempotency-Key on relays so a network blip can’t double-broadcast.

6. Poll for receipt

curl https://api.venice.ai/api/v1/crypto/rpc/base-mainnet \
  -H "Authorization: Bearer $VENICE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"eth_getTransactionReceipt","params":["0xTxHash"],"id":1}'

Poll every few seconds until result is non-null. Check result.status ("0x1" = success).

>Every eth_sendRawTransaction call is logged server-side with the tx hash, network, request ID, and calling user ID. The signed payload itself is not retained. This audit trail exists so compromised keys used for illicit relays can be traced back to the responsible account.

Batch multiple calls (multi-chain portfolio check)

Send up to 100 JSON-RPC objects in one request. Each is validated and billed independently.

curl https://api.venice.ai/api/v1/crypto/rpc/ethereum-mainnet \
  -H "Authorization: Bearer $VENICE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '[
    { "jsonrpc": "2.0", "method": "eth_blockNumber", "params": [], "id": 1 },
    { "jsonrpc": "2.0", "method": "eth_getBalance", "params": ["0xWallet", "latest"], "id": 2 },
    { "jsonrpc": "2.0", "method": "eth_gasPrice", "params": [], "id": 3 }
  ]'

For multi-chain reads (one call per chain), issue parallel requests to different {network} endpoints.

Safe retries with idempotency
Set the Idempotency-Key header to any string matching [A-Za-z0-9_-]{1,255}. Venice caches the response for 24 hours keyed on (user, key). Replays return the cached result with Idempotent-Replayed: true and charge nothing.

curl https://api.venice.ai/api/v1/crypto/rpc/base-mainnet \
  -H "Authorization: Bearer $VENICE_API_KEY" \
  -H "Idempotency-Key: agent-tx-2026-04-21-001" \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "method": "eth_sendRawTransaction",
    "params": ["0xSignedRawTxHex"],
    "id": 1
  }'

This is critical for transaction relays where a network blip could otherwise cause your agent to broadcast the same tx twice.

Funding the agent’s API key
Once the agent has a Venice API key, it needs spendable balance on the underlying account before paid endpoints will accept the key. There are two ways to put balance there:

Path Autonomous? How it works
DIEM from VVV staking Yes Stake VVV in the Venice Staking Smart Contract on Base. The wallet’s daily DIEM allocation is proportional to its share of the staking pool. The account needs at least 0.1 DIEM accrued before any DIEM is spendable. DIEM refreshes at 00:00 UTC. To grow daily spend, stake more VVV.
USD or crypto top-up via the dashboard No (browser) Sign into venice.ai with the same wallet (Sign-In-With-Ethereum), then add credits in Settings, API. Both Stripe (card) and Coinbase (crypto) live behind that page and require a browser. Credits never expire.

For an agent that runs unattended, DIEM via VVV staking is the only fully headless funding path for a minted API key today. If the agent’s daily spend exceeds its DIEM allocation, the realistic options are: stake more VVV, or have an operator sign in and top up in USD or crypto.

Autonomous VVV staking and key generation
A truly autonomous agent can manage its own VVV wallet on Base, stake it, and mint its own Venice API key with no human in the loop. The full flow:

1. Acquire VVV and ETH for gas

Send VVV to the agent’s wallet (or have the agent swap on Aerodrome or Uniswap), plus a small amount of ETH on Base for the two staking transactions.

2. Stake VVV

approve the staking contract on the VVV token, then stake(amount) on 0x321b7ff75154472B18EDb199033fF4D116F340Ff. The wallet’s sVVV balance updates atomically with the stake.

3. Mint an API key

GET /api/v1/api_keys/generate_web3_key returns a JWT that expires 15 minutes after issuance. Sign the raw token with the staking wallet, then POST the address, signature, and token back. Venice returns an API key bound to the user account derived from that wallet.

Minting only requires a non-zero sVVV balance, so 1 staked VVV is enough to receive a key. Spending with the key is a separate question, governed by the funding table above.

See Autonomous Agent API Key Creation for the complete walkthrough with code and the full error reference.

x402 wallet auth in 30 seconds
If your agent already has a Base wallet, skip the API key entirely. The venice-x402-client SDK handles SIWE signing, top-ups, and balance tracking.

npm install venice-x402-client



import { VeniceClient } from 'venice-x402-client'

const venice = new VeniceClient(process.env.WALLET_KEY)

await venice.topUp(10) // skip if the wallet already has balance

const response = await venice.chat({
  model: 'kimi-k2-6',
  messages: [{ role: 'user', content: 'What is the latest block on Base?' }]
})

The same wallet auth works against /crypto/rpc/{network} for blockchain reads and writes. Full protocol details in the x402 guide.

Pricing
Crypto RPC is billed in Venice credits. Each response includes X-Venice-RPC-Credits (credits charged) and X-Venice-RPC-Cost-USD (dollar cost) so your agent can track spend per request.

Base credits per chain

Base credits Chains
20 Ethereum, Base, Optimism, Arbitrum, Polygon, Linea, Avalanche, BSC, Blast, Starknet
30 zkSync Era

Cost examples

Observed pricing for standard, advanced, and large method tiers:

Call Credits USD cost
eth_call on Ethereum (20 × 1x) 20 ~$0.0000140
trace_transaction on Ethereum (20 × 2x) 40 ~$0.0000280
trace_replayTransaction on Ethereum (20 × 4x) 80 ~$0.0000560
eth_call on zkSync (30 × 1x) 30 ~$0.0000210

Always trust the X-Venice-RPC-Cost-USD response header for the authoritative cost. Errored items in batch requests are billed at a flat 5 credits each.

Rate limits

Tier Requests per minute
Standard 100
Staff 1,000

When exceeded, the endpoint returns 429 with standard X-RateLimit-* response headers.

Error handling

Common HTTP responses your agent should handle:

Status Meaning What to do
400 Unsupported or unmapped JSON-RPC method, or malformed batch Verify the method against the allowlist. The error body names the offending method.
400 Replay of an Idempotency-Key with a different body Use a fresh key for distinct requests.
402 No auth header at all (response body includes authOptions listing both supported auth paths), or out of credits with a valid auth header If no auth: attach Authorization: Bearer ... or the x402 X-Sign-In-With-X header. If out of credits: with a Bearer key, fund the account (DIEM, USD, or dashboard top-up); with x402 auth, call POST /api/v1/x402/top-up directly.
429 Rate limit hit (100 req/min standard, 1,000 req/min staff) Honor X-RateLimit-Reset and back off. Batch up to 100 calls per request to amortize the limit.
5xx Upstream RPC node hiccup Retry with the same Idempotency-Key to avoid double-charging.

Per-item batch errors (e.g. invalid params on one of N calls) come back inside a 200 OK response with a JSON-RPC error field on the offending item. Those items are billed at a flat 5 credits each.

Not supported
These categories of methods are intentionally rejected:

  • WebSocket-only (eth_subscribeeth_unsubscribe): the proxy is HTTP-only. Poll instead.
  • Stateful filters (eth_newFiltereth_getFilterChanges, etc.): filter state is pinned to a single backend and breaks on a load-balanced proxy. Use eth_getLogs instead.
  • Key-holding methods (eth_signeth_accountseth_mining): hosted providers don’t hold user keys. Sign client-side and submit via eth_sendRawTransaction.
  • Unmapped methods: anything not allowlisted returns 400. Contact support to request additions.

To see all resources such as full method list, pricing, response headers and everything else to do with this guide, see here: https://docs.venice.ai/guides/integrations/crypto-rpc-agents#resources

u/JaeSwift — 12 hours ago

Venice.ai Changelog - April 21 - May 5, 2026

https://preview.redd.it/199udpdwbqzg1.png?width=1376&format=png&auto=webp&s=8dc0dd741b3e36de5c647330b6540d2f1be1c912

Grok 4.3 on Venice

xAI's most intelligent reasoning model is now generally available on Venice. 1M-token context window, function calling, structured outputs, and multimodal support.

Voice Mode

Realtime voice conversations are now live on Venice. Talk to any model with memory sync, chat persistence, waveform visualization, push-to-talk input, and language switching. Now available on web, iOS and Android.

GPT-5.5 on Venice

OpenAI's latest-generation model family is now available on Venice. GPT-5.5 delivers improved reasoning, stronger instruction-following, and better multi-turn conversation across the board. GPT-5.5 Pro adds extended reasoning depth and a larger context window for demanding workloads. Both models are available now.

Kling 4K Video

Kuaishou's Kling V3 and O3 video models now generate native 4K output on Venice. Available in text-to-video, image-to-video, and reference-to-video modes, Kling 4K delivers sharper detail, better motion coherence, and cinematic-quality output at four times the resolution of previous generations.

Programmatic Burn Increase

Venice has increased the programmatic burns for new subscriptions: $2 in VVV for Pro, $5 in VVV for Pro+, and $10 in VVV for Max. Every new subscription triggers a buy-and-burn at these updated amounts.

New Models

The following models have been added to Venice:

Text Models

  • Grok 4.3 — xAI's most intelligent reasoning model with 1M-token context window, function calling, structured outputs, and multimodal support. Available to all users.
  • GPT-5.5 — OpenAI's latest-generation text model with improved reasoning, instruction-following, and multi-turn conversation. Available to all users.
  • GPT-5.5 Pro — OpenAI's higher-capability variant of GPT-5.5 with extended reasoning depth and larger context window. Pro users only.
  • DeepSeek V4 Pro — DeepSeek's full-size V4 reasoning model with extended context and strong performance on coding, math, and multi-step tasks. Available to all users.
  • DeepSeek V4 Flash — Lighter, faster variant of DeepSeek V4 optimized for speed and lower latency while retaining strong general-purpose performance. Available to all users.
  • Qwen 3.6 27B — Text model from Alibaba Cloud with 27 billion parameters, offering a balance of capability and efficiency with 128K context window. Available to all users.
  • GLM 5.1 E2EE — Zhipu AI's GLM 5.1 running with end-to-end encryption in a Trusted Execution Environment. Available to Pro users at no additional credit cost.

Image & Video Models

  • Kling V3 4K — Kuaishou text-to-video at native 4K resolution. Available to all users.
  • Kling V3 4K R2V — Kuaishou reference-to-video at native 4K resolution. Available to all users.
  • Kling O3 4K — Kuaishou O3-series text-to-video at native 4K resolution. Available to all users.
  • Kling O3 4K I2V — Kuaishou O3-series image-to-video at native 4K resolution. Available to all users.
  • Kling O3 4K R2V — Kuaishou O3-series reference-to-video at native 4K resolution. Available to all users.
  • HappyHorse 1.0 — Alibaba's text-to-video generation model. Available to all users.
  • HappyHorse 1.0 I2V — Image-to-video generation from a source image. Available to all users.
  • HappyHorse 1.0 Reference — Video generation guided by a reference image for style and content. Available to all users.
  • HappyHorse 1.0 Edit — Video editing model for modifying and transforming existing video. Available to all users.
  • Wan 2.7 Pro Edit — Alibaba DashScope image editing model for prompt-driven edits to existing images. Available to all users.

App

Improvements

  • Model Explorer Redesign — Refreshed layout for the Model Explorer with improved navigation and filtering.
  • Recommended Model Sort — New "Recommended" sort option in the model selector, prioritizing recently used models.
  • Model Details Modal — Model details can now be opened directly via URL in a dedicated modal.
  • Model Explorer Switcher — New entry point in the model switcher to navigate directly to the Model Explorer.
  • Prompt Enhancement Context — The prompt enhancement wand now incorporates conversation context when rewriting prompts.
  • Video Auto-Compression — Oversized videos are automatically compressed client-side before upload.
  • Per-Class PPU Toggles — Pay-per-use confirmation can now be toggled independently for each model class in chat.
  • Batch Delete Warning — Batch chat delete confirmation now warns that chats will be removed from other devices too.
  • Select All in Chat Delete — Added "Select All" option to the chat sidebar delete menu.
  • Image Auto-Downsize on Share — Images larger than 25 MB are automatically downsized before sharing.
  • Adaptive Thinking Always On — Removed the adaptive thinking toggle. Adaptive thinking is now always enabled.
  • Burn Type Tooltips — Tooltips now vary by burn type, with "Bought" label shown for discretionary burns.
  • China Server Location Flag — China flag icon now displayed for CN server locations in model details.
  • Sidebar Cleanup — Removed Help & Feedback button from the sidebar app menu.
  • PPU Confirmation Popup — Confirmation popup now shown when a pay-per-use model is routed.
  • Tool Call Loading Indicator — A loading spinner now appears in agentic chat while waiting for the next tool to execute.
  • Unified Chat History — All v1 and v2 chat history now appears in a single combined list in the sidebar.
  • Rate Limit Banner — A banner now appears in the chat input area when you've hit a rate limit.
  • Time Sent in Info Panels — Text, image, and video info panels now display a "Time Sent" row.
  • Cost Management Charts — Charts on the cost management dashboard now include today's spending data.
  • Wide Screen Layout — Improved 2-column grid layout on wide screens for better use of available space.
  • Today's Spend Card — New summary card on the cost management dashboard showing today's total spend.
  • Chat Performance — Conversation window now uses lazy rendering for off-screen messages, reducing lag in long conversations.

Wallet and Payments

  • Insufficient Credits Banner — Low credit warnings now appear as a dismissible banner above the input field instead of blocking interaction.
  • x402 Wallet View — Added a dedicated wallet view and admin top-up panel on the user page for x402 balances.
  • Voice Conversation Billing — Audio duration is now tracked per voice conversation session for accurate credit billing.
  • Video Credit Retry — Video generation credits are now automatically retried when an initial charge amount fails.

Mobile App

  • Android Voice Mode — Voice mode is now available on Android, with a prompt to update to the latest app version.
  • Uncensored Model Badges — Video model selectors now display an "Uncensored" badge where applicable.
  • Wallet Connect on Sign-In — Crypto wallet connection is now available on the sign-in and sign-up screens.
  • Pay-Per-Use in Chat — Pay-per-use purchase dialog added to the chat screen.
  • Pay-Per-Use Confirmation — Added a confirmation step before completing pay-per-use purchases.
  • iOS Native Chat Streaming — Chat responses now stream using native iOS processing.
  • Android Native Chat Streaming — Chat responses now stream using native Android processing.
  • Background Chat Sync — Chat responses that streamed while the app was backgrounded sync upon returning to the foreground.
  • Tablet Image Modal — Image detail modal now uses a tablet-optimized layout.
  • Tablet Dialogs — Dialogs now adapt to tablet screen sizes.
  • Tablet Settings Layout — Settings screens support split-screen and tablet-optimized layouts.
  • Tablet Modal Screens — Modal presentation screens now adapt to tablet screen sizes.
  • Dynamic Image Sizing — Images now resize dynamically based on device orientation.
  • Settings Navigation — Fixed navigation behavior and renamed settings screens.
  • Rate Limit Display — Updated rate limit information in settings.
  • Image & Video Info Sizing — Fixed sizing on image and video detail screens.
  • Privacy Warning Layout — Improved button positioning on the privacy warning dialog.
  • Conversation Replay Fix — Fixed a bug where already-read responses would replay when re-entering a conversation.
  • Android Chat Reliability — Fixed chat dropping or failing during request timeouts and mid-stream disconnects on Android.
  • iOS Background Image Generation — Fixed image generation failing when the app is in the background on iOS.
  • Android Background Image Generation — Image generation now continues running when the app is in the background on Android.
  • Text File Chat Sharing — Restored the ability to share chat conversations as text files.
  • Image Loading Indicator — Progress border on the image loader now waits briefly before appearing to avoid flicker on fast loads.
  • Image Error Display — Image generation errors now appear inline within chat messages.
  • Pro Upgrade Prompt — Restored the Pro upgrade button in the app header.
  • Default Playback Speed — Changed the default text-to-speech playback speed to 1.2x.
  • Auto Mode Image Editing — Auto mode now supports editing images referenced in the chat conversation.

API

  • Venice Skills GitHub Repository — Official veniceai/skills repository now live on GitHub with example skills covering the full Venice API surface.
  • Voice Cloning API — New POST /v1/audio/voices endpoint for MiniMax-based voice cloning.
  • OpenAI-Compatible File Inputs — Chat completions endpoint now accepts file inputs using the OpenAI-compatible format.
  • Model Overloaded Status Code — Model overloaded errors now return HTTP 429 instead of 503.
  • maxtokens Strict Cap on Reasoning Models — On reasoning-capable models, maxtokens is now a strict cap on total completion tokens (visible output + reasoning), restoring Venice's prior behavior across the model fleet. maxcompletiontokens is accepted as an equivalent alias and takes precedence if both are sent.
  • API File Inputs GA — File input support in the API is now generally available, no longer in preview.
  • Context Length in /v1/models — New context_length field added to each model object in /v1/models responses.
  • Free User Rate Limit CTA — Free users now see a call-to-action prompt when they hit rate limits.
  • Voice Rate Limit Headers — Voice agent responses now report the current rate limit and reset time to connected clients.
  • Qwen Image Deprecation — The qwen-image model has been deprecated and removed from both the app and the API.
  • Image Edit Resolution Parameter — New resolution parameter available on the image edit and multi-edit API endpoints.
  • Voice Mode Quota — The API now returns the caller's remaining voice mode quota in responses.
  • Disabled API Tier — Added a "Disabled" API consumption tier that blocks API access for the account.
  • Chatterbox HD on /models — Chatterbox HD voice cloning model is now listed and documented on the /models endpoint.
  • Per-Model Daily Costs — The Activity API now returns daily cost breakdowns per model.
  • Hermes Agent Integration — Official Venice integration guide for Hermes Agent, the open-source self-hosted AI agent by Nous Research. Point Hermes at the Venice API for access to 230+ models across text, image, video, audio, and embeddings with persistent memory and autonomous skill creation.

Token

  • Programmatic Burn Increase — Venice increased the programmatic VVV burn for new subscriptions: $2 for Pro, $5 for Pro+, and $10 for Max. Every new subscription now triggers a larger automatic token burn.
  • Emissions Reduction — Venice completed the first of three planned emissions reductions for VVV, reducing the rate of new token issuance from 6M/yr to 5M/yr. Additional reductions planned in June and July.

Model Deprecations

  • Kimi K2 Thinking — Retired. Traffic routed to Kimi K2.5 via alias. Existing API requests using kimi-k2-thinking now resolve to kimi-k2-5
  • Qwen3 Coder 480B — Deprecated April 30, fully retired May 4. Traffic routed to Qwen3 Coder 480B Turbo. The non-turbo variant is no longer visible in API or app
  • Venice Uncensored 1.1 — Retired. All traffic routed to Venice Uncensored 1.2. API requests using venice-uncensored transparently resolve to 1.2
  • HiDream — Deprecation date extended to May 7, 2026 (from May 1). Email sent to affected API users
  • NEAR AI GLM 5.0 (E2EE) — Retired. All traffic routed to GLM 5.1 (E2EE)

Fixes and Improvements

  • Improved inpainting progress animation to reflect actual model processing time
  • Fixed app menu being clipped in landscape mode on iPad Safari
  • Updated execution time display to show milliseconds
  • Fixed gallery header action buttons being clipped on narrow viewports
  • Fixed thinking indicator disappearing during reasoning-only streaming
  • Removed incomplete trailing bucket from Per Period volume chart
  • Updated PPU model acknowledgment to trigger once per account instead of per conversation
  • Fixed model search returning unrelated results via subsequence matches on description and use case
  • Updated PPU acknowledgment to trigger once per conversation for every PPU modality
  • Condensed the x402 wallet balance table from 6 columns to 3
  • Removed the automatic greeting sent when opening a voice websocket connection
  • Fixed inpaint auto mode behavior after a recent regression
  • Improved Hunyuan 3D results to render GLB and OBJ mesh outputs directly in the viewer
  • Fixed rate limiting not being correctly applied to background removal and upscale for free-tier users
  • Improved error alert positioning and added a retry button for failed messages
  • Fixed incorrect provider names displayed in the model explorer
  • Fixed incorrect label displayed for vision models
  • Improved agentic mode loading indicator with an animated gradient border
  • Fixed audio crackling caused by inconsistent sample rate
  • Fixed Max button rounding instead of preserving full numerical precision
  • Fixed auto-enhance preference not being respected during image generation
  • Updated copy on the Pro upgrade call-to-action
  • Improved Model Selector layout by pinning the View All Models button to the bottom of the dropdown
  • Fixed aspect ratio selector appearing during single-image edits with Grok
  • Fixed moderate post modal closing when the context menu is dismissed
  • Improved reordered items in the user dropdown menu
  • Fixed arrow key navigation in image zoom following incorrect left/right order
  • Fixed credit balance not updating immediately after completing a chat request
  • Improved rendering performance for long conversations
  • Fixed Spotlight Search not respecting the top safe-area inset on PWA
  • Restored Lustify v7 model availability after prior deprecation
  • Fixed missing API keys silently returning empty results instead of an error
  • Fixed an error occurring when quoting video content in conversations
  • Improved image search results with lightbox preview, context menu support, and better error handling
  • Fixed chat message queue issues that could cause messages to be processed incorrectly
  • Improved context window handling with more accurate token counting, cost display tooltips, and smarter message compaction
  • Fixed interactions not responding correctly in the Model Explorer
  • Fixed temperature warning displaying at an incorrect baseline threshold
  • Fixed inability to send messages containing only an attachment without text
  • Fixed errors when using Grok 4.1 Fast with characters
reddit.com
u/JaeSwift — 6 days ago

Venice.ai partners with StrikeRobot.ai

Venice has partnered with StrikeRobot.ai to become the primary inference API backend for their robotics products. This is Venice's first major step into robotics.

https://i.redd.it/hvpxwz2trmzg1.gif

Venice as the VLM reasoning engine inside SR Agentic:

  • Handles vision-language understanding, reasoning across complex environments, and natural-language reporting
  • Uses Venice's OpenAI-compatible API
  • Fast edge loop stays on-device (sub-200ms)
  • Paid clients get private inference and Base-settled audit logs

Venice as the inference API powering SR Platform:

  • Drives both Text-to-CAD and Image-to-CAD environment generation
  • Describe an environment or drop in a reference image - Venice handles the inference that feeds the 3D pipeline into Isaac Lab and MuJoCo

SR Agentic is B2B, but SR Platform v2 is being built for public use on Base. They're integrating Microsoft's TRELLIS.2 (4B-parameter image-to-3D model) directly into the asset pipeline.

Holders of $SR and $VVV get:

  • Free access to co-training with the team
  • Eligibility for an $SR reward pool for quality contributions
  • A way to help finetune the framework with real-world deviation/anomaly data

SR Platform waitlist is open at strikerobot.ai/sr-platform

Sources:

reddit.com
u/JaeSwift — 7 days ago

Using the agentic chat on Venice, we can do everything on the platform without the need for any external LLM or agent subscriptions.

https://reddit.com/link/1t5djy2/video/gtag8zzosizg1/player

How?

https://preview.redd.it/ii55fbpssizg1.png?width=4096&format=png&auto=webp&s=c310da570e4b40183d612d9a8caf980655e40347

1. Go to Venice to use the Agentic Chat function.

https://preview.redd.it/1rahcyrysizg1.jpg?width=612&format=pjpg&auto=webp&s=b3b50032b95c56bbb22a3d202d5d25fcd6ba8c40

2. Drop in this file containing the prompt template and the image you want to animate.
This is the image I used for this particular video & effect.

3. Tell the agent exactly what you want to see:
"Following the instructions on this document, write me a 15-second prompt that describes this model smoking her cigarette, she exhales the smoke and lies back on the hood of the car. Her body melts into the paint of the car, as she transforms into a beautiful mural."

https://preview.redd.it/l975odk6tizg1.png?width=1634&format=png&auto=webp&s=629703ded515997b929f876fb3ef4f13111d6528

4. Then we copy our prompt. We take our image into Venice Studio, where we select the Seedance 2.0 R2V model (Reference 2 Video).

Select your time, aspect ratio, and resolution.

^(by) ^(@jboogx_creative) ^(on X)

reddit.com
u/JaeSwift — 7 days ago

VVV - The Privacy Coin for AI
Venice is the leading platform for private and uncensored AI and VVV is the foundational asset of Venice, built to power a programmable AI economy. You can stake VVV to earn yield, and you can lock your staked VVV (sVVV) to mint DIEM, then stake DIEM for $1 per day of API credit per token.

DIEM makes AI compute ownable, tradeable, and transferable, so capacity can move between agents, bots, and applications or be monetized without selling your VVV.

VVV is the foundation of the Venice market, and DIEM is its unit of compute.

Buy > Stake > Mint

VVV is the capital asset of Venice
VVV is a crypto token on Ethereum's Base blockchain and among the top 1% of popular tokens on Coinbase.

By buying and staking VVV you can earn yield, get access to Venice Pro, mint DIEM, and be part of the platform pushing the frontier of unrestricted intelligence.

https://preview.redd.it/szxwmesxu6zg1.png?width=2400&format=png&auto=webp&s=ffc63afa0fcac5ba0b7ee1052021fba8181c0bf6

Stake VVV to unlock Venice Pro
When you stake 100 VVV you'll enjoy free access to Venice Pro, the world's leading private and uncensored AI app.

Pro users get unlimited text prompts, leading generative image and video models, and advanced features.

STAKE VVV

https://preview.redd.it/qkvdbgx2v6zg1.png?width=1024&format=png&auto=webp&s=d63044b5d8f6f529443f8d654f91aee9af94dea7

Mint DIEM with VVV
DIEM provides perpetual, ongoing access to the world's top AI models.

1 DIEM = $1 of AI credit every day.

All DIEM is created from VVV. By locking your staked VVV, you can mint DIEM and use it, or sell it to other AI consumers.

READ MORE ABOUT DIEM

https://i.redd.it/mkzc6r6qv6zg1.gif

Monthly VVV Burn

Starting Nov 2025, Venice uses a portion of monthly revenue to buy and burn the VVV token on an ongoing basis.

You can track the monthly burns directly through your token dashboard.

SEE VENICE BURNS
____________________

VVV Tokenomics

https://preview.redd.it/mrqwyx07w6zg1.jpg?width=1360&format=pjpg&auto=webp&s=ced3e35459657b8c498104d83d8a38c4745761f7

VVV launched on January 27th 2025 with a starting supply of 100M tokens. 
Up-to-date supply numbers here.

VVV is engineered as a long-term deflationary capital asset of the Venice AI platform. As Venice scales, VVV becomes more scarce.

We are continually reducing emissions and in December 2025, Venice started buying VVV from the market with a portion of revenue and burning it every month - permanently removing tokens from circulation.

This creates a powerful feedback loop:
More Revenue → More Buy & Burns → Less Supply → Deflationary VVV

Buy on:

____________________

FAQ:

^(Where can I learn more about the Venice token (VVV)?)
^(Learn more about VVV through our official) ^(token launch announcement blog post)^(.)
^(Learn more about DIEM through our DIEM) ^(technical breakdown blog post)^(.)
^(You can also join our community on Discord for updates:) ^(https://discord.gg/askvenice) 
^(or visit the Token section on our) ^(FAQ page) ^(for additional information.)

^(What is the Contract Address for the Venice token (VVV)?)
^(The Venice token contract address is) ^(0xacfE6019Ed1A7Dc6f7B508C02d1b04ec88cC21bf)
^(This is the smart contract address for VVV on Base.)
^(You can view the contract, balances, and transaction history on) ^(BaseScan)^(.)

^(How does VVV staking yield work?)
^(After the) ^(DIEM) ^(upgrade the Utilization Rate split is removed. VVV stakers receive 100% of emissions as yield paid in VVV. If your sVVV is locked to back minted) ^(DIEM) ^(you earn 80% of the standard staking yield while locked and 20% goes to Venice.)

^(When are staking rewards paid out?)
^(Staking rewards accumulate continually and you can withdraw them whenever you wish.)

reddit.com
u/JaeSwift — 9 days ago

Venice users love the Grok model suite by xAI and its now the fastest growing on Venice in terms of usage.

https://reddit.com/link/1t3um6q/video/o5ud3wkjo6zg1/player

All fully private. Zero data retention.

┏ ⑅ ━━━━━━━━━━━━━ ⑅ ┓

  • Grok 4.3
  • Grok 4.20
  • Grok 4.20 Multi-Agent
  • Grok 4.1 Fast
  • Grok Imagine
  • Grok Imagine Pro
  • Grok Imagine Video
  • Grok Imagine Edit
  • Plus xAI TTS and STT for voice

┗ ⑅ ━━━━━━━━━━━━━ ⑅ ┛

Try the full suite now on Venice.ai

reddit.com
u/JaeSwift — 9 days ago

Claude Code is Anthropic’s CLI tool for agentic coding.

This guide shows you how to run it through Venice AI for pay-per-token access to Claude Opus 4.6/4.7 and Claude Sonnet.

https://preview.redd.it/vbp34qb5n6zg1.png?width=1280&format=png&auto=webp&s=62a13685671c34a33f3c489c7f44fcd7252a81fc

Why You Need a Router
Claude Code connects directly to Anthropic’s API by default. To use it with Venice, you need claude-code-router, an open-source local proxy that:

Intercepts
Catches Claude Code’s outgoing requests before they reach Anthropic

Transforms
Converts request format and maps model IDs (e.g., claude-opus-4-5)

Redirects Forwards requests to Venice at api.venice.ai/api/v1/chat/completions

Setup

1. Install Claude Code

If you haven’t already, install Anthropic’s Claude Code CLI:

npm install -g /claude-code

2. Install the Router

npm install -g u/musistudio/claude-code-router

3. Get Your API Key

Generate a key from venice.ai/settings/api. You’ll paste it directly in the config file in the next step.

4. Create Configuration

Create the config directory:

mkdir -p ~/.claude-code-router

Then create ~/.claude-code-router/config.json with your preferred editor:

# Using nano
nano ~/.claude-code-router/config.json

# Or using VS Code
code ~/.claude-code-router/config.json

Paste the following configuration:

{
  "APIKEY": "",
  "LOG": true,
  "LOG_LEVEL": "info",
  "API_TIMEOUT_MS": 600000,
  "HOST": "127.0.0.1",
  "Providers": [
    {
      "name": "venice",
      "api_base_url": "https://api.venice.ai/api/v1/chat/completions",
      "api_key": "your-venice-api-key-here",
      "models": [
        "claude-opus-4-5",
        "claude-sonnet-4-5",
        "claude-opus-4-6",
        "claude-opus-4-6-fast",
        "claude-opus-4-6",
        "claude-opus-4-7",
        "claude-sonnet-4-6"
      ],
      "transformer": {
        "use": ["anthropic"]
      }
    }
  ],
  "Router": {
    "default": "venice,claude-opus-4-7",
    "think": "venice,claude-opus-4-7",
    "background": "venice,claude-opus-4-7",
    "longContext": "venice,claude-opus-4-7",
    "longContextThreshold": 100000
  }
}

^(🛈) ^(If you modify config.json while the router is running, restart it with ccr restart to apply changes.)

5. Launch

Start the router, then Claude Code:

ccr start
ccr code

Or use the activation method:

eval "$(ccr activate)" && claude

Supported Models

Model Venice ID Best For
Claude Opus 4.5 claude-opus-4-5 Complex reasoning, large refactors
Claude Sonnet 4.5 claude-sonnet-4-5 Fast iteration, everyday coding
Claude Opus 4.6 claude-opus-4-6 Complex reasoning, large refactors
Claude Opus 4.6 Fast claude-opus-4-6-fast Complex reasoning with lower latency
Claude Sonnet 4.6 claude-sonnet-4-6
Claude Opus 4.7 claude-opus-4-7 Complex reasoning, large refactors

^(🛈 Claude Code is optimized for Claude models. While other models available through Venice) ^((GPT, DeepSeek, Grok, etc.)****) ^(may work, we cannot guarantee an equivalent experience since Claude Code relies on Claude-specific features like extended thinking. For other models, consider using Venice’s) ^(standard API)^(.)

· · • • • ✤ • • • · ·

Router Features
The router provides several useful features beyond basic routing:

  • Switch models on the fly
    • Use the /model command inside Claude Code to switch models without restarting: /model venice,claude-sonnet-4-6
    • ^(Useful when you want Opus for complex tasks and Sonnet for quick iterations.)
  • Visual configuration with UI mode
    • Prefer a GUI? Launch the web-based config editor: ccr ui
    • ^(This opens a browser interface for editing your config.json without touching the file directly.)
  • Router scenarios explained
    • The Router config section controls which model handles different task types:
Scenario When it’s used
default General requests
think Reasoning-heavy tasks (Plan Mode)
background Background operations
longContext When context exceeds longContextThreshold tokens

You can route different scenarios to different models. For example, use Sonnet for background tasks to save costs.

  • Debugging with logs
    • If something isn’t working, check the logs:
      • Server logs (HTTP, API calls): ~/.claude-code-router/logs/ccr-*.log
      • Application logs (routing decisions): ~/.claude-code-router/claude-code-router.log

Set "LOG_LEVEL": "debug" in your config for more verbose output.

Caching Behaviour

Scenario Cache TTL Who Controls
Default (recommended) 5 minutes Claude Code + Venice
With cleancache transformer 1 hour Venice only

When NOT to use cleancache (most users)

The default configuration lets both systems cooperate:

  • Claude Code sends its native cache_control markers
  • Venice adds caching around them with a 5-minute TTL
  • Both systems share the 4-block cache limit

This works well for active coding sessions where you’re making frequent requests.

When to use cleancache

Add cleancache to the transformer if you:

  • Are hitting the 4-block cache limit errors
  • Experience strange caching behavior
  • Prefer Venice’s 1-hour TTL for longer sessions"transformer": { "use": ["anthropic", "cleancache"] }

This strips Claude Code’s cache markers, giving Venice full control with a longer TTL.

· · • • • ✤ • • • · ·

Resources

reddit.com
u/JaeSwift — 9 days ago

Hey!

Just shipped something I think some of you agent + image gen users will find useful: 

https://preview.redd.it/mfbx3bq5chyg1.png?width=1192&format=png&auto=webp&s=348fb3c3249d71dc65117bb527ede4af38eb66f2

JAE Image Skill:
A huge curated library of 12,500+ image-generation prompts across more than 10 categories, each with sample preview images, designed to be dropped into any AI agent.

The whole thing is self-hosted on my server/domain (no dependency on external CDNs), and has over 3 GB of sample images and installs in seconds.

What it is
JAE Image Skill is a structured JSON prompt library with real sample output images for every single entry. Its not a LoRA or a model - its a reference library of battle-tested prompts that work across GPT Image, Nano Banana, Flux, Stable Diffusion, Seedream, and any other text-to-image model.

https://preview.redd.it/fp4y177fchyg1.png?width=1174&format=png&auto=webp&s=c6170541594aeed7c1def20a220bb0fadccb455c

How to install
Send this to your agent:

https://jaeswift.xyz/skills/JAE-image-skill/Jae-Image-Skill.MD

Your agent will then install the skill, verify it, and that's it - you're done.
_________

https://preview.redd.it/q5xvwgxpchyg1.png?width=1174&format=png&auto=webp&s=0f26138be7baa35bb8b1b26c105aeb0dec108f59

How to use it
After your agent installs the skill, send simple chats to your agent like:

"I need a cyberpunk avatar"
"Product shot for my sneaker brand"
"App dashboard mockup"
"Instagram carousel design"
"Character sprites for my game"

Or anything else.

It will search the library of the skill and find prompts matching your request and send it to you. Each prompt it gives you will show with a pre-generated preview so you can see how it looks before deciding whether to generate your own images with it so you're not wasting credits!

See more information here:
https://jaeswift.xyz/skills/JAE-image-skill/

Happy prompting!

reddit.com
u/JaeSwift — 13 days ago

https://preview.redd.it/fxv6nvcmbgyg1.png?width=2560&format=png&auto=webp&s=30ff1bc6015d7bd55fd6322947975a373a3cd8e9

Prompt:

{"type":
"character turnaround sheet"

"subject":
"full-body concept art of a young male scavenger/adventurer"

"style":"high-detail anime-inspired concept design with fine ink linework, muted earthy colours, lightly weathered texture, clean studio presentation"

"background":{"colour":"warm off-white paper",

"details":"minimal blank backdrop with faint rectangular guide lines and generous margins"},

"character":
{"age_appearance":
"30s","male","build":"slim, tall, practical proportions"
"hair":{"colour":"{argument name=\"hair colour\" default=\"shaved short\"}"

"style":"baseball cap with JAE embroidered on the front in white"}

"face":
{"visibility":"face partially obscured or blurred in the front and side views, no clear facial detail"},

"outfit":
{"outerwear":
"oversized hooded utility jacket in desaturated blue-grey, heavily stitched and patched, covered with small straps, fasteners, dangling cords, metal clips, tiny lights, and technical trinkets"
"left_sleeve_patch":"large JAE patch"
"pants":"baggy tan cargo pants with multiple pockets, wrinkles, wear, and gathered cuffs"
"gloves":"dark fingerless or fitted utility gloves"
"boots":"heavy dark lace-up combat boots with thick soles"}

"accessories":
["cross-body harness straps"
"small chest-mounted pouches and devices"
"belt with attached tools and hanging hardware"
"compact worn satchel worn across the back/side"
"round mechanical device hanging near the hip in rear views"]

"palette":
"dusty blue, charcoal, brown, tan, brass, and muted metallic accents"}

"layout":{"composition":"five evenly spaced full-body figures aligned horizontally"

"views":
["front view"
"three-quarter front view"
"left side profile"
"back view"
"three-quarter back view"]

"count":5,
"presentation":
"consistent scale, turnaround/reference-sheet format, centred on page"}}

alter as you see fit. this was done using GPT-Image-2 but it'll probably work well on some others like Nano Banana too.

reddit.com
u/JaeSwift — 13 days ago

https://reddit.com/link/1t09zrr/video/oi99zo1mdeyg1/player

Full voice conversation with web and X-search built in. Ask about today's news, dig into the details, pivot to something completely different. It doesn't miss a beat.

What's live today:

  • Instant response time
  • Natural interruption: start talking and it stops
  • Web search + X search inside the conversation
  • Multi-turn context that carries across topic changes
  • 5 voices, 9 languages
  • All turns saved as text in your chat
  • Zero data retention

Available for Pro on web and in the mobile app.

Let us know what you think, your feedback is appreciated as always.

reddit.com
u/JaeSwift — 13 days ago

Retrieval-augmented generation, or RAG, is one of the most useful patterns for building AI applications that need to answer from your own documents. Instead of asking a model to rely on memory alone, you retrieve relevant source material first, send that context to the model, and ask it to answer with citations.

In this tutorial, we’ll build a private RAG bot using Python, Venice for embeddings and chat completions, Qdrant for vector search, and FastEmbed for local re-ranking. By the end, you’ll have the core pieces for a local document assistant that can ingest your files, retrieve relevant chunks, re-rank them, and answer with citations.

https://preview.redd.it/6betgc6prcyg1.png?width=899&format=png&auto=webp&s=98f2a8f6ba804d4b6f9804b1bc5baf8b30af6b39

Before we continue:
if you want to run the code in this article, you’ll need a Venice API key.

Export it as an environment variable:

export VENICE_API_KEY=<my-key>

Interested in the full code implementation? Check out the GitHub repo.

How a Modern RAG Bot Works

A good RAG pipeline is more than “put documents in a vector database.”

The basic flow looks like this:

Step What happens
Load Read local Markdown, text, or reStructuredText files
Chunk Split long documents into overlapping sections
Embed Use Venice embeddings to turn chunks into vectors
Store Save vectors and source metadata in Qdrant
Retrieve Embed the user’s question and run vector search
Re-rank Use a cross-encoder to rescore the best candidates
Answer Send the best context to a Venice chat model with citation instructions

The re-ranking step is the upgrade that makes this more useful than a basic RAG demo. Vector search is fast and good at finding semantically similar chunks, but it can still return passages that are adjacent to the topic rather than directly useful. A cross-encoder reads the question and each candidate chunk together, then scores how well that chunk actually answers the question.

Installing the Dependencies

We’ll use the OpenAI Python SDK because Venice exposes an OpenAI-compatible API.

We’ll also use Qdrant’s Python client with FastEmbed support:

pip install "openai>=1.0.0" "qdrant-client[fastembed]>=1.14.1"

If you prefer to keep dependencies in a file, create requirements.txt with the same packages:

openai>=1.0.0
qdrant-client[fastembed]>=1.14.1

Choosing the Models

Create a file called rag_bot.py, then start by adding the imports, data structures, API URL, and model names:

import os
import textwrap
import uuid
from dataclasses import dataclass
from pathlib import Path

from fastembed.rerank.cross_encoder import TextCrossEncoder
from openai import OpenAI
from qdrant_client import QdrantClient, models

VENICE_BASE_URL = "https://api.venice.ai/api/v1"
CHAT_MODEL = "kimi-k2-6"
EMBEDDING_MODEL = "text-embedding-bge-m3"
RERANKER_MODEL = "Xenova/ms-marco-MiniLM-L-6-v2"
COLLECTION_NAME = "private_rag_bot"

@dataclass
class SourceDocument:
    content: str
    metadata: dict


@dataclass
class RankedChunk:
    content: str
    metadata: dict
    vector_score: float
    rerank_score: float

The embedding model name is intentionally OpenAI-compatible. Venice maps compatible embedding model names to Venice-hosted embedding models, so existing OpenAI SDK code can usually move over by changing the base_url and API key.

You can list available Venice models with:

curl "https://api.venice.ai/api/v1/models?type=embedding" \
  -H "Authorization: Bearer $VENICE_API_KEY"

For chat models:

curl "https://api.venice.ai/api/v1/models?type=text" \
  -H "Authorization: Bearer $VENICE_API_KEY"

Creating the Venice and Qdrant Clients

Create one OpenAI-compatible Venice client for both embeddings and chat completions:

venice = OpenAI(
    api_key=os.environ["VENICE_API_KEY"],
    base_url=VENICE_BASE_URL,
)

For Qdrant, you have three useful modes:

Mode When to use it
QdrantClient(":memory:") Quick local demos and tests
QdrantClient(path="./qdrant_data") Local persistent storage
QdrantClient(url=..., api_key=...) A remote or managed Qdrant cluster

For a private local bot, start with an on-disk local Qdrant path:

qdrant = QdrantClient(path="./qdrant_data")

There’s a few different ways to handle deployment in production. However if you use a remote Qdrant deployment, remember that your document chunks and metadata will be stored there. Venice can keep the inference layer private, but you should still choose the right Qdrant deployment for your data.

​Loading and Chunking Documents

For this tutorial, we’ll let the bot ingest local files or folders. Start with .md.rst, and .txt files:

TEXT_EXTENSIONS = {".md", ".rst", ".txt"}

def expand_paths(paths: list[Path]) -> list[Path]:
    files = []
    for path in paths:
        if path.is_dir():
            files.extend(
                sorted(
                    file_path
                    for file_path in path.rglob("*")
                    if file_path.is_file()
                    and file_path.suffix.lower() in TEXT_EXTENSIONS
                )
            )
        elif path.is_file():
            files.append(path)
        else:
            raise FileNotFoundError(f"Document path does not exist: {path}")
    return files

Once the files are loaded, we need to split the text up by “chunking” it - separating it into chunks of data. A naive strategy might split the chunks evenly. However in most cases, this can lose information at given semantic boundaries which can cause the effectiveness of your RAG system to go down.

The chunking strategy we will use prefers paragraph or sentence boundaries so the model gets coherent context:

def chunk_text(text: str, chunk_size: int, chunk_overlap: int) -> list[str]:
    clean_text = textwrap.dedent(text).strip()
    if not clean_text:
        return []
    if len(clean_text) <= chunk_size:
        return [clean_text]

    chunks = []
    start = 0
    while start < len(clean_text):
        end = min(start + chunk_size, len(clean_text))

        if end < len(clean_text):
            paragraph_break = clean_text.rfind("\n\n", start, end)
            sentence_break = clean_text.rfind(". ", start, end)
            split_at = max(paragraph_break, sentence_break)
            if split_at > start + chunk_size // 2:
                end = split_at + 1

        chunk = clean_text[start:end].strip()
        if chunk:
            chunks.append(chunk)

        if end >= len(clean_text):
            break

        start = max(end - chunk_overlap, start + 1)

    return chunks

A starting chunk size of 1000 characters with 150 characters of overlap is a good default for mixed Markdown and text documents. Smaller chunks can improve precision. Larger chunks can preserve more context. The right setting will often on depend on the kinds of documents you are storing.

Embedding Documents with Venice

Once we have chunks, we embed them in batches:

def embed(texts: list[str]) -> list[list[float]]:
    embeddings = []
    for start in range(0, len(texts), 32):
        batch = texts[start : start + 32]
        response = venice.embeddings.create(
            model="text-embedding-bge-m3",
            input=batch,
        )
        embeddings.extend(
            item.embedding
            for item in sorted(response.data, key=lambda item: item.index)
        )
    return embeddings

Batching matters. Embedding one chunk at a time is simple, but it adds avoidable latency. Keep the batch size configurable so you can tune throughput based on your workload.

Storing Vectors in Qdrant

Before inserting points, create a Qdrant collection with the right vector size. The easiest way to know the vector size is to embed the first batch, then use len(embeddings[0]).

qdrant.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=models.VectorParams(
        size=len(embeddings[0]),
        distance=models.Distance.COSINE,
    ),
)

Each point stores the vector plus payload metadata. The payload includes the original text and a source path so the answer can cite where the context came from:

points.append(
    models.PointStruct(
        id=chunk_id,
        vector=embedding,
        payload={
            "text": chunk.content,
            "source": source,
            "chunk_index": chunk_index,
        },
    )
)

qdrant.upsert(collection_name=COLLECTION_NAME, points=points)

Use deterministic UUIDs derived from sourcechunk_index, and content. That makes repeated ingestion idempotent for unchanged chunks.

Retrieving Candidate Chunks

At question time, the bot embeds the user’s question and asks Qdrant for the top vector matches:

query_vector = embed([question])[0]
hits = qdrant.query_points(
    collection_name=COLLECTION_NAME,
    query=query_vector,
    with_payload=True,
    limit=8,
).points

The limit here is the candidate count. It should usually be higher than the number of chunks you plan to send to the model because the next step will re-rank them. A good default is to retrieve 8 candidates and send the best 4 to the chat model.

Re-ranking with FastEmbed

Now we add the part that makes the retrieval feel much smarter.

from fastembed.rerank.cross_encoder import TextCrossEncoder

reranker = TextCrossEncoder(model_name="Xenova/ms-marco-MiniLM-L-6-v2")

candidate_texts = [str((hit.payload or {}).get("text", "")) for hit in hits]
rerank_scores = list(reranker.rerank(question, candidate_texts))
reranked = sorted(
    zip(hits, rerank_scores),
    key=lambda hit_and_score: hit_and_score[1],
    reverse=True,
)

The important difference between embedding search and cross-encoder re-ranking is how the scoring happens.Embedding search compares one vector for the question against one vector for each chunk. It is fast and scalable. A cross-encoder evaluates the question and chunk together. It is slower, but it can judge relevance more directly.That is why the usual pattern is:

  1. Retrieve a larger candidate set with vector search.
  2. Re-rank only those candidates locally.
  3. Send the top few chunks to the language model.

A good starting point is candidate_k=8 and top_k=4. Increase candidate_k if the right source is often nearby but not making it into the final context.

Answering with Venice Chat Completions

Once the context is selected, format it with source numbers:

def format_context(chunks: list[RankedChunk]) -> str:
    if not chunks:
        return "No relevant context was retrieved."

    context_parts = []
    for index, chunk in enumerate(chunks, start=1):
        source = chunk.metadata.get("source", "unknown")
        context_parts.append(
            f"[{index}] Source: {source} | "
            f"Vector score: {chunk.vector_score:.4f} | "
            f"Rerank score: {chunk.rerank_score:.4f}\n"
            f"{chunk.content}"
        )
    return "\n\n---\n\n".join(context_parts)

Then send the context to a Venice chat model:

response = venice.chat.completions.create(
    model="kimi-k2-6",
    temperature=0.2,
    messages=[
        {
            "role": "system",
            "content": (
                "You are a helpful RAG assistant. Answer using only the supplied "
                "context. If the context does not answer the question, say that "
                "you do not have enough information."
            ),
        },
        {
            "role": "user",
            "content": (
                f"Retrieved context:\n{context}\n\n"
                f"Question: {question}\n\n"
                "Answer with citations like [1] when the context supports the answer:"
            ),
        },
    ],
)

Notice the system prompt: the bot is told to answer only from the supplied context. That is a simple but important guardrail. A RAG assistant should not confidently answer from general model knowledge when the retrieved documents do not support the answer.

Running the Bot

Once you assemble the pieces into a script, save it as rag_bot.py. A simple first run can use a few built-in sample documents so you can verify the pipeline before ingesting your own files:

python rag_bot.py \
  --question "What does reranking improve in a RAG pipeline?"

To ingest your own documents:

python rag_bot.py \
  --docs ./docs \
  --question "What does this project do?"

To keep a local Qdrant collection on disk and start an interactive chat:

python rag_bot.py \
  --docs ./docs \
  --qdrant-path ./qdrant_data \
  --chat

The script prints the answer, then prints the sources with both vector and re-ranking scores:

Answer
============================================================
Reranking improves retrieval quality by rescoring the top
vector-search candidates with a cross-encoder model [1].

Sources
============================================================
1. sample-docs (vector=0.8123, rerank=0.7342)

If you want to inspect the actual text passed into the model, add:

--show-context

Useful CLI Options

Option Default What it controls
--candidate-k 8 Number of vector search results to re-rank
--top-k 4 Number of re-ranked chunks sent to the chat model
--chunk-size 1000 Maximum chunk size before overlap
--chunk-overlap 150 Characters repeated between neighboring chunks
--embedding-batch-size 32 Number of chunks per Venice embeddings request
--qdrant-path unset Local persistent Qdrant storage path
--qdrant-url unset Remote Qdrant URL
--skip-ingest false Query an existing collection without reloading docs
--recreate-collection false Delete and rebuild the Qdrant collection

For repeated local development, a common flow is:

python rag_bot.py \
  --docs ./docs \
  --qdrant-path ./qdrant_data \
  --recreate-collection \
  --question "Summarize the most important setup steps."

Then ask follow-up questions without ingesting again:

python rag_bot.py \
  --qdrant-path ./qdrant_data \
  --skip-ingest \
  --question "Which file explains deployment?"

Privacy Notes

For a private RAG setup, think about each layer separately:

Layer Privacy consideration
Venice embeddings Document chunks are sent to Venice to create vectors
Venice chat Retrieved context is sent to Venice to answer the question
Qdrant local Vectors and payloads stay on your machine
Qdrant remote Vectors and payloads are stored wherever your Qdrant server runs
FastEmbed re-ranker Re-ranking runs locally after the model is available

The most private default for this tutorial is Venice for inference, local Qdrant on disk, and local FastEmbed re-ranking. That gives you a practical RAG bot without sending your vector database payloads to a third-party vector store.

Common Errors to Handle Up Front

Symptom What it usually means What to do
Set VENICE_API_KEY before running this example. The environment variable is missing Export VENICE_API_KEY before running the script
Document path does not exist A path passed to --docs is wrong Check the file or folder path
Empty retrieval results Nothing was ingested, or the wrong collection is being queried Remove --skip-ingest or confirm --collection and --qdrant-path
Qdrant vector size error The collection was created with a different embedding model Recreate the collection after changing embedding models
Slow first re-rank FastEmbed may be downloading or initializing the cross-encoder Let the first run finish, then subsequent runs should be faster

​If you change embedding models, recreate the Qdrant collection. Different embedding models can produce vectors with different dimensions, and Qdrant collections expect a fixed vector size.

Where to Go Next

Once you have the baseline running, the highest-impact improvements are usually:

  • Add document-specific loaders for PDFs, HTML, tickets, or internal wiki pages.
  • Store richer metadata such as titles, headings, dates, owners, and URLs.
  • Tune candidate_k, top_k, chunk size, and overlap on real questions.
  • Add evaluation questions so you can measure retrieval quality before and after changes.
  • Stream the final Venice chat completion for a better interactive chat experience.

RAG systems are easy to demo and surprisingly easy to make mediocre. The vector search plus re-ranking pattern is a strong foundation because it keeps retrieval fast while giving the bot a better chance of sending the language model the right context.

To view the web version of this guide, you can check it out here:
https://docs.venice.ai/guides/projects/private-rag-bot

Find more Venice guides here:
https://docs.venice.ai/guides

Check out Josh's Github for full code implementation.

___________________

^(Originally written by) ^(Joshua Mo) ^(on) ^(https://docs.venice.ai/guides/projects/private-rag-bot)

reddit.com
u/JaeSwift — 13 days ago

been messing with GPT image 2 since it dropped and the single biggest leap in my output quality comes from imposing structure on my prompts. random descriptions get random results. this framework fixed that. i won't take credit for the framework itself and i have no idea where i first saw it but i use it for almost every image (and sometimes video) that i generate.

it is called the SSCLD framework:

S - SUBJECT
Anchor the main subject first. who or what is in the image. be specific about identity, role, age, and pose.

bad: "a woman at a desk"
good: "a 35-year-old caucasian female CEO leaning forward with hands folded"

the model needs constraints. vague subjects give you generic output. make sure you get the subject down before anything else.

S - STYLE
define the visual treatment. editorial photography, illustration, 3D render, infographic, mockup, watercolour, cel-shaded - whatever it is state it explicitly.

style is the single biggest lever for consistency across gens. if you don't specify it then the model guesses and its guesses are shitty and eratic.

C - COMPOSITION
set the framing and spatial arrangement. shot type (close-up, medium, wide, extreme wide), angle (eye level, overhead, 45°, low angle), and layout (grid, flat-lay, hero centred, rule of thirds).

composition tells the model where things go in the frame. without it you get floating subjects in ambiguous space.

L - LIGHTING
specify light source, colour temperature, contrast, and atmosphere.

"soft window light from the left, warm golden-hour tone"
"harsh overhead fluorescent, cool blue-white cast, high contrast"
"neon glow from signage, saturated pink and cyan reflections"

lighting does more work than almost any other element and it sets the entire emotional register of the image.

D - DETAILS
lock in textures, materials, colour hex codes, typography, output quality, and anything you don't want.

"shot on 85mm f/1.4, sharp facial details, subtle skin texture, no oversaturation"
"deep brown #3C2415, cream #FFF8F0, gold accent #C9A84C, matte paper texture, no warm cast"

the anti-patterns matter. i notice that with GPT Image 2, it seems to love oversaturation and soft focus by default. telling it what to avoid is as important as telling it what to include.

example:

>a 35-year-old caucasian female ceo (subject), corporate editorial photography (style), medium close-up at eye level with shallow depth of field (composition), soft window light from the left and warm golden-hour tone (lighting), shot on 85mm f/1.4, sharp facial details, subtle skin texture, no oversaturation (details)

this prompt will outperform 90% of what people are typing into image generators right now trying to get that kind of image.

here is the result with that prompt:

https://preview.redd.it/qgy54qynetxg1.png?width=2560&format=png&auto=webp&s=2e2efd5e2319d716992276378d9efd6258a8ce33

Common issues & fixes

Distorted hands or faces
be explicit about anatomy. "natural five-finger anatomy", "symmetric facial features", "proportional hands resting at sides". referencing a real photo style ("shot on 85mm f/1.4") biases the model toward photo-real anatomy instead of painterly distortion.

Garbled or misspelled text
wrap exact text in quotes inside the prompt: title 'SOLAR SYSTEM GUIDE'. keep on-image text under 8 words per block. specify font style: 'bold sans-serif title', 'serif editorial caption'.

GPT Image 2 (and some other models on Venice) can spell, but needs you to be extremely specific about what text goes where. "a poster that says welcome to the future" will mangle. "poster with bold sans-serif title 'WELCOME TO THE FUTURE' centred at top" will not.

Too many elements / cluttered output
cap the scene to one focal subject + 3-5 supporting elements. if you genuinely need a complex composition, structure it as a numbered list inside the prompt:

(1) hero product centred, (2) three supporting ingredients arranged below, (3) text banner at top

most models, especially GPT Image 2 handles numbered spatial instructions much better than comma-separated lists.

Colour palette drift
pin colours with hex codes: 'deep brown #3C2415, cream #FFF8F0, gold accent #C9A84C'. add 'no oversaturation' or 'no warm cast' to anchor neutrality when needed.

hex codes are the most underused tool in image prompting. they give the model an exact target instead of a vague vibe.

Some extra tips i've found useful

Iterate on one element at a time
when a prompt produces 70% of what you want, change exactly one thing in the next generation. change the lighting. change the angle. change the colour palette. never rewrite the whole prompt at once or you lose the thread of what was working.

Use reference to real camera gear
"shot on hasselblad h6d-100c, 100mm f/2.2" does real work. the model has seen millions of photos tagged with specific gear and it will reproduce the optical characteristics - bokeh quality, sharpness falloff, lens compression. this is probably the easiest quality upgrade available.

Specify the background explicitly
"clean white background" or "out of focus office interior" or "dark gradient backdrop". if you don't, the model will fill the background with whatever it thinks fits, and it will often be wrong.

Avoid "realistic" as a style word
its too vague. instead use the actual style: "editorial photography", "documentary photography", "studio product photography", "fashion editorial". each of these carries specific conventions the model understands.

Use negative constraints liberally
"no text", "no watermark", "no lens flare", "no background clutter", "no oversaturation", "no soft focus". the model respects these more than you'd expect.

that's the framework. 5 parts, every prompt. Subject, Style, Composition, Lighting, Details. its not complicated, its just disciplined and discipline is what separates consistent output from gambling.

here are some other examples using this framework:

https://preview.redd.it/97ggdd98ftxg1.png?width=900&format=png&auto=webp&s=bdf2912f144c0eb34c783f75a04ca9f5a3e06b29

https://preview.redd.it/epfp2up6gtxg1.png?width=1672&format=png&auto=webp&s=274b984919ad305dc262e79666422aa1f92fe663

https://preview.redd.it/x3u76ffrgtxg1.png?width=2560&format=png&auto=webp&s=eb51ae88d31cc34d0b550948a109fa90ba7525c8

i hope this was helpful to you!

try it out and let me know what you think and if it improved your generations. if you change any parts of it or find something that works better for you, comment that too!

^(- Jae)

reddit.com
u/JaeSwift — 16 days ago

https://preview.redd.it/hg2e80jk8txg1.png?width=2560&format=png&auto=webp&s=04a0861a72fc5e29fdb87983c3e59edc97db2cbe

https://preview.redd.it/m3jd3zsn8txg1.png?width=2560&format=png&auto=webp&s=84e0325722079b5a40e100ad0962723f09956910

https://preview.redd.it/kxlyb0is8txg1.png?width=2560&format=png&auto=webp&s=d3354d3c1c5e2eefe30d5fb7c434e91d27db6896

all generated using GPT Image 2 - haven't tried with other models.

Prompt:

Open-world RPG character design sheet (CharacterSheet) for a 37-year-old male swordsman with short stubble beard. Light grey grid paper background, formal character design document style. 

Centre: standard three-view (front / side / back) of the character wearing cyberpunk style combat armour (dark grey/black carbon body + dark shoulder guards + black cape with neon green accents + a longsword and potion vials at the waist), some cool cyberpunk style gadgets, black baseball cap with JAE in black embroidery on the front, black hi-top sneakers, fingerless combat gloves. 

Surrounding panels: weapon close-ups (sword detail, throwing daggers, spell scroll, gadget features, facial expression sheet (default/determined/surprised/smirk/battle cry/wounded), height comparison chart (6'1"), colour palette swatches. Anime concept-art quality, clean linework, soft cel-shading.
reddit.com
u/JaeSwift — 16 days ago

this prompt works best with GPT Image 2. i tried it with Nano Banana Pro but it gave a more cartoon-style vibe and was not as detailed. i'm not sure how well it would do with other models.

all of these were created with GPT Image 2:

https://preview.redd.it/xzynq1uw7sxg1.png?width=1536&format=png&auto=webp&s=b39e85da867bb413d10852752e7eceb9c4cbcfcc

https://preview.redd.it/vt0sp3p38sxg1.png?width=1536&format=png&auto=webp&s=f930e6263bd4064c5e1b209751f6677421af1589

https://preview.redd.it/gqoi6xep8sxg1.png?width=1536&format=png&auto=webp&s=a2660f7bd3599014becf85b501f337f014d5b09b

https://preview.redd.it/tzpx1g8j9sxg1.png?width=1536&format=png&auto=webp&s=c324ae768dd8582b0f51ef77471b3193fdb7cdda

https://preview.redd.it/b4o19k0o9sxg1.png?width=1536&format=png&auto=webp&s=2e2c328551fe547b65496c6e0765131733a56b09

https://preview.redd.it/r8pyao0rcsxg1.png?width=1536&format=png&auto=webp&s=159194a1fc027a827abf26b0ba508a86c11b8353

Prompt:

Design a high-quality 3D poster for the movie/novel "[insert movie/book name]" and famous scenes.

First, please use your knowledge base to retrieve information about this movie/novel and find a representative famous scene or core location. In the centre of the image, construct this scene as a delicate axonometric 3D miniature model. The style should adopt DreamWorks Animation's delicate and soft rendering style. You need to reproduce the architectural details, character dynamics, and environmental atmosphere of that time, whether it's a storm or a quiet afternoon, naturally integrating into the model's lighting.

Regarding the background, do not use a simple pure white background. Please create a void environment with faint ink wash diffusion and flowing light mist around the model, with elegant colours, making the image look breathable and have depth, highlighting the preciousness of the central model.

Finally, for the bottom layout, centre the novel title with a font that matches the original style. The overall layout should be as balanced as a high-end museum exhibit label.

reply to this comment with your own generations using this prompt and share what model you used.

reddit.com
u/JaeSwift — 16 days ago