Back to Home

Gemini Native API

Google Gemini's native protocol endpoint. Lets you use Google's official SDKs (google-genai / @google/genai) against Tokensmart with zero code changes, accessing the full Gemini multimodal feature set.

πŸ’‘ Already using the OpenAI SDK? You can keep calling /v1/chat/completions for Gemini too β€” just set model to gemini-3.5-flash or any other Gemini model. You don't have to switch to this endpoint.
Use Gemini Native when: you're already on Google's official SDK / you need Gemini-only features (precise thinkingBudget, fine-grained safetySettings, avgLogprobs confidence output) / you want zero-translation token accounting.
Stick with the OpenAI format when: you call multiple providers (GPT + Claude + Gemini) from the same codebase / you're migrating from OpenAI and don't want to swap SDKs.

Endpoints

PurposeEndpoint
Non-streamingPOST /v1beta/models/{model}:generateContent
Streaming (SSE)POST /v1beta/models/{model}:streamGenerateContent?alt=sse

Replace {model} with a model ID like gemini-3.5-flash.

Authentication (3 methods)

Pick any one. Header methods are recommended over query param:

1. Authorization header (recommended)

Authorization: Bearer pk_live_xxxxxxxxxxxxxxxx

2. x-api-key header

x-api-key: pk_live_xxxxxxxxxxxxxxxx

3. ?key= query parameter (Google SDK default)

POST /v1beta/models/{model}:generateContent?key=pk_live_xxxxxxxxxxxxxxxx

⚠️ With ?key= the API key appears in the URL and may be captured by CDN logs or browser history. Prefer the header methods in production.

Request body

Fully follows the Google generateContent reference. Common fields:

FieldTypeRequiredDescription
contentsarrayβœ“Conversation history. Each item has role (user / model) and a parts array
systemInstructionobjectβœ—System instruction. Shape: {parts: [{text: "..."}]}
generationConfigobjectβœ—Output parameters (maxOutputTokens, temperature, topP, topK, thinkingConfig, ...)
toolsarrayβœ—Function calling tool definitions
toolConfigobjectβœ—Tool calling mode configuration
safetySettingsarrayβœ—Fine-grained safety threshold control
cachedContentstringβœ—Reference an existing cached content resource (creation endpoint not yet supported)

Example: plain text

curl https://api.tokensmart.ai/v1beta/models/gemini-3.5-flash:generateContent \
  -H "Authorization: Bearer pk_live_xxxxxxxxxxxxxxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [
      { "parts": [{ "text": "Hi, briefly introduce yourself." }] }
    ],
    "generationConfig": {
      "maxOutputTokens": 512
    }
  }'

Response:

{
  "candidates": [{
    "content": {
      "role": "model",
      "parts": [{ "text": "Hello! I'm Gemini..." }]
    },
    "finishReason": "STOP",
    "avgLogprobs": -1.23
  }],
  "usageMetadata": {
    "promptTokenCount": 10,
    "candidatesTokenCount": 50,
    "totalTokenCount": 60,
    "thoughtsTokenCount": 0
  },
  "modelVersion": "gemini-3.5-flash"
}

Example: streaming

curl https://api.tokensmart.ai/v1beta/models/gemini-3.5-flash:streamGenerateContent?alt=sse \
  -H "Authorization: Bearer pk_live_xxxxxxxxxxxxxxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [{ "parts": [{ "text": "Tell me a one-sentence story." }] }],
    "generationConfig": { "maxOutputTokens": 1024 }
  }'

Returns SSE format. Each chunk:

data: {"candidates":[{"content":{"role":"model","parts":[{"text":"Once upon..."}]}}],"modelVersion":"gemini-3.5-flash"}

data: {"candidates":[{"content":{"role":"model","parts":[{"text":" a time..."}]},"finishReason":"STOP"}],"usageMetadata":{"promptTokenCount":8,"candidatesTokenCount":15,"totalTokenCount":23}}

The last chunk carries the complete usageMetadata. The Google SDK accumulates this automatically.

Example: multimodal (image understanding)

Add an inline_data part (base64-encoded image) to the parts array:

IMG_B64=$(base64 -w 0 photo.jpg)

curl https://api.tokensmart.ai/v1beta/models/gemini-3.5-flash:generateContent \
  -H "Authorization: Bearer pk_live_xxxxxxxxxxxxxxxx" \
  -H "Content-Type: application/json" \
  -d "{
    \"contents\": [{
      \"parts\": [
        { \"text\": \"What is in this image?\" },
        { \"inline_data\": { \"mime_type\": \"image/jpeg\", \"data\": \"$IMG_B64\" } }
      ]
    }]
  }"

Supported MIME types: image/jpeg, image/png, image/webp, image/heic, image/heif.

Note: total request body size limit is 30MB. The Files API for huge files (video / PDF) is not yet supported.

Example: precise thinking budget

Thinking-capable models (e.g. gemini-3.5-flash) expose explicit reasoning token control:

{
  "contents": [{ "parts": [{ "text": "..." }] }],
  "generationConfig": {
    "maxOutputTokens": 2048,
    "thinkingConfig": {
      "thinkingBudget": 1024
    }
  }
}

thinkingBudget semantics:

ValueBehavior
0Disable thinking, respond immediately
> 0Exact thinking budget in tokens
-1Unlimited, model decides

πŸ’‘ Thinking models can burn significant tokens on reasoning by default. If you only need short answers, explicitly set thinkingBudget: 0 or a small value β€” otherwise max_tokens may be exhausted by reasoning and the visible output gets cut off.

Using Google's official SDK

This is the headline value of the Gemini Native endpoint β€” change one baseUrl line to migrate from Google official.

Python (google-genai)

from google import genai

client = genai.Client(
    api_key="pk_live_xxxxxxxxxxxxxxxx",
    http_options={"base_url": "https://api.tokensmart.ai"},
)

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Hi",
)
print(response.text)

Node.js (@google/genai)

import { GoogleGenAI } from "@google/genai";

const ai = new GoogleGenAI({
  apiKey: "pk_live_xxxxxxxxxxxxxxxx",
  httpOptions: { baseUrl: "https://api.tokensmart.ai" },
});

const response = await ai.models.generateContent({
  model: "gemini-3.5-flash",
  contents: "Hello",
});
console.log(response.text);

Application code is unchanged. SDK behavior, field names, error handling β€” all preserve Google's official semantics.

Token billing breakdown

Gemini's usageMetadata is billed as follows:

FieldBilling rate
promptTokenCountModel's input_price
candidatesTokenCountoutput_price (visible output)
thoughtsTokenCountoutput_price (reasoning is also output)
cachedContentTokenCountcache_read_price (much lower than input_price)

Multimodal image input tokens appear under promptTokensDetails[modality=IMAGE] and are billed at input_price (same rate as text tokens).

Error response format

Errors come back in Google's native shape:

{
  "error": {
    "code": 404,
    "message": "Model 'xxx' is not available",
    "status": "NOT_FOUND"
  }
}

Common errors:

HTTPstatusMeaning
401UNAUTHENTICATEDInvalid or missing API key
403PERMISSION_DENIEDKey has no access to this model, or account suspended
404NOT_FOUNDModel does not exist or has been retired
402FAILED_PRECONDITIONInsufficient balance
429RESOURCE_EXHAUSTEDRate limit or concurrent connection limit triggered
502UNAVAILABLEUpstream gateway failure
501UNIMPLEMENTEDEndpoint not yet implemented

Endpoint coverage status

Google endpointTokensmart status
:generateContent non-streamβœ… Fully supported
:streamGenerateContent streamβœ… Fully supported (SSE)
:countTokens pre-count❌ Not yet implemented
:embedContent / :batchEmbedContents❌ Not implemented β€” embedding users please use the OpenAI-compatible endpoints
Google-format GET /v1beta/models❌ Use /v1/models (OpenAI format) instead
Files API (/v1beta/files)❌ Not yet β€” for large files use inline_data (30MB cap)
Cached Content explicit creation❌ Not yet implemented
Imagen text-to-image (:predict)❌ For image generation use /v1/images/generations
Batch async (:batchGenerateContent)❌ Not yet implemented

Supported Gemini models

See the model list for the current set of available models. Every gemini-* model is callable through the Gemini Native endpoint.

If a model is available on both the OpenAI-compatible and Gemini Native endpoints, either protocol works equally well β€” billing and rate limits are identical across both.