Gemini Native API

Google Gemini's native protocol endpoint. Lets you use Google's official SDKs (google-genai / @google/genai) against Tokensmart with zero code changes, accessing the full Gemini multimodal feature set.

💡 Already using the OpenAI SDK? You can keep calling /v1/chat/completions for Gemini too — just set model to gemini-3.5-flash or any other Gemini model. You don't have to switch to this endpoint.
Use Gemini Native when: you're already on Google's official SDK / you need Gemini-only features (precise thinkingBudget, fine-grained safetySettings, avgLogprobs confidence output) / you want zero-translation token accounting.
Stick with the OpenAI format when: you call multiple providers (GPT + Claude + Gemini) from the same codebase / you're migrating from OpenAI and don't want to swap SDKs.

Endpoints

Purpose	Endpoint
Non-streaming	`POST /v1beta/models/{model}:generateContent`
Streaming (SSE)	`POST /v1beta/models/{model}:streamGenerateContent?alt=sse`

Replace {model} with a model ID like gemini-3.5-flash.

Authentication (3 methods)

Pick any one. Header methods are recommended over query param:

1. Authorization header (recommended)

Authorization: Bearer pk_live_xxxxxxxxxxxxxxxx

2. x-api-key header

x-api-key: pk_live_xxxxxxxxxxxxxxxx

3. `?key=` query parameter (Google SDK default)

POST /v1beta/models/{model}:generateContent?key=pk_live_xxxxxxxxxxxxxxxx

⚠️ With ?key= the API key appears in the URL and may be captured by CDN logs or browser history. Prefer the header methods in production.

Request body

Fully follows the Google generateContent reference. Common fields:

Field	Type	Required	Description
`contents`	array	✓	Conversation history. Each item has `role` (`user` / `model`) and a `parts` array
`systemInstruction`	object	✗	System instruction. Shape: `{parts: [{text: "..."}]}`
`generationConfig`	object	✗	Output parameters (maxOutputTokens, temperature, topP, topK, thinkingConfig, ...)
`tools`	array	✗	Function calling tool definitions
`toolConfig`	object	✗	Tool calling mode configuration
`safetySettings`	array	✗	Fine-grained safety threshold control
`cachedContent`	string	✗	Reference an existing cached content resource (creation endpoint not yet supported)

Example: plain text

curl https://api.tokensmart.ai/v1beta/models/gemini-3.5-flash:generateContent \
  -H "Authorization: Bearer pk_live_xxxxxxxxxxxxxxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [
      { "parts": [{ "text": "Hi, briefly introduce yourself." }] }
    ],
    "generationConfig": {
      "maxOutputTokens": 512
    }
  }'

Response:

{
  "candidates": [{
    "content": {
      "role": "model",
      "parts": [{ "text": "Hello! I'm Gemini..." }]
    },
    "finishReason": "STOP",
    "avgLogprobs": -1.23
  }],
  "usageMetadata": {
    "promptTokenCount": 10,
    "candidatesTokenCount": 50,
    "totalTokenCount": 60,
    "thoughtsTokenCount": 0
  },
  "modelVersion": "gemini-3.5-flash"
}

Example: streaming

curl https://api.tokensmart.ai/v1beta/models/gemini-3.5-flash:streamGenerateContent?alt=sse \
  -H "Authorization: Bearer pk_live_xxxxxxxxxxxxxxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [{ "parts": [{ "text": "Tell me a one-sentence story." }] }],
    "generationConfig": { "maxOutputTokens": 1024 }
  }'

Returns SSE format. Each chunk:

data: {"candidates":[{"content":{"role":"model","parts":[{"text":"Once upon..."}]}}],"modelVersion":"gemini-3.5-flash"}

data: {"candidates":[{"content":{"role":"model","parts":[{"text":" a time..."}]},"finishReason":"STOP"}],"usageMetadata":{"promptTokenCount":8,"candidatesTokenCount":15,"totalTokenCount":23}}

The last chunk carries the complete usageMetadata. The Google SDK accumulates this automatically.

Example: multimodal (image understanding)

Add an inline_data part (base64-encoded image) to the parts array:

IMG_B64=$(base64 -w 0 photo.jpg)

curl https://api.tokensmart.ai/v1beta/models/gemini-3.5-flash:generateContent \
  -H "Authorization: Bearer pk_live_xxxxxxxxxxxxxxxx" \
  -H "Content-Type: application/json" \
  -d "{
    \"contents\": [{
      \"parts\": [
        { \"text\": \"What is in this image?\" },
        { \"inline_data\": { \"mime_type\": \"image/jpeg\", \"data\": \"$IMG_B64\" } }
      ]
    }]
  }"

Supported MIME types: image/jpeg, image/png, image/webp, image/heic, image/heif.

Note: total request body size limit is 30MB. The Files API for huge files (video / PDF) is not yet supported.

Example: precise thinking budget

Thinking-capable models (e.g. gemini-3.5-flash) expose explicit reasoning token control:

{
  "contents": [{ "parts": [{ "text": "..." }] }],
  "generationConfig": {
    "maxOutputTokens": 2048,
    "thinkingConfig": {
      "thinkingBudget": 1024
    }
  }
}

thinkingBudget semantics:

Value	Behavior
`0`	Disable thinking, respond immediately
`> 0`	Exact thinking budget in tokens
`-1`	Unlimited, model decides

💡 Thinking models can burn significant tokens on reasoning by default. If you only need short answers, explicitly set thinkingBudget: 0 or a small value — otherwise max_tokens may be exhausted by reasoning and the visible output gets cut off.

Using Google's official SDK

This is the headline value of the Gemini Native endpoint — change one baseUrl line to migrate from Google official.

Python (`google-genai`)

from google import genai

client = genai.Client(
    api_key="pk_live_xxxxxxxxxxxxxxxx",
    http_options={"base_url": "https://api.tokensmart.ai"},
)

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Hi",
)
print(response.text)

Node.js (`@google/genai`)

import { GoogleGenAI } from "@google/genai";

const ai = new GoogleGenAI({
  apiKey: "pk_live_xxxxxxxxxxxxxxxx",
  httpOptions: { baseUrl: "https://api.tokensmart.ai" },
});

const response = await ai.models.generateContent({
  model: "gemini-3.5-flash",
  contents: "Hello",
});
console.log(response.text);

Application code is unchanged. SDK behavior, field names, error handling — all preserve Google's official semantics.

Token billing breakdown

Gemini's usageMetadata is billed as follows:

Field	Billing rate
`promptTokenCount`	Model's input_price
`candidatesTokenCount`	output_price (visible output)
`thoughtsTokenCount`	output_price (reasoning is also output)
`cachedContentTokenCount`	cache_read_price (much lower than input_price)

Multimodal image input tokens appear under promptTokensDetails[modality=IMAGE] and are billed at input_price (same rate as text tokens).

Error response format

Errors come back in Google's native shape:

{
  "error": {
    "code": 404,
    "message": "Model 'xxx' is not available",
    "status": "NOT_FOUND"
  }
}

Common errors:

HTTP	status	Meaning
401	`UNAUTHENTICATED`	Invalid or missing API key
403	`PERMISSION_DENIED`	Key has no access to this model, or account suspended
404	`NOT_FOUND`	Model does not exist or has been retired
402	`FAILED_PRECONDITION`	Insufficient balance
429	`RESOURCE_EXHAUSTED`	Rate limit or concurrent connection limit triggered
502	`UNAVAILABLE`	Upstream gateway failure
501	`UNIMPLEMENTED`	Endpoint not yet implemented

Endpoint coverage status

Google endpoint	Tokensmart status
`:generateContent` non-stream	✅ Fully supported
`:streamGenerateContent` stream	✅ Fully supported (SSE)
`:countTokens` pre-count	❌ Not yet implemented
`:embedContent` / `:batchEmbedContents`	❌ Not implemented — embedding users please use the OpenAI-compatible endpoints
Google-format `GET /v1beta/models`	❌ Use `/v1/models` (OpenAI format) instead
Files API (`/v1beta/files`)	❌ Not yet — for large files use `inline_data` (30MB cap)
Cached Content explicit creation	❌ Not yet implemented
Imagen text-to-image (`:predict`)	❌ For image generation use `/v1/images/generations`
Batch async (`:batchGenerateContent`)	❌ Not yet implemented

Supported Gemini models

See the model list for the current set of available models. Every gemini-* model is callable through the Gemini Native endpoint.

If a model is available on both the OpenAI-compatible and Gemini Native endpoints, either protocol works equally well — billing and rate limits are identical across both.