Gemini Native API
Google Gemini's native protocol endpoint. Lets you use Google's official SDKs (google-genai / @google/genai) against Tokensmart with zero code changes, accessing the full Gemini multimodal feature set.
π‘ Already using the OpenAI SDK? You can keep calling
/v1/chat/completionsfor Gemini too β just setmodeltogemini-3.5-flashor any other Gemini model. You don't have to switch to this endpoint.
Use Gemini Native when: you're already on Google's official SDK / you need Gemini-only features (precisethinkingBudget, fine-grainedsafetySettings,avgLogprobsconfidence output) / you want zero-translation token accounting.
Stick with the OpenAI format when: you call multiple providers (GPT + Claude + Gemini) from the same codebase / you're migrating from OpenAI and don't want to swap SDKs.
Endpoints
| Purpose | Endpoint |
|---|---|
| Non-streaming | POST /v1beta/models/{model}:generateContent |
| Streaming (SSE) | POST /v1beta/models/{model}:streamGenerateContent?alt=sse |
Replace {model} with a model ID like gemini-3.5-flash.
Authentication (3 methods)
Pick any one. Header methods are recommended over query param:
1. Authorization header (recommended)
Authorization: Bearer pk_live_xxxxxxxxxxxxxxxx
2. x-api-key header
x-api-key: pk_live_xxxxxxxxxxxxxxxx
3. ?key= query parameter (Google SDK default)
POST /v1beta/models/{model}:generateContent?key=pk_live_xxxxxxxxxxxxxxxx
β οΈ With
?key=the API key appears in the URL and may be captured by CDN logs or browser history. Prefer the header methods in production.
Request body
Fully follows the Google generateContent reference. Common fields:
| Field | Type | Required | Description |
|---|---|---|---|
contents | array | β | Conversation history. Each item has role (user / model) and a parts array |
systemInstruction | object | β | System instruction. Shape: {parts: [{text: "..."}]} |
generationConfig | object | β | Output parameters (maxOutputTokens, temperature, topP, topK, thinkingConfig, ...) |
tools | array | β | Function calling tool definitions |
toolConfig | object | β | Tool calling mode configuration |
safetySettings | array | β | Fine-grained safety threshold control |
cachedContent | string | β | Reference an existing cached content resource (creation endpoint not yet supported) |
Example: plain text
curl https://api.tokensmart.ai/v1beta/models/gemini-3.5-flash:generateContent \
-H "Authorization: Bearer pk_live_xxxxxxxxxxxxxxxx" \
-H "Content-Type: application/json" \
-d '{
"contents": [
{ "parts": [{ "text": "Hi, briefly introduce yourself." }] }
],
"generationConfig": {
"maxOutputTokens": 512
}
}'
Response:
{
"candidates": [{
"content": {
"role": "model",
"parts": [{ "text": "Hello! I'm Gemini..." }]
},
"finishReason": "STOP",
"avgLogprobs": -1.23
}],
"usageMetadata": {
"promptTokenCount": 10,
"candidatesTokenCount": 50,
"totalTokenCount": 60,
"thoughtsTokenCount": 0
},
"modelVersion": "gemini-3.5-flash"
}
Example: streaming
curl https://api.tokensmart.ai/v1beta/models/gemini-3.5-flash:streamGenerateContent?alt=sse \
-H "Authorization: Bearer pk_live_xxxxxxxxxxxxxxxx" \
-H "Content-Type: application/json" \
-d '{
"contents": [{ "parts": [{ "text": "Tell me a one-sentence story." }] }],
"generationConfig": { "maxOutputTokens": 1024 }
}'
Returns SSE format. Each chunk:
data: {"candidates":[{"content":{"role":"model","parts":[{"text":"Once upon..."}]}}],"modelVersion":"gemini-3.5-flash"}
data: {"candidates":[{"content":{"role":"model","parts":[{"text":" a time..."}]},"finishReason":"STOP"}],"usageMetadata":{"promptTokenCount":8,"candidatesTokenCount":15,"totalTokenCount":23}}
The last chunk carries the complete usageMetadata. The Google SDK accumulates this automatically.
Example: multimodal (image understanding)
Add an inline_data part (base64-encoded image) to the parts array:
IMG_B64=$(base64 -w 0 photo.jpg)
curl https://api.tokensmart.ai/v1beta/models/gemini-3.5-flash:generateContent \
-H "Authorization: Bearer pk_live_xxxxxxxxxxxxxxxx" \
-H "Content-Type: application/json" \
-d "{
\"contents\": [{
\"parts\": [
{ \"text\": \"What is in this image?\" },
{ \"inline_data\": { \"mime_type\": \"image/jpeg\", \"data\": \"$IMG_B64\" } }
]
}]
}"
Supported MIME types: image/jpeg, image/png, image/webp, image/heic, image/heif.
Note: total request body size limit is 30MB. The Files API for huge files (video / PDF) is not yet supported.
Example: precise thinking budget
Thinking-capable models (e.g. gemini-3.5-flash) expose explicit reasoning token control:
{
"contents": [{ "parts": [{ "text": "..." }] }],
"generationConfig": {
"maxOutputTokens": 2048,
"thinkingConfig": {
"thinkingBudget": 1024
}
}
}
thinkingBudget semantics:
| Value | Behavior |
|---|---|
0 | Disable thinking, respond immediately |
> 0 | Exact thinking budget in tokens |
-1 | Unlimited, model decides |
π‘ Thinking models can burn significant tokens on reasoning by default. If you only need short answers, explicitly set
thinkingBudget: 0or a small value β otherwisemax_tokensmay be exhausted by reasoning and the visible output gets cut off.
Using Google's official SDK
This is the headline value of the Gemini Native endpoint β change one baseUrl line to migrate from Google official.
Python (google-genai)
from google import genai
client = genai.Client(
api_key="pk_live_xxxxxxxxxxxxxxxx",
http_options={"base_url": "https://api.tokensmart.ai"},
)
response = client.models.generate_content(
model="gemini-3.5-flash",
contents="Hi",
)
print(response.text)
Node.js (@google/genai)
import { GoogleGenAI } from "@google/genai";
const ai = new GoogleGenAI({
apiKey: "pk_live_xxxxxxxxxxxxxxxx",
httpOptions: { baseUrl: "https://api.tokensmart.ai" },
});
const response = await ai.models.generateContent({
model: "gemini-3.5-flash",
contents: "Hello",
});
console.log(response.text);
Application code is unchanged. SDK behavior, field names, error handling β all preserve Google's official semantics.
Token billing breakdown
Gemini's usageMetadata is billed as follows:
| Field | Billing rate |
|---|---|
promptTokenCount | Model's input_price |
candidatesTokenCount | output_price (visible output) |
thoughtsTokenCount | output_price (reasoning is also output) |
cachedContentTokenCount | cache_read_price (much lower than input_price) |
Multimodal image input tokens appear under promptTokensDetails[modality=IMAGE] and are billed at input_price (same rate as text tokens).
Error response format
Errors come back in Google's native shape:
{
"error": {
"code": 404,
"message": "Model 'xxx' is not available",
"status": "NOT_FOUND"
}
}
Common errors:
| HTTP | status | Meaning |
|---|---|---|
| 401 | UNAUTHENTICATED | Invalid or missing API key |
| 403 | PERMISSION_DENIED | Key has no access to this model, or account suspended |
| 404 | NOT_FOUND | Model does not exist or has been retired |
| 402 | FAILED_PRECONDITION | Insufficient balance |
| 429 | RESOURCE_EXHAUSTED | Rate limit or concurrent connection limit triggered |
| 502 | UNAVAILABLE | Upstream gateway failure |
| 501 | UNIMPLEMENTED | Endpoint not yet implemented |
Endpoint coverage status
| Google endpoint | Tokensmart status |
|---|---|
:generateContent non-stream | β Fully supported |
:streamGenerateContent stream | β Fully supported (SSE) |
:countTokens pre-count | β Not yet implemented |
:embedContent / :batchEmbedContents | β Not implemented β embedding users please use the OpenAI-compatible endpoints |
Google-format GET /v1beta/models | β Use /v1/models (OpenAI format) instead |
Files API (/v1beta/files) | β Not yet β for large files use inline_data (30MB cap) |
| Cached Content explicit creation | β Not yet implemented |
Imagen text-to-image (:predict) | β For image generation use /v1/images/generations |
Batch async (:batchGenerateContent) | β Not yet implemented |
Supported Gemini models
See the model list for the current set of available models. Every gemini-* model is callable through the Gemini Native endpoint.
If a model is available on both the OpenAI-compatible and Gemini Native endpoints, either protocol works equally well β billing and rate limits are identical across both.