Chat API

`POST`

https://localhost:14241/v1/api/chat/completions

The Chat API allows you to analyze text content, create new text content, and have dynamic conversations with a variety of open-source LLMs.

Request Body

Required Parameters

Parameter	Type	Description
`messages`	Array	An array of messages comprising the conversation history. Each message is an object with `role` and `content` Supported roles are: `system`, `user`, and `assistant`.
`model`	String	Identifier of the model to be used to chat with. (e.g., `DeepSeek-R1-Distill-Llama-8B-q4f32_1-MLC`).

Optional Parameters

Parameter	Type	Default	Description
`frequency_penalty`	Number	0	Adjust the penalty for frequent tokens to a value between -2.0 and 2.0.
`logit_bias`	Object	null	Adjust the likelihood of each token being selected by mapping token IDs to bias values between -100 and 100.
`max_completion_tokens`	Integer	null	Set a maximum limit on the total number of tokens that can be generated for a completion, including visible and reasoning tokens.
`n`	Integer	1	Number of chat completion choices to generate.
`presence_penalty`	Number	0	Set the penalty for repeated tokens to a value between -2.0 and 2.0.0.
`stream`	Boolean	false	Enable or disable real-time streaming of the partial message deltas’ generation process.
`stop`	String \| Array	null	The API will stop generating tokens after a (maximum of 4) sequence(s) have been completed.
`temperature`	Number	1	Adjust the randomness level in the output to a value between 0 and 2, where lower values result in more focused output.
`tools`	Array	null	A comprehensive list of functions that the model may call during its execution.
`tool_choice`	String \| Object	`none`	Controls tool selection behavior. Options: `none`, `auto`, `required`.

Available Model IDs

DeepSeek Models

Model Name	Description
`DeepSeek-R1-Distill-Qwen-7B-q4f16_1-MLC`	DeepSeek R1 Distilled Qwen 7B Model (4-bit quantized, 16-bit float)
`DeepSeek-R1-Distill-Qwen-7B-q4f32_1-MLC`	DeepSeek R1 Distilled Qwen 7B Model (4-bit quantized, 32-bit float)
`DeepSeek-R1-Distill-Llama-8B-q4f32_1-MLC`	DeepSeek R1 Distilled Llama 8B Model (4-bit quantized, 32-bit float)
`DeepSeek-R1-Distill-Llama-8B-q4f16_1-MLC`	DeepSeek R1 Distilled Llama 8B Model (4-bit quantized, 16-bit float)

Llama Models

Model Name	Description
`Llama-3.2-1B-Instruct-q4f32_1-MLC`	Llama 3.2 1B Instruct Model (4-bit quantized, 32-bit float)
`Llama-3.2-1B-Instruct-q4f16_1-MLC`	Llama 3.2 1B Instruct Model (4-bit quantized, 16-bit float)
`Llama-3.2-1B-Instruct-q0f32-MLC`	Llama 3.2 1B Instruct Model (32-bit float)
`Llama-3.2-1B-Instruct-q0f16-MLC`	Llama 3.2 1B Instruct Model (16-bit float)
`Llama-3.2-3B-Instruct-q4f32_1-MLC`	Llama 3.2 3B Instruct Model (4-bit quantized, 32-bit float)
`Llama-3.2-3B-Instruct-q4f16_1-MLC`	Llama 3.2 3B Instruct Model (4-bit quantized, 16-bit float)
`Llama-3.1-8B-Instruct-q4f32_1-MLC-1k`	Llama 3.1 8B Instruct Model (4-bit quantized, 32-bit float, 1k context)
`Llama-3.1-8B-Instruct-q4f16_1-MLC-1k`	Llama 3.1 8B Instruct Model (4-bit quantized, 16-bit float, 1k context)
`Llama-3.1-8B-Instruct-q4f32_1-MLC`	Llama 3.1 8B Instruct Model (4-bit quantized, 32-bit float)
`Llama-3.1-8B-Instruct-q4f16_1-MLC`	Llama 3.1 8B Instruct Model (4-bit quantized, 16-bit float)
`Llama-3-8B-Instruct-q4f32_1-MLC-1k`	Llama 3 8B Instruct Model (4-bit quantized, 32-bit float, 1k context)
`Llama-3-8B-Instruct-q4f16_1-MLC-1k`	Llama 3 8B Instruct Model (4-bit quantized, 16-bit float, 1k context)
`Llama-3-8B-Instruct-q4f32_1-MLC`	Llama 3 8B Instruct Model (4-bit quantized, 32-bit float)
`Llama-3-8B-Instruct-q4f16_1-MLC`	Llama 3 8B Instruct Model (4-bit quantized, 16-bit float)
`Llama-3-70B-Instruct-q3f16_1-MLC`	Llama 3 70B Instruct Model (3-bit quantized, 16-bit float)
`Llama-3.1-70B-Instruct-q3f16_1-MLC`	Llama 3.1 70B Instruct Model (3-bit quantized, 16-bit float)
`Llama-2-7b-chat-hf-q4f32_1-MLC-1k`	Llama 2 7B Chat Model (4-bit quantized, 32-bit float, 1k context)
`Llama-2-7b-chat-hf-q4f16_1-MLC-1k`	Llama 2 7B Chat Model (4-bit quantized, 16-bit float, 1k context)
`Llama-2-7b-chat-hf-q4f32_1-MLC`	Llama 2 7B Chat Model (4-bit quantized, 32-bit float)
`Llama-2-7b-chat-hf-q4f16_1-MLC`	Llama 2 7B Chat Model (4-bit quantized, 16-bit float)
`Llama-2-13b-chat-hf-q4f16_1-MLC`	Llama 2 13B Chat Model (4-bit quantized, 16-bit float)

Mistral & Hermes Models

Model Name	Description
`Mistral-7B-Instruct-v0.3-q4f16_1-MLC`	Mistral 7B Instruct v0.3 (4-bit quantized, 16-bit float)
`Mistral-7B-Instruct-v0.3-q4f32_1-MLC`	Mistral 7B Instruct v0.3 (4-bit quantized, 32-bit float)
`Mistral-7B-Instruct-v0.2-q4f16_1-MLC`	Mistral 7B Instruct v0.2 (4-bit quantized, 16-bit float)
`Hermes-2-Pro-Llama-3-8B-q4f16_1-MLC`	Hermes Pro Llama 3 8B Model (4-bit quantized, 16-bit float)
`Hermes-2-Pro-Llama-3-8B-q4f32_1-MLC`	Hermes Pro Llama 3 8B Model (4-bit quantized, 32-bit float)
`Hermes-2-Pro-Mistral-7B-q4f16_1-MLC`	Hermes Pro Mistral 7B Model (4-bit quantized, 16-bit float)
`OpenHermes-2.5-Mistral-7B-q4f16_1-MLC`	OpenHermes Mistral 7B Model (4-bit quantized, 16-bit float)
`NeuralHermes-2.5-Mistral-7B-q4f16_1-MLC`	NeuralHermes Mistral 7B Model (4-bit quantized, 16-bit float)

Phi Models

Model Name	Description
`Phi-3-mini-4k-instruct-q4f16_1-MLC`	Phi-3 Mini 4K Instruct Model (4-bit quantized, 16-bit float)
`Phi-3-mini-4k-instruct-q4f32_1-MLC`	Phi-3 Mini 4K Instruct Model (4-bit quantized, 32-bit float)
`Phi-3-mini-4k-instruct-q4f16_1-MLC-1k`	Phi-3 Mini 4K Instruct Model (4-bit quantized, 16-bit float, 1k context)
`Phi-3-mini-4k-instruct-q4f32_1-MLC-1k`	Phi-3 Mini 4K Instruct Model (4-bit quantized, 32-bit float, 1k context)
`phi-2-q4f16_1-MLC`	Phi-2 Model (4-bit quantized, 16-bit float)
`phi-2-q4f32_1-MLC`	Phi-2 Model (4-bit quantized, 32-bit float)
`phi-2-q4f16_1-MLC-1k`	Phi-2 Model (4-bit quantized, 16-bit float, 1k context)
`phi-2-q4f32_1-MLC-1k`	Phi-2 Model (4-bit quantized, 32-bit float, 1k context)
`phi-1_5-q4f16_1-MLC`	Phi-1.5 Model (4-bit quantized, 16-bit float)
`phi-1_5-q4f32_1-MLC`	Phi-1.5 Model (4-bit quantized, 32-bit float)
`phi-1_5-q4f16_1-MLC-1k`	Phi-1.5 Model (4-bit quantized, 16-bit float, 1k context)
`phi-1_5-q4f32_1-MLC-1k`	Phi-1.5 Model (4-bit quantized, 32-bit float, 1k context)

RedPajama Models

Model Name	Description
`RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC`	RedPajama Chat 3B Model (4-bit quantized, 16-bit float)
`RedPajama-INCITE-Chat-3B-v1-q4f32_1-MLC`	RedPajama Chat 3B Model (4-bit quantized, 32-bit float)
`RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC-1k`	RedPajama Chat 3B Model (4-bit quantized, 16-bit float, 1k context)
`RedPajama-INCITE-Chat-3B-v1-q4f32_1-MLC-1k`	RedPajama Chat 3B Model (4-bit quantized, 32-bit float, 1k context)

SmolLM Models

Model Name	Description
`SmolLM2-1.7B-Instruct-q4f16_1-MLC`	SmolLM2 1.7B Instruct Model (4-bit quantized, 16-bit float)
`SmolLM2-1.7B-Instruct-q4f32_1-MLC`	SmolLM2 1.7B Instruct Model (4-bit quantized, 32-bit float)
`SmolLM2-360M-Instruct-q0f16-MLC`	SmolLM2 360M Instruct Model (16-bit float)
`SmolLM2-360M-Instruct-q0f32-MLC`	SmolLM2 360M Instruct Model (32-bit float)
`SmolLM2-360M-Instruct-q4f16_1-MLC`	SmolLM2 360M Instruct Model (4-bit quantized, 16-bit float)
`SmolLM2-360M-Instruct-q4f32_1-MLC`	SmolLM2 360M Instruct Model (4-bit quantized, 32-bit float)
`SmolLM2-135M-Instruct-q0f16-MLC`	SmolLM2 135M Instruct Model (16-bit float)
`SmolLM2-135M-Instruct-q0f32-MLC`	SmolLM2 135M Instruct Model (32-bit float)

TinyLlama Models

Model Name	Description
`TinyLlama-1.1B-Chat-v1.0-q4f16_1-MLC`	TinyLlama Chat 1.1B v1.0 Model (4-bit quantized, 16-bit float)
`TinyLlama-1.1B-Chat-v1.0-q4f32_1-MLC`	TinyLlama Chat 1.1B v1.0 Model (4-bit quantized, 32-bit float)
`TinyLlama-1.1B-Chat-v1.0-q4f16_1-MLC-1k`	TinyLlama Chat 1.1B v1.0 Model (4-bit quantized, 16-bit float, 1k context)
`TinyLlama-1.1B-Chat-v1.0-q4f32_1-MLC-1k`	TinyLlama Chat 1.1B v1.0 Model (4-bit quantized, 32-bit float, 1k context)
`TinyLlama-1.1B-Chat-v0.4-q4f16_1-MLC`	TinyLlama Chat 1.1B v0.4 Model (4-bit quantized, 16-bit float)
`TinyLlama-1.1B-Chat-v0.4-q4f32_1-MLC`	TinyLlama Chat 1.1B v0.4 Model (4-bit quantized, 32-bit float)
`TinyLlama-1.1B-Chat-v0.4-q4f16_1-MLC-1k`	TinyLlama Chat 1.1B v0.4 Model (4-bit quantized, 16-bit float, 1k context)
`TinyLlama-1.1B-Chat-v0.4-q4f32_1-MLC-1k`	TinyLlama Chat 1.1B v0.4 Model (4-bit quantized, 32-bit float, 1k context)

Qwen2 Models

Model Name	Description
`Qwen2-0.5B-Instruct-q4f16_1-MLC`	Qwen2 0.5B Instruct Model (4-bit quantized, 16-bit float)
`Qwen2-0.5B-Instruct-q0f16-MLC`	Qwen2 0.5B Instruct Model (16-bit float)
`Qwen2-0.5B-Instruct-q0f32-MLC`	Qwen2 0.5B Instruct Model (32-bit float)
`Qwen2-1.5B-Instruct-q4f16_1-MLC`	Qwen2 1.5B Instruct Model (4-bit quantized, 16-bit float)
`Qwen2-1.5B-Instruct-q4f32_1-MLC`	Qwen2 1.5B Instruct Model (4-bit quantized, 32-bit float)
`Qwen2-7B-Instruct-q4f16_1-MLC`	Qwen2 7B Instruct Model (4-bit quantized, 16-bit float)
`Qwen2-7B-Instruct-q4f32_1-MLC`	Qwen2 7B Instruct Model (4-bit quantized, 32-bit float)

Other Models

Model Name	Description
`gemma-2b-it-q4f16_1-MLC`	Gemma 2B Instruct Model (4-bit quantized, 16-bit float)
`gemma-2b-it-q4f32_1-MLC`	Gemma 2B Instruct Model (4-bit quantized, 32-bit float)
`gemma-2b-it-q4f16_1-MLC-1k`	Gemma 2B Instruct Model (4-bit quantized, 16-bit float, 1k context)
`gemma-2b-it-q4f32_1-MLC-1k`	Gemma 2B Instruct Model (4-bit quantized, 32-bit float, 1k context)
`stablelm-2-zephyr-1_6b-q4f16_1-MLC`	StableLM 2 Zephyr 1.6B Model (4-bit quantized, 16-bit float)
`stablelm-2-zephyr-1_6b-q4f32_1-MLC`	StableLM 2 Zephyr 1.6B Model (4-bit quantized, 32-bit float)
`stablelm-2-zephyr-1_6b-q4f16_1-MLC-1k`	StableLM 2 Zephyr 1.6B Model (4-bit quantized, 16-bit float, 1k context)
`stablelm-2-zephyr-1_6b-q4f32_1-MLC-1k`	StableLM 2 Zephyr 1.6B Model (4-bit quantized, 32-bit float, 1k context)
`WizardMath-7B-V1.1-q4f16_1-MLC`	WizardMath 7B Model (4-bit quantized, 16-bit float)

Example Request

{
  "stream": true,
  "messages": [
    {
      "role": "assistant",
      "content": "You are a helpful assistant"
    },
    {
      "role": "user",
      "content": "Can you tell me the meaning of life?"
    }
  ],
  "model": "Llama-3.2-1B-Instruct-q4f32_1-MLC"
}

Response Format (Non-Streaming)

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "Llama-3.2-1B-Instruct-q4f32_1-MLC",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 12,
    "total_tokens": 21
  }
}

Response Format (Streaming)

{"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"Llama-3.2-1B-Instruct-q4f32_1-MLC", "choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

{"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"Llama-3.2-1B-Instruct-q4f32_1-MLC", "choices":[{"index":0,"delta":{"content":"Hello"},"logprobs":null,"finish_reason":null}]}

....

{"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"Llama-3.2-1B-Instruct-q4f32_1-MLC", "choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}]}

Examples:

Local Non-Streaming Request (cURL)

curl --http1.1 -N http://localhost:6969/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "stream": true,
    "model": "Llama-3.2-3B-Instruct-q4f16_1-MLC",
    "messages": [
      {
        "role": "assistant",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Can you tell me the meaning of life?"
      }
    ]
  }'

Local Streaming Request (cURL)

curl --http1.1 -N http://localhost:6969/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "stream": true,
    "model": "Llama-3.2-3B-Instruct-q4f16_1-MLC",
    "messages": [
      {
        "role": "assistant",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Can you tell me the meaning of life?"
      }
    ]
  }'

Remote Request (cURL)

curl --http1.1 -N https://api.rana.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "stream": true,
    "model": "Llama-3.2-3B-Instruct-q4f16_1-MLC",
    "messages": [
      {
        "role": "assistant",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Can you tell me the meaning of life?"
      }
    ]
  }'