DINO-X MCP

![License](https://img.shields.io/badge/License-Apache 2.0-blue.svg)

English | 中文

DINO-X Official MCP Server — powered by the DINO-X and Grounding DINO models — brings fine-grained object detection and image understanding to your multimodal applications.

Your browser does not support the video tag.

Why DINO-X MCP?

With DINO-X MCP, you can:

Fine-Grained Understanding: Full image detection, object detection, and region-level descriptions.
Structured Outputs: Get object categories, counts, locations, and attributes for VQA and multi-step reasoning tasks.
Composable: Works seamlessly with other MCP servers to build end-to-end visual agents or automation pipelines.

Transport Modes

DINO-X MCP supports two transport modes:

Feature	STDIO (default)	Streamable HTTP
Runtime	Local	Local or Cloud
Transport	Standard I/O	HTTP (streaming responses)
Input source	`file://` and `https://`	`https://` only
Visualization	Supported (saves annotated images locally)	Not supported (for now)

Quick Start

1. Prepare an MCP client

Any MCP-compatible client works, e.g.:

2. Get your API key

Apply on the DINO-X platform: Request API Key (new users get free quota).

3. Configure MCP

Option A: Official Hosted Streamable HTTP (Recommended)

Add to your MCP client config and replace with your API key:

*Configuration content*

Option B: Use the NPM package locally (STDIO)

Install Node.js first

Download the installer from nodejs.org
Or use command:

# macOS / Linux
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash
# or
wget -qO- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash

# load nvm into current shell (choose the one you use)
source ~/.bashrc || true
source ~/.zshrc  || true

# install and use LTS Node.js
nvm install --lts
nvm use --lts

# Windows (one of the following)
winget install OpenJS.NodeJS.LTS
# or with Chocolatey (in admin PowerShell)
iwr -useb https://raw.githubusercontent.com/chocolatey/chocolatey/master/chocolateyInstall/InstallChocolatey.ps1 | iex
choco install nodejs-lts -y

Configure your MCP client:

*Configuration content*

Note: Replace your-api-key-here with your real key.

Option C: Run from source locally

Make sure Node.js is installed (see Option B), then:

# clone
git clone https://github.com/IDEA-Research/DINO-X-MCP.git
cd DINO-X-MCP

# install deps
npm install

# build
npm run build

Configure your MCP client:

*Configuration content*

CLI Flags & Environment Variables

Common flags
- --http: start in Streamable HTTP mode (otherwise STDIO by default)
- --stdio: force STDIO mode
- --dinox-api-key=...: set API key
- --enable-client-key: allow API key via URL ?key= (Streamable HTTP only)
- --port=8080: HTTP port (default 3020)
Environment variables
- DINOX_API_KEY (required/conditionally required): DINO-X platform API key
- IMAGE_STORAGE_DIRECTORY (optional, STDIO): directory to save annotated images
- AUTH_TOKEN (optional, HTTP): if set, client must send Authorization: Bearer <token>
Examples:

# STDIO (local)
node build/index.js --dinox-api-key=your-api-key

# Streamable HTTP (server provides a shared API key)
node build/index.js --http --dinox-api-key=your-api-key

# Streamable HTTP (custom port)
node build/index.js --http --dinox-api-key=your-api-key --port=8080

# Streamable HTTP (require client-provided API key via URL)
node build/index.js --http --enable-client-key

Client config when using ?key=:

*Configuration content*

Using AUTH_TOKEN with a gateway that injects Authorization: Bearer <token>:

AUTH_TOKEN=my-token node build/index.js --http --enable-client-key

Client example with supergateway:

*Configuration content*

Tools

Capability	Tool ID	Transport	Input	Output
Full-scene object detection	`detect-all-objects`	STDIO / HTTP	Image URL	Category + bbox + (optional) captions
Text-prompted object detection	`detect-objects-by-text`	STDIO / HTTP	Image URL + English nouns (dot-separated for multiple, e.g., `person.car`)	Target object bbox + (optional) captions
Human pose estimation	`detect-human-pose-keypoints`	STDIO / HTTP	Image URL	17 keypoints + bbox + (optional) captions
Visualization	`visualize-detection-result`	STDIO only	Image URL + detection results array	Local path to annotated image

🎬 Use Cases

🎯 Scenario	📝 Input	✨ Output
Detection & Localization	💬 Prompt: `Detect and visualize the` `fire areas in the forest` 🖼️ Input Image:
Object Counting	💬 Prompt: `Please analyze this` `warehouse image, detect` `all the cardboard boxes,` `count the total number` 🖼️ Input Image:
Feature Detection	💬 Prompt: `Find all red cars` `in the image` 🖼️ Input Image:
Attribute Reasoning	💬 Prompt: `Find the tallest person` `in the image, describe` `their clothing` 🖼️ Input Image:
Full Scene Detection	💬 Prompt: `Find the fruit with` `the highest vitamin C` `content in the image` 🖼️ Input Image:	Answer: Kiwi fruit (93mg/100g)
Pose Analysis	💬 Prompt: `Please analyze what` `yoga pose this is` 🖼️ Input Image: