Crawl-MCP: Unofficial MCP Server for crawl4ai

⚠️ Important: This is an unofficial MCP server implementation for the excellent crawl4ai library.
Not affiliated with the original crawl4ai project.

A comprehensive Model Context Protocol (MCP) server that wraps the powerful crawl4ai library with advanced AI capabilities. Extract and analyze content from any source: web pages, PDFs, Office documents, YouTube videos, and more. Features intelligent summarization to dramatically reduce token usage while preserving key information.

🌟 Key Features

🔍 Google Search Integration - 7 optimized search genres with Google official operators
🔍 Advanced Web Crawling: JavaScript support, deep site mapping, entity extraction
🌐 Universal Content Extraction: Web pages, PDFs, Word docs, Excel, PowerPoint, ZIP archives
🤖 AI-Powered Summarization: Smart token reduction (up to 88.5%) while preserving essential information
🎬 YouTube Integration: Extract video transcripts and summaries without API keys
⚡ Production Ready: 21 specialized tools with comprehensive error handling

🚀 Quick Start

Prerequisites (Required First)

Install system dependencies for Playwright:

Linux/macOS:

sudo bash scripts/prepare_for_uvx_playwright.sh

Windows (as Administrator):

scripts/prepare_for_uvx_playwright.ps1

Installation

UVX (Recommended - Easiest):

# After system preparation above - that's it!
uvx --from git+https://github.com/walksoda/crawl-mcp crawl-mcp

Claude Desktop Setup

Add to your claude_desktop_config.json:

*Configuration content*

For Japanese interface:

*Configuration content*

📖 Documentation

Topic	Description
Installation Guide	Complete installation instructions for all platforms
API Reference	Full tool documentation and usage examples
Configuration Examples	Platform-specific setup configurations
HTTP Integration	HTTP API access and integration methods
Advanced Usage	Power user techniques and workflows
Development Guide	Contributing and development setup

Language-Specific Documentation

English: docs/ directory
日本語: docs/ja/ directory

🛠️ Tool Overview

Web Crawling

crawl_url - Single page crawling with JavaScript support
deep_crawl_site - Multi-page site mapping and exploration
crawl_url_with_fallback - Robust crawling with retry strategies
batch_crawl - Process multiple URLs simultaneously

AI-Powered Analysis

intelligent_extract - Semantic content extraction with custom instructions
auto_summarize - LLM-based summarization for large content
extract_entities - Pattern-based entity extraction (emails, phones, URLs, etc.)

Media Processing

process_file - Convert PDFs, Office docs, ZIP archives to markdown
extract_youtube_transcript - Multi-language transcript extraction
batch_extract_youtube_transcripts - Process multiple videos

Search Integration

search_google - Genre-filtered Google search with metadata
search_and_crawl - Combined search and content extraction
batch_search_google - Multiple search queries with analysis

🎯 Common Use Cases

Content Research:

search_and_crawl → intelligent_extract → structured analysis

Documentation Mining:

deep_crawl_site → batch processing → comprehensive extraction

Media Analysis:

extract_youtube_transcript → auto_summarize → insight generation

Competitive Intelligence:

batch_crawl → extract_entities → comparative analysis

🚨 Quick Troubleshooting

Installation Issues:

Run system diagnostics: Use get_system_diagnostics tool
Re-run setup scripts with proper privileges
Try development installation method

Performance Issues:

Use wait_for_js: true for JavaScript-heavy sites
Increase timeout for slow-loading pages
Enable auto_summarize for large content

Configuration Issues:

Check JSON syntax in claude_desktop_config.json
Verify file paths are absolute
Restart Claude Desktop after configuration changes

🏗️ Project Structure

Original Library: crawl4ai by unclecode
MCP Wrapper: This repository (walksoda)
Implementation: Unofficial third-party integration

📄 License

This project is an unofficial wrapper around the crawl4ai library. Please refer to the original crawl4ai license for the underlying functionality.

🤝 Contributing

See our Development Guide for contribution guidelines and development setup instructions.

🔗 Related Projects

crawl4ai - The underlying web crawling library
Model Context Protocol - The standard this server implements
Claude Desktop - Primary client for MCP servers

Crawl MCP