GitHub - mzau/mlx-knife: ollama like cli tool for MLX models on huggingface (pull, rm, list, show, serve etc.)
Extracto
ollama like cli tool for MLX models on huggingface (pull, rm, list, show, serve etc.) - mzau/mlx-knife
Contenido
A lightweight, ollama-like CLI for managing and running MLX models on Apple Silicon. CLI-only tool designed for personal, local use - perfect for individual developers and researchers working with MLX models.
Note: MLX Knife is designed as a command-line interface tool only. While some internal functions are accessible via Python imports, only CLI usage is officially supported.
Current Version: 1.0.1 (August 2025)
Features
Core Functionality
- List & Manage Models: Browse your HuggingFace cache with MLX-specific filtering
- Model Information: Detailed model metadata including quantization info
- Download Models: Pull models from HuggingFace with progress tracking
- Run Models: Native MLX execution with streaming and chat modes
- Health Checks: Verify model integrity and completeness
- Cache Management: Clean up and organize your model storage
Local Server & Web Interface
- OpenAI-Compatible API: Local REST API with
/v1/chat/completions
,/v1/completions
,/v1/models
- Web Chat Interface: Built-in HTML chat interface with markdown rendering
- Single-User Design: Optimized for personal use, not multi-user production environments
- Conversation Context: Full chat history maintained for follow-up questions
- Streaming Support: Real-time token streaming via Server-Sent Events
- Configurable Limits: Set default max tokens via
--max-tokens
parameter - Model Hot-Swapping: Switch between models per conversation
- Tool Integration: Compatible with OpenAI-compatible clients (Cursor IDE, etc.)
Run Experience
- Direct MLX Integration: Models load and run natively without subprocess overhead
- Real-time Streaming: Watch tokens generate with proper spacing and formatting
- Interactive Chat: Full conversational mode with history tracking
- Memory Insights: See GPU memory usage after model loading and generation
- Dynamic Stop Tokens: Automatic detection and filtering of model-specific stop tokens
- Customizable Generation: Control temperature, max_tokens, top_p, and repetition penalty
- Context-Managed Memory: Context manager pattern ensures automatic cleanup and prevents memory leaks
- Exception-Safe: Robust error handling with guaranteed resource cleanup
Installation
Via PyPI (Recommended)
Via GitHub (Development)
pip install git+https://github.com/mzau/mlx-knife.git
Requirements
- macOS with Apple Silicon (M1/M2/M3)
- Python 3.9+ (native macOS version or newer)
- 8GB+ RAM recommended + RAM to run LLM
Python Compatibility
MLX Knife has been comprehensively tested and verified on:
✅ Python 3.9.6 (native macOS) - Primary target
✅ Python 3.10-3.13 - Fully compatible
All versions include full MLX model execution testing with real models.
Install from Source
# Clone the repository git clone https://github.com/mzau/mlx-knife.git cd mlx-knife # Install in development mode pip install -e . # Or install normally pip install . # Install with development tools (ruff, mypy, tests) pip install -e ".[dev,test]"
Install Dependencies Only
pip install -r requirements.txt
Quick Start
CLI Usage
# List all MLX models in your cache mlxk list # Show detailed info about a model mlxk show Phi-3-mini-4k-instruct-4bit # Download a new model mlxk pull mlx-community/Mistral-7B-Instruct-v0.3-4bit # Run a model with a prompt mlxk run Phi-3-mini "What is the capital of France?" # Start interactive chat mlxk run Phi-3-mini # Check model health mlxk health
Web Chat Interface
MLX Knife includes a built-in web interface for easy model interaction:
# Start the OpenAI-compatible API server mlxk server --port 8000 --max-tokens 4000 # Open web chat interface in your browser open simple_chat.html
Features:
- No installation required - Pure HTML/CSS/JS
- Real-time streaming - Watch tokens appear as they're generated
- Model selection - Choose any MLX model from your cache
- Conversation history - Full context for follow-up questions
- Markdown rendering - Proper formatting for code, lists, tables
- Mobile-friendly - Responsive design works on all devices
Local API Server Integration
The MLX Knife server provides OpenAI-compatible endpoints for local development and personal use:
# Start local server (single-user, no authentication) mlxk server --host 127.0.0.1 --port 8000 # Test with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{"model": "Phi-3-mini-4k-instruct-4bit", "messages": [{"role": "user", "content": "Hello!"}]}' # Integration with development tools (community-tested): # - Cursor IDE: Set API URL to http://localhost:8000/v1 # - LibreChat: Configure as custom OpenAI endpoint # - Open WebUI: Add as local OpenAI-compatible API # - SillyTavern: Add as OpenAI API with custom URL
Note: Tool integrations are community-tested. Some tools may require specific configuration or have compatibility limitations. Please report issues via GitHub.
Command Reference
Available Commands
list
- Browse Models
mlxk list # Show MLX models only (short names) mlxk list --verbose # Show MLX models with full paths mlxk list --all # Show all models with framework info mlxk list --all --verbose # All models with full paths mlxk list --health # Include health status mlxk list Phi-3 # Filter by model name mlxk list --verbose Phi-3 # Show detailed info (same as show)
show
- Model Details
mlxk show <model> # Display model information mlxk show <model> --files # Include file listing mlxk show <model> --config # Show config.json content
pull
- Download Models
mlxk pull <model> # Download from HuggingFace mlxk pull <org>/<model> # Full model path
run
- Execute Models
mlxk run <model> "prompt" # Single prompt (minimal output) mlxk run <model> "prompt" --verbose # Show loading, memory, and stats mlxk run <model> # Interactive chat mlxk run <model> "prompt" --no-stream # Batch output mlxk run <model> --max-tokens 1000 # Custom length mlxk run <model> --temperature 0.9 # Higher creativity mlxk run <model> --no-chat-template # Raw completion mode
rm
- Remove Models
mlxk rm <model> # Delete a model mlxk rm <model> --force # Skip confirmation
health
- Check Integrity
mlxk health # Check all models mlxk health <model> # Check specific model
server
- Start API Server
mlxk server # Start on localhost:8000 mlxk server --port 8001 # Custom port mlxk server --host 0.0.0.0 --port 8000 # Allow external access mlxk server --max-tokens 4000 # Set default max tokens (default: 2000) mlxk server --reload # Development mode with auto-reload
Command Aliases
After installation, these commands are equivalent:
mlxk
(recommended)mlx-knife
mlx_knife
Project Structure
mlx_knife/
├── __init__.py # Package metadata and version
├── cli.py # Command-line interface and argument parsing
├── cache_utils.py # Core model management functionality
├── mlx_runner.py # Native MLX model execution
├── server.py # OpenAI-compatible API server with FastAPI
├── hf_download.py # HuggingFace download integration
├── throttled_download_worker.py # Background download worker
├── requirements.txt # Python dependencies
├── pyproject.toml # Package configuration
├── simple_chat.html # Built-in web chat interface
└── README.md # This file
Module Overview
cli.py
: Entry point handling command parsing and dispatchcache_utils.py
: Model discovery, metadata extraction, and cache operationsmlx_runner.py
: MLX model loading, token generation, and streamingserver.py
: FastAPI-based REST API server with OpenAI compatibilitysimple_chat.html
: Standalone web chat interface for immediate usehf_download.py
: Robust downloading with progress trackingthrottled_download_worker.py
: Prevents network overload during downloads
Configuration
Cache Location
By default, models are stored in ~/.cache/huggingface/hub
. Configure with:
# Set custom cache location export HF_HOME="/path/to/your/cache" # Example: External SSD export HF_HOME="/Volumes/ExternalSSD/models"
Model Name Expansion
Short names are automatically expanded for MLX models:
Phi-3-mini-4k-instruct-4bit
→mlx-community/Phi-3-mini-4k-instruct-4bit
- Models already containing
/
are used as-is
Advanced Usage
Generation Parameters
# Creative writing (high temperature, diverse output) mlxk run Mistral-7B "Write a story" --temperature 0.9 --top-p 0.95 # Precise tasks (low temperature, focused output) mlxk run Phi-3-mini "Extract key points" --temperature 0.3 --top-p 0.9 # Long-form generation mlxk run Mixtral-8x7B "Explain quantum computing" --max-tokens 2000 # Reduce repetition mlxk run model "prompt" --repetition-penalty 1.2
Working with Specific Commits
# Use specific model version mlxk show model@commit_hash mlxk run model@commit_hash "prompt"
Non-MLX Model Handling
The tool automatically detects framework compatibility:
# Attempting to run PyTorch model mlxk run bert-base-uncased # Error: Model bert-base-uncased is not MLX-compatible (Framework: PyTorch)! # Use MLX-Community models: https://huggingface.co/mlx-community
Testing
MLX Knife includes comprehensive test coverage across all supported Python versions.
Quick Start
Prerequisites:
- Apple Silicon Mac (M1/M2/M3)
- Python 3.9+
- At least one MLX model:
mlxk pull mlx-community/Phi-3-mini-4k-instruct-4bit
Run Tests:
pip install -e ".[test]"
pytest
Why Local Testing?
MLX requires Apple Silicon hardware and real models (4GB+) for testing. This is standard for MLX projects and ensures tests reflect real-world usage.
For detailed testing documentation, development workflows, and multi-Python verification, see TESTING.md.
Part of the BROKE Ecosystem 🦫
MLX Knife is the first component of BROKE Cluster, our research project for intelligent LLM routing across heterogeneous Apple Silicon networks.
- Use MLX Knife: For single Mac setups (available now)
- Use BROKE Cluster: For multi-Mac environments (in development)
Technical Details
Token Decoding
MLX Knife uses context-aware decoding to handle tokenizers that encode spaces as separate tokens:
# Sliding window approach maintains context for proper spacing window_tokens = generated_tokens[-10:] # Last 10 tokens window_text = tokenizer.decode(window_tokens)
Stop Token Detection
Stop tokens are dynamically extracted from each model's tokenizer:
- Primary:
tokenizer.eos_token
- Secondary:
tokenizer.pad_token
(if different) - Additional: Special tokens containing 'end', 'stop', or 'eot'
- Common tokens verified as single-token entities
Memory Management
- Context Managers: Automatic resource cleanup with Python context managers
- Exception-Safe: Model cleanup guaranteed even on errors
- Baseline Tracking: Memory captured before model loading
- Real-time Monitoring: GPU memory tracking via
mlx.core.get_active_memory()
- Memory Statistics: Detailed usage displayed after generation
- Leak Prevention: Automatic
mx.clear_cache()
and garbage collection
# Context manager pattern (automatic cleanup) with MLXRunner(model_path) as runner: response = runner.generate_batch(prompt) # Model automatically cleaned up here
Troubleshooting
Model Not Found
# If model isn't found, try full path mlxk pull mlx-community/Model-Name-4bit # List available models mlxk list --all
Performance Issues
- Ensure sufficient RAM for model size
- Close other applications to free memory
- Use smaller quantized models (4-bit recommended)
Streaming Issues
- Some models may have spacing issues - this is handled automatically
- Use
--no-stream
for batch output if needed
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Quick Start:
- Fork and clone the repository
- Install with development tools:
pip install -e ".[dev,test]"
- Make your changes and add tests
- Run tests locally on Apple Silicon:
pytest
- Check code style:
ruff check mlx_knife/ --fix
- Submit a pull request
We prioritize compatibility with Python 3.9 (native macOS) but welcome contributions tested on any version 3.9+.
Security
For security concerns, please see SECURITY.md or contact us at broke@gmx.eu.
MLX Knife runs entirely locally - no data is sent to external servers except when downloading models from HuggingFace.
License
MIT License - see LICENSE file for details
Copyright (c) 2025 The BROKE team 🦫
Acknowledgments
- Built for Apple Silicon using the MLX framework
- Models hosted by the MLX Community on HuggingFace
- Inspired by ollama's user experience
Made with ❤️ by The BROKE team
Version 1.0-rc3 | August 2025
🔮 Next: BROKE Cluster for multi-node deployments
Fuente: GitHub