Absortio

Email β†’ Summary β†’ Bookmark β†’ Email

GitHub - merveenoyan/smol-vision: Recipes for shrinking, optimizing, customizing cutting edge vision models. πŸ’œ

Extracto

Recipes for shrinking, optimizing, customizing cutting edge vision models. πŸ’œ - GitHub - merveenoyan/smol-vision: Recipes for shrinking, optimizing, customizing cutting edge vision models. πŸ’œ

Contenido

Quantization/ONNX Faster and Smaller Zero-shot Object Detection with Optimum Quantize the state-of-the-art zero-shot object detection model OWLv2 using Optimum ONNXRuntime tools. VLM Fine-tuning Fine-tune PaliGemma Fine-tune state-of-the-art vision language backbone PaliGemma using transformers. Intro to Optimum/ORT Optimizing DETR with πŸ€— Optimum A soft introduction to exporting vision models to ONNX and quantizing them. Model Shrinking Knowledge Distillation for Computer Vision Knowledge distillation for image classification. Quantization Fit in vision models using Quanto Fit in vision models to smaller hardware using quanto Speed-up Faster foundation models with torch.compile Improving latency for foundation models using torch.compile [NEW] VLM Fine-tuning Fine-tune Florence-2 Fine-tune Florence-2 on DocVQA dataset VLM Fine-tuning QLoRA/Fine-tune IDEFICS3 or SmolVLM on VQAv2 QLoRA/Full Fine-tune IDEFICS3 or SmolVLM on VQAv2 dataset VLM Fine-tuning (Script) QLoRA Fine-tune IDEFICS3 on VQAv2 QLoRA/Full Fine-tune IDEFICS3 or SmolVLM on VQAv2 dataset [NEW] VLM Fine-tuning Grounded Fine-tuning Grounded fine-tuning for vision-language models [NEW] Vision Model Fine-tuning Fine-tune DINOv3 Fine-tune DINOv3 for vision tasks Multimodal RAG Multimodal RAG using ColPali and Qwen2-VL Learn to retrieve documents and pipeline to RAG without hefty document processing using ColPali through Byaldi and do the generation with Qwen2-VL Multimodal Retriever Fine-tuning Fine-tune ColPali for Multimodal RAG Learn to apply contrastive fine-tuning on ColPali to customize it for your own multimodal document RAG use case Any-to-Any Fine-tuning Fine-tune Gemma-3n for all modalities (audio-text-image) Fine-tune Gemma-3n model to handle any modality: audio, text, and image. Any-to-Any RAG Any-to-Any (Video) RAG with OmniEmbed and Qwen Do retrieval and generation across modalities (including video) using OmniEmbed and Qwen. Speed-up/Memory Optimization Vision language model serving using TGI (SOON) Explore speed-ups and memory improvements for vision-language model serving with text-generation inference Quantization/Optimum/ORT All levels of quantization and graph optimizations for Image Segmentation using Optimum (SOON) End-to-end model optimization using Optimum

Fuente: GitHub