--- title: Markit_v2 emoji: 📄 colorFrom: blue colorTo: indigo sdk: gradio sdk_version: 5.14.0 app_file: app.py build_script: build.sh startup_script: setup.sh pinned: false hf_oauth: true --- # Document to Markdown Converter with RAG Chat **Author: Anse Min** | [🤗 Hugging Face Space](https://huggingface.co/spaces/Ansemin101/Markit_v2) | [GitHub](https://github.com/ansemin/Markit_v2) | [LinkedIn](https://www.linkedin.com/in/ansemin/) A powerful Hugging Face Space that converts various document formats to Markdown and enables intelligent chat with your documents using advanced RAG (Retrieval-Augmented Generation). ## 🎥 Demo Video

**[▶️ Watch Full Demo (YouTube)](https://www.youtube.com/watch?v=PmXu3Si6hXo)** *Complete walkthrough of Markit's flagship features including multi-document processing, RAG chat, and advanced retrieval strategies*

Table of contents

- [Demo Video](#-demo-video) - [Live Demos](#-live-demos) - [System Overview](#-system-overview) - [Environment Setup](#-environment-setup) - [Local Development](#-local-development) - [Technical Details](#-technical-details)

## 🎬 Live Demos ### 1. Multi-Document Processing (Flagship Feature)

**What it does:** Process up to 5 files simultaneously (20MB combined) with 4 intelligent processing types: - **🔗 Combined**: Merge documents with smart duplicate removal - **📑 Individual**: Separate sections per document with clear organization - **📈 Summary**: Executive overview + detailed analysis of all documents - **⚖️ Comparison**: Cross-document analysis with similarities/differences tables **Why it matters:** Industry-leading multi-document processing that compares and contrasts information across different files, handles mixed file types seamlessly, and recognizes relationships across document boundaries.

*Industry-leading multi-document processing with 4 intelligent processing types*

### 2. Single Document Conversion Flow

**What it does:** Convert PDFs, Office documents, images, and more to Markdown using 5 powerful parsers: - **Gemini Flash**: AI-powered understanding with high accuracy - **Mistral OCR**: Fastest processing with document understanding - **Docling**: Open source with advanced PDF table recognition - **GOT-OCR**: Mathematical/scientific documents to LaTeX - **MarkItDown**: High accuracy for CSV/XML and broad format support **Why it matters:** Perfect table preservation creates enhanced markdown tables for superior RAG context, unlike standard PDF text extraction.

*Choose the right parser for your specific needs and document types*

### 3. RAG Chat System in Action

**What it does:** Chat with your converted documents using 4 advanced retrieval strategies: - **🎯 Similarity**: Traditional semantic similarity using embeddings - **🔀 MMR**: Diverse results with reduced redundancy - **🔍 BM25**: Traditional keyword-based retrieval - **🔗 Hybrid**: Combines semantic + keyword search (recommended) **Why it matters:** Ask for markdown tables in chat responses (impossible with standard PDF RAG), get streaming responses with document context, and easily clear data directly from the interface.

*Advanced RAG system with 4 retrieval strategies for optimal document search*

### 4. Query Ranker Analysis

**What it does:** Interactive document search with: - **Real-time ranking** of document chunks with confidence scores - **Method comparison** to test different retrieval strategies - **Adjustable results** (1-10) with responsive slider control - **Transparent scoring** with actual ChromaDB similarity scores **Why it matters:** Provides complete transparency into how your RAG system finds and ranks information, helping you optimize retrieval strategies. ### 5. GOT-OCR LaTeX Processing

$GOT-OCR LaTeX Demo$

**What it does:** Advanced LaTeX processing for mathematical and scientific documents: - **Native LaTeX output** with no LLM conversion for maximum accuracy - **Mathpix rendering** using the same library as official GOT-OCR demo - **RAG-compatible chunking** that preserves LaTeX structures and mathematical tables - **Professional display** with proper mathematical formatting **Why it matters:** Perfect for research papers, scientific documents, and academic content with complex equations and structured data. ## 🎯 System Overview

*Complete workflow from document upload to intelligent RAG chat interaction*

## 🔧 Environment Setup ### Required API Keys ```bash GOOGLE_API_KEY=your_gemini_api_key_here # For Gemini Flash parser and RAG chat OPENAI_API_KEY=your_openai_api_key_here # For embeddings and AI descriptions MISTRAL_API_KEY=your_mistral_api_key_here # For Mistral OCR parser (optional) ``` ### Key Configuration Options ```bash DEBUG=true # Enable debug logging MAX_FILE_SIZE=10485760 # 10MB per file limit MAX_BATCH_FILES=5 # Maximum files for multi-document processing MAX_BATCH_SIZE=20971520 # 20MB combined limit for batch processing CHUNK_SIZE=1000 # Document chunk size for Markdown content RETRIEVAL_K=4 # Number of documents to retrieve for RAG ``` ## 🚀 Local Development ### Quick Start ```bash # Clone repository git clone https://github.com/ansemin/Markit_v2 cd Markit_v2 # Create environment file cp .env.example .env # Edit .env with your API keys # Install dependencies pip install -r requirements.txt # Run application python app.py # Full environment setup (HF Spaces compatible) python run_app.py # Local development (faster startup) python run_app.py --clear-data-and-run # Testing with clean data ``` ### Data Management **Two ways to clear data:** 1. **UI Method**: Chat tab → "🗑️ Clear All Data" button (works in both local and HF Space) 2. **CLI Method**: `python run_app.py --clear-data-and-run` **What gets cleared:** Vector store embeddings, chat history, and session data ## 🔍 Technical Details ### Retrieval Strategy Performance | Method | Best For | Accuracy | |--------|----------|----------| | **🎯 Similarity** | General semantic questions | Good | | **🔀 MMR** | Diverse perspectives | Good | | **🔍 BM25** | Exact keyword searches | Medium | | **🔗 Hybrid** | Most queries (recommended) | **Excellent** | ### Core Technologies - **Parsers**: Gemini Flash, Mistral OCR, Docling, GOT-OCR, MarkItDown - **RAG System**: OpenAI embeddings + ChromaDB vector store + Gemini 2.5 Flash - **UI Framework**: Gradio with modular component architecture - **GPU Support**: ZeroGPU integration for HF Spaces ### Smart Content-Aware Chunking - **Markdown chunking**: Preserves tables and code blocks - **LaTeX chunking**: Preserves mathematical tables, environments, and structures - **Automatic format detection**: Optimal chunking strategy per document type ## Credits - [MarkItDown](https://github.com/microsoft/markitdown) by Microsoft - [Docling](https://github.com/DS4SD/docling) by IBM Research - [GOT-OCR](https://github.com/stepfun-ai/GOT-OCR-2.0) by StepFun - [Mathpix Markdown](https://github.com/Mathpix/mathpix-markdown-it) for LaTeX rendering - [Gradio](https://gradio.app/) for the UI framework --- **🚀 [Try it live on Hugging Face Spaces](https://huggingface.co/spaces/Ansemin101/Markit_v2)**