File size: 7,635 Bytes
4962477
 
 
 
 
 
 
 
 
 
 
 
 
19a6fbb
 
 
 
 
 
 
 
 
 
a409078
19a6fbb
a409078
19a6fbb
 
 
a409078
19a6fbb
 
 
 
 
a409078
19a6fbb
a409078
 
 
 
19a6fbb
 
 
 
 
 
 
 
 
 
a409078
 
 
19a6fbb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a409078
19a6fbb
 
 
 
a409078
 
 
 
 
 
 
 
 
 
 
 
19a6fbb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a409078
 
 
19a6fbb
 
 
 
 
 
 
 
 
 
 
 
 
 
a409078
19a6fbb
 
 
 
 
a409078
19a6fbb
 
 
 
 
 
 
 
 
 
 
 
 
 
a409078
19a6fbb
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
title: LoomRAG
emoji: πŸ†
colorFrom: indigo
colorTo: pink
sdk: streamlit
sdk_version: 1.41.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: 🧠 Multimodal RAG that "weaves" together text and images πŸͺ‘
---

# 🌟 LoomRAG: Multimodal Retrieval-Augmented Generation for AI-Powered Search

![GitHub stars](https://img.shields.io/github/stars/NotShrirang/LoomRAG?style=social)
![GitHub forks](https://img.shields.io/github/forks/NotShrirang/LoomRAG?style=social)
![GitHub commits](https://img.shields.io/github/commit-activity/t/NotShrirang/LoomRAG)
![GitHub issues](https://img.shields.io/github/issues/NotShrirang/LoomRAG)
![GitHub pull requests](https://img.shields.io/github/issues-pr/NotShrirang/LoomRAG)
![GitHub](https://img.shields.io/github/license/NotShrirang/LoomRAG)
![GitHub last commit](https://img.shields.io/github/last-commit/NotShrirang/LoomRAG)
![GitHub repo size](https://img.shields.io/github/repo-size/NotShrirang/LoomRAG)
<a href="https://huggingface.co/spaces/NotShrirang/LoomRAG"><img src="https://img.shields.io/badge/Streamlit%20App-red?style=flat-rounded-square&logo=streamlit&labelColor=white"/></a>

This project implements a Multimodal Retrieval-Augmented Generation (RAG) system, named **LoomRAG**, that leverages OpenAI's CLIP model for neural cross-modal retrieval and semantic search. The system allows users to input text queries and retrieve both text and image responses seamlessly through vector embeddings. It features a comprehensive annotation interface for creating custom datasets and supports CLIP model fine-tuning with configurable parameters for domain-specific applications. The system also supports uploading images and PDFs for enhanced interaction and intelligent retrieval capabilities through a Streamlit-based interface.

Experience the project in action:

[![LoomRAG Streamlit App](https://img.shields.io/badge/Streamlit%20App-red?style=for-the-badge&logo=streamlit&labelColor=white)](https://huggingface.co/spaces/NotShrirang/LoomRAG)

---

## πŸ“Έ Implementation Screenshots

| ![Screenshot 2025-01-01 184852](https://github.com/user-attachments/assets/ad79d0f0-d200-4a82-8c2f-0890a9fe8189) | ![Screenshot 2025-01-01 222334](https://github.com/user-attachments/assets/7307857d-a41f-4f60-8808-00d6db6e8e3e) |
| ---------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| Data Upload Page                                                                                                 | Data Search / Retrieval                                                                                          |
|                                                                                                                  |                                                                                                                  |
| ![Screenshot 2025-01-01 222412](https://github.com/user-attachments/assets/e38273f4-426b-444d-80f0-501fa9563779) | ![Screenshot 2025-01-01 223948](https://github.com/user-attachments/assets/21724a92-ef79-44ae-83e6-25f8de29c45a) |
| Data Annotation Page                                                                                             | CLIP Fine-Tuning                                                                                                 |

---

## ✨ Features

- πŸ”„ **Cross-Modal Retrieval**: Search text to retrieve both text and image results using deep learning
- 🌐 **Streamlit Interface**: Provides a user-friendly web interface for interacting with the system
- πŸ“€ **Upload Options**: Allows users to upload images and PDFs for AI-powered processing and retrieval
- 🧠 **Embedding-Based Search**: Uses OpenAI's CLIP model to align text and image embeddings in a shared latent space
- πŸ” **Augmented Text Generation**: Enhances text results using LLMs for contextually rich outputs
- 🏷️ **Image Annotation**: Enables users to annotate uploaded images through an intuitive interface
- 🎯 **CLIP Fine-Tuning**: Supports custom model training with configurable parameters including test dataset split size, learning rate, optimizer, and weight decay
- πŸ”¨ **Fine-Tuned Model Integration**: Seamlessly load and utilize fine-tuned CLIP models for enhanced search and retrieval

---

## πŸ—οΈ Architecture Overview

1. **Data Indexing**:

   - Text, images, and PDFs are preprocessed and embedded using the CLIP model
   - Embeddings are stored in a vector database for fast and efficient retrieval

2. **Query Processing**:

   - Text queries are converted into embeddings for semantic search
   - Uploaded images and PDFs are processed and embedded for comparison
   - The system performs a nearest neighbor search in the vector database to retrieve relevant text and images

3. **Response Generation**:

   - For text results: Optionally refined or augmented using a language model
   - For image results: Directly returned or enhanced with image captions
   - For PDFs: Extracts text content and provides relevant sections

4. **Image Annotation**:

   - Dedicated annotation page for managing uploaded images
   - Support for creating and managing multiple datasets simultaneously
   - Flexible annotation workflow for efficient data labeling
   - Dataset organization and management capabilities

5. **Model Fine-Tuning**:
   - Custom CLIP model training on annotated images
   - Configurable training parameters for optimization
   - Integration of fine-tuned models into the search pipeline

---

## πŸš€ Installation

1. Clone the repository:

   ```bash
   git clone https://github.com/NotShrirang/LoomRAG.git
   cd LoomRAG
   ```

2. Create a virtual environment and install dependencies:
   ```bash
   pip install -r requirements.txt
   ```

---

## πŸ“– Usage

1. **Running the Streamlit Interface**:

   - Start the Streamlit app:

     ```bash
     streamlit run app.py
     ```

   - Access the interface in your browser to:
     - Submit natural language queries
     - Upload images or PDFs to retrieve contextually relevant results
     - Annotate uploaded images
     - Fine-tune CLIP models with custom parameters
     - Use fine-tuned models for improved search results

2. **Example Queries**:
   - **Text Query**: "sunset over mountains"  
     Output: An image of a sunset over mountains along with descriptive text
   - **PDF Upload**: Upload a PDF of a scientific paper  
     Output: Extracted key sections or contextually relevant images

---

## βš™οΈ Configuration

- πŸ“Š **Vector Database**: It uses FAISS for efficient similarity search
- πŸ€– **Model**: Uses OpenAI CLIP for neural embedding generation
- ✍️ **Augmentation**: Optional LLM-based augmentation for text responses
- πŸŽ›οΈ Fine-Tuning: Configurable parameters for model training and optimization

---

## πŸ—ΊοΈ Roadmap

- [x] Fine-tuning CLIP for domain-specific datasets
- [ ] Adding support for audio and video modalities
- [ ] Improving the re-ranking system for better contextual relevance
- [ ] Enhanced PDF parsing with semantic section segmentation

---

## 🀝 Contributing

Contributions are welcome! Please open an issue or submit a pull request for any feature requests or bug fixes.

---

## πŸ“„ License

This project is licensed under the Apache-2.0 License. See the [LICENSE](LICENSE) file for details.

---

## πŸ™ Acknowledgments

- [OpenAI CLIP](https://openai.com/research/clip)
- [FAISS](https://github.com/facebookresearch/faiss)
- [Hugging Face](https://huggingface.co/)