jatinmehra commited on
Commit
5ba2999
Β·
1 Parent(s): 3878b6f

feat: add README_hf.md and update workflow to prepare README for Hugging Face

Browse files

to fix config issue in the spaces coz of readme file.

In short-
{Add workaround for Hugging Face README.md}

Files changed (2) hide show
  1. .github/workflows/Push_to_hf.yaml +4 -0
  2. README_hf.md +197 -0
.github/workflows/Push_to_hf.yaml CHANGED
@@ -16,6 +16,10 @@ jobs:
16
  with:
17
  fetch-depth: 0
18
 
 
 
 
 
19
  - name: Push to Hugging Face Space
20
  env:
21
  HF_TOKEN: ${{ secrets.HF_TOKEN }}
 
16
  with:
17
  fetch-depth: 0
18
 
19
+ - name: Prepare Hugging Face README
20
+ run: |
21
+ cp README_hf.md README.md
22
+
23
  - name: Push to Hugging Face Space
24
  env:
25
  HF_TOKEN: ${{ secrets.HF_TOKEN }}
README_hf.md ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ title: CRAWLGPT
4
+ sdk: docker
5
+ emoji: πŸ’»
6
+ colorFrom: pink
7
+ colorTo: blue
8
+ pinned: true
9
+ short_description: A powerful web content crawler with LLM-powered RAG.
10
+ ---
11
+ # CRAWLGPT πŸ€–
12
+
13
+ A powerful web content crawler with LLM-powered summarization and chat capabilities. CRAWLGPT extracts content from URLs, stores it in a vector database (FAISS), and enables natural language querying of the stored content. It combines modern web crawling technology with advanced language models to help you extract, analyze, and interact with web content intelligently.
14
+
15
+ ## 🌟 Features
16
+
17
+ - **Web Crawling**
18
+ Async-based crawling powered by [crawl4ai](https://pypi.org/project/crawl4ai/) and Playwright.
19
+ Includes configurable rate limiting and content validation.
20
+
21
+ - **Content Processing**
22
+ Automatically chunks large texts, generates embeddings, and summarizes text via the Groq API.
23
+
24
+ - **Chat Interface**
25
+ Streamlit-based UI with a user-friendly chat panel.
26
+ Supports summarized or full-text retrieval (RAG) for context injection.
27
+
28
+ - **Data Management**
29
+ Stores content in a local or in-memory vector database (FAISS) for efficient retrieval.
30
+ Tracks usage metrics and supports import/export of system state.
31
+
32
+ - **Testing**
33
+ Comprehensive unit and integration tests using Python’s `unittest` framework.
34
+
35
+
36
+ ## πŸŽ₯ Demo
37
+ ### [Deployed APP πŸš€πŸ€–](https://huggingface.co/spaces/jatinmehra/CRAWL-GPT-CHAT)
38
+
39
+ [streamlit-chat_app video.webm](https://github.com/user-attachments/assets/ae1ddca0-9e3e-4b00-bf21-e73bb8e6cfdf)
40
+
41
+ _Example of CRAWLGPT in action!_
42
+
43
+ ## πŸ”§ Requirements
44
+
45
+ - Python >= 3.8
46
+ - Operating System: OS Independent
47
+ - Required packages are handled by the setup script.
48
+
49
+
50
+ ## πŸš€ Quick Start
51
+
52
+ 1. Clone the Repository:
53
+
54
+ ```git clone https://github.com/Jatin-Mehra119/CRAWLGPT.git
55
+ cd CRAWLGPT
56
+ ```
57
+
58
+ 2. Run the Setup Script:
59
+
60
+ ```
61
+ python -m setup_env
62
+ ```
63
+
64
+ _This script installs dependencies, creates a virtual environment, and prepares the project._
65
+
66
+ 3. Update Your Environment Variables:
67
+
68
+ - Create or modify the `.env` file.
69
+ - Add your Groq API key and Ollama API key. Learn how to get API keys.
70
+
71
+
72
+ ```
73
+ GROQ_API_KEY=your_groq_api_key_here
74
+ OLLAMA_API_TOKEN=your_ollama_api_key_here
75
+ ```
76
+
77
+ 4. Activate the Virtual Environment:
78
+
79
+ ```
80
+ source .venv/bin/activate # On Unix/macOS
81
+ .venv\Scripts\activate # On Windows
82
+ ```
83
+
84
+ 5. Run the Application:
85
+ ```
86
+ python -m streamlit run src/crawlgpt/ui/chat_app.py
87
+ ```
88
+
89
+ ## πŸ“¦ Dependencies
90
+
91
+ ### Core Dependencies
92
+
93
+ - `streamlit==1.41.1`
94
+ - `groq==0.15.0`
95
+ - `sentence-transformers==3.3.1`
96
+ - `faiss-cpu==1.9.0.post1`
97
+ - `crawl4ai==0.4.247`
98
+ - `python-dotenv==1.0.1`
99
+ - `pydantic==2.10.5`
100
+ - `aiohttp==3.11.11`
101
+ - `beautifulsoup4==4.12.3`
102
+ - `numpy==2.2.0`
103
+ - `tqdm==4.67.1`
104
+ - `playwright>=1.41.0`
105
+ - `asyncio>=3.4.3`
106
+
107
+ ### Development Dependencies
108
+
109
+ - `pytest==8.3.4`
110
+ - `pytest-mockito==0.0.4`
111
+ - `black==24.2.0`
112
+ - `isort==5.13.0`
113
+ - `flake8==7.0.0`
114
+
115
+ ## πŸ—οΈ Project Structure
116
+
117
+
118
+ ```
119
+ crawlgpt/
120
+ β”œβ”€β”€ src/
121
+ β”‚ └── crawlgpt/
122
+ β”‚ β”œβ”€β”€ core/
123
+ β”‚ β”‚ β”œβ”€β”€ DatabaseHandler.py
124
+ β”‚ β”‚ β”œβ”€β”€ LLMBasedCrawler.py
125
+ β”‚ β”‚ └── SummaryGenerator.py
126
+ β”‚ β”œβ”€β”€ ui/
127
+ β”‚ β”‚ β”œβ”€β”€ chat_app.py
128
+ β”‚ β”‚ └── chat_ui.py
129
+ β”‚ └── utils/
130
+ β”‚ β”œβ”€β”€ content_validator.py
131
+ β”‚ β”œβ”€β”€ data_manager.py
132
+ β”‚ β”œβ”€β”€ helper_functions.py
133
+ β”‚ β”œβ”€β”€ monitoring.py
134
+ β”‚ └── progress.py
135
+ β”œβ”€β”€ tests/
136
+ β”‚ └── test_core/
137
+ β”‚ β”œβ”€β”€ test_database_handler.py
138
+ β”‚ β”œβ”€β”€ test_integration.py
139
+ β”‚ β”œβ”€β”€ test_llm_based_crawler.py
140
+ β”‚ └── test_summary_generator.py
141
+ β”œβ”€β”€ .github/
142
+ β”‚ └── workflows/
143
+ β”‚ └── Push_to_hf.yaml
144
+ β”œβ”€β”€ .gitignore
145
+ β”œβ”€β”€ LICENSE
146
+ β”œβ”€β”€ README.md
147
+ β”œβ”€β”€ Docs
148
+ β”œβ”€β”€ pyproject.toml
149
+ β”œβ”€β”€ pytest.ini
150
+ └── setup_env.py
151
+ ```
152
+
153
+ ## πŸ§ͺ Testing
154
+
155
+ Run all tests
156
+ ```
157
+ python -m pytest
158
+ ```
159
+ _The tests include unit tests for core functionality and integration tests for end-to-end workflows._
160
+
161
+ ## πŸ“ License
162
+
163
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
164
+
165
+ ## πŸ”— Links
166
+
167
+ - [Bug Tracker](https://github.com/Jatin-Mehra119/crawlgpt/issues)
168
+ - [Documentation](https://github.com/Jatin-Mehra119/crawlgpt/wiki)
169
+ - [Source Code](https://github.com/Jatin-Mehra119/crawlgpt)
170
+
171
+ ## 🧑 Acknowledgments
172
+
173
+ - Inspired by the potential of GPT models for intelligent content processing.
174
+ - Special thanks to the creators of Crawl4ai, Groq, FAISS, and Playwright for their powerful tools.
175
+
176
+ ## πŸ‘¨β€πŸ’» Author
177
+
178
+ - Jatin Mehra ([email protected])
179
+
180
+ ## 🀝 Contributing
181
+
182
+ Contributions are welcome! Please feel free to submit a Pull Request. For major changes, open an issue first to discuss your proposal.
183
+
184
+ 1. Fork the Project.
185
+ 2. Create your Feature Branch:
186
+ ```
187
+ git checkout -b feature/AmazingFeature`
188
+ ```
189
+ 3. Commit your Changes:
190
+ ```
191
+ git commit -m 'Add some AmazingFeature
192
+ ```
193
+ 4. Push to the Branch:
194
+ ```
195
+ git push origin feature/AmazingFeature
196
+ ```
197
+ 5. Open a Pull Request.