Spaces:
Sleeping
Sleeping
update
Browse files
README.md
CHANGED
|
@@ -1,66 +1,12 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
- **Language Model**: Uses `Viet-Mistral/Vistral-7B-Chat`, a language model based on Mistral, with continued pretraining on Vietnamese for better generation performance.
|
| 14 |
-
|
| 15 |
-
## Installation
|
| 16 |
-
1. Clone the repository:
|
| 17 |
-
```sh
|
| 18 |
-
git clone https://github.com/quoctata2911/RAG-based-ChatBot-System.git
|
| 19 |
-
```
|
| 20 |
-
|
| 21 |
-
2. Navigate to the project directory:
|
| 22 |
-
```sh
|
| 23 |
-
cd RAG-Based-Chatbot-System
|
| 24 |
-
```
|
| 25 |
-
|
| 26 |
-
3. Install the required dependencies:
|
| 27 |
-
```sh
|
| 28 |
-
pip install -r requirements.txt
|
| 29 |
-
```
|
| 30 |
-
|
| 31 |
-
## Usage
|
| 32 |
-
Upload your Word .docx documents into the data folder. Ensure that each document has been chunked using a special chunk marker separator as specified in the config.yaml file.
|
| 33 |
-
|
| 34 |
-
1. Configure the chunk marker:
|
| 35 |
-
- Open the `config.yaml` file located in the project directory.
|
| 36 |
-
- Locate the parameter defining the chunk marker and adjust it as needed for your document segmentation requirements.
|
| 37 |
-
|
| 38 |
-
2. Prepare the data:
|
| 39 |
-
```sh
|
| 40 |
-
python prepare_data.py
|
| 41 |
-
```
|
| 42 |
-
3. Run the chatbot:
|
| 43 |
-
```sh
|
| 44 |
-
python chat.py
|
| 45 |
-
```
|
| 46 |
-
|
| 47 |
-
## Project Structure
|
| 48 |
-
- **prepare_data.py**: Script to preprocess and chunk documents, converting tables into HTML and segmenting them with chunk markers.
|
| 49 |
-
- **chat.py**: Main script to run the chatbot system.
|
| 50 |
-
|
| 51 |
-
## Models
|
| 52 |
-
- **Embedding Model**: We use the `intfloat/multilingual-e5-small` model for generating embeddings. This model is particularly effective for Vietnamese text, outperforming other models in our benchmarks.
|
| 53 |
-
|
| 54 |
-
- **Language Model**: The language model used is Vistral, a variant of the Mistral model that has been further pre-trained on Vietnamese text for improved performance in language generation tasks.
|
| 55 |
-
|
| 56 |
-
## Benchmarking and Performance
|
| 57 |
-
Through extensive benchmarking, the `intfloat/multilingual-e5-small` model has proven to be the best choice for Vietnamese embeddings, offering a balance of efficiency and performance. The Vistral model enhances language generation capabilities, ensuring the chatbot responds accurately and naturally in Vietnamese.
|
| 58 |
-
|
| 59 |
-
## Contributions
|
| 60 |
-
We welcome contributions to improve the RAG-ChatBot. Please fork the repository and create a pull request with your changes. For major changes, please open an issue first to discuss what you would like to change.
|
| 61 |
-
|
| 62 |
-
## License
|
| 63 |
-
This project is licensed under the MIT License. See the LICENSE file for more details.
|
| 64 |
-
|
| 65 |
-
## Contact
|
| 66 |
-
For any questions or suggestions, please contact me at [email protected]
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: GPT2 Vietnamese
|
| 3 |
+
emoji: 🚀
|
| 4 |
+
colorFrom: gray
|
| 5 |
+
colorTo: green
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: 4.21.0
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: false
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|