DACMini-IT / README.md

Update README.md

4bc3ac7 verified about 1 month ago

9.14 kB

	---
	license: mit
	datasets:
	- Mattimax/DATA-AI_Conversation_ITA
	language:
	- it
	base_model:
	- Mattimax/DACMini
	library_name: transformers
	tags:
	- DAC
	- DATA-AI
	- data-ai
	---

	[![HuggingFace](https://img.shields.io/badge/HuggingFace-Mattimax-brightgreen)](https://huggingface.co/Mattimax)
	[![M.INC](https://img.shields.io/badge/M.INC-Labs-blue)](https://huggingface.co/MINC01)

	# Mattimax/DACMini-IT

	![Logo di DACMini](https://huggingface.co/Mattimax/DACMini/resolve/main/DACMini_Logo/DACMini_Logo.png)

	* Autore: [Mattimax](https://huggingface.co/Mattimax)
	* Organizzazione: [M.INC](https://huggingface.co/MINC01)
	* Licenza: MIT

	---

	## Descrizione

	DACMini-IT è un modello di linguaggio compatto e instruction tuned per chat e dialogo in lingua italiana.
	Basato sull’architettura GPT-2 Small (italian adaptation), è progettato per essere rapido, leggero e facilmente distribuibile su dispositivi con risorse limitate.

	Rispetto a DACMini “base”, DACMini-IT è addestrato su dataset italiani conversazionali strutturati in formato user-assistant, ottimizzando la capacità di seguire istruzioni e gestire conversazioni multi-turno naturali.

	---

	## Dimensioni e caratteristiche tecniche

	* Parametri: 109M
	* Architettura: GPT-2 Small (italian adaptation)
	* Lunghezza massima del contesto: 512 token
	* Numero di strati: 12
	* Numero di teste di attenzione: 12
	* Dimensione embedding: 768
	* Vocabolario: ~50.000 token
	* Quantizzazione: supportata (8-bit / 4-bit opzionale con `bitsandbytes`)

	---

	## Dataset di addestramento

	Addestrato su [Mattimax/DATA-AI_Conversation_ITA](https://huggingface.co/datasets/Mattimax/DATA-AI_Conversation_ITA), un dataset italiano di dialoghi instruction tuned, contenente coppie prompt-response strutturate per favorire risposte coerenti, naturali e grammaticalmente corrette.

	---

	## Obiettivi

	* Chatbot in lingua italiana con capacità di seguire istruzioni.
	* Risposte concise, chiare e naturali in contesti multi-turno.
	* Applicazioni leggere o offline dove la dimensione del modello è un vincolo.

	---

	## Avvertenze e limitazioni

	* Modello sperimentale: può produrre errori logici o risposte non pertinenti.
	* Non addestrato su temi sensibili o contenuti specialistici.
	* Prestazioni limitate su conversazioni molto lunghe o prompt complessi.
	* Non destinato ad usi commerciali senza ulteriore validazione.

	---

	## Uso consigliato

	* Applicazioni chatbot leggere o offline in italiano.
	* Prototipazione e test di pipeline NLP italiane.
	* Generazione di risposte sintetiche e dataset per training o valutazione.

	---

	## Codice per inferenza di esempio

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	# 1. Carica modello e tokenizer addestrati
	model_path = "Mattimax/DACMini-IT"
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	model = AutoModelForCausalLM.from_pretrained(model_path)
	model.eval()

	# 2. Funzione di generazione
	def chat_inference(prompt, max_new_tokens=150, temperature=0.7, top_p=0.9):
	# Costruisci input nel formato usato in training
	formatted_prompt = f"<\|user\|> {prompt.strip()} <\|assistant\|>"

	# Tokenizza
	inputs = tokenizer(formatted_prompt, return_tensors="pt")

	# Genera risposta
	with torch.no_grad():
	output = model.generate(
	**inputs,
	max_new_tokens=max_new_tokens,
	temperature=temperature,
	top_p=top_p,
	do_sample=True,
	pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id
	)

	# Decodifica e rimuovi prompt iniziale
	generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
	response = generated_text.split("<\|assistant\|>")[-1].strip()
	return response

	# 3. Esempio d’uso
	if __name__ == "__main__":
	while True:
	user_input = input("👤 Utente: ")
	if user_input.lower() in ["exit", "quit"]:
	break
	response = chat_inference(user_input)
	print(f"🤖 Assistant: {response}\n")
	````

	## Referenze

	* Dataset: [Mattimax/DATA-AI_Conversation_ITA](https://huggingface.co/datasets/Mattimax/DATA-AI_Conversation_ITA)
	* Modello di base: [DACMini](https://huggingface.co/Mattimax/DACMini)
	* Organizzazione: [M.INC](https://huggingface.co/MINC01)
	* Collezione: [Little_DAC Collection](https://huggingface.co/collections/Mattimax/little-dac-collection-68e11d19a5949d08e672b312)

	## Citazione

	Se utilizzi Mattimax/DACMini-IT in un progetto, un articolo o qualsiasi lavoro, ti chiediamo gentilmente di citarlo usando il file `CITATION.bib` incluso nel repository:

	```bibtex
	@misc{mattimax2025dacminiit,
	title = {{Mattimax/DACMini-IT}: Un modello di linguaggio open source},
	author = {Mattimax},
	howpublished = {\url{https://huggingface.co/Mattimax/DACMini-IT}},
	year = {2025},
	note = {License: MIT. Se usi questo modello, per favore citane la fonte originale.}
	}
	```
	---

	# English version

	## Description

	DACMini-IT is a compact, instruction-tuned language model for Italian chat and dialogue.
	Based on the GPT-2 Small (Italian adaptation) architecture, it is designed to be fast, lightweight, and easily deployable on low-resource devices.

	Compared to the “base” DACMini, DACMini-IT is trained on Italian conversational datasets structured in user-assistant format, optimizing its ability to follow instructions and handle natural multi-turn conversations.

	---

	## Size and technical specs

	* Parameters: 109M
	* Architecture: GPT-2 Small (Italian adaptation)
	* Max context length: 512 tokens
	* Number of layers: 12
	* Number of attention heads: 12
	* Embedding size: 768
	* Vocabulary: ~50,000 tokens
	* Quantization: supported (optional 8-bit / 4-bit via `bitsandbytes`)

	---

	## Training dataset

	Trained on [Mattimax/DATA-AI_Conversation_ITA](https://huggingface.co/datasets/Mattimax/DATA-AI_Conversation_ITA), an Italian instruction-tuned conversational dataset containing structured prompt-response pairs designed to promote coherent, natural, and grammatically correct answers.

	---

	## Objectives

	* Italian-language chatbot with instruction-following capabilities.
	* Concise, clear, and natural responses in multi-turn contexts.
	* Lightweight or offline applications where model size is a constraint.

	---

	## Warnings and limitations

	* Experimental model: may produce logical errors or irrelevant answers.
	* Not trained on sensitive topics or specialized content.
	* Limited performance on very long conversations or complex prompts.
	* Not intended for commercial use without further validation.

	---

	## Recommended use

	* Lightweight or offline Italian chatbot applications.
	* Prototyping and testing of Italian NLP pipelines.
	* Synthetic response generation and datasets for training or evaluation.

	---

	## Example inference code

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	# 1. Load trained model and tokenizer
	model_path = "Mattimax/DACMini-IT"
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	model = AutoModelForCausalLM.from_pretrained(model_path)
	model.eval()

	# 2. Generation function
	def chat_inference(prompt, max_new_tokens=150, temperature=0.7, top_p=0.9):
	# Build input in the format used during training
	formatted_prompt = f"<\|user\|> {prompt.strip()} <\|assistant\|>"

	# Tokenize
	inputs = tokenizer(formatted_prompt, return_tensors="pt")

	# Generate response
	with torch.no_grad():
	output = model.generate(
	**inputs,
	max_new_tokens=max_new_tokens,
	temperature=temperature,
	top_p=top_p,
	do_sample=True,
	pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id
	)

	# Decode and remove initial prompt
	generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
	response = generated_text.split("<\|assistant\|>")[-1].strip()
	return response

	# 3. Usage example
	if __name__ == "__main__":
	while True:
	user_input = input("👤 User: ")
	if user_input.lower() in ["exit", "quit"]:
	break
	response = chat_inference(user_input)
	print(f"🤖 Assistant: {response}\n")
	```

	---

	## References

	* Dataset: [Mattimax/DATA-AI_Conversation_ITA](https://huggingface.co/datasets/Mattimax/DATA-AI_Conversation_ITA)
	* Base model: [DACMini](https://huggingface.co/Mattimax/DACMini)
	* Organization: [M.INC](https://huggingface.co/MINC01)
	* Collection: [Little_DAC Collection](https://huggingface.co/collections/Mattimax/little-dac-collection-68e11d19a5949d08e672b312)

	---

	## Citation

	If you use Mattimax/DACMini-IT in a project, paper, or any work, please cite it using the `CITATION.bib` file included in the repository:

	```bibtex
	@misc{mattimax2025dacminiit,
	title = {{Mattimax/DACMini-IT}: An open-source language model},
	author = {Mattimax},
	howpublished = {\url{https://huggingface.co/Mattimax/DACMini-IT}},
	year = {2025},
	note = {License: MIT. If you use this model, please cite the original source.}
	}
	```