Sarashina2-7B BitsAndBytes 4-bit Quantized

This is a 4-bit quantized version of sbintuitions/sarashina2-7b using BitsAndBytes (NF4).

Model Description

  • Base Model: sarashina2-7b (7B parameters)
  • Quantization Method: BitsAndBytes (bitsandbytes library)
  • Quantization Type: NF4 (Normal Float 4-bit)
  • Double Quantization: Enabled
  • Compute dtype: bfloat16
  • Original Size: ~14.6 GB
  • Quantized Size: ~4-5 GB
  • Memory Reduction: ~70-75%

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_id = "{hf_model_id}"

# Configure BitsAndBytes
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Generate text
prompt = "おはようございます、今日の天気は"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.7,
    top_p=0.95
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Installation

pip install transformers accelerate bitsandbytes

Requirements

  • CUDA GPU: BitsAndBytes requires CUDA (not compatible with CPU)
  • GPU Memory: ~5-6 GB VRAM recommended
  • Python: 3.8+

Performance

  • Memory Usage: Reduced by ~70-75% compared to FP16
  • Inference Speed: Comparable to FP16 on modern GPUs
  • Quality: Minimal accuracy loss with NF4 quantization

Advantages of BitsAndBytes

  • No calibration required - quantizes on model load
  • Easy to use - single configuration parameter
  • Widely compatible - works with most Hugging Face models
  • Double quantization - additional memory savings
  • NF4 quantization - optimized for neural network weights

Limitations

  • Requires CUDA GPU (no CPU support)
  • May have slight quality degradation compared to full precision
  • Cannot export to ONNX or other formats

License

MIT License (inherited from base model)

Citation

@misc{{sarashina2-7b-bnb,
  author = {{Ronan Takizawa}},
  title = {{Sarashina2-7B BitsAndBytes 4-bit Quantized}},
  year = {{2025}},
  publisher = {{Hugging Face}},
  howpublished = {{\\url{{https://huggingface.co/{hf_model_id}}}}}
}}

Base Model Citation

Please refer to the original model card for the base model citation.

Downloads last month
-
Safetensors
Model size
7B params
Tensor type
F16
·
F32
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ronantakizawa/sarashina2-7b-4bit-bnb

Quantized
(6)
this model