How to Set Up BIELIK AI with vLLM and GGUF (Easy Guide)

February 9, 2025

BIELIK AI

The full version of BIELIK AI speakleash/Bielik-11B-v2.3-Instruct requires a minimum of 24GB VRAM.


 

In my case, I have a server with an NVIDIA RTX 4000 Ada Generation card (20GB VRAM).

 

So, I had to use the quantized version.

 

What does ‘quantized model’ mean?

- It simply means that the model requires fewer hardware resources while trying to maintain as much accuracy as possible compared to the full version.

 

- Analogy: It is Like compressing a high-resolution photo to JPEG – you lose some detail but keep the essential visual information.

 

After a recommendation from Remigiusz Kinas, I decided to go with Q8_0.

 

The Q8_0 requires approximately 12GB of VRAM (which cuts the requirements almost by 50%)

 

And it shows negligible accuracy loss, performing almost indistinguishably (brilliant !!) from the full version of the model in most scenarios. 

 

The documentation says that the context window between the Q8_0 version and the FULL version does not change, so it's great.

 

I used vLLM and Ollama, and I will probably stick with vLLM. Why?

 

Ollama has issues with allocating VRAM on my server, With vLLM it is not an issue. 

 

Additionally, vLLM is said to be more optimized.

 

There is a list of steps on how to download and use the BIELIK AI quantized version on your server. 

 

1. I used python 3.11.0


2. I downloaded the .gguf file from Hugging Face.
 

wget https://huggingface.co/speakleash/Bielik-11B-v2.3-Instruct-GGUF/resolve/main/Bielik-11B-v2.3-Instruct.Q8_0.gguf

 

What is GGUF?

- It is a file format for quantized language models, allowing for a reduction in model size while retaining most of its functionality.

 

3. Next, we install the vLLM package and Transformers.

 

pip install vllm transformers -U


The -U option in the pip install command stands for --upgrade, which means updating the package to the latest available version.
 

4. Our quantized version is in the .gguf format. To use vLLM with .gguf, we need to install Rust.

 

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

 

5. Add Rust ENV to PATH.

 

source "$HOME/.cargo/env"

 

6. Start the vLLM server with our BIELIK GGUF file.

 

vllm serve ./Bielik-11B-v2.3-Instruct.Q8_0.gguf \
	--load-format gguf \
	--max-model-len 4096 \
	--served-model-name bielik-11b-Q8_0


What do the parameters mean?

  • --load-format gguf – quantization format

  • --max-model-len – memory usage limitation

  • --served-model-name – the name of the model to be used in the code (needed later in the Python code)


7. Test if our server is running using cURL.

 

curl -X POST http://localhost:8000/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
  	"model": "bielik-11b-Q8_0",
  	"messages": [{"role": "user", "content": "Opisz zasady zdrowego odżywiania"}]
	}'

 

8. Next, you can test it with a Python file.

- If you don't have OpenAI installed yet, run:

 

pip install openai


Example Python file where we connect to our local BIELIK AI.

 

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
	model="bielik-11b-Q8_0",
	messages=[{"role": "user", "content": "Opisz zasady zdrowego odżywiania"}]
)

print(response.choices[0].message.content)


- base_url is the URL that vLLM outputs when starting our model – if needed, you can check it using:  

 

vllm serve --host 0.0.0.0 --port 8000


- model="bielik-11b-Q8_0", - is the name you provided when running vLLM

 

For the future use case you can set the temperature in Python file:

 

response = client.chat.completions.create(
	model="bielik-11b-Q8_0",
	messages=[{"role": "user", "content": "Opisz zasady zdrowego odżywiania"}],
	temperature=0.7,  # Creativity control
)

 

9. Alternatively you can use Ollama instead of vLLM, but it didn't work for me.

 

ollama run SpeakLeash/bielik-11b-v2.3-instruct-imatrix:Q8_0

 

Ollama and BIELIK AI

 

Additionally When running ollama run SpeakLeash/bielik-11b-v2.3-instruct it downloads by default the Q4_K_M quantization.


- BIELIK AI’s GGUF models can use the OpenAI class because they can use OpenAI API, making coding in Python quite convenient.

 

Resources

 

- You can find more info there: Bielik AI
- Shoutout to Maciej Krystian Szymanski for introducing me to the BIELIK AI Discord community that helped me. And I encourage you to do the same!

- Thank you Remigiusz Kinas once more.

 

https://huggingface.co/speakleash/Bielik-11B-v2.3-Instruct

https://huggingface.co/speakleash/Bielik-11B-v2.3-Instruct-GGUF
 

I will test the model in business use cases in the following days.

 

You can also check out posts about:

 


If you have any questions, reach me out! Find me on linkedin!

 

Want to stay updated? Join my newsletter and get a weekly report on the most exciting industry news! 🚀