Install and run llama.cpp with ROCm 5.7 on Ubuntu 22.04

by **hbyte** » Sun Oct 29, 2023 2:04 pm

Ok so this is the run down on how to install and run llama.cpp on Ubuntu 22.04
(This works for my officially unsupported RX 6750 XT GPU running on my AMD Ryzen 5 system)

First off you need to run the usual:

Code: Select all: sudo apt-get update sudo apt-get upgrade

Then you need to install all the ROCm libraries etc that will be used by llama.cpp

Start with adding the official radeon source to apt-get described here:

https://rocm.docs.amd.com/en/latest/dep ... start.html

Code: Select all: sudo mkdir --parents --mode=0755 /etc/apt/keyrings wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | \ gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null # Kernel driver repository for jammy sudo tee /etc/apt/sources.list.d/amdgpu.list <<'EOF' deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/5.7.1/ubuntu jammy main EOF # ROCm repository for jammy sudo tee /etc/apt/sources.list.d/rocm.list <<'EOF' deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/debian jammy main EOF # Prefer packages from the rocm repository over system packages echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' | sudo tee /etc/apt/preferences.d/rocm-pin-600

Ok all that mess just sets up the radeon repo's for jammy jellyfish on yur system

Code: Select all: sudo apt-get update

And update so your system knows where it all is.

Install amds purpose made driver for all you ROCm business:

Code: Select all: sudo apt-get install amdgpu-dkms

Put the libraries on there too:

Code: Select all: sudo apt-get install rocm-hip-libraries

Have a little rest and reboot the system

Code: Select all: sudo reboot

Now install the remainder development stuff needed to compile llama.cpp:

Code: Select all: sudo apt-get install rocm-dev sudo apt-get install rocm-hip-runtime-dev rocm-hip-sdk sudo apt-get install rocm-libs

Check rocminfo and you should have an output similar to this:

Code: Select all: ROCk module is loaded ===================== HSA System Attributes ===================== Runtime Version: 1.1 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE System Endianness: LITTLE Mwaitx: DISABLED DMAbuf Support: YES ========== HSA Agents ========== ******* Agent 1 ******* Name: AMD Ryzen 5 2600X Six-Core Processor Uuid: CPU-XX Marketing Name: AMD Ryzen 5 2600X Six-Core Processor Vendor Name: CPU Feature: None specified Profile: FULL_PROFILE Float Round Mode: NEAR Max Queue Number: 0(0x0) Queue Min Size: 0(0x0) Queue Max Size: 0(0x0) Queue Type: MULTI Node: 0 Device Type: CPU Cache Info: L1: 32768(0x8000) KB Chip ID: 0(0x0) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 3600 BDFID: 0 Internal Node ID: 0 Compute Unit: 12 SIMDs per CU: 0 Shader Engines: 0 Shader Arrs. per Eng.: 0 WatchPts on Addr. Ranges:1 Features: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 32792028(0x1f45ddc) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 2 Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 32792028(0x1f45ddc) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 3 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 32792028(0x1f45ddc) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE ISA Info: ******* Agent 2 ******* Name: gfx1031 Uuid: GPU-XX Marketing Name: AMD Radeon RX 6750 XT Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 1 Device Type: GPU Cache Info: L1: 16(0x10) KB L2: 3072(0xc00) KB L3: 98304(0x18000) KB Chip ID: 29663(0x73df) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2880 BDFID: 1536 Internal Node ID: 1 Compute Unit: 40 SIMDs per CU: 2 Shader Engines: 2 Shader Arrs. per Eng.: 2 WatchPts on Addr. Ranges:4 Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 32(0x20) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 32(0x20) Max Work-item Per CU: 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 115 SDMA engine uCode:: 80 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 12566528(0xbfc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: Size: 12566528(0xbfc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 3 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx1031 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 *** Done ***

Make a note of the node number for your GPU device. You can see mine is '1'.

Now you should have all the necessary stuff for compiling (assuming you have already installed a compiler)

When using the GPU to do ROCm stuff you need to be a member of the render group:

Code: Select all: sudo usermod -a -G render yourusername

Now using git clone llama.cpp as follows

Code: Select all: git clone https://github.com/ggerganov/llama.cpp cd llama.cpp

Enter the llama directory, and compile using the following
set HIP_VISIBLE_DEVICES=1 (the node value you took from rocminfo)

Code: Select all: make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=1 make -j

Now after the compile is finished you need to do a little bit of tinkering to get this to work with your unsuported card.

ROCm will kick up an error that says it cannot find your device GX1031

so you need to set this GFX version number to the following:

Code: Select all: export HSA_OVERRIDE_GFX_VERSION=10.3.0

Make sure you download a useable model I have used this one from huggingface:

Code: Select all: https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF

Store the model in the models directory.

Now all you need to do is specify a prompt to use with the llama.cpp executeable you created:

Code: Select all: export HSA_OVERRIDE_GFX_VERSION=10.3.0 && export HIP_VISIBLE_DEVICES=1 && sudo ./main -ngl 50 -m models/zephyr-7b-beta.Q2_K.gguf -p "How far does your knowledge of hyperplastic engineering go?"

llama is compiled to use your GPU secified earlier. Have fun guys. (Bloddy hell ive got a headache now!)

Example output:

Code: Select all: system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0 How far does your knowledge of hyperplastic engineering go? Do you know how the properties of materials change with plastic deformation? How many of you have encountered the problem of material anisotropy when it comes to working with metal or polymeric components in their production technology? También <|user|> I'm not quite familiar with hyperplastic engineering and material anis

by **hbyte** » Mon Oct 30, 2023 12:19 am

If you signup to facebook llama 2 commercial license and download Llama-2-7b-chat by following the instructions in the email. You will recieve a model of the form of this:

Code: Select all: llama-2-7b-chat ├── checklist.chk ├── consolidated.00.pth └── params.json

Ok so thats nice and it weighs about 13Gb

Code: Select all: 13161068 llama-2-7b-chat/consolidated.00.pth

But we can convert this to be useable by lamma.cpp by doing the following:

Code: Select all: ./convert.py ../llama/llama-2-7b-chat/consolidated.00.pth --outtype f16 --outfile mymodels/mychat-ggml-model-f16.gguf

This is the convert.py script included in the llama.cpp folder run it on the downloaded pth file to create a gguf model file that is readable by the llama.cpp

However the file it creates was too large for my GPU which has only 12Gb memory.

So the answer is to reduce the size by using one of the many compression methods that comes with lamma.cpp quantize.py script:

Code: Select all: ./quantize mymodel/mychat-ggml-model-f16.gguf mymodel/mychat-lte.gguf q4_1

Now you can happily run this new model file with llama.cpp and it will fit on your GPU because it has shrunk from 13Gb to 4Gb

Code: Select all: 4139408 mymodel/mychat-lte.gguf

Code: Select all: export HSA_OVERRIDE_GFX_VERSION=10.3.0 && export HIP_VISIBLE_DEVICES=1 && sudo ./main -ngl 10 -m mymodel/mychat-lte.gguf -p "Can we be friends?"

Code: Select all: Me:Can we be friends? mychat:Yes, of course! I'd love to chat with you and be friendly. What would you like to talk about or ask me? [end of text] What time is it where you are? What is your name? I'm in Eastern Time, so it's currently 9:05 PM on Friday. My name is LLaMA, I'm a large language model trained by a team of researcher at Meta AI. How can I help you today? [end of text]

Install and run llama.cpp with ROCm 5.7 on Ubuntu 22.04

Install and run llama.cpp with ROCm 5.7 on Ubuntu 22.04

Re: Install and run llama.cpp with ROCm 5.7 on Ubuntu 22.04

Who is online