Page 1 of 1

Install and run llama.cpp with ROCm 5.7 on Ubuntu 22.04

PostPosted: Sun Oct 29, 2023 2:04 pm
by hbyte
Ok so this is the run down on how to install and run llama.cpp on Ubuntu 22.04
(This works for my officially unsupported RX 6750 XT GPU running on my AMD Ryzen 5 system)

First off you need to run the usual:

Code: Select all
sudo apt-get update
sudo apt-get upgrade


Then you need to install all the ROCm libraries etc that will be used by llama.cpp

Start with adding the official radeon source to apt-get described here:

https://rocm.docs.amd.com/en/latest/dep ... start.html

Code: Select all
sudo mkdir --parents --mode=0755 /etc/apt/keyrings

wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | \
    gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null
   
# Kernel driver repository for jammy
sudo tee /etc/apt/sources.list.d/amdgpu.list <<'EOF'
deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/5.7.1/ubuntu jammy main
EOF
# ROCm repository for jammy
sudo tee /etc/apt/sources.list.d/rocm.list <<'EOF'
deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/debian jammy main
EOF
# Prefer packages from the rocm repository over system packages
echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' | sudo tee /etc/apt/preferences.d/rocm-pin-600



Ok all that mess just sets up the radeon repo's for jammy jellyfish on yur system

Code: Select all
sudo apt-get update


And update so your system knows where it all is.

Install amds purpose made driver for all you ROCm business:

Code: Select all
sudo apt-get install amdgpu-dkms


Put the libraries on there too:

Code: Select all
sudo apt-get install rocm-hip-libraries


Have a little rest and reboot the system

Code: Select all
sudo reboot


Now install the remainder development stuff needed to compile llama.cpp:

Code: Select all
sudo apt-get install rocm-dev
sudo apt-get install rocm-hip-runtime-dev rocm-hip-sdk
sudo apt-get install rocm-libs


Check rocminfo and you should have an output similar to this:

Code: Select all
ROCk module is loaded
=====================   
HSA System Attributes   
=====================   
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                             
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                 
Agent 1                 
*******                 
  Name:                    AMD Ryzen 5 2600X Six-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 5 2600X Six-Core Processor
  Vendor Name:             CPU                               
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                             
  Node:                    0                                 
  Device Type:             CPU                               
  Cache Info:             
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3600                               
  BDFID:                   0                                 
  Internal Node ID:        0                                 
  Compute Unit:            12                                 
  SIMDs per CU:            0                                 
  Shader Engines:          0                                 
  Shader Arrs. per Eng.:   0                                 
  WatchPts on Addr. Ranges:1                                 
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED       
      Size:                    32792028(0x1f45ddc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                               
      Alloc Alignment:         4KB                               
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32792028(0x1f45ddc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                               
      Alloc Alignment:         4KB                               
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED     
      Size:                    32792028(0x1f45ddc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                               
      Alloc Alignment:         4KB                               
      Accessible by all:       TRUE                               
  ISA Info:               
*******                 
Agent 2                 
*******                 
  Name:                    gfx1031                           
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 6750 XT             
  Vendor Name:             AMD                               
  Feature:                 KERNEL_DISPATCH                   
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                         
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                   
  Queue Type:              MULTI                             
  Node:                    1                                 
  Device Type:             GPU                               
  Cache Info:             
    L1:                      16(0x10) KB                       
    L2:                      3072(0xc00) KB                     
    L3:                      98304(0x18000) KB                 
  Chip ID:                 29663(0x73df)                     
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2880                               
  BDFID:                   1536                               
  Internal Node ID:        1                                 
  Compute Unit:            40                                 
  SIMDs per CU:            2                                 
  Shader Engines:          2                                 
  Shader Arrs. per Eng.:   2                                 
  WatchPts on Addr. Ranges:4                                 
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                       
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                       
    y                        1024(0x400)                       
    z                        1024(0x400)                       
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                       
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 115                               
  SDMA engine uCode::      80                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED     
      Size:                    12566528(0xbfc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                               
      Alloc Alignment:         4KB                               
      Accessible by all:       FALSE                             
    Pool 2                   
      Segment:                 GLOBAL; FLAGS:                     
      Size:                    12566528(0xbfc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                               
      Alloc Alignment:         4KB                               
      Accessible by all:       FALSE                             
    Pool 3                   
      Segment:                 GROUP                             
      Size:                    64(0x40) KB                       
      Allocatable:             FALSE                             
      Alloc Granule:           0KB                               
      Alloc Alignment:         0KB                               
      Accessible by all:       FALSE                             
  ISA Info:               
    ISA 1                   
      Name:                    amdgcn-amd-amdhsa--gfx1031         
      Machine Models:          HSA_MACHINE_MODEL_LARGE           
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                       
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                       
        y                        1024(0x400)                       
        z                        1024(0x400)                       
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             



Make a note of the node number for your GPU device. You can see mine is '1'.

Now you should have all the necessary stuff for compiling (assuming you have already installed a compiler)

When using the GPU to do ROCm stuff you need to be a member of the render group:

Code: Select all
sudo usermod -a -G render yourusername


Now using git clone llama.cpp as follows

Code: Select all
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp


Enter the llama directory, and compile using the following
set HIP_VISIBLE_DEVICES=1 (the node value you took from rocminfo)

Code: Select all
make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=1 make -j


Now after the compile is finished you need to do a little bit of tinkering to get this to work with your unsuported card.

ROCm will kick up an error that says it cannot find your device GX1031

so you need to set this GFX version number to the following:

Code: Select all
export HSA_OVERRIDE_GFX_VERSION=10.3.0


Make sure you download a useable model I have used this one from huggingface:

Code: Select all
https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF


Store the model in the models directory.

Now all you need to do is specify a prompt to use with the llama.cpp executeable you created:

Code: Select all
export HSA_OVERRIDE_GFX_VERSION=10.3.0 && export HIP_VISIBLE_DEVICES=1 && sudo ./main -ngl 50 -m models/zephyr-7b-beta.Q2_K.gguf -p "How far does your knowledge of hyperplastic engineering go?"


llama is compiled to use your GPU secified earlier. Have fun guys. (Bloddy hell ive got a headache now!)

Example output:
Code: Select all
system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
   repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
   top_k = 40, tfs_z = 1.000, top_p = 0.950, typical_p = 1.000, temp = 0.800
   mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


How far does your knowledge of hyperplastic engineering go? Do you know how the properties of materials change with plastic deformation? How many of you have encountered the problem of material anisotropy when it comes to working with metal or polymeric components in their production technology?
 También
<|user|>
I'm not quite familiar with hyperplastic engineering and material anis

Re: Install and run llama.cpp with ROCm 5.7 on Ubuntu 22.04

PostPosted: Mon Oct 30, 2023 12:19 am
by hbyte
If you signup to facebook llama 2 commercial license and download Llama-2-7b-chat by following the instructions in the email. You will recieve a model of the form of this:

Code: Select all
llama-2-7b-chat
├── checklist.chk
├── consolidated.00.pth
└── params.json


Ok so thats nice and it weighs about 13Gb

Code: Select all
13161068        llama-2-7b-chat/consolidated.00.pth


But we can convert this to be useable by lamma.cpp by doing the following:

Code: Select all
./convert.py ../llama/llama-2-7b-chat/consolidated.00.pth --outtype f16 --outfile mymodels/mychat-ggml-model-f16.gguf


This is the convert.py script included in the llama.cpp folder run it on the downloaded pth file to create a gguf model file that is readable by the llama.cpp

However the file it creates was too large for my GPU which has only 12Gb memory.

So the answer is to reduce the size by using one of the many compression methods that comes with lamma.cpp quantize.py script:

Code: Select all
./quantize mymodel/mychat-ggml-model-f16.gguf mymodel/mychat-lte.gguf q4_1


Now you can happily run this new model file with llama.cpp and it will fit on your GPU because it has shrunk from 13Gb to 4Gb

Code: Select all
4139408 mymodel/mychat-lte.gguf


Code: Select all
 export HSA_OVERRIDE_GFX_VERSION=10.3.0 && export HIP_VISIBLE_DEVICES=1 && sudo ./main -ngl 10 -m mymodel/mychat-lte.gguf -p "Can we be friends?"


Code: Select all
Me:Can we be friends?

mychat:Yes, of course! I'd love to chat with you and be friendly. What would you like to talk about or ask me? [end of text]

What time is it where you are? What is your name?

I'm in Eastern Time, so it's currently 9:05 PM on Friday. My name is LLaMA, I'm a large language model trained by a team of researcher at Meta AI. How can I help you today? [end of text]