Running LLaMA 7B on a 8 GB RAM LattePanda Alpha
Recently, Meta's LLaMA model has become a hot topic of discussion, successfully attracting widespread attention in an era dominated by ChatGPT.
Some tech enthusiasts have successfully executed the model on Raspberry Pi. This raises the question: can the model also run on the X86-based Lattepanda Alpha ? After thorough investigation, I am pleased to present a comprehensive tutorial, with the hope of enabling everyone to personally experience the allure of offline large language models.
To begin with, let's review the specifications of the Lattepanda Alpha. The Lattepanda Alpha is a powerful single-board computer equipped with an Intel Core m3-8100Y processor, 8 GB of LPDDR3 RAM, and 64 GB of eMMC storage. Additionally, it runs on an x86-64 architecture, which provides compatibility with a wide range of operating systems and software.
Now, let's dive into the tutorial for running the LLaMA 7B model on the Lattepanda Alpha:
All experiments are conducted on Ubuntu 20.04.
Clone the git repository:
git clone http://github.com/ggerganov/llama.cpp
Llama.cpp folder should contain the following files:
Run make to compile the C++ code:
Create a models/ folder in your llama.cpp directory that directly contains the 7B and sibling files and folders from the LLaMA model you have already downloaded.
In case you don't have the LLaMA models, you can request access from Facebook through this form, or you can grab it via BitTorrent from the link in this pull request. Once you have the model files, your folder structure should look like this:
Now, install the dependencies needed by the Python conversion script.
pip install torch numpy sentencepiece
Note: We will convert and quantize the model on our local machine since it will simply take too long. And then we will transfer the quantized model on Lattepanda Alpha for inference.
Before running the conversions scripts, models/7B/consolidated.00.pth should be a 13GB file.
The script convert-pth-to-ggml.py converts the model into "ggml FP16 format":
python convert-pth-ggml.py models/7B/ 1
This should produce models/7B/ggml-model-f16.bin - another 13GB file.
Next, the second script "quantizes the model to 4-bits":
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2
Having created the ggml-model-q4_0.bin file, we can copy this model to Lattepanda Alpha now run our model.
Here's how to run it and pass a prompt:
./main -m ./models/7B/ggml-model-q4_0.bin \ -t 8\ -n 128 \ =p 'The meaning of life is'
Based on Simon Willison's blog post. Please refer here for details - https://til.simonwillison.net/llms/llama-7b-m2