I have always wanted to run a Large Language Model (LLM) on my local machine. Benefits? More control, less censorship, no privacy concerns, and no API costs. Also, it may come in handy when internet access is limited or unavailable.

My go-to device is a Macbook Air M3 with 16 GB of RAM, so I expected it to handle a small model with ease. After some research, I decided to go with the Llama3.1 8B. But there are other options such as Mistral 7B, Gemma2 9B, and Qwen2.5 7B, which I may try later. The full list can be found at library - Ollama.

Now letā€™s see how we can get an LLM up and running. For the installation process, I wonā€™t waste your time with unnecessary details - just follow the steps right below.

Installation & setup

  1. Install Ollama
  2. In your terminal, run
    • ollama pull llama3.1
    • ollama pull nomic-embed-text
  3. Install the Page Assist extension (only available for Firefox and Chromium-based browsers)
  4. Customize the extension
    • General Settings
      • Perform Simple Internet Search: ON
    • RAG Settings
      • Embedding Model: nomic-embed-text:latest
      • Chat with website using vector embeddings: ON

Note: Ollama is all you need to chat with an LLM (via a command line interface), so you can stop at step 2. But you would want Page Assist for a web interface and extra features.

Usage

Chat with an LLM

In your terminal, run

ollama run llama3.1

You can then enter your questions and have the AI give you the answers.

Under the hood, the CLI makes requests to a localhost server. This means you can write scripts to call the API endpoints directly, or have apps calling it. By default, the Ollama server will start when you log into your machine.

The CLI is fine, but if you prefer a GUI, open your browser and activate the Page Assist extension by clicking its icon or using the shortcut cmd + shift + L.

I find chatting via the CLI faster than the web interface though.

Do web searches

Page Assist can help you find relevant information in documents (including web pages) thanks to a technique called Retrieval-augmented generation (RAG). I will talk about the technical details later, for now letā€™s learn how to make use of it.

To configure RAG in Page Assist, go to Web Search in Settings > RAG Settings. We already chose nomic-embed-text:latest as the embedding model. You can leave other settings as is or update them to be the same as mine.

After that, you can toggle on Search Internet (beside the globe icon) to allow web searches during chats.

To tweak the settings for Web Search, go to Settings > General Settings > Manage Web Search.

Manage knowledge

First, go to the Manage Knowledge tab, click Add New Knowledge to upload your document files.

After the files are processed, open a new chat and select the new knowledge from the Knowledge button. You can then ask the LLM about the specific topic or keyword in the documents that youā€™re interested in.

Chat with web pages

In the same way you can chat with files, you can also chat with web pages.

Hit cmd + shift + P to open a side panel that allows you to ask the LLM about the current web page you are on. Remember to check the Chat with current page checkbox first.

Digging deeper

RAG & embeddings

As mentioned before, RAG is used to enable interaction with documents and web pages. For RAG to effectively retrieve relevant information from a large corpus based on a userā€™s query, it requires a method to convert textual information into a numerical representation. Vector embeddings provide exactly that capability. As an example, the sentence ā€œDogs are playing in the park.ā€ can be encoded as a vector like [0.32,āˆ’0.47,0.19,0.85,āˆ’0.22,0.03,...].

The embeddings and other data are stored in Page Assistā€™s local storage, which can be inspected by running await chrome.storage.local.get() in Chromeā€™s Console.

In the screenshot, the embedding data is stored in the vector:pa_knowledge_* fields.

For large documents, creating embeddings may take quite some time. Here are the requests sent to the Ollama server when I added the Python 3.13 documentation to my knowledge base.

I havenā€™t tried yet, but itā€™s possible to reduce the processing time by adjusting the RAG settings.

Anyway, loading such a large number of embeddings into memory can cause performance issues. In fact, the whole Python documentation was so extensive that I had to remove some files during embedding creation for the extension to run smoothly šŸ«¤.

There is a discussion on supporting custom databases beside chrome.storage to store and retrieve knowledge more efficiently, but itā€™s not implemented yet. You can take a look at https://github.com/n4ze3m/page-assist/issues/135 for more information.

Conclusion

Thatā€™s it! The installation process is straightforward, and the results are quite decent. The local LLM is perfect for tasks like grammar-checking, answering quick questions, or even coding queries.

One thing I havenā€™t explored much is integrating Ollama with other apps. There are a lot of possibilities here, such as hooking it up to a code editor for AI code completion. Iā€™d also like to check out other models to see how they compare in terms of quality and performance. More posts to come!


Notes

  • My machineā€™s specs: Apple M3 chip with 8-core CPU, 10-core GPU, and 16GB of RAM.
  • Running inference and creating embeddings can consume quite some resources. For example, when Llama3 is generating a response, I noticed a surge of about 4GB in RAM usage.
  • Another issue I notice is the LLM runs slower on longer context windows.
  • If you want a vision LLM, try llama3.2-vision ā€” an 11B parameter model from Meta.
    • It is much slower than the text-only models such as llama3.1 and llama3.2.
    • It outperforms llava 7b on many types of images, especially text-heavy screenshots.