Imagine your development teams using AI to review pull requests, generate design documents, and answer architecture questions, all without a single byte of code leaving your infrastructure. No surprise cloud bills, no vendor lock-in, and full control over data and performance. That is exactly what you get when you host open source Large Language Models (LLMs) on your own VM or laptop using tools like Ollama.
If you are a director or manager responsible for software delivery, this guide will walk you through:
- What open source LLMs are and why they matter for your business
- Why hosting LLMs on a VM or laptop is often a smart move
- How to install and use Ollama step by step
- How to configure key parameters and fine-tune behavior
- How to measure and improve performance on local hardware
- How to choose from 10 popular open source LLMs for real work
- Typical challenges and how teams are solving them in practice
The goal is simple: give you enough clarity so that you can confidently say to your team, “Yes, let us run our own LLMs. Here is how and why.”
What Are Open Source LLMs and Why Do We Need Them?
An LLM is a Large Language Model: a neural network trained on massive amounts of text so that it can generate and understand human-like language. Models like GPT, Llama, Mistral, and DeepSeek can write code, summarize documents, explain APIs, and even generate test cases.
Open source LLMs are models whose weights and licenses are openly available. Examples include:
- Llama 3 (Meta)
- Mistral 7B / Mixtral (Mistral AI)
- DeepSeek (DeepSeek AI)
- Gemma (Google)
- Phi-3 (Microsoft)
- Qwen (Alibaba)
- StarCoder2 (Hugging Face and partners)
Compared to closed SaaS models, open source LLMs give you:
- Control: You decide where it runs, how it is configured, and what it is allowed to see.
- Cost transparency: No per-token pricing; you pay only for compute and storage.
- Customization: You can shape the model’s behavior via system prompts, parameters, and in some cases custom fine-tuning.
- Compliance and data residency: Data can stay on-prem or within your preferred region.
Real-world example: Code review automation on local LLMs
A mid-size embedded software company in Canada set up a DeepSeek-based model on their internal Kubernetes cluster to assist with code reviews. Engineers trigger the model from their CI pipeline; the LLM runs inside their VPC and never sends code outside. They reduced manual review time by roughly 30 percent and cut external AI API spend by more than half. You can read a similar performance-tuning story here: Ollama Performance Tuning Guide .
Why Host LLMs on a VM or Laptop?
You might ask: with so many cloud AI services around, why bother running a model locally on a VM or laptop? In many organizations, these reasons are strong enough:
- Data privacy and IP protection Source code, design docs, customer data, and logs stay inside your network. This matters for regulated industries, export control, and strict customer contracts.
- Cost control Instead of paying per prompt or per million tokens, you reuse existing hardware. For teams that heavily use AI for code, documentation, and support, this can be a big saving.
- Latency and reliability Local models avoid internet round-trips and external outages. For internal tools like chatbots, IDE assistants, or CI helpers, this can make the experience feel instant and reliable.
- Experimentation freedom Your teams can compare different models, tweak parameters, and prototype internal agents without waiting for approvals from external vendors.
In practice, many teams start on a powerful developer laptop, then move to a shared VM or on-prem cluster when adoption grows.
Tools for Running Local LLMs: Why Ollama Stands Out
There are several ways to run open source LLMs locally. The most common tools today include:
- Ollama - Simple CLI and API for Mac, Linux, and Windows.
- LM Studio - Desktop UI focused on ease of use and GPU acceleration.
- LocalAI - Open source API compatible with OpenAI, good for containerized deployments.
- vLLM - High-throughput server engine, great for production backends.
For day-to-day use on laptops and developer VMs, Ollama is a great starting point because:
- Installation is easy on all major operating systems.
- It supports GPU acceleration where available.
- It exposes a simple REST API that mimics the style of popular cloud LLM APIs.
- Models are managed with simple commands like
pull,run, andlist.
Installing Ollama: Step-by-Step on macOS, Linux, and Windows
1. Install Ollama on Linux (for example Ubuntu 22.04 or 24.04)
- Open a terminal.
- Run the official install script:
curl -fsSL https://ollama.com/install.sh | sh - Verify the installation:
ollama --version - If needed, start the Ollama service:
ollama serve
2. Install Ollama on macOS
- Go to https://ollama.com.
- Download the macOS installer (.pkg) and run it.
- After installation, open the Terminal app and type:
ollama --version - If the background service is not running, start it:
ollama serve
3. Install Ollama on Windows
- Visit https://ollama.com.
- Download the Windows installer (.exe) and run it.
- Open Command Prompt or PowerShell and verify:
ollama --version
Once installed, the experience is the same on all platforms. You can now pull and run models.
First Steps with Ollama: Pulling and Running a Model
Let us start with a popular and reasonably light model such as llama3.
- Pull the model:
ollama pull llama3 - Start an interactive chat session:
ollama run llama3 - Type your prompt, for example:
Explain DevOps to a non-technical manager in 3 short bullet points. - Press Enter and observe the result.
You now have a working LLM running on your own hardware. For many leaders, seeing this running locally is a turning point because it shows how close AI is to their existing infrastructure and teams.
Understanding Key Ollama Parameters (The “Basics” You Will Actually Use)
Ollama lets you control how the model behaves through parameters. You can set them:
- In a Modelfile (to create your own model flavor)
- Interactively via
/setwhen usingollama run - Through the REST API in the
optionsfield
Here are the most commonly used parameters:
| Parameter | What it controls | Typical range | When to adjust |
|---|---|---|---|
num_ctx |
Maximum context window (how much text the model can “see” at once) | 4096 to 16384 | Increase for long documents or multi-file code, decrease to save memory |
temperature |
Creativity and randomness of responses | 0.1 to 1.0 | Lower for deterministic answers, higher for brainstorming or creative writing |
top_p |
Nucleus sampling; how much of the probability mass is considered | 0.7 to 1.0 | Lower values make answers more focused, higher values more diverse |
top_k |
Top K tokens considered at each step | 20 to 100 | Reduce to make output more stable; increase for more variety |
num_predict |
Maximum number of tokens the model should generate | e.g. 256, 512, 1024 | Increase for longer answers, decrease to avoid runaway responses |
repeat_penalty |
How strongly the model is discouraged from repeating itself | 1.0 to 1.2 | Increase if you see repeated sentences, decrease if the model forgets context too fast |
seed |
Random seed for reproducibility | Any integer | Set a value to get repeatable outputs for testing and comparison |
num_gpu |
How many layers or how much of the model is offloaded to GPU | 0 (CPU only) or “max” | Increase if you have enough VRAM to speed up generation significantly |
Using Modelfiles: Fine-Tuning Behavior Without Heavy Training
Strictly speaking, “fine-tuning” in the research sense means updating the model weights with new training data. Ollama focuses more on configuration-level fine-tuning:
- Fixing a system role or personality
- Preloading instructions or domain context
- Setting defaults for parameters like temperature, num_ctx, and so on
This is often enough for many internal use cases such as:
- DevOps assistant that always answers in your tools and terminology
- Code review helper that follows your coding standards
- Support assistant that responds in your company’s tone of voice
Example Modelfile for a DevOps Assistant
FROM llama3
# Set default behavior and parameters
PARAMETER temperature 0.4
PARAMETER num_ctx 8192
PARAMETER repeat_penalty 1.1
SYSTEM """
You are a senior DevOps and platform engineering expert.
You answer with clear, practical steps.
You use examples based on Linux, Kubernetes, and CI/CD pipelines.
"""
TEMPLATE """
User question:
{{ .Prompt }}
Your answer:
"""
Save this as Modelfile, then build and run it:
ollama create devops-assistant -f Modelfile
ollama run devops-assistant
From this point on, anyone who runs devops-assistant gets consistently focused responses tuned for DevOps use cases.
No complex GPU training is involved, just smart use of instructions and parameters.
How to Add and Adjust Parameters in Practice
1. In a Modelfile (recommended for reusable models)
Use the PARAMETER directive:
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER repeat_penalty 1.1
After editing the Modelfile, rebuild the model:
ollama create my-custom-model -f Modelfile
2. Interactively with /set
- Run your model:
ollama run llama3 - Inside the interactive session, type:
/set parameter temperature 0.2 /set parameter num_ctx 4096 - Continue the conversation; the new settings apply immediately.
3. Via the REST API
When calling the API, you can include an options object to override parameters:
curl http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3",
"prompt": "Summarize this deployment pipeline:",
"options": {
"temperature": 0.3,
"num_ctx": 8192,
"top_p": 0.9
}
}'
This is useful when you want different parts of your system (for example, a chatbot and a summarizer service) to use different parameter settings against the same base model.
How to Test and Benchmark Local LLM Performance
Once your teams start using local LLMs, the next questions arrive quickly:
- How fast is this model on our hardware?
- How many concurrent users can we support?
- What is the impact of model size or quantization on latency?
You do not need a complex benchmarking framework to get useful signals. Start with a simple approach:
Step 1: Use a Fixed Prompt
Pick a realistic prompt that represents your workload. For example:
Review this Kubernetes deployment YAML and list 5 risks.
Use the same prompt across all tests so that comparisons are fair.
Step 2: Measure Response Time
On Linux or macOS, you can use the time command:
time ollama run llama3 "Review this Kubernetes deployment YAML and list 5 risks."
Run it a few times and take the average. If you change parameters such as num_ctx or temperature,
repeat the same measurement.
Step 3: Scripted Benchmark Loop
For a slightly more systematic test, you can use a small shell script:
#!/bin/bash
MODEL="llama3"
PROMPT="Explain blue-green deployment in 150 words."
for i in {1..5}; do
echo "Run $i"
time ollama run "$MODEL" "$PROMPT" > /dev/null
echo
done
This gives you a feel for average response time under consistent conditions. If you want to check CPU and GPU utilization while doing this, monitor:
toporhtopfor CPUnvidia-smifor NVIDIA GPUs
Step 4: Compare Models and Settings
Use the same script but vary:
- Model (for example
llama3vsmistral) - Quantization level (for example full precision vs 4-bit quantized variants)
- Context size (
num_ctx)
Capture results in a simple spreadsheet so your team can see the trade-offs: smaller models are faster and lighter, bigger models are smarter but need more resources.
Checklist of Fine-Tuning and Performance Parameters
Here is a compact checklist you can share with your team when configuring models in Ollama, LM Studio, or similar tools.
| Parameter / Setting | Purpose | Typical Starting Value | Notes |
|---|---|---|---|
num_ctx |
Max context length | 4096 for normal use, 8192+ for large docs | Higher values require more RAM or VRAM |
temperature |
Creativity level | 0.2 for deterministic, 0.7 for creative | Lower for code and factual answers |
top_p |
Controls diversity of token sampling | 0.9 | Reduce if outputs feel too random |
top_k |
Number of candidate tokens per step | 40 | Lower for stability, higher for variety |
num_predict |
Maximum length of generated output | 256–512 | Increase for long answers, decrease to improve latency |
repeat_penalty |
Discourages repetition | 1.1 | Increase if you see loops in text |
num_gpu |
GPU offload level | Automatic or maximum | Use GPU if you have enough VRAM for a big speed-up |
| Model quantization | Reduces size and memory usage | Use 4-bit where possible | Small quality loss, big performance gain for many tasks |
| Hardware threads | Parallel CPU usage | Number of physical cores | More threads can improve speed up to a point |
How to Make Locally Hosted LLMs Run Faster
Assuming your model works correctly but feels slow, here are the most effective levers to pull.
1. Choose the Right Model Size
- On a laptop, start with 7B or 8B parameter models (for example Llama 3 8B, Mistral 7B, Phi-3 Mini).
- On a strong VM with a good GPU, you can move to 14B or even 30B level models.
- Very large models (70B and above) are usually better suited to servers with high-end GPUs.
2. Use Quantized Models
Quantization compresses the model weights into smaller data types, such as 4-bit. This:
- Reduces memory and VRAM usage
- Often improves token generation speed
- Introduces only modest quality loss for many practical tasks
In practice, a quantized 7B model can feel very responsive on a modern laptop while still giving high quality answers.
3. Leverage GPU Acceleration
If your laptop or VM has a GPU (NVIDIA, AMD, or Apple Silicon), enable GPU offloading. This can reduce latency from seconds
to fractions of a second, especially for larger models. In Ollama this is usually automatic, but you can adjust
num_gpu or similar options based on your VRAM.
4. Tune Context and Output Length
- Do not use a huge context window if you are only asking short questions.
- Limit output length with
num_predictunless you truly need multi-page answers. - Structured prompts (clear instructions, bullet points) often allow shorter but more useful outputs.
5. Optimize Your VM or Host
- Allocate enough RAM to the VM (16 GB or more for comfort with modern models).
- Ensure CPU virtualization features are enabled in the BIOS or hypervisor.
- For GPU passthrough, follow your hypervisor’s documentation carefully (for example VMware, Proxmox, or Hyper-V).
In consulting projects, a very common pattern is this: teams start with a general-purpose model on default settings, then see dramatic improvements simply by moving to a smaller quantized model with GPU enabled and tighter context/output limits.
Popular Open Source LLMs and Where They Shine
The “best” model depends on your use case and hardware. Here is a high-level comparison for ten well-known open source models. Sizes and use cases are simplified to help with decision-making.
| Model | Approx. Size (parameters) | Best For | Typical Hardware |
|---|---|---|---|
| Llama 3 8B | Around 8B | General chat, coding help, documentation | Modern laptop or VM with 16 GB+ RAM |
| Llama 3 70B | Around 70B | High quality reasoning and complex tasks | Server-class GPU, large memory footprint |
| Mistral 7B | Around 7B | Fast, strong general-purpose assistant | Laptop or VM, good performance even without GPU |
| Mixtral 8x7B | Mixture-of-Experts, behaves like larger model | Better reasoning than simple 7B models | Desktop or server with capable GPU |
| Phi-3 Mini | Small to medium | Education, explanation, light coding | Entry-level laptops and compact VMs |
| Gemma 2 | Variants such as 9B | General-purpose tasks, summarization | Laptop with enough RAM or mid-range GPU |
| DeepSeek (general) | Medium to large | Strong reasoning, technical explanations | VM or server with good GPU for best results |
| DeepSeek Coder | Medium to large | Code generation and review | Developer laptops, CI servers |
| Qwen 2 7B | Around 7B | Multilingual and general tasks | Laptop or VM, runs well with quantization |
| StarCoder2 7B | Around 7B | Code completion and analysis | Developer workstations, coding assistants |
For many software organizations, a practical strategy is:
- Start with a 7B or 8B model like Llama 3 8B or Mistral 7B for general internal use.
- Add a code-focused model such as DeepSeek Coder or StarCoder2 for development workflows.
- Evaluate a larger model (for example Llama 3 70B) on a central server if you need more advanced reasoning.
Challenges Teams Face and How to Overcome Them
1. Hardware Constraints
Not every team has a spare GPU server. This is why right-sizing models and using quantization is so important. Teams often discover that a well-tuned 7B model is “good enough” for most internal tasks and runs comfortably on existing laptops or mid-range VMs.
2. Model Selection Overload
There are many models and versions, which can be confusing. A simple rule of thumb is:
- One general-purpose model for chat and documentation.
- One code-focused model for engineering teams.
- Optional domain-specific model if you have a niche language or industry.
3. Governance and Safety
Even when models run locally, you still need rules: what kind of data can be given to the model, which outputs are acceptable, and how to log and review usage. Start small with internal tools and gradually expand usage once you are comfortable with the behavior and controls.
4. Integration into Existing Workflows
A local LLM is most valuable when it sits inside existing tools: IDEs, Slack, Jira, CI/CD pipelines. Ollama’s simple HTTP API makes it relatively straightforward for your teams to plug the model into these systems.
Future Outlook: Local LLMs and Agentic Workflows
Over the next few years, expect more of your tools to quietly embed local or hybrid LLMs:
- IDE extensions that ship with local models for offline coding help.
- Monitoring and SRE tools that use on-host LLMs to explain incidents.
- Agent-like systems that orchestrate multiple tools (for example Git, CI/CD, ticketing) with a local LLM at the core.
As hardware continues to improve and models become more efficient, the line between “cloud AI” and “local AI” will blur. Many organizations will run a mix: small, fast models locally for everyday tasks, and larger models in the cloud for heavy analysis when needed.
Key Takeaways for Software Leaders
- Open source LLMs give you control, privacy, and cost transparency.
- Hosting models on a VM or laptop is realistic with today’s tooling and hardware.
- Ollama provides a clean path from experiment to internal API with simple commands.
- Most of the “fine-tuning” you need is smart configuration, not heavy training.
- Performance comes from the right model size, quantization, GPU usage, and sensible parameters.
- Start small with clear use cases such as internal chat, code review, and documentation support.
If you guide your teams with the right guardrails, local LLMs can become a powerful part of your software delivery platform, not just a toy demo.
Ready to Explore This for Your Organization?
If you are thinking, “This sounds promising, but my teams are busy and I do not want a science experiment,” you are not alone. Many leaders feel the same. The good news is that a small, focused pilot is often enough to prove value and build confidence.
If you want help designing and implementing a practical, 90-day roadmap for running LLMs on your own laptops and VMs, especially in DevOps and platform engineering environments, reach out here: https://stonetusker.com/contact-us/
References and Further Reading
- Ollama Documentation https://ollama.com
- Ollama Model Library https://ollama.com/library
- Open LLM Leaderboard (Hugging Face) https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
- Top Open-Source LLMs 2026 - DataCamp https://www.datacamp.com/blog/top-open-source-llms
- Ollama Performance Tuning Guide https://dasroot.net/posts/2026/01/ollama-performance-tuning-gpu-acceleration-model-quantization/
- Denis Rothman, Transformers for Natural Language Processing, Packt Publishing
- Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, O’Reilly Media



