<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Local LLMs on AI VOID</title><link>https://ai-blog.noorshomelab.dev/tags/local-llms/</link><description>Recent content in Local LLMs on AI VOID</description><generator>Hugo</generator><language>en</language><lastBuildDate>Fri, 06 Feb 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://ai-blog.noorshomelab.dev/tags/local-llms/index.xml" rel="self" type="application/rss+xml"/><item><title>Local LLMs with any-llm (Ollama Integration)</title><link>https://ai-blog.noorshomelab.dev/any-llm-guide-2025/local-llms-ollama/</link><pubDate>Tue, 30 Dec 2025 00:00:00 +0000</pubDate><guid>https://ai-blog.noorshomelab.dev/any-llm-guide-2025/local-llms-ollama/</guid><description>&lt;h2 id="introduction-bringing-llms-home"&gt;Introduction: Bringing LLMs Home&lt;/h2&gt;
&lt;p&gt;Welcome back, future AI architect! So far in our &lt;code&gt;any-llm&lt;/code&gt; journey, we&amp;rsquo;ve largely focused on interacting with powerful cloud-based LLMs like OpenAI, Anthropic, or Mistral. These services are incredible for their scale and performance, but what if you need more privacy, lower latency, or simply want to experiment without incurring API costs?&lt;/p&gt;
&lt;p&gt;This chapter is all about bringing the power of Large Language Models directly to your machine. We&amp;rsquo;ll dive into the exciting world of &lt;strong&gt;Local LLMs&lt;/strong&gt; and learn how to run them efficiently using a fantastic tool called &lt;strong&gt;Ollama&lt;/strong&gt;. Best of all, we&amp;rsquo;ll see how &lt;code&gt;any-llm&lt;/code&gt; seamlessly integrates with Ollama, allowing you to switch between local and cloud models with minimal code changes. Pretty neat, right?&lt;/p&gt;</description></item><item><title>AI Coding Tools 2026: The Developer&amp;#39;s Definitive Comparison</title><link>https://ai-blog.noorshomelab.dev/comparisons/ai-coding-tools-comparison-2026/</link><pubDate>Fri, 06 Feb 2026 00:00:00 +0000</pubDate><guid>https://ai-blog.noorshomelab.dev/comparisons/ai-coding-tools-comparison-2026/</guid><description>&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;The landscape of software development in 2026 is profoundly shaped by Artificial Intelligence. Developers are no longer just writing code; they are orchestrating intelligent agents, leveraging sophisticated models, and navigating an ecosystem where AI is deeply embedded in every stage of the development lifecycle. This rapid evolution presents both immense opportunities for productivity gains and significant challenges, particularly around data privacy, reliability, and integration into existing workflows.&lt;/p&gt;
&lt;p&gt;This comprehensive comparison aims to cut through the hype and provide an objective, data-driven analysis of the leading AI coding tools, IDE integrations, and underlying models available today. We will dissect their capabilities, evaluate their real-world impact on productivity, scrutinize their cost and performance characteristics, and, critically, examine their stance on code privacy and enterprise compliance.&lt;/p&gt;</description></item><item><title>LLM Quantization: Making Models Lean for Local Deployment</title><link>https://ai-blog.noorshomelab.dev/ai/llm-quantization-mastery/</link><pubDate>Fri, 22 Aug 2025 00:00:00 +0000</pubDate><guid>https://ai-blog.noorshomelab.dev/ai/llm-quantization-mastery/</guid><description>&lt;h1 id="llm-quantization-making-models-lean-for-local-deployment"&gt;LLM Quantization: Making Models Lean for Local Deployment&lt;/h1&gt;
&lt;h2 id="table-of-contents"&gt;Table of Contents&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="#introduction-the-need-for-lean-llms"&gt;Introduction: The Need for Lean LLMs&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#what-are-llms-and-why-are-they-so-large"&gt;What are LLMs and Why Are They So Large?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-challenge-of-local-deployment"&gt;The Challenge of Local Deployment&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#enter-quantization-a-solution-for-resource-constrained-environments"&gt;Enter Quantization: A Solution for Resource-Constrained Environments&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#understanding-the-basics-what-is-quantization"&gt;Understanding the Basics: What is Quantization?&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#floating-point-numbers-fp32-in-llms"&gt;Floating-Point Numbers (FP32) in LLMs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-concept-of-reduced-precision"&gt;The Concept of Reduced Precision&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#analogy-from-high-definition-to-standard-definition"&gt;Analogy: From High-Definition to Standard-Definition&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#benefits-of-quantization-size-speed-and-energy-efficiency"&gt;Benefits of Quantization: Size, Speed, and Energy Efficiency&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-trade-off-accuracy-vs-efficiency"&gt;The Trade-Off: Accuracy vs. Efficiency&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#quantization-techniques-a-deep-dive"&gt;Quantization Techniques: A Deep Dive&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#post-training-quantization-ptq-vs-quantization-aware-training-qat"&gt;Post-Training Quantization (PTQ) vs. Quantization-Aware Training (QAT)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#symmetric-vs-asymmetric-quantization"&gt;Symmetric vs. Asymmetric Quantization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#per-tensor-vs-per-channel-quantization"&gt;Per-Tensor vs. Per-Channel Quantization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#common-quantization-bit-widths"&gt;Common Quantization Bit-Widths&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#8-bit-quantization-int8"&gt;8-bit Quantization (INT8)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#4-bit-quantization-int4"&gt;4-bit Quantization (INT4)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#other-bit-widths-eg-2-bit-3-bit-5-bit"&gt;Other Bit-Widths (e.g., 2-bit, 3-bit, 5-bit)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#specific-quantization-algorithms-and-formats"&gt;Specific Quantization Algorithms and Formats&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#gptq-general-purpose-parameter-quantization"&gt;GPTQ (General-purpose Parameter Quantization)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#awq-activation-aware-weight-quantization"&gt;AWQ (Activation-aware Weight Quantization)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#gguf-gpt-generated-unified-format-a-key-for-llamacpp-and-ollama"&gt;GGUF (GPT-Generated Unified Format): A Key for &lt;code&gt;llama.cpp&lt;/code&gt; and Ollama&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#gguf-quantization-types-q2_k-q3_k-q4_k-q5_k-q6_k-q8_0"&gt;GGUF Quantization Types (Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_0)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#practical-implementation-quantizing-llms"&gt;Practical Implementation: Quantizing LLMs&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#using-bitsandbytes-for-quantization-aware-training-and-inference-pytorch"&gt;Using &lt;code&gt;bitsandbytes&lt;/code&gt; for Quantization-Aware Training and Inference (PyTorch)&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#installation"&gt;Installation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#loading-8-bit-models"&gt;Loading 8-bit Models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#loading-4-bit-models-nf4"&gt;Loading 4-bit Models (NF4)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#integrating-with-hugging-face-transformers"&gt;Integrating with Hugging Face Transformers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#fine-tuning-4-bit-models-qlora"&gt;Fine-tuning 4-bit Models (QLoRA)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#leveraging-llamacpp-and-gguf-for-cpu-friendly-inference"&gt;Leveraging &lt;code&gt;llama.cpp&lt;/code&gt; and GGUF for CPU-friendly Inference&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#introduction-to-llamacpp"&gt;Introduction to &lt;code&gt;llama.cpp&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#building-llamacpp"&gt;Building &lt;code&gt;llama.cpp&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#converting-models-to-gguf-format"&gt;Converting Models to GGUF Format&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#quantizing-gguf-models-with-llamacpps-quantize-tool"&gt;Quantizing GGUF Models with &lt;code&gt;llama.cpp&lt;/code&gt;&amp;rsquo;s &lt;code&gt;quantize&lt;/code&gt; tool&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#running-gguf-models-with-llamacpp"&gt;Running GGUF Models with &lt;code&gt;llama.cpp&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#ollama-simplified-local-llm-deployment"&gt;Ollama: Simplified Local LLM Deployment&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#how-ollama-utilizes-gguf"&gt;How Ollama Utilizes GGUF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#downloading-and-running-quantized-models-with-ollama"&gt;Downloading and Running Quantized Models with Ollama&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#creating-custom-modelfiles-for-quantized-models"&gt;Creating Custom Modelfiles for Quantized Models&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#evaluating-quantization-trade-offs"&gt;Evaluating Quantization Trade-offs&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#model-size-reduction"&gt;Model Size Reduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#inference-speed-latency"&gt;Inference Speed (Latency)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#accuracy-metrics-and-evaluation"&gt;Accuracy Metrics and Evaluation&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#perplexity"&gt;Perplexity&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#benchmark-tasks-eg-helm-mmlu"&gt;Benchmark Tasks (e.g., HELM, MMLU)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#qualitative-evaluation"&gt;Qualitative Evaluation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#hardware-considerations-cpu-vs-gpu"&gt;Hardware Considerations (CPU vs. GPU)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#choosing-the-right-quantization-scheme-for-your-use-case"&gt;Choosing the Right Quantization Scheme for Your Use Case&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#advanced-topics-and-future-directions"&gt;Advanced Topics and Future Directions&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#dynamic-vs-static-quantization"&gt;Dynamic vs. Static Quantization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#mixed-precision-training-and-inference"&gt;Mixed-Precision Training and Inference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#fine-grained-quantization-techniques"&gt;Fine-grained Quantization Techniques&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#emerging-quantization-research"&gt;Emerging Quantization Research&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusion"&gt;Conclusion&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#recap-of-key-concepts"&gt;Recap of Key Concepts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-future-of-lean-llms"&gt;The Future of Lean LLMs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#further-learning-resources"&gt;Further Learning Resources&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2 id="1-introduction-the-need-for-lean-llms"&gt;1. Introduction: The Need for Lean LLMs&lt;/h2&gt;
&lt;p&gt;The advent of Large Language Models (LLMs) has revolutionized various fields, from natural language processing to creative content generation. Models like GPT-3, LLaMA, Mistral, and many others have demonstrated unprecedented capabilities in understanding and generating human-like text. However, this power comes at a significant cost: immense model size and computational requirements.&lt;/p&gt;</description></item><item><title>Local LLM Deployment: Mastering Ollama for Custom Fine-tuned Models</title><link>https://ai-blog.noorshomelab.dev/ai/llm-deployment-serving/</link><pubDate>Fri, 22 Aug 2025 00:00:00 +0000</pubDate><guid>https://ai-blog.noorshomelab.dev/ai/llm-deployment-serving/</guid><description>&lt;h1 id="llm-deployment-and-serving-local-mastering-ollama-for-custom-models"&gt;LLM Deployment and Serving (Local): Mastering Ollama for Custom Models&lt;/h1&gt;
&lt;hr&gt;
&lt;h2 id="1-introduction-the-power-of-local-llms"&gt;1. Introduction: The Power of Local LLMs&lt;/h2&gt;
&lt;p&gt;Large Language Models (LLMs) have ushered in a new era of intelligent applications, from advanced chatbots to sophisticated code assistants. While powerful, many LLMs are often accessed via cloud-based APIs, leading to concerns about data privacy, recurring costs, and internet dependency. This document champions the increasingly vital practice of deploying and serving LLMs locally. It offers a comprehensive guide to understanding, implementing, and optimizing local LLM inference, with a particular emphasis on &lt;strong&gt;Ollama&lt;/strong&gt;, an innovative framework that simplifies this complex process for both pre-packaged and custom fine-tuned models.&lt;/p&gt;</description></item></channel></rss>