<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>LLM Inference on AI VOID</title><link>https://ai-blog.noorshomelab.dev/tags/llm-inference/</link><description>Recent content in LLM Inference on AI VOID</description><generator>Hugo</generator><language>en</language><lastBuildDate>Mon, 30 Mar 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://ai-blog.noorshomelab.dev/tags/llm-inference/index.xml" rel="self" type="application/rss+xml"/><item><title>Crafting Robust LLM Inference Pipelines</title><link>https://ai-blog.noorshomelab.dev/llmops-ai-infra-guide-2026/crafting-llm-inference-pipelines/</link><pubDate>Fri, 20 Mar 2026 00:00:00 +0000</pubDate><guid>https://ai-blog.noorshomelab.dev/llmops-ai-infra-guide-2026/crafting-llm-inference-pipelines/</guid><description>&lt;h2 id="introduction-from-training-to-production-ready-llms"&gt;Introduction: From Training to Production-Ready LLMs&lt;/h2&gt;
&lt;p&gt;Welcome back, future MLOps architect! In our previous chapters, we laid the groundwork for understanding LLMOps and the unique challenges of working with Large Language Models. We&amp;rsquo;ve seen how crucial it is to manage the lifecycle of these powerful models. Now, it&amp;rsquo;s time to shift our focus from &lt;em&gt;training&lt;/em&gt; these behemoths to &lt;em&gt;serving&lt;/em&gt; them efficiently and reliably in a production environment.&lt;/p&gt;
&lt;p&gt;Deploying LLMs for inference comes with its own set of fascinating challenges. Unlike traditional machine learning models, LLMs are often massive, requiring significant computational resources (especially GPUs) and memory. They also generate output token by token, which demands careful handling for latency and throughput. This chapter is your guide to building robust, scalable, and cost-efficient LLM inference pipelines. We&amp;rsquo;ll break down the journey a user&amp;rsquo;s prompt takes, from initial input to final response, exploring each critical stage and how to optimize it.&lt;/p&gt;</description></item><item><title>Smart Caching Strategies for Cost-Efficient LLM Inference</title><link>https://ai-blog.noorshomelab.dev/llmops-ai-infra-guide-2026/caching-strategies-llm-inference/</link><pubDate>Fri, 20 Mar 2026 00:00:00 +0000</pubDate><guid>https://ai-blog.noorshomelab.dev/llmops-ai-infra-guide-2026/caching-strategies-llm-inference/</guid><description>&lt;h2 id="smart-caching-strategies-for-cost-efficient-llm-inference"&gt;Smart Caching Strategies for Cost-Efficient LLM Inference&lt;/h2&gt;
&lt;p&gt;Welcome back, fellow MLOps enthusiasts! In our previous chapters, we&amp;rsquo;ve explored the foundations of LLMOps, set up robust inference pipelines, and learned how to dynamically route requests to different models. Now, it&amp;rsquo;s time to tackle one of the biggest challenges in production LLM systems: managing the high computational cost and latency associated with large language models.&lt;/p&gt;
&lt;p&gt;This chapter is all about &lt;strong&gt;caching&lt;/strong&gt;. You&amp;rsquo;ll discover how implementing smart caching strategies can dramatically reduce your GPU usage, lower inference costs, and significantly improve the responsiveness of your LLM applications. We&amp;rsquo;ll dive deep into different types of caches, understand &lt;em&gt;why&lt;/em&gt; and &lt;em&gt;how&lt;/em&gt; they work, and explore their practical applications in real-world scenarios. Get ready to supercharge your LLM deployments!&lt;/p&gt;</description></item><item><title>Dynamic Model Routing and A/B Testing for LLMs</title><link>https://ai-blog.noorshomelab.dev/llmops-ai-infra-guide-2026/dynamic-model-routing-ab-testing/</link><pubDate>Fri, 20 Mar 2026 00:00:00 +0000</pubDate><guid>https://ai-blog.noorshomelab.dev/llmops-ai-infra-guide-2026/dynamic-model-routing-ab-testing/</guid><description>&lt;h2 id="introduction-navigating-the-llm-model-maze"&gt;Introduction: Navigating the LLM Model Maze&lt;/h2&gt;
&lt;p&gt;Welcome back, MLOps engineers, data scientists, and developers! In our previous chapters, we&amp;rsquo;ve explored the foundational concepts of LLMOps and started to build robust inference pipelines. We learned that getting an LLM to production is only the first step; managing it effectively is where the real challenge lies.&lt;/p&gt;
&lt;p&gt;Large Language Models are not static entities. They evolve rapidly, with new versions, architectures, and fine-tunes emerging constantly. How do we introduce these new models to users without risking system stability or user experience? How do we compare the performance, cost-efficiency, and quality of different models in a real-world setting? This is where &lt;strong&gt;dynamic model routing&lt;/strong&gt; and &lt;strong&gt;A/B testing&lt;/strong&gt; come into play.&lt;/p&gt;</description></item><item><title>Monitoring and Observability for Production LLMs</title><link>https://ai-blog.noorshomelab.dev/llmops-ai-infra-guide-2026/monitoring-observability-production-llms/</link><pubDate>Fri, 20 Mar 2026 00:00:00 +0000</pubDate><guid>https://ai-blog.noorshomelab.dev/llmops-ai-infra-guide-2026/monitoring-observability-production-llms/</guid><description>&lt;h2 id="monitoring-and-observability-for-production-llms"&gt;Monitoring and Observability for Production LLMs&lt;/h2&gt;
&lt;p&gt;Welcome back, fellow MLOps engineers and data scientists! In our previous chapters, we&amp;rsquo;ve explored the exciting world of building robust LLM inference pipelines, optimizing them for GPU usage, implementing smart caching strategies, and designing for scalability. We&amp;rsquo;ve laid a strong foundation, but there&amp;rsquo;s a crucial piece missing: How do we &lt;em&gt;know&lt;/em&gt; if our systems are actually performing as expected in the wild? How do we catch issues before our users do?&lt;/p&gt;</description></item><item><title>TurboQuant vs. GGUF &amp;amp; INT8/INT4 Quantization: Complete Comparison 2026</title><link>https://ai-blog.noorshomelab.dev/comparisons/turboquant-gguf-int8-int4-quantization-comparison-2026/</link><pubDate>Mon, 30 Mar 2026 00:00:00 +0000</pubDate><guid>https://ai-blog.noorshomelab.dev/comparisons/turboquant-gguf-int8-int4-quantization-comparison-2026/</guid><description>&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;The rapid growth of Large Language Models (LLMs) has brought unprecedented capabilities but also significant computational demands, particularly in terms of memory footprint and inference speed. Quantization has emerged as a critical technique to address these challenges, allowing LLMs to run more efficiently on a wider range of hardware, from powerful data center GPUs to consumer-grade CPUs.&lt;/p&gt;
&lt;p&gt;This comprehensive guide provides an objective, side-by-side comparison of the latest advancements in LLM quantization as of March 30, 2026:&lt;/p&gt;</description></item></channel></rss>