<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>VLLM on AI VOID</title><link>https://ai-blog.noorshomelab.dev/tags/vllm/</link><description>Recent content in VLLM on AI VOID</description><generator>Hugo</generator><language>en</language><lastBuildDate>Tue, 19 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://ai-blog.noorshomelab.dev/tags/vllm/index.xml" rel="self" type="application/rss+xml"/><item><title>Crafting Robust LLM Inference Pipelines</title><link>https://ai-blog.noorshomelab.dev/llmops-ai-infra-guide-2026/crafting-llm-inference-pipelines/</link><pubDate>Fri, 20 Mar 2026 00:00:00 +0000</pubDate><guid>https://ai-blog.noorshomelab.dev/llmops-ai-infra-guide-2026/crafting-llm-inference-pipelines/</guid><description>&lt;h2 id="introduction-from-training-to-production-ready-llms"&gt;Introduction: From Training to Production-Ready LLMs&lt;/h2&gt;
&lt;p&gt;Welcome back, future MLOps architect! In our previous chapters, we laid the groundwork for understanding LLMOps and the unique challenges of working with Large Language Models. We&amp;rsquo;ve seen how crucial it is to manage the lifecycle of these powerful models. Now, it&amp;rsquo;s time to shift our focus from &lt;em&gt;training&lt;/em&gt; these behemoths to &lt;em&gt;serving&lt;/em&gt; them efficiently and reliably in a production environment.&lt;/p&gt;
&lt;p&gt;Deploying LLMs for inference comes with its own set of fascinating challenges. Unlike traditional machine learning models, LLMs are often massive, requiring significant computational resources (especially GPUs) and memory. They also generate output token by token, which demands careful handling for latency and throughput. This chapter is your guide to building robust, scalable, and cost-efficient LLM inference pipelines. We&amp;rsquo;ll break down the journey a user&amp;rsquo;s prompt takes, from initial input to final response, exploring each critical stage and how to optimize it.&lt;/p&gt;</description></item><item><title>Supercharging GPUs: Optimization Techniques for LLMs</title><link>https://ai-blog.noorshomelab.dev/llmops-ai-infra-guide-2026/gpu-optimization-for-llms/</link><pubDate>Fri, 20 Mar 2026 00:00:00 +0000</pubDate><guid>https://ai-blog.noorshomelab.dev/llmops-ai-infra-guide-2026/gpu-optimization-for-llms/</guid><description>&lt;h2 id="supercharging-gpus-optimization-techniques-for-llms"&gt;Supercharging GPUs: Optimization Techniques for LLMs&lt;/h2&gt;
&lt;p&gt;Welcome back, future LLMOps maestros! In our previous chapters, we laid the groundwork for understanding LLM inference pipelines and how to set them up. We&amp;rsquo;ve seen that serving Large Language Models in production is a whole different ball game compared to traditional machine learning models. One of the biggest challenges? The sheer computational power and memory these models demand, especially from GPUs.&lt;/p&gt;
&lt;p&gt;In this chapter, we&amp;rsquo;re diving deep into the exciting world of GPU optimization for LLMs. Our goal isn&amp;rsquo;t just to make models run, but to make them &lt;em&gt;fly&lt;/em&gt; – faster, more efficiently, and at a lower cost. We&amp;rsquo;ll explore cutting-edge techniques that can dramatically reduce latency and boost throughput, turning your GPU infrastructure into a lean, mean, inference machine.&lt;/p&gt;</description></item><item><title>Run MTP LLMs with llama.cpp &amp;amp; vLLM</title><link>https://ai-blog.noorshomelab.dev/tutorials/run-mtp-llms-llama-cpp-vllm/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://ai-blog.noorshomelab.dev/tutorials/run-mtp-llms-llama-cpp-vllm/</guid><description>&lt;p&gt;&lt;strong&gt;What you&amp;rsquo;ll build:&lt;/strong&gt; By the end of this tutorial, you will be able to set up and run Multi-Token Prediction (MTP) capable LLMs locally using &lt;code&gt;llama.cpp&lt;/code&gt; and &lt;code&gt;vLLM&lt;/code&gt;, compare their performance against standard generation, and understand fallback options.
&lt;strong&gt;Time needed:&lt;/strong&gt; ~90 minutes
&lt;strong&gt;Prerequisites:&lt;/strong&gt; Basic command-line interface (CLI) familiarity, Git installed, C++ compiler (GCC/Clang for Linux/macOS, MSVC for Windows), CMake installed, Python 3.9+ installed, NVIDIA GPU with CUDA (11.8+ recommended) or AMD GPU with ROCm, or Apple Silicon (Metal), Sufficient RAM (16GB+ recommended) and VRAM (8GB+ recommended)
&lt;strong&gt;Version used:&lt;/strong&gt; llama.cpp: main branch (post MTP merge); vLLM: latest stable/developer preview with MTP support&lt;/p&gt;</description></item></channel></rss>