<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Production Engineering on AI VOID</title><link>https://ai-blog.noorshomelab.dev/categories/production-engineering/</link><description>Recent content in Production Engineering on AI VOID</description><generator>Hugo</generator><language>en</language><lastBuildDate>Tue, 26 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://ai-blog.noorshomelab.dev/categories/production-engineering/index.xml" rel="self" type="application/rss+xml"/><item><title>Signal Impacted by Twilio Social Engineering Attack</title><link>https://ai-blog.noorshomelab.dev/postmortems/signal-twilio-social-engineering-attack/</link><pubDate>Tue, 26 May 2026 00:00:00 +0000</pubDate><guid>https://ai-blog.noorshomelab.dev/postmortems/signal-twilio-social-engineering-attack/</guid><description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Incident:&lt;/strong&gt; Signal Impacted by Twilio Social Engineering Attack
&lt;strong&gt;Date:&lt;/strong&gt; 2022-08-08 | &lt;strong&gt;Duration:&lt;/strong&gt; ~None hours | &lt;strong&gt;Severity:&lt;/strong&gt; P1-high
&lt;strong&gt;Affected:&lt;/strong&gt; A small number of Signal users | &lt;strong&gt;Systems:&lt;/strong&gt; Twilio&amp;rsquo;s phone number verification services, Signal user registration/verification process
&lt;strong&gt;Root cause (summary):&lt;/strong&gt; Twilio employees fell victim to a sophisticated phishing attack, leading to the compromise of their credentials and unauthorized access to Twilio&amp;rsquo;s internal support systems.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Timeline:&lt;/strong&gt;
&lt;em&gt;Timeline not available from public sources.&lt;/em&gt;&lt;/p&gt;
&lt;h2 id="incident-summary"&gt;Incident Summary&lt;/h2&gt;
&lt;p&gt;On August 8, 2022, Signal was notified by its third-party phone number verification provider, Twilio, about a security incident. Twilio had experienced a sophisticated social engineering attack, where malicious actors successfully phished several of its employees. This compromise granted the attackers unauthorized access to Twilio&amp;rsquo;s internal support systems, which included access to customer data for a limited number of Twilio clients.&lt;/p&gt;</description></item><item><title>LLM Guardrail Failure in Production: The Discrepancy Between Test and Reality</title><link>https://ai-blog.noorshomelab.dev/postmortems/llm-guardrail-failure-production-test-reality-discrepancy/</link><pubDate>Mon, 25 May 2026 00:00:00 +0000</pubDate><guid>https://ai-blog.noorshomelab.dev/postmortems/llm-guardrail-failure-production-test-reality-discrepancy/</guid><description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Incident:&lt;/strong&gt; LLM Guardrail Failure in Production: The Discrepancy Between Test and Reality
&lt;strong&gt;Date:&lt;/strong&gt; unknown | &lt;strong&gt;Duration:&lt;/strong&gt; ~6.0 hours | &lt;strong&gt;Severity:&lt;/strong&gt; P1-high
&lt;strong&gt;Affected:&lt;/strong&gt; unknown, potentially thousands over time | &lt;strong&gt;Systems:&lt;/strong&gt; LLM Inference Service, Guardrail Enforcement Layer, User-Facing Application
&lt;strong&gt;Root cause (summary):&lt;/strong&gt; LLM guardrails, which performed adequately in pre-production testing, failed to prevent undesirable outputs when exposed to the full spectrum of real-world user inputs and sustained production load.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="incident-summary"&gt;Incident Summary&lt;/h2&gt;
&lt;p&gt;On an unknown date, our AI-Powered Service Provider experienced a critical incident where the Large Language Model (LLM) guardrails, designed to filter and prevent undesirable outputs, failed in our production environment. This failure led to the generation and delivery of inappropriate or harmful content to users through our primary user-facing application. The incident persisted for approximately 6 hours, marking a P1-high severity event due to the direct impact on user experience and brand reputation.&lt;/p&gt;</description></item><item><title>OpenAI macOS App Supply Chain Attack via TanStack</title><link>https://ai-blog.noorshomelab.dev/postmortems/openai-macos-app-tanstack-supply-chain-attack/</link><pubDate>Sat, 23 May 2026 00:00:00 +0000</pubDate><guid>https://ai-blog.noorshomelab.dev/postmortems/openai-macos-app-tanstack-supply-chain-attack/</guid><description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Incident:&lt;/strong&gt; OpenAI macOS App Supply Chain Attack via TanStack
&lt;strong&gt;Date:&lt;/strong&gt; 2026-05-21 | &lt;strong&gt;Duration:&lt;/strong&gt; ~unknown hours | &lt;strong&gt;Severity:&lt;/strong&gt; P0-critical
&lt;strong&gt;Affected:&lt;/strong&gt; All macOS app users (potential for future compromise) | &lt;strong&gt;Systems:&lt;/strong&gt; OpenAI macOS app, OpenAI iOS app (potential), OpenAI Windows app (potential)
&lt;strong&gt;Root cause (summary):&lt;/strong&gt; The compromise of two OpenAI employee devices via a malicious TanStack npm package, which was part of the broader Shai-Hulud supply chain attack, led to the exfiltration of private code signing certificates for macOS, iOS, and Windows.&lt;/p&gt;</description></item><item><title>RubyGems Malicious Package Upload Security Incident</title><link>https://ai-blog.noorshomelab.dev/postmortems/rubygems-malicious-package-upload-security-incident/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://ai-blog.noorshomelab.dev/postmortems/rubygems-malicious-package-upload-security-incident/</guid><description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Incident:&lt;/strong&gt; RubyGems Malicious Package Upload Security Incident
&lt;strong&gt;Date:&lt;/strong&gt; 2025-09-10 | &lt;strong&gt;Duration:&lt;/strong&gt; ~192 hours | &lt;strong&gt;Severity:&lt;/strong&gt; P0-critical
&lt;strong&gt;Affected:&lt;/strong&gt; RubyGems users (specific number unknown) | &lt;strong&gt;Systems:&lt;/strong&gt; RubyGems.org package registry, RubyGems.org user signup system
&lt;strong&gt;Root cause (summary):&lt;/strong&gt; The incident was caused by the improper use or compromise of administrative credentials, allowing unauthorized uploads of hundreds of malicious packages to the RubyGems.org registry.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3 id="timeline-of-events"&gt;Timeline of Events&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time (UTC)&lt;/th&gt;
&lt;th&gt;Event&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;September 10-18, 2025&lt;/td&gt;
&lt;td&gt;Period during which hundreds of malicious packages were uploaded to RubyGems.org, leading to the suspension of new signups.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;September 18, 2025&lt;/td&gt;
&lt;td&gt;Ruby Central communicates termination to a former RubyGems.org operator, as part of the incident response.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;October 10, 2025&lt;/td&gt;
&lt;td&gt;Ruby Central releases a comprehensive security incident report addressing the events.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2 id="incident-summary"&gt;Incident Summary&lt;/h2&gt;
&lt;p&gt;On September 10, 2025, RubyGems.org, the primary package registry for the Ruby programming language, experienced a severe security breach involving the unauthorized upload of hundreds of malicious packages. This critical incident, which spanned approximately eight days, severely impacted the integrity of the RubyGems ecosystem and necessitated the suspension of new user signups to contain the threat.&lt;/p&gt;</description></item><item><title>DENIC .de TLD DNSSEC Outage</title><link>https://ai-blog.noorshomelab.dev/postmortems/denic-de-tld-dnssec-outage-may-2026/</link><pubDate>Thu, 21 May 2026 00:00:00 +0000</pubDate><guid>https://ai-blog.noorshomelab.dev/postmortems/denic-de-tld-dnssec-outage-may-2026/</guid><description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Incident:&lt;/strong&gt; DENIC .de TLD DNSSEC Outage
&lt;strong&gt;Date:&lt;/strong&gt; 2026-05-05 | &lt;strong&gt;Duration:&lt;/strong&gt; ~None hours | &lt;strong&gt;Severity:&lt;/strong&gt; P0-critical
&lt;strong&gt;Affected:&lt;/strong&gt; Millions of domains unreachable | &lt;strong&gt;Systems:&lt;/strong&gt; .de TLD DNSSEC validation, DNS resolvers globally
&lt;strong&gt;Root cause (summary):&lt;/strong&gt; DENIC, the registry operator for the .de TLD, published incorrect DNSSEC signatures for the .de zone.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3 id="incident-summary"&gt;Incident Summary&lt;/h3&gt;
&lt;p&gt;On May 5, 2026, the internet experienced a significant disruption affecting millions of domains under the &lt;code&gt;.de&lt;/code&gt; country-code top-level domain (ccTLD). This outage was triggered when DENIC, the authoritative registry operator for the &lt;code&gt;.de&lt;/code&gt; TLD, began publishing incorrect DNSSEC signatures for its zone.&lt;/p&gt;</description></item><item><title>Mini Shai-Hulud Supply Chain Attack on TanStack npm Packages</title><link>https://ai-blog.noorshomelab.dev/postmortems/mini-shai-hulud-tanstack-supply-chain-attack/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://ai-blog.noorshomelab.dev/postmortems/mini-shai-hulud-tanstack-supply-chain-attack/</guid><description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Incident:&lt;/strong&gt; Mini Shai-Hulud Supply Chain Attack on TanStack npm Packages
&lt;strong&gt;Date:&lt;/strong&gt; 2026-05-17 | &lt;strong&gt;Duration:&lt;/strong&gt; ~2.0 hours | &lt;strong&gt;Severity:&lt;/strong&gt; P0-critical
&lt;strong&gt;Affected:&lt;/strong&gt; unknown (potentially millions of downstream users) | &lt;strong&gt;Systems:&lt;/strong&gt; TanStack npm packages, npm registry, developer build systems
&lt;strong&gt;Root cause (summary):&lt;/strong&gt; Malicious versions of TanStack npm packages were published to the npm registry, containing the self-propagating &amp;lsquo;Mini Shai-Hulud&amp;rsquo; worm, indicating a compromise of TanStack&amp;rsquo;s publishing credentials or build process.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Timeline (if available):&lt;/strong&gt;&lt;/p&gt;</description></item><item><title>Node-IPC Supply Chain Attack: Protestware Incident</title><link>https://ai-blog.noorshomelab.dev/postmortems/node-ipc-supply-chain-attack-protestware-incident/</link><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><guid>https://ai-blog.noorshomelab.dev/postmortems/node-ipc-supply-chain-attack-protestware-incident/</guid><description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Incident:&lt;/strong&gt; Node-IPC Supply Chain Attack: Protestware Incident
&lt;strong&gt;Date:&lt;/strong&gt; 2022-03-08 | &lt;strong&gt;Duration:&lt;/strong&gt; Malicious versions available: Early March 2022 - March 2022 (mitigated) | &lt;strong&gt;Severity:&lt;/strong&gt; P0-critical
&lt;strong&gt;Affected:&lt;/strong&gt; unknown, potentially widespread across the JavaScript ecosystem | &lt;strong&gt;Systems:&lt;/strong&gt; Node.js applications using node-ipc, Any system with a dependency on node-ipc (direct or transitive)
&lt;strong&gt;Root cause (summary):&lt;/strong&gt; The maintainer of the &amp;rsquo;node-ipc&amp;rsquo; package published malicious versions (e.g., 9.2.x, 10.1.x) to npm, containing &amp;lsquo;protestware&amp;rsquo; designed to wipe files on systems located in specific geographic regions.&lt;/p&gt;</description></item><item><title>QUIC Congestion Window Stalling Due to Linux Kernel Idle Optimization Misport: Engineering Postmortem</title><link>https://ai-blog.noorshomelab.dev/postmortems/quic-congestion-window-stalling-linux-kernel-idle-optimization-misport/</link><pubDate>Sun, 17 May 2026 00:00:00 +0000</pubDate><guid>https://ai-blog.noorshomelab.dev/postmortems/quic-congestion-window-stalling-linux-kernel-idle-optimization-misport/</guid><description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Incident:&lt;/strong&gt; QUIC Congestion Window Stalling Due to Linux Kernel Idle Optimization Misport
&lt;strong&gt;Date:&lt;/strong&gt; 2023-08-15 (Discovered) | &lt;strong&gt;Duration:&lt;/strong&gt; Latent for years, ~6 hours (diagnosis &amp;amp; fix deployment) | &lt;strong&gt;Severity:&lt;/strong&gt; P1-high
&lt;strong&gt;Affected:&lt;/strong&gt; All Cloudflare QUIC connections utilizing the &lt;code&gt;quiche&lt;/code&gt; library, impacting global user experience, especially after packet loss.
&lt;strong&gt;Systems:&lt;/strong&gt; Cloudflare &lt;code&gt;quiche&lt;/code&gt; QUIC implementation, Linux kernel CUBIC porting layer, QUIC-enabled services.
&lt;strong&gt;Root cause (summary):&lt;/strong&gt; Incorrect calculation of &amp;ldquo;idle&amp;rdquo; periods in &lt;code&gt;quiche&lt;/code&gt;&amp;rsquo;s CUBIC congestion control port, preventing congestion window recovery after packet loss by perpetually resetting the idle timer.&lt;/p&gt;</description></item></channel></rss>