<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Large-Scale Data on AI VOID</title><link>https://ai-blog.noorshomelab.dev/tags/large-scale-data/</link><description>Recent content in Large-Scale Data on AI VOID</description><generator>Hugo</generator><language>en</language><lastBuildDate>Wed, 28 Jan 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://ai-blog.noorshomelab.dev/tags/large-scale-data/index.xml" rel="self" type="application/rss+xml"/><item><title>Distributed Data Processing with MetaDataFlow</title><link>https://ai-blog.noorshomelab.dev/metadataflow-guide-2026/10-distributed-processing/</link><pubDate>Wed, 28 Jan 2026 00:00:00 +0000</pubDate><guid>https://ai-blog.noorshomelab.dev/metadataflow-guide-2026/10-distributed-processing/</guid><description>&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Welcome back, aspiring data wizard! In our journey through MetaDataFlow, we&amp;rsquo;ve explored how to define, manage, and transform datasets locally. But what happens when your datasets grow beyond the memory capacity of a single machine? What if you&amp;rsquo;re dealing with terabytes or even petabytes of data, a common scenario in modern AI development? That&amp;rsquo;s where distributed data processing comes in, and it&amp;rsquo;s the focus of this exciting chapter!&lt;/p&gt;
&lt;p&gt;Here, we&amp;rsquo;ll dive deep into how MetaDataFlow empowers you to scale your data operations across multiple machines, leveraging the power of distributed computing frameworks. We&amp;rsquo;ll uncover the core concepts behind processing massive datasets, learn how MetaDataFlow integrates with popular tools like Apache Spark (via PySpark) and Dask, and put these ideas into practice with hands-on examples. Get ready to unlock the true potential of MetaDataFlow for large-scale machine learning!&lt;/p&gt;</description></item></channel></rss>