{"componentChunkName":"component---src-templates-blog-post-js","path":"/crypto-data-pipeline/","result":{"data":{"site":{"siteMetadata":{"title":"/dev/yukarinoki"}},"markdownRemark":{"id":"8aec1336-31d9-578c-9f9c-499716179023","excerpt":"日本語要約: Binance WebSocketから取得した1億5500万件超のオーダーブックレコードを効率的に処理するデータパイプラインの構築記録です。バイナリサーチローダー、Numba JITによる高速集計、LOBシミュレーターの設計に加え、AR-MLPやQuant-GANなどのML…","html":"<blockquote>\n<p><strong>日本語要約</strong>: Binance WebSocketから取得した1億5500万件超のオーダーブックレコードを効率的に処理するデータパイプラインの構築記録です。バイナリサーチローダー、Numba JITによる高速集計、LOBシミュレーターの設計に加え、AR-MLPやQuant-GANなどのML手法も試した結果と考察を共有します。</p>\n</blockquote>\n<p>I’ve been working on crypto microstructure analysis for a while now, and the hardest part isn’t the modeling — it’s wrangling the data. When you’re looking at tick-level BTC/USDT orderbook dynamics, you’re dealing with a firehose of information that will happily eat all your RAM and leave you staring at a frozen terminal.</p>\n<p>This post covers how I built a data pipeline that can efficiently load and process 155+ million orderbook records, run aggregation at sub-second intervals, and feed the results into ML models and LOB simulators.</p>\n<h2 id=\"the-problem\" style=\"position:relative;\"><a href=\"#the-problem\" aria-label=\"the problem permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>The Problem</h2>\n<p>Crypto microstructure research requires tick-level data. Not 1-minute candles, not even 1-second bars — actual individual bid/ask updates as they stream from the exchange. For BTC/USDT on Binance, that means:</p>\n<ul>\n<li><strong>155M+ bid/ask depth update records</strong> (~48GB raw CSV)</li>\n<li><strong>20M trade tick records</strong> (~6GB)</li>\n<li><strong>3.45M orderbook snapshot records</strong> (~12GB)</li>\n</ul>\n<p>All of this collected over several months from the Binance WebSocket feed. Each record has a timestamp, and we need to slice arbitrary time windows, compute derived features (mid price, VWAP, volume profiles), and feed them into models.</p>\n<p>The goal was simple: given any arbitrary time range, produce aggregated microstructure features at 0.5-second intervals in under 2 seconds. Sounds reasonable until you realize a naive approach takes 45+ minutes just to load the data.</p>\n<h2 id=\"why-naive-approaches-fail\" style=\"position:relative;\"><a href=\"#why-naive-approaches-fail\" aria-label=\"why naive approaches fail permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Why Naive Approaches Fail</h2>\n<p>My first attempt was embarrassingly straightforward:</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-python line-numbers\"><code class=\"language-python\"><span class=\"token keyword\">import</span> pandas <span class=\"token keyword\">as</span> pd\n\n<span class=\"token comment\"># Don't do this with 48GB of CSVs</span>\ndf <span class=\"token operator\">=</span> pd<span class=\"token punctuation\">.</span>read_csv<span class=\"token punctuation\">(</span><span class=\"token string\">\"binance_btcusdt_depth_updates.csv\"</span><span class=\"token punctuation\">)</span></code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span></span></pre></div>\n<p>On my 64GB workstation, this OOM’d spectacularly. Even with <code class=\"language-text\">chunksize</code>, iterating through 155M rows to find a 30-minute window was painfully slow (~3 minutes per query). Parquet helped with compression but the full scan was still the bottleneck.</p>\n<p>The key insight: these files are already sorted by timestamp. We don’t need to scan — we need to <em>seek</em>.</p>\n<h2 id=\"binary-search-loader\" style=\"position:relative;\"><a href=\"#binary-search-loader\" aria-label=\"binary search loader permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Binary Search Loader</h2>\n<p>The idea is simple. Since our CSV files are sorted by timestamp (they come from a WebSocket feed, after all), we can binary search for the starting position and only read what we need.</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-python line-numbers\"><code class=\"language-python\"><span class=\"token keyword\">import</span> os\n<span class=\"token keyword\">import</span> bisect\n<span class=\"token keyword\">import</span> numpy <span class=\"token keyword\">as</span> np\n<span class=\"token keyword\">import</span> pandas <span class=\"token keyword\">as</span> pd\n<span class=\"token keyword\">from</span> pathlib <span class=\"token keyword\">import</span> Path\n<span class=\"token keyword\">from</span> typing <span class=\"token keyword\">import</span> Tuple<span class=\"token punctuation\">,</span> Optional\n\n\n<span class=\"token keyword\">class</span> <span class=\"token class-name\">BinarySearchCSVLoader</span><span class=\"token punctuation\">:</span>\n    <span class=\"token triple-quoted-string string\">\"\"\"\n    Loads time-sliced data from large sorted CSV files using binary search.\n    Builds a sparse index on first access, then seeks directly to the\n    relevant byte offset for subsequent queries.\n    \"\"\"</span>\n\n    <span class=\"token keyword\">def</span> <span class=\"token function\">__init__</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">,</span> csv_path<span class=\"token punctuation\">:</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">,</span> timestamp_col<span class=\"token punctuation\">:</span> <span class=\"token builtin\">str</span> <span class=\"token operator\">=</span> <span class=\"token string\">\"timestamp\"</span><span class=\"token punctuation\">,</span>\n                 index_granularity<span class=\"token punctuation\">:</span> <span class=\"token builtin\">int</span> <span class=\"token operator\">=</span> 100_000<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n        self<span class=\"token punctuation\">.</span>csv_path <span class=\"token operator\">=</span> Path<span class=\"token punctuation\">(</span>csv_path<span class=\"token punctuation\">)</span>\n        self<span class=\"token punctuation\">.</span>timestamp_col <span class=\"token operator\">=</span> timestamp_col\n        self<span class=\"token punctuation\">.</span>granularity <span class=\"token operator\">=</span> index_granularity\n        self<span class=\"token punctuation\">.</span>_index<span class=\"token punctuation\">:</span> Optional<span class=\"token punctuation\">[</span>np<span class=\"token punctuation\">.</span>ndarray<span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> <span class=\"token boolean\">None</span>\n        self<span class=\"token punctuation\">.</span>_offsets<span class=\"token punctuation\">:</span> Optional<span class=\"token punctuation\">[</span>np<span class=\"token punctuation\">.</span>ndarray<span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> <span class=\"token boolean\">None</span>\n        self<span class=\"token punctuation\">.</span>_header<span class=\"token punctuation\">:</span> Optional<span class=\"token punctuation\">[</span><span class=\"token builtin\">str</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> <span class=\"token boolean\">None</span>\n        self<span class=\"token punctuation\">.</span>_build_index<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n\n    <span class=\"token keyword\">def</span> <span class=\"token function\">_build_index</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n        <span class=\"token triple-quoted-string string\">\"\"\"Build sparse index: sample every N-th row's timestamp and byte offset.\"\"\"</span>\n        timestamps <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span><span class=\"token punctuation\">]</span>\n        offsets <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span><span class=\"token punctuation\">]</span>\n\n        <span class=\"token keyword\">with</span> <span class=\"token builtin\">open</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">.</span>csv_path<span class=\"token punctuation\">,</span> <span class=\"token string\">'r'</span><span class=\"token punctuation\">)</span> <span class=\"token keyword\">as</span> f<span class=\"token punctuation\">:</span>\n            self<span class=\"token punctuation\">.</span>_header <span class=\"token operator\">=</span> f<span class=\"token punctuation\">.</span>readline<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n            row_count <span class=\"token operator\">=</span> <span class=\"token number\">0</span>\n\n            <span class=\"token keyword\">while</span> <span class=\"token boolean\">True</span><span class=\"token punctuation\">:</span>\n                offset <span class=\"token operator\">=</span> f<span class=\"token punctuation\">.</span>tell<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n                line <span class=\"token operator\">=</span> f<span class=\"token punctuation\">.</span>readline<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n                <span class=\"token keyword\">if</span> <span class=\"token keyword\">not</span> line<span class=\"token punctuation\">:</span>\n                    <span class=\"token keyword\">break</span>\n\n                <span class=\"token keyword\">if</span> row_count <span class=\"token operator\">%</span> self<span class=\"token punctuation\">.</span>granularity <span class=\"token operator\">==</span> <span class=\"token number\">0</span><span class=\"token punctuation\">:</span>\n                    ts <span class=\"token operator\">=</span> <span class=\"token builtin\">int</span><span class=\"token punctuation\">(</span>line<span class=\"token punctuation\">.</span>split<span class=\"token punctuation\">(</span><span class=\"token string\">','</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">[</span><span class=\"token number\">0</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">)</span>  <span class=\"token comment\"># timestamp is first column</span>\n                    timestamps<span class=\"token punctuation\">.</span>append<span class=\"token punctuation\">(</span>ts<span class=\"token punctuation\">)</span>\n                    offsets<span class=\"token punctuation\">.</span>append<span class=\"token punctuation\">(</span>offset<span class=\"token punctuation\">)</span>\n\n                row_count <span class=\"token operator\">+=</span> <span class=\"token number\">1</span>\n\n        self<span class=\"token punctuation\">.</span>_index <span class=\"token operator\">=</span> np<span class=\"token punctuation\">.</span>array<span class=\"token punctuation\">(</span>timestamps<span class=\"token punctuation\">,</span> dtype<span class=\"token operator\">=</span>np<span class=\"token punctuation\">.</span>int64<span class=\"token punctuation\">)</span>\n        self<span class=\"token punctuation\">.</span>_offsets <span class=\"token operator\">=</span> np<span class=\"token punctuation\">.</span>array<span class=\"token punctuation\">(</span>offsets<span class=\"token punctuation\">,</span> dtype<span class=\"token operator\">=</span>np<span class=\"token punctuation\">.</span>int64<span class=\"token punctuation\">)</span>\n        <span class=\"token keyword\">print</span><span class=\"token punctuation\">(</span><span class=\"token string-interpolation\"><span class=\"token string\">f\"Index built: </span><span class=\"token interpolation\"><span class=\"token punctuation\">{</span><span class=\"token builtin\">len</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">.</span>_index<span class=\"token punctuation\">)</span><span class=\"token punctuation\">}</span></span><span class=\"token string\"> entries covering </span><span class=\"token interpolation\"><span class=\"token punctuation\">{</span>row_count<span class=\"token punctuation\">:</span><span class=\"token format-spec\">,</span><span class=\"token punctuation\">}</span></span><span class=\"token string\"> rows\"</span></span><span class=\"token punctuation\">)</span>\n\n    <span class=\"token keyword\">def</span> <span class=\"token function\">load_range</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">,</span> start_ts<span class=\"token punctuation\">:</span> <span class=\"token builtin\">int</span><span class=\"token punctuation\">,</span> end_ts<span class=\"token punctuation\">:</span> <span class=\"token builtin\">int</span><span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> pd<span class=\"token punctuation\">.</span>DataFrame<span class=\"token punctuation\">:</span>\n        <span class=\"token triple-quoted-string string\">\"\"\"Load only rows within [start_ts, end_ts] using binary search.\"\"\"</span>\n        <span class=\"token comment\"># Find the index entry just before our start timestamp</span>\n        idx_start <span class=\"token operator\">=</span> <span class=\"token builtin\">max</span><span class=\"token punctuation\">(</span><span class=\"token number\">0</span><span class=\"token punctuation\">,</span> bisect<span class=\"token punctuation\">.</span>bisect_left<span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">.</span>_index<span class=\"token punctuation\">,</span> start_ts<span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span> <span class=\"token number\">1</span><span class=\"token punctuation\">)</span>\n        idx_end <span class=\"token operator\">=</span> <span class=\"token builtin\">min</span><span class=\"token punctuation\">(</span><span class=\"token builtin\">len</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">.</span>_index<span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span> <span class=\"token number\">1</span><span class=\"token punctuation\">,</span>\n                      bisect<span class=\"token punctuation\">.</span>bisect_right<span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">.</span>_index<span class=\"token punctuation\">,</span> end_ts<span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span>\n\n        byte_start <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>_offsets<span class=\"token punctuation\">[</span>idx_start<span class=\"token punctuation\">]</span>\n        <span class=\"token comment\"># Read until well past the end to account for granularity</span>\n        <span class=\"token keyword\">if</span> idx_end <span class=\"token operator\">+</span> <span class=\"token number\">1</span> <span class=\"token operator\">&lt;</span> <span class=\"token builtin\">len</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">.</span>_offsets<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n            byte_end <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>_offsets<span class=\"token punctuation\">[</span>idx_end <span class=\"token operator\">+</span> <span class=\"token number\">1</span><span class=\"token punctuation\">]</span>\n        <span class=\"token keyword\">else</span><span class=\"token punctuation\">:</span>\n            byte_end <span class=\"token operator\">=</span> os<span class=\"token punctuation\">.</span>path<span class=\"token punctuation\">.</span>getsize<span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">.</span>csv_path<span class=\"token punctuation\">)</span>\n\n        bytes_to_read <span class=\"token operator\">=</span> byte_end <span class=\"token operator\">-</span> byte_start\n\n        <span class=\"token comment\"># Read only the relevant chunk</span>\n        <span class=\"token keyword\">from</span> io <span class=\"token keyword\">import</span> StringIO\n        <span class=\"token keyword\">with</span> <span class=\"token builtin\">open</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">.</span>csv_path<span class=\"token punctuation\">,</span> <span class=\"token string\">'r'</span><span class=\"token punctuation\">)</span> <span class=\"token keyword\">as</span> f<span class=\"token punctuation\">:</span>\n            f<span class=\"token punctuation\">.</span>seek<span class=\"token punctuation\">(</span>byte_start<span class=\"token punctuation\">)</span>\n            chunk <span class=\"token operator\">=</span> f<span class=\"token punctuation\">.</span>read<span class=\"token punctuation\">(</span>bytes_to_read<span class=\"token punctuation\">)</span>\n\n        df <span class=\"token operator\">=</span> pd<span class=\"token punctuation\">.</span>read_csv<span class=\"token punctuation\">(</span>StringIO<span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">.</span>_header <span class=\"token operator\">+</span> chunk<span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span>\n        <span class=\"token comment\"># Filter to exact range</span>\n        mask <span class=\"token operator\">=</span> <span class=\"token punctuation\">(</span>df<span class=\"token punctuation\">[</span>self<span class=\"token punctuation\">.</span>timestamp_col<span class=\"token punctuation\">]</span> <span class=\"token operator\">>=</span> start_ts<span class=\"token punctuation\">)</span> <span class=\"token operator\">&amp;</span> \\\n               <span class=\"token punctuation\">(</span>df<span class=\"token punctuation\">[</span>self<span class=\"token punctuation\">.</span>timestamp_col<span class=\"token punctuation\">]</span> <span class=\"token operator\">&lt;=</span> end_ts<span class=\"token punctuation\">)</span>\n        <span class=\"token keyword\">return</span> df<span class=\"token punctuation\">[</span>mask<span class=\"token punctuation\">]</span><span class=\"token punctuation\">.</span>reset_index<span class=\"token punctuation\">(</span>drop<span class=\"token operator\">=</span><span class=\"token boolean\">True</span><span class=\"token punctuation\">)</span></code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<p>The index build takes about 90 seconds for the 155M-row file (one-time cost), and after that, loading any 30-minute window takes <strong>0.8-1.2 seconds</strong> instead of 3+ minutes. Memory usage drops from “all of it” to just the slice you need — typically 50-200MB for a 30-minute window depending on market activity.</p>\n<h3 id=\"performance-numbers\" style=\"position:relative;\"><a href=\"#performance-numbers\" aria-label=\"performance numbers permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Performance Numbers</h3>\n<table>\n<thead>\n<tr>\n<th>Operation</th>\n<th>Naive (pandas)</th>\n<th>Binary Search Loader</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Load 30-min window</td>\n<td>185s</td>\n<td>1.1s</td>\n</tr>\n<tr>\n<td>Load 2-hour window</td>\n<td>185s (same full scan)</td>\n<td>3.8s</td>\n</tr>\n<tr>\n<td>Memory (30-min)</td>\n<td>48GB+ (OOM)</td>\n<td>~150MB</td>\n</tr>\n<tr>\n<td>Index build (one-time)</td>\n<td>N/A</td>\n<td>92s</td>\n</tr>\n</tbody>\n</table>\n<h2 id=\"numba-jit-for-hot-path-aggregation\" style=\"position:relative;\"><a href=\"#numba-jit-for-hot-path-aggregation\" aria-label=\"numba jit for hot path aggregation permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Numba JIT for Hot-Path Aggregation</h2>\n<p>Once we have the raw tick data loaded, we need to aggregate it into features at regular intervals. Computing mid price, VWAP, and volume at 0.5-second intervals across millions of ticks is the hot path that runs on every query.</p>\n<p>Pure pandas/numpy was taking 800ms+ for a 30-minute window. With Numba JIT, this drops to about 45ms.</p>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-python line-numbers\"><code class=\"language-python\"><span class=\"token keyword\">import</span> numba\n<span class=\"token keyword\">import</span> numpy <span class=\"token keyword\">as</span> np\n<span class=\"token keyword\">from</span> numba <span class=\"token keyword\">import</span> njit<span class=\"token punctuation\">,</span> prange\n\n\n<span class=\"token decorator annotation punctuation\">@njit</span><span class=\"token punctuation\">(</span>cache<span class=\"token operator\">=</span><span class=\"token boolean\">True</span><span class=\"token punctuation\">)</span>\n<span class=\"token keyword\">def</span> <span class=\"token function\">compute_aggregated_features</span><span class=\"token punctuation\">(</span>\n    timestamps<span class=\"token punctuation\">:</span> np<span class=\"token punctuation\">.</span>ndarray<span class=\"token punctuation\">,</span>    <span class=\"token comment\"># int64, microseconds</span>\n    bid_prices<span class=\"token punctuation\">:</span> np<span class=\"token punctuation\">.</span>ndarray<span class=\"token punctuation\">,</span>    <span class=\"token comment\"># float64</span>\n    ask_prices<span class=\"token punctuation\">:</span> np<span class=\"token punctuation\">.</span>ndarray<span class=\"token punctuation\">,</span>    <span class=\"token comment\"># float64</span>\n    bid_volumes<span class=\"token punctuation\">:</span> np<span class=\"token punctuation\">.</span>ndarray<span class=\"token punctuation\">,</span>   <span class=\"token comment\"># float64</span>\n    ask_volumes<span class=\"token punctuation\">:</span> np<span class=\"token punctuation\">.</span>ndarray<span class=\"token punctuation\">,</span>   <span class=\"token comment\"># float64</span>\n    trade_prices<span class=\"token punctuation\">:</span> np<span class=\"token punctuation\">.</span>ndarray<span class=\"token punctuation\">,</span>  <span class=\"token comment\"># float64</span>\n    trade_volumes<span class=\"token punctuation\">:</span> np<span class=\"token punctuation\">.</span>ndarray<span class=\"token punctuation\">,</span> <span class=\"token comment\"># float64</span>\n    trade_timestamps<span class=\"token punctuation\">:</span> np<span class=\"token punctuation\">.</span>ndarray<span class=\"token punctuation\">,</span>  <span class=\"token comment\"># int64</span>\n    interval_us<span class=\"token punctuation\">:</span> <span class=\"token builtin\">int</span> <span class=\"token operator\">=</span> 500_000     <span class=\"token comment\"># 0.5 seconds in microseconds</span>\n<span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span><span class=\"token operator\">></span> <span class=\"token builtin\">tuple</span><span class=\"token punctuation\">:</span>\n    <span class=\"token triple-quoted-string string\">\"\"\"\n    Compute mid price, VWAP, and volume at fixed intervals.\n    All arrays must be sorted by timestamp.\n    \"\"\"</span>\n    t_start <span class=\"token operator\">=</span> timestamps<span class=\"token punctuation\">[</span><span class=\"token number\">0</span><span class=\"token punctuation\">]</span>\n    t_end <span class=\"token operator\">=</span> timestamps<span class=\"token punctuation\">[</span><span class=\"token operator\">-</span><span class=\"token number\">1</span><span class=\"token punctuation\">]</span>\n    n_intervals <span class=\"token operator\">=</span> <span class=\"token builtin\">int</span><span class=\"token punctuation\">(</span><span class=\"token punctuation\">(</span>t_end <span class=\"token operator\">-</span> t_start<span class=\"token punctuation\">)</span> <span class=\"token operator\">/</span> interval_us<span class=\"token punctuation\">)</span> <span class=\"token operator\">+</span> <span class=\"token number\">1</span>\n\n    mid_prices <span class=\"token operator\">=</span> np<span class=\"token punctuation\">.</span>empty<span class=\"token punctuation\">(</span>n_intervals<span class=\"token punctuation\">,</span> dtype<span class=\"token operator\">=</span>np<span class=\"token punctuation\">.</span>float64<span class=\"token punctuation\">)</span>\n    vwaps <span class=\"token operator\">=</span> np<span class=\"token punctuation\">.</span>empty<span class=\"token punctuation\">(</span>n_intervals<span class=\"token punctuation\">,</span> dtype<span class=\"token operator\">=</span>np<span class=\"token punctuation\">.</span>float64<span class=\"token punctuation\">)</span>\n    volumes <span class=\"token operator\">=</span> np<span class=\"token punctuation\">.</span>empty<span class=\"token punctuation\">(</span>n_intervals<span class=\"token punctuation\">,</span> dtype<span class=\"token operator\">=</span>np<span class=\"token punctuation\">.</span>float64<span class=\"token punctuation\">)</span>\n    interval_times <span class=\"token operator\">=</span> np<span class=\"token punctuation\">.</span>empty<span class=\"token punctuation\">(</span>n_intervals<span class=\"token punctuation\">,</span> dtype<span class=\"token operator\">=</span>np<span class=\"token punctuation\">.</span>int64<span class=\"token punctuation\">)</span>\n\n    tick_idx <span class=\"token operator\">=</span> <span class=\"token number\">0</span>\n    trade_idx <span class=\"token operator\">=</span> <span class=\"token number\">0</span>\n\n    <span class=\"token keyword\">for</span> i <span class=\"token keyword\">in</span> <span class=\"token builtin\">range</span><span class=\"token punctuation\">(</span>n_intervals<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n        t_interval_start <span class=\"token operator\">=</span> t_start <span class=\"token operator\">+</span> i <span class=\"token operator\">*</span> interval_us\n        t_interval_end <span class=\"token operator\">=</span> t_interval_start <span class=\"token operator\">+</span> interval_us\n        interval_times<span class=\"token punctuation\">[</span>i<span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> t_interval_start\n\n        <span class=\"token comment\"># Advance tick pointer to end of this interval</span>\n        last_bid <span class=\"token operator\">=</span> <span class=\"token number\">0.0</span>\n        last_ask <span class=\"token operator\">=</span> <span class=\"token number\">0.0</span>\n        <span class=\"token keyword\">while</span> tick_idx <span class=\"token operator\">&lt;</span> <span class=\"token builtin\">len</span><span class=\"token punctuation\">(</span>timestamps<span class=\"token punctuation\">)</span> <span class=\"token keyword\">and</span> timestamps<span class=\"token punctuation\">[</span>tick_idx<span class=\"token punctuation\">]</span> <span class=\"token operator\">&lt;</span> t_interval_end<span class=\"token punctuation\">:</span>\n            last_bid <span class=\"token operator\">=</span> bid_prices<span class=\"token punctuation\">[</span>tick_idx<span class=\"token punctuation\">]</span>\n            last_ask <span class=\"token operator\">=</span> ask_prices<span class=\"token punctuation\">[</span>tick_idx<span class=\"token punctuation\">]</span>\n            tick_idx <span class=\"token operator\">+=</span> <span class=\"token number\">1</span>\n\n        mid_prices<span class=\"token punctuation\">[</span>i<span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> <span class=\"token punctuation\">(</span>last_bid <span class=\"token operator\">+</span> last_ask<span class=\"token punctuation\">)</span> <span class=\"token operator\">/</span> <span class=\"token number\">2.0</span>\n\n        <span class=\"token comment\"># Compute VWAP and volume from trades in this interval</span>\n        vol_sum <span class=\"token operator\">=</span> <span class=\"token number\">0.0</span>\n        pv_sum <span class=\"token operator\">=</span> <span class=\"token number\">0.0</span>\n        <span class=\"token keyword\">while</span> trade_idx <span class=\"token operator\">&lt;</span> <span class=\"token builtin\">len</span><span class=\"token punctuation\">(</span>trade_timestamps<span class=\"token punctuation\">)</span> <span class=\"token keyword\">and</span> \\\n              trade_timestamps<span class=\"token punctuation\">[</span>trade_idx<span class=\"token punctuation\">]</span> <span class=\"token operator\">&lt;</span> t_interval_end<span class=\"token punctuation\">:</span>\n            v <span class=\"token operator\">=</span> trade_volumes<span class=\"token punctuation\">[</span>trade_idx<span class=\"token punctuation\">]</span>\n            p <span class=\"token operator\">=</span> trade_prices<span class=\"token punctuation\">[</span>trade_idx<span class=\"token punctuation\">]</span>\n            vol_sum <span class=\"token operator\">+=</span> v\n            pv_sum <span class=\"token operator\">+=</span> p <span class=\"token operator\">*</span> v\n            trade_idx <span class=\"token operator\">+=</span> <span class=\"token number\">1</span>\n\n        volumes<span class=\"token punctuation\">[</span>i<span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> vol_sum\n        vwaps<span class=\"token punctuation\">[</span>i<span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> pv_sum <span class=\"token operator\">/</span> vol_sum <span class=\"token keyword\">if</span> vol_sum <span class=\"token operator\">></span> <span class=\"token number\">0</span> <span class=\"token keyword\">else</span> mid_prices<span class=\"token punctuation\">[</span>i<span class=\"token punctuation\">]</span>\n\n    <span class=\"token keyword\">return</span> interval_times<span class=\"token punctuation\">,</span> mid_prices<span class=\"token punctuation\">,</span> vwaps<span class=\"token punctuation\">,</span> volumes</code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<p>The first call takes ~2s due to JIT compilation, but subsequent calls with the same type signature hit the cache and run in 40-50ms. For interactive exploration this is a huge win — you get near-real-time iteration on feature engineering.</p>\n<h2 id=\"lob-simulator\" style=\"position:relative;\"><a href=\"#lob-simulator\" aria-label=\"lob simulator permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>LOB Simulator</h2>\n<p>With the pipeline working, I built a Limit Order Book simulator to generate synthetic orderbook dynamics. The idea was to test whether ML models could learn microstructure patterns from simulated data before deploying on real data.</p>\n<p><span\n      class=\"gatsby-resp-image-wrapper\"\n      style=\"position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; \"\n    >\n      <a\n    class=\"gatsby-resp-image-link\"\n    href=\"/static/de23a0c171791ae5f10501d0f1ef72b3/248b0/price_sim_top.png\"\n    style=\"display: block\"\n    target=\"_blank\"\n    rel=\"noopener\"\n  >\n    <span\n    class=\"gatsby-resp-image-background-image\"\n    style=\"padding-bottom: 47.2972972972973%; position: relative; bottom: 0; left: 0; background-image: url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAJCAYAAAAywQxIAAAACXBIWXMAAA7DAAAOwwHHb6hkAAABYklEQVQoz3WSCW7EIAxFc//L9QZVpXaUHQJhCUvy+83MVGrVifRlY8B+NumcUTjLAZwZV02oKSAFi8MZlLg3P9Pm5Hku8FwCKs/X9K+6r9sN++4QQoCnLP1pmuGPA6lUxFwQXIC59VDLAh8jvA8v1Q3DAKVU07ZtWNcVSmuAieuyIjNZGWfkVcE5KRxb8VdqCfu+x0GiUgoJd+h5Rlg0/KqRJCmpcq0ItPGhv4me8U7IpGpKCTnnFhTSiUk3a1oB5z0crae11rb9GI9fBWSvEd7aDO+HhdIY01RJ1MRCJ+15nmx5x0TyzLkCV4tdlPgxhla4G8ehkUm7Is35CcXz89riui5eAdIR8f414u1jweYTDnmwVLDHDPeYbTeNIxPs7ZIQSdCS0FOOrY18XbGy9nwU+Qs+J41RWaiNMjuG1WClL5SdtPnsX0hlLfMUKwrB//j3OXMEJaNSibHMWOE9icv+N3XQuHjnxV4fAAAAAElFTkSuQmCC'); background-size: cover; display: block;\"\n  ></span>\n  <img\n        class=\"gatsby-resp-image-image\"\n        alt=\"Price Simulator Dashboard\"\n        title=\"Price Simulator Dashboard\"\n        src=\"/static/de23a0c171791ae5f10501d0f1ef72b3/fcda8/price_sim_top.png\"\n        srcset=\"/static/de23a0c171791ae5f10501d0f1ef72b3/12f09/price_sim_top.png 148w,\n/static/de23a0c171791ae5f10501d0f1ef72b3/e4a3f/price_sim_top.png 295w,\n/static/de23a0c171791ae5f10501d0f1ef72b3/fcda8/price_sim_top.png 590w,\n/static/de23a0c171791ae5f10501d0f1ef72b3/efc66/price_sim_top.png 885w,\n/static/de23a0c171791ae5f10501d0f1ef72b3/c83ae/price_sim_top.png 1180w,\n/static/de23a0c171791ae5f10501d0f1ef72b3/248b0/price_sim_top.png 1316w\"\n        sizes=\"(max-width: 590px) 100vw, 590px\"\n        style=\"width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;\"\n        loading=\"lazy\"\n      />\n  </a>\n    </span></p>\n<p>The simulator supports multiple price process models:</p>\n<ul>\n<li><strong>Geometric Brownian Motion</strong> (baseline)</li>\n<li><strong>Jump Diffusion</strong> (Merton model)</li>\n<li><strong>Self-Exciting Jump Diffusion</strong> (Hawkes process intensity)</li>\n<li><strong>AR-MLP</strong> (autoregressive neural network)</li>\n<li><strong>Quant-GAN</strong> (generative adversarial network)</li>\n</ul>\n<p>Each model generates price paths, and the LOB simulator wraps these with realistic order flow dynamics — queue sizes, arrival rates, and cancellation patterns calibrated from the real Binance data.</p>\n<h2 id=\"ml-experiments\" style=\"position:relative;\"><a href=\"#ml-experiments\" aria-label=\"ml experiments permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>ML Experiments</h2>\n<p>Here’s where things got interesting (and humbling).</p>\n<h3 id=\"ar-mlp-autoregressive-mlp\" style=\"position:relative;\"><a href=\"#ar-mlp-autoregressive-mlp\" aria-label=\"ar mlp autoregressive mlp permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>AR-MLP: Autoregressive MLP</h3>\n<p>The AR-MLP takes a window of past mid-price returns and orderbook imbalance features, and predicts the next 0.5s return. Architecture is straightforward: 3 hidden layers (256, 128, 64), batch norm, dropout 0.3, trained on 2 months of data.</p>\n<p><span\n      class=\"gatsby-resp-image-wrapper\"\n      style=\"position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 538px; \"\n    >\n      <a\n    class=\"gatsby-resp-image-link\"\n    href=\"/static/756156fe56fd5f56964308efe51c3033/9516f/price_sim_armlp.png\"\n    style=\"display: block\"\n    target=\"_blank\"\n    rel=\"noopener\"\n  >\n    <span\n    class=\"gatsby-resp-image-background-image\"\n    style=\"padding-bottom: 54.72972972972974%; position: relative; bottom: 0; left: 0; background-image: url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAALCAYAAAB/Ca1DAAAACXBIWXMAAA7DAAAOwwHHb6hkAAABhUlEQVQoz3WS23KDMAxE8/8f2b40F0gCwRhjjPF2V9Rp0plmRiNHyEcrWQfnBlwuF9zvd7RNi7Zt0TQXi9mZvmkUa3A+nSx+Pp/Nrterxfq+R4wzQgg4TNOEcRwhrw+PxwNuGOCcw/hj7tWYW230HgNjAsU57sB5nuEnj67r8PH5gSMrzjH+bwSZCUQfCRmGjnDGyDKgYJKfeE73DhtVltFho9K/lgkqBK0C9g/4wbH9L+sssuBhWRZr9Xa7EUgFPmIaCY4J27oCW0bJ2c7mS0EBkLcN0U+4cfa985ip1BTqMQSTbUzSb81MXjKmkBDmhLQSWop9K8wpW0EmPLPImhJbHuA5TwMKpHYFVpIu7uDdp5SZmPhoC1si4Af+LEBfH9WAatlWhisiQAXWS/vFX3iQaprgtfAb0F6ZcrUSifKVtK57srX1YqXsMakMYTHV8lqXUGdYDzLNQv/16ipS91IK1IVi6kS58srV0kuMXvi5h6/QV7OFZaJmfDwen4PfFb3fq+dv1HNS7IgZuosAAAAASUVORK5CYII='); background-size: cover; display: block;\"\n  ></span>\n  <img\n        class=\"gatsby-resp-image-image\"\n        alt=\"AR-MLP Simulation\"\n        title=\"AR-MLP Simulation\"\n        src=\"/static/756156fe56fd5f56964308efe51c3033/9516f/price_sim_armlp.png\"\n        srcset=\"/static/756156fe56fd5f56964308efe51c3033/12f09/price_sim_armlp.png 148w,\n/static/756156fe56fd5f56964308efe51c3033/e4a3f/price_sim_armlp.png 295w,\n/static/756156fe56fd5f56964308efe51c3033/9516f/price_sim_armlp.png 538w\"\n        sizes=\"(max-width: 538px) 100vw, 538px\"\n        style=\"width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;\"\n        loading=\"lazy\"\n      />\n  </a>\n    </span></p>\n<p>The problem is immediately visible: <strong>systematic drift</strong>. The model learns a slight directional bias from the training set and compounds it over time. Even with careful normalization and detrending, the generated paths diverge from realistic price behavior after about 5 minutes of simulation.</p>\n<p>I tried several fixes:</p>\n<ul>\n<li>Mean-centering predictions (helps but introduces weird mean-reversion artifacts)</li>\n<li>Predicting log returns instead of price levels (same drift, just in log space)</li>\n<li>Adding an explicit mean-reversion penalty to the loss (improves it but kills realistic momentum)</li>\n</ul>\n<p>None fully solved it. The model captures short-term autocorrelation well but fails at longer horizons.</p>\n<h3 id=\"quant-gan\" style=\"position:relative;\"><a href=\"#quant-gan\" aria-label=\"quant gan permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Quant-GAN</h3>\n<p>The Quant-GAN approach (based on the Wiese et al. paper) uses temporal convolutional networks as both generator and discriminator, operating on sequences of log returns.</p>\n<p><span\n      class=\"gatsby-resp-image-wrapper\"\n      style=\"position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 590px; \"\n    >\n      <a\n    class=\"gatsby-resp-image-link\"\n    href=\"/static/d012a5abee93df755d543ec8b7a8512c/8ecb0/price_sim_ml_quantgan.png\"\n    style=\"display: block\"\n    target=\"_blank\"\n    rel=\"noopener\"\n  >\n    <span\n    class=\"gatsby-resp-image-background-image\"\n    style=\"padding-bottom: 54.72972972972974%; position: relative; bottom: 0; left: 0; background-image: url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAALCAYAAAB/Ca1DAAAACXBIWXMAAA7DAAAOwwHHb6hkAAABcklEQVQoz41S7W6EIBD0/Z+xP++aqiggCPKh01nUS5o2l5qMy8IyDDt0xhiM49gwzzOmaWpQErXGTKxhRYwRIYQW36Hz3kMpxSSg7jtKKSiMddtQlgXrNMMYjfgPskY4UJk3FnVQyJ6kB1CZFxIlNcMRC4mFMPyX0C0OR8qNKGuDPGnkup/kO7Cua7vujfv6f6GT02UD96E4RzKFchzIKb0g60KysQ1vFQYqlCIhFSPEAMux9NSR/Ia/FFprf6n9AfrQTNEkErdlw7YlRKLW0gzKhFss1xZ8fo1IVNzmc35BchFmrUbX931jv7/CgrTGNj7o9s7re+cxqgnGn/PgHH8n2pi9rpW38ejk/d2EsrRnnk41r3382tOi6x/PGT5mbImKWBcZ0xU3mip1nfTEy7O4sEofH89XHgThdPUxaIzaQRmHx2hanKyDXjws1QlXF3l3KRZjxADpZz/0LfeXKbeLJW3sVWxwjodT0bkWzndKfAMSglQBasjLHAAAAABJRU5ErkJggg=='); background-size: cover; display: block;\"\n  ></span>\n  <img\n        class=\"gatsby-resp-image-image\"\n        alt=\"Quant-GAN Simulation\"\n        title=\"Quant-GAN Simulation\"\n        src=\"/static/d012a5abee93df755d543ec8b7a8512c/fcda8/price_sim_ml_quantgan.png\"\n        srcset=\"/static/d012a5abee93df755d543ec8b7a8512c/12f09/price_sim_ml_quantgan.png 148w,\n/static/d012a5abee93df755d543ec8b7a8512c/e4a3f/price_sim_ml_quantgan.png 295w,\n/static/d012a5abee93df755d543ec8b7a8512c/fcda8/price_sim_ml_quantgan.png 590w,\n/static/d012a5abee93df755d543ec8b7a8512c/efc66/price_sim_ml_quantgan.png 885w,\n/static/d012a5abee93df755d543ec8b7a8512c/8ecb0/price_sim_ml_quantgan.png 989w\"\n        sizes=\"(max-width: 590px) 100vw, 590px\"\n        style=\"width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;\"\n        loading=\"lazy\"\n      />\n  </a>\n    </span></p>\n<p>This produced more visually interesting results. The distribution of log returns from the GAN actually matches the empirical distribution reasonably well — you can see the heavy tails are captured. But in practice:</p>\n<ul>\n<li>Training was unstable (classic GAN problems, even with spectral normalization and gradient penalty)</li>\n<li>Generated paths lacked the <em>temporal</em> structure of real prices — autocorrelation of absolute returns was wrong</li>\n<li>Most importantly, for downstream tasks (strategy backtesting), it didn’t outperform calibrated jump-diffusion</li>\n</ul>\n<h3 id=\"why-traditional-models-won\" style=\"position:relative;\"><a href=\"#why-traditional-models-won\" aria-label=\"why traditional models won permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Why Traditional Models Won</h3>\n<p>After weeks of ML experiments, I ended up using a <strong>self-exciting jump diffusion</strong> model for the LOB simulator. The reasons:</p>\n<ol>\n<li><strong>Calibration is fast and stable</strong> — fit parameters via MLE on historical data in seconds</li>\n<li><strong>Interpretable</strong> — you can directly see how jump intensity decays, what the baseline volatility is</li>\n<li><strong>Hawkes-driven clustering</strong> — captures the volatility clustering and trade arrival patterns that matter for microstructure</li>\n<li><strong>No drift problem</strong> — martingale property is built in by construction</li>\n</ol>\n<p>The ML models are interesting academically, and I think with more data and better architectures (transformers, perhaps) they could eventually win. But for a practical LOB simulator where you need reliable synthetic data for strategy testing, a well-calibrated stochastic model is hard to beat.</p>\n<h2 id=\"pipeline-architecture-summary\" style=\"position:relative;\"><a href=\"#pipeline-architecture-summary\" aria-label=\"pipeline architecture summary permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Pipeline Architecture Summary</h2>\n<p>The full pipeline looks like:</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-text line-numbers\"><code class=\"language-text\">Binance WebSocket --&gt; Raw CSVs (sorted by timestamp)\n                         |\n                    Binary Search Index (sparse, in-memory)\n                         |\n                    Time-Slice Loader (seek + read)\n                         |\n                    Numba Aggregator (0.5s intervals)\n                         |\n               +--------------------+\n               |                    |\n         Feature Store         LOB Simulator\n               |                    |\n         ML Models            Strategy Backtester</code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<p>Total latency from “give me features for this time range” to having aggregated data ready: <strong>~1.5 seconds</strong> for a typical 30-minute window. This includes loading from disk, so if you’re iterating on features interactively, it feels snappy enough.</p>\n<h2 id=\"lessons-learned\" style=\"position:relative;\"><a href=\"#lessons-learned\" aria-label=\"lessons learned permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Lessons Learned</h2>\n<p><strong>Binary search on sorted files is underrated.</strong> Everyone reaches for databases or Parquet partitioning, but if your data is already sorted and you need arbitrary range queries, a simple sparse index + seek gets you 90% of the way there with zero dependencies.</p>\n<p><strong>Numba is magic for numerical hot paths.</strong> The constraint is that you need to write “numpy-style” code inside the JIT function (no pandas, no Python objects), but for aggregation loops it’s a perfect fit. The <code class=\"language-text\">cache=True</code> flag is essential — without it you pay compilation cost on every restart.</p>\n<p><strong>ML models for price simulation are harder than they look.</strong> The drift problem in AR models and mode collapse in GANs are well-known, but experiencing them firsthand on real financial data is educational. The gap between “interesting paper results” and “useful tool for my workflow” is wider than I expected.</p>\n<p><strong>Start with the data pipeline, not the model.</strong> I spent the first two weeks just on the loader and aggregator, and it paid off massively. Every model experiment after that was fast to iterate on because the data was always ready in seconds.</p>\n<p>The full pipeline code is roughly 2,500 lines of Python. Not a massive codebase, but it handles the core problem well: turning a firehose of raw WebSocket data into something you can actually work with for research.</p>","tableOfContents":"<ul>\n<li><a href=\"/crypto-data-pipeline/#the-problem\">The Problem</a></li>\n<li><a href=\"/crypto-data-pipeline/#why-naive-approaches-fail\">Why Naive Approaches Fail</a></li>\n<li>\n<p><a href=\"/crypto-data-pipeline/#binary-search-loader\">Binary Search Loader</a></p>\n<ul>\n<li><a href=\"/crypto-data-pipeline/#performance-numbers\">Performance Numbers</a></li>\n</ul>\n</li>\n<li><a href=\"/crypto-data-pipeline/#numba-jit-for-hot-path-aggregation\">Numba JIT for Hot-Path Aggregation</a></li>\n<li><a href=\"/crypto-data-pipeline/#lob-simulator\">LOB Simulator</a></li>\n<li>\n<p><a href=\"/crypto-data-pipeline/#ml-experiments\">ML Experiments</a></p>\n<ul>\n<li><a href=\"/crypto-data-pipeline/#ar-mlp-autoregressive-mlp\">AR-MLP: Autoregressive MLP</a></li>\n<li><a href=\"/crypto-data-pipeline/#quant-gan\">Quant-GAN</a></li>\n<li><a href=\"/crypto-data-pipeline/#why-traditional-models-won\">Why Traditional Models Won</a></li>\n</ul>\n</li>\n<li><a href=\"/crypto-data-pipeline/#pipeline-architecture-summary\">Pipeline Architecture Summary</a></li>\n<li><a href=\"/crypto-data-pipeline/#lessons-learned\">Lessons Learned</a></li>\n</ul>","frontmatter":{"title":"Processing 155 Million Orderbook Records: Building a Fast Data Pipeline for Crypto Microstructure Analysis","date":"February 20, 2026","description":"How I built a data pipeline to process 155M+ Binance orderbook records for crypto microstructure research, using binary search, Numba JIT, and LOB simulation."}}},"pageContext":{"slug":"/crypto-data-pipeline/","previous":{"fields":{"slug":"/self-exciting-jump-diffusion/"},"frontmatter":{"title":"Self-Exciting Jump-Diffusion for Crypto: Why Vanilla Models Miss Momentum and Volatility Clustering"}},"next":{"fields":{"slug":"/polymarket-orderbook-analysis/"},"frontmatter":{"title":"Analyzing Bid-Ask Spreads and Liquidity Shifts in Polymarket's 5-Minute Binary Option Orderbooks"}}}}}