Hudi, Delta Lake, and Iceberg all support ACID, Copy-On-Write, schema evolution, and time travel. Hudi excels with Merge-On-Read

TLDR Data 2025-10-09

Deep Dives

Apache Iceberg vs Delta Lake vs Apache Hudi - Feature Comparison Deep Dive (15 minute read)

All three formats (Hudi, Delta Lake, and Iceberg) support ACID, Copy-On-Write, schema evolution, and time travel. Hudi excels with Merge-On-Read, advanced indexing, partial updates, non-blocking concurrency, and automated compaction/clustering. Delta shines in Databricks/Z-order integration but uses experimental features and proprietary elements, while Iceberg leads in partition evolution and read/write support, yet demands manual maintenance, slower metadata, and lacks CDC/primary keys.

Building a Resilient Event Publisher with Dual Failure Capture (9 minute read)

Klaviyo revamped its event publishing system to eliminate data loss during network hiccups, Kafka timeouts, or serialization errors, processing up to 170,000 events/sec at peak. The solution implements a dual failure capture strategy: automatic retries write failed events to a self-hosted Kafka DLQ (retained for 7 days), while persistent failures or serialization bugs route events to S3 for infinite retention and manual recovery.

How OpenAI Uses Kubernetes And Apache Kafka for GenAI (15 minute read)

OpenAI’s engineering team built a stream processing platform using PyFlink on Kubernetes with Apache Kafka as the event streaming backbone to handle massive data volumes for AI systems, shifting from batch to real-time processing for fresher data. Kafka acts as the multi-primary event backbone for logs, training data, and experiments.

Opinions & Advice

Engineering Growth: The Data Layers Powering Modern GTM (12 minute read)

Growth no longer rewards the widest net. Modern Go-To-Market (GTM) teams win with precision, not volume, building revenue on infrastructure like pipelines, warehouses, and customer data platforms that turn signals into action. However, not all data is created equal, as the insights draw from five distinct data sources, each with unique engineering challenges, governance requirements, and strategic value.

The Single Node Rebellion (6 minute read)

Tools like DuckDB and Polars are challenging distributed systems (e.g., Spark, AWS EC2, and Databricks) for most workloads, as datasets are rarely “Big Data” and single-node solutions offer cost savings and simplicity amid rising cloud expenses.

7 Questions Every Data Team Should Ask the Business (5 minute read)

Data teams often face vague or misaligned project requests from business partners. Instead of asking “What data do you need?”, they should use targeted questions to uncover pain points, decision gaps, and opportunities. For example, ask what recent win they want to scale (to build rapport and amplify successes), and when a lack of data led to a bad decision (to reveal gaps and value perceptions).

Launches & Tools

OSS Insight (Tool)

You can gauge open source momentum by tracking GitHub activity, especially pull requests, which signal innovation speed and community engagement. Comparing trends across areas like analytics engines, event streaming, orchestration, and lakehouse formats can reveal where the ecosystem is moving.

Introducing OpenZL: An Open Source Format-Aware Compression Framework (8 minute read)

Meta has open-sourced OpenZL, a format-aware, lossless compression framework that outperforms generic compressors by leveraging explicit data structure descriptions. Tailored for structured data like database tables, timeseries, and ML tensors, OpenZL achieves higher compression ratios and speed while maintaining a single universal decompressor, reducing operational complexity. The offline trainer generates data-specific compression configs, enabling rapid adaptation to schema changes without re-deployment.

Examining Versionless Apache Spark: AI-powered upgrades and seamless stability for 2 billion workloads (4 minute read)

Versionless Spark decouples clients from servers via a stable Spark Connect API for automatic upgrades, using environment versioning with base images (Spark Connect and Python deps). Its AI-powered Release Stability System (RSS) facilitates upgrades via workload fingerprints, historical metadata, ML-driven error triage, and anomaly detection, resulting in a 99.99% success rate across 2 billion jobs transitioned from DBR 14 to 17 (including Spark 4) with features like collation and bloom filters unlocked.

Arc Core (GitHub Repo)

Arc Core is a high-performance time-series data warehouse designed for rapid data ingestion, achieving 1.89 million records per second on native deployment. It utilizes DuckDB, Parquet, and MinIO, making it a great choice for those who require efficient storage and querying of time-series data.

Miscellaneous

Locality, and Temporal-Spatial Hypothesis (8 minute read)

The “temporal-spatial locality hypothesis” states that data written around the same time is likely read around the same time, justifying storage proximity for efficiency, using the example of faster forward scans (read-ahead caching) versus slower backward ones (database needs to block on IO to fetch the next page). The hypothesis holds trivially in time-ordered systems like streaming and time-series data, but hash-based stores like DynamoDB reject it via random keys to avoid write hotspots, trading read spatial locality for write scalability.

Accelerating Large-Scale Data Analytics with GPU-Native Velox and NVIDIA cuDF (7 minute read)

IBM and NVIDIA have integrated cuDF with the Velox execution engine to enable GPU-native SQL query execution in systems like Presto and Apache Spark. Velox rewrites plans to use GPU operators (joins, scans, and aggregations) and supports UCX-based exchange for multi-GPU data routing. In benchmarks, single-node Presto showed an order-of-magnitude performance improvement over CPU.

Quick Links

How Not to Partition Data in S3 (And What to Do Instead) (5 minute read)

Partitioning S3 data lakes by year/month/day seems logical, but often degrades performance, creating many small files that increase scan costs and slow queries.

Building a Better Lakehouse: From Airflow to Dagster (7 minute read)

Replacing Airflow with Dagster enabled smarter partitioning, event-driven monitoring, and pure SQL data loading, significantly improving lakehouse efficiency and capabilities.

sumitup.dev

Explorer

tldr-data-2025-10-09

Table of Contents