Meta built an internal AI Analytics Agent to autonomously handle routine data analysis tasks using a layered knowledge system with “Cookbooks”

TLDR Data 2026-04-02

Deep Dives

Inside Meta’s Home Grown AI Analytics Agent (12 minute read)

Meta built an internal AI Analytics Agent to autonomously handle routine data analysis tasks using a layered knowledge system with “Cookbooks” (domain expertise), “Recipes” (step-by-step workflows with validations), and “Ingredients” (semantic models, documentation, and query history) to gather rich context from a user’s past queries, and runs an iterative reasoning loop.

The Power of Data Sketches: A Comprehensive Guide (18 minute read)

Data sketches are compact, probabilistic data structures that create small summaries of massive datasets in a single pass, trading a tiny, mathematically bounded error for huge gains in speed and memory efficiency, making them ideal for big data analytics in Spark, Druid, Pinot, BigQuery, and Presto/Trino.

Beyond BM25 and dense embeddings: How we built smart and interpretable retrieval at Faire (10 minute read)

Faire deployed a sparse neural retrieval model to solve vocabulary mismatch in marketplace search while keeping Elasticsearch compatibility and interpretability. By expanding queries and documents with semantically related terms, the system improved long-tail candidate quality by over 30%, lifted search-page order value by 4.27%, and increased global marketplace order value. Key engineering choices included domain-specific BERT pretraining, WordPiece tokenization, max pooling, asymmetric sparsity penalties, and moving Product Quality Score blending to index time to preserve latency.

Opinions & Advice

Agent responsibly (5 minute read)

AI coding agents can produce convincing, production-ready code that passes tests but still fails in real-world systems, creating false confidence and risk. The solution is to leverage agents (not rely on them) by maintaining human ownership and building strong infrastructure guardrails that make safe deployment the default.

The Revenge of the Data Scientist (9 minute read)

Data scientists are not becoming obsolete despite the rise of powerful LLMs and easy-to-use AI APIs. Instead, their core skills in experimentation, evaluation design, observability, metric creation, and “always looking at the data” are more critical than ever, forming the essential “harness” that makes AI agents and systems reliable, debuggable, and effective in production.

Nobody Is Making Decisions With Your Dashboards (6 minute read)

Dashboard requests are often proxy asks for visibility theater, data ownership, anxiety reduction, or raw data export, but not true BI needs. Treating data teams as a “Human SQL API” creates technical debt, orphaned pipelines, and noisy, untrusted environments, especially when dashboards lack clear owners or decommissioning processes. Stakeholders must define the decision, the action, and accountability before any dashboard is built.

Change Data Capture: Stop Copying 50M Rows to Move 5K Changes (7 minute read)

Change Data Capture (CDC) is a technique to efficiently track and stream only the changes from a source database instead of repeatedly copying entire tables. Popular tools include Debezium, Kafka, Fivetran, and Striim. Start simple with timestamps for prototyping and moving to log-based CDC for reliable, low-latency, scalable data synchronization in modern data pipelines.

Launches & Tools

MotherDuck Now Speaks Postgres (4 minute read)

MotherDuck announced a new Postgres-compatible endpoint that lets users connect to and query their MotherDuck data warehouse using any standard PostgreSQL client, driver, or BI tool, allowing teams to keep Postgres for transactional workloads and offload fast analytical queries to MotherDuck’s serverless compute.

Writing Custom Table Providers in Apache DataFusion (9 minute read)

DataFusion table providers let custom sources expose data from files, APIs, or proprietary systems by separating planning from execution. TableProvider::scan() runs during planning and should stay lightweight, while ExecutionPlan::execute() creates per-partition streams and SendableRecordBatchStream does the actual data work. Correctly declaring partitioning, ordering, and filter pushdown can eliminate RepartitionExec, SortExec, and wasted I/O.

Qdrant Skills for AI Agents (8 minute read)

Qdrant introduced open-source “skills” to encode production vector-search expertise for agents, shifting beyond basic RAG patterns like embed → retrieve top-k → prompt. The skills provide symptom-based decision trees for issues like memory pressure, latency regressions, tombstone buildup, and multitenancy, while qcloud-cli handles cluster operations in terminal and CI/CD. This shows how skills can shift agentic patterns from “read the doc” to diagnosis-aware guidance akin to a solutions architect.

Miscellaneous

Engineering the Memory Layer For An AI Agent To Navigate Large-scale Event Data (12 minute read)

MLOps Community built a sophisticated memory layer for an AI agent using ApertureDB as a unified multimodal vector-graph database with a clean graph schema, Gemma embeddings on segmented transcript chunks, constrained semantic search, and ACID transactions, enabling the AI agent to handle complex natural language queries with high accuracy and reduced hallucinations.

What is inference engineering? Deepdive (28 minute read)

LLM inference has become a core production concern as open models mature, making inference engineering relevant beyond frontier labs. The stack spans runtime, infrastructure, and tooling, with common optimizations like batching, caching, quantization, speculative decoding, tensor/expert parallelism, and disaggregated prefill/decode. At scale, these techniques can cut latency, improve uptime to 99.99%+ in dedicated deployments, and reduce cost by 80%+ versus closed-model APIs.

Quick Links

The Fed Chair Just Said What AI Leaders Won’t: The Models Don’t Work (11 minute read)

Reliable agentic platforms need hybrid architectures combining causal AI, knowledge graphs, simulations, and physics-informed models such as PINNs and digital twins to handle real-world operational complexity.

AlloyDB AI (6 minute read)

AlloyDB AI extends PostgreSQL with built-in vector embeddings, ScaNN-based vector search, natural-language SQL, and direct model calls via simple SQL.

sumitup.dev

Explorer

tldr-data-2026-04-02

Table of Contents