Filter by tags
Run dlt pipelines and Temporal workflows to sync data from multiple sources (Everflow, Redtrack, S3, PostHog, Mautic, Google Sheets). Supports scheduled execution, real-time triggering, status monitoring, and debugging for data ingestion workflows.
Comprehensive skill for creating, modifying, and optimizing BigQuery tables with partitioning, clustering, and security features. Covers time-based and range partitioning, clustering strategies, schema management, external tables, snapshots, and access control including row-level and column-level security.
Build high-performance data pipelines with stream processing, controlled concurrency, backpressure handling, and resource-safe transformations. Supports batching, resilience patterns, and constant memory processing for large datasets.
Build modern data lakehouses using DuckDB as ephemeral compute engine, Dagster for orchestration, dbt for SQL transforms, and Apache Iceberg/Polaris for catalog-managed storage. Provides patterns for configuring dbt-duckdb with Polaris plugin, reading/writing Iceberg tables, and designing catalog-first lakehouse architecture.
A comprehensive skill for developing BigQuery Dataform transformations with enforced TDD workflow, safety practices, and proper architecture patterns. Ensures use of ${ref()} syntax, comprehensive documentation, dev testing with --schema-suffix, and prevents technical debt under time pressure.
Integrate SQL tables into data stack layers (ODS/DIM/DWD/ADS) with Airflow + DuckDB + MinIO architecture. Supports configuring data sources, table dependencies, and creating migration files with proper naming conventions (_full/_inc/_zip).
Process vehicle insurance Excel data using Pandas - file handling, data cleaning, merging, validation. Handles Excel/CSV imports, implements business rules (negative premiums, zero commissions), and optimizes Pandas performance for data pipelines.
Expert guidance for Apache NiFi data integration platform, covering flow design, processors, controller services, NiFi Registry, cluster configuration, and real-time data pipeline orchestration.
A comprehensive toolkit for populating Supabase databases with seed data, supporting lookup tables, test data generation, bulk loading with COPY, and large file management with DVC.
Comprehensive Supabase and Postgres database operations for SignalRoom, including database queries, schema inspection, connection management, and data validation. Supports both direct and pooler connections with MCP integration.
Execute SQL queries against Snowflake data warehouse with multiple authentication methods (password, key-pair, SSO/OAuth) and flexible output formats (JSON, table, CSV). Supports ad-hoc queries, data extraction, and schema exploration.
Lightning-fast in-memory DataFrame library built on Apache Arrow with lazy evaluation and parallel execution. Best for 1-100GB datasets, ETL pipelines, and high-performance data processing as a faster pandas replacement.
Fundamental NumPy operations for efficient multidimensional array manipulation, including ndarray creation, dtype management, shape transformation, and memory alignment optimization for high-performance numerical computing.
Analyzes dbt model dependencies, traces upstream sources and downstream consumers, identifies circular dependencies, and visualizes data lineage for impact analysis and data flow understanding.
An interactive data dashboard built with Evidence framework that enables users to write SQL queries in markdown to create visualizations from happiness score datasets using DuckDB.
Load data into Google BigQuery from various file formats (CSV, JSON, Avro, Parquet) and sources (Cloud Storage, local files). Supports schema detection, partitioning, incremental loading, and error handling for efficient data ingestion.
Export BigQuery tables and query results to Google Cloud Storage in multiple formats (CSV, JSON, Avro, Parquet) using bq extract command or EXPORT DATA statements, with support for compression, partitioning, and large-scale data transfers.
Query remote Parquet files over HTTP/HTTPS without downloading the entire file using DuckDB's httpfs extension. Leverage column pruning, row filtering, and HTTP range requests for efficient bandwidth usage in crypto/trading data distribution and analytics.
Perform SQL joins and multi-table analysis on Danmarks Statistik (DST) data in DuckDB. Provides patterns for combining tables on common dimensions like time, region, and demographics to enable complex statistical analysis and correlation studies.
Fetch statistical data from Danmarks Statistik API and store it in DuckDB for analysis. Handles unlimited data streaming, complex filters, and automatic format selection to overcome API limitations.
A comprehensive guide for using dlt (data load tool) to build ETL pipelines in SignalRoom, covering source creation, incremental loading, schema evolution, write dispositions, and debugging pipeline failures.
Automatically seed databases with realistic fake data for development, testing, and staging environments. Supports PostgreSQL, MySQL, SQLite, MongoDB with ORM-based seeding and Faker library integration for generating test fixtures and sample data.
Execute SQL queries against Databricks using the DBSQL MCP server. Provides access to Unity Catalog tables, SQL warehouses, and supports query execution, data exploration, analytics, and ETL operations with comprehensive error handling.
Expert guidance for designing, optimizing, and maintaining database schemas for SQL and NoSQL systems. Covers normalization, indexing, data types, relationships, performance optimization, security policies, GDPR compliance, and migration management with comprehensive validation tools.
Expert database performance optimization for MongoDB, SQL Server, and PostgreSQL. Resolves N+1 queries, optimizes indexes, implements efficient query patterns, and improves data access performance through eager loading, parallel queries, and bulk operations.
Transform and export data using DuckDB SQL. Supports reading from CSV/Parquet/JSON/Excel/databases, applying SQL transformations (joins, aggregations, PIVOT/UNPIVOT, sampling), and writing results to various formats. Ideal for data cleaning, format conversion, multi-source joins, and creating partitioned datasets.
Replace demo data with custom data sources by connecting to PostgreSQL databases or importing CSV files. Automatically discovers schema, updates application configuration, and generates sample queries for the new data.
Profile datasets to understand schema, quality, and characteristics. Analyzes CSV, JSON, and Parquet files to extract schema information, statistical summaries, data quality metrics, distributions, uniqueness characteristics, and pattern detection at basic and intermediate levels.
Comprehensive backend development guide for Next.js 14/tRPC/Express/TypeScript monorepo. Covers tRPC routers, public API endpoints, BullMQ queue processors, layered architecture, dual database system (PostgreSQL + ClickHouse), multi-tenant isolation, OpenTelemetry observability, and testing strategies.
Plan and execute database migrations, data transformations, and system migrations safely with rollback strategies and data integrity validation. Supports schema changes, zero-downtime deployments, and safe migration between database systems.