Clickhouse

Clickhouse Endpoint Documentation

OverviewCopied!

ClickHouse endpoints in Eden provide a high-performance interface for interacting with ClickHouse, an open-source column-oriented database management system that excels at real-time analytics on large datasets. These endpoints are specifically optimized for the analytical workloads and unique performance characteristics of ClickHouse.

Key FeaturesCopied!

  • High-Performance Analytics: Optimized for analytical queries on large datasets

  • Column-Oriented Storage: Takes advantage of ClickHouse's column storage format

  • Vectorized Query Execution: Support for ClickHouse's vectorized query processing

  • Real-Time Data Processing: Enables low-latency analytics on fresh data

  • Advanced Aggregation: Efficient implementation of complex analytical functions

  • Distributed Query Processing: Support for ClickHouse's sharded architecture

API Request TypesCopied!

ClickHouse endpoints currently support a single primary operation type:

Read Operation

Used for executing SQL queries against ClickHouse to retrieve data.

Required Properties:

  • kind: Must be "clickhouse"

  • type: Must be "read"

  • body: A SQL query string to be executed against ClickHouse

Optional Properties:

  • settings: Object containing ClickHouse-specific query settings

  • parameters: Named parameters to be substituted in the query

  • format: Desired output format (default is "JSON")

Example Request:

{
  "kind": "clickhouse",
  "type": "read",
  "body": "SELECT date, count() AS hits, uniq(user_id) AS unique_users FROM web_traffic WHERE site_id = {site_id:UInt32} AND date >= {start_date:Date} GROUP BY date ORDER BY date DESC LIMIT 10",
  "parameters": {
    "site_id": 42,
    "start_date": "2025-04-01"
  },
  "settings": {
    "max_execution_time": 30,
    "max_threads": 4
  },
  "format": "JSON"
}

Response:

{
  "data": [
    {
      "date": "2025-05-10",
      "hits": 145289,
      "unique_users": 28376
    },
    {
      "date": "2025-05-09",
      "hits": 152843,
      "unique_users": 30128
    },
    // ...more rows
  ],
  "metadata": {
    "rows": 10,
    "bytes": 1240,
    "executionTime": 0.042,
    "statistics": {
      "rows_read": 24563789,
      "bytes_read": 982543560,
      "elapsed_seconds": 0.038
    }
  }
}

Use Cases:

  • High-performance analytical queries

  • Time-series data analysis

  • Real-time dashboards and reporting

  • Complex aggregations on large datasets

  • User behavior analytics

  • Log analysis and monitoring

Endpoint ConfigurationCopied!

Basic Configuration Example

{
  "name": "AnalyticsClickhouse",
  "type": "clickhouse",
  "description": "Analytics data warehouse",
  "connection": {
    "host": "clickhouse.example.com",
    "port": 8123,
    "database": "analytics",
    "protocol": "http"
  },
  "authentication": {
    "type": "username_password",
    "username": "${ENV_CLICKHOUSE_USER}",
    "password": "${ENV_CLICKHOUSE_PASSWORD}"
  },
  "settings": {
    "maxExecutionTime": 60,
    "compressionEnabled": true,
    "connectionTimeout": 30,
    "socketTimeout": 300
  }
}

Advanced Configuration Example

{
  "name": "EnterpriseClickhouse",
  "type": "clickhouse",
  "description": "Enterprise data analytics platform",
  "connection": {
    "host": "clickhouse-cluster.example.com",
    "port": 8123,
    "database": "enterprise_analytics",
    "protocol": "https",
    "secure": true,
    "clusterName": "analytics_cluster"
  },
  "authentication": {
    "type": "username_password",
    "username": "${ENV_CLICKHOUSE_USER}",
    "password": "${ENV_CLICKHOUSE_PASSWORD}"
  },
  "settings": {
    "maxExecutionTime": 300,
    "compressionEnabled": true,
    "connectionTimeout": 30,
    "socketTimeout": 600
  },
  "advanced": {
    "querySettings": {
      "maxThreads": 8,
      "maxMemoryUsage": "20G",
      "maxBytesBeforeExternal": "10G",
      "joinAlgorithm": "auto",
      "maxInsertBlockSize": 1000000,
      "prefer_column_name_to_alias": 0
    },
    "connection": {
      "tcpKeepAlive": true,
      "loadBalancing": "round_robin",
      "maxParallelReplicas": 2
    },
    "retries": 3,
    "retryBackoffMs": 1000,
    "caching": {
      "enabled": true,
      "ttlSeconds": 300,
      "maxCacheSize": "1G"
    },
    "metrics": {
      "enabled": true,
      "collectQueryStatistics": true
    }
  }
}

Configuration Options ReferenceCopied!

Connection Options

Option

Type

Description

host

String

ClickHouse server hostname or IP

port

Number

Port to connect to (default: 8123 for HTTP, 9000 for native)

database

String

Default database to use

protocol

String

"http", "https", or "native"

secure

Boolean

Whether to use TLS/SSL for connections

clusterName

String

ClickHouse cluster name for distributed operations

Authentication Options

Option

Type

Description

type

String

Authentication type: "username_password", "none", or "ldap"

username

String

Username for authentication

password

String

Password for authentication

ldapServer

String

LDAP server (for LDAP authentication)

Settings Options

Option

Type

Description

maxExecutionTime

Number

Maximum query execution time in seconds

compressionEnabled

Boolean

Whether to use compression for data transfer

connectionTimeout

Number

Connection timeout in seconds

socketTimeout

Number

Socket operation timeout in seconds

Advanced Options

Option

Type

Description

querySettings

Object

ClickHouse-specific query settings

connection

Object

Advanced connection parameters

retries

Number

Number of retry attempts for failed queries

retryBackoffMs

Number

Backoff time between retries in milliseconds

caching

Object

Result caching configuration

metrics

Object

Metrics collection configuration

Query FeaturesCopied!

Parameter Binding

ClickHouse endpoints support named parameter binding with type hints:

SELECT * FROM users WHERE id = {user_id:UInt64} AND active = {status:UInt8}

The following types are supported:

  • UInt8, UInt16, UInt32, UInt64

  • Int8, Int16, Int32, Int64

  • Float32, Float64

  • String, FixedString

  • Date, DateTime, DateTime64

  • UUID, IPv4, IPv6

  • Array(T) where T is one of the above types

Output Formats

The endpoint supports multiple output formats:

  • JSON (default): Returns data as JSON objects

  • JSONEachRow: JSON format with one row per line

  • CSVWithNames: CSV format with column headers

  • TabSeparated: Tab-separated values

  • Pretty: Human-readable text format

Query Settings

Key ClickHouse query settings that can be specified:

  • max_execution_time: Maximum query execution time in seconds

  • max_threads: Maximum number of threads to use for query execution

  • max_memory_usage: Maximum memory usage for query execution

  • priority: Query priority (higher values get more resources)

  • distributed_product_mode: Mode for distributed joins

Best PracticesCopied!

Query Optimization

  1. Schema Design:

    • Design tables with the MergeTree engine family

    • Choose appropriate sorting keys based on query patterns

    • Use materialized columns for frequently computed values

    • Consider data skipping indexes for frequently filtered columns

  2. Query Writing:

    • Filter early and on indexed columns

    • Use PREWHERE for preliminary filtering

    • Leverage FINAL keyword judiciously (it can be expensive)

    • Use subqueries with WITH for readability and optimization

  3. Aggregation Optimization:

    • Use approximate functions (uniqHLL, quantileTDigest) for large datasets

    • Consider two-level aggregation for distributed tables

    • Use SimpleAggregateFunction when applicable

    • Pre-aggregate data with materialized views when possible

Performance Tuning

  1. Resource Management:

    • Set appropriate max_threads based on query complexity

    • Set reasonable memory limits with max_memory_usage

    • Use max_execution_time to prevent runaway queries

    • Consider priority settings for important queries

  2. Data Loading:

    • Use bulk inserts with large batches

    • Take advantage of ClickHouse's parallel insertion capabilities

    • Consider using the Buffer engine for write-heavy workloads

    • Use asynchronous inserts for non-critical data

  3. Caching Strategy:

    • Enable result caching for frequently executed queries

    • Set appropriate TTL values based on data freshness requirements

    • Monitor cache hit rates and adjust accordingly

Operational Considerations

  1. Monitoring:

    • Track query performance metrics

    • Monitor system tables for resource usage

    • Set up alerts for slow queries

    • Track disk space usage

  2. High Availability:

    • Configure multiple contact points for ClickHouse clusters

    • Implement retry logic for transient failures

    • Use load balancing across replicas

    • Consider read/write splitting for heavy workloads

TroubleshootingCopied!

Common Issues and Solutions

  1. Memory Issues:

    • Adjust max_memory_usage in query settings

    • Optimize queries to reduce memory consumption

    • Consider breaking large queries into smaller chunks

    • Use streaming for large result sets

  2. Performance Problems:

    • Check system.query_log for slow queries

    • Verify proper index usage with EXPLAIN

    • Look for distributed query bottlenecks

    • Check for proper sorting key usage

  3. Connection Failures:

    • Verify network connectivity

    • Check authentication credentials

    • Ensure database exists

    • Verify server resource availability

Advanced Diagnostics

  1. Query Analysis:

    • Use EXPLAIN to understand query execution

    • Check system.query_log for detailed query metrics

    • Analyze system.metrics and system.asynchronous_metrics

    • Review system.processes for currently running queries

  2. Profile Queries:

    EXPLAIN PIPELINE
    SELECT count() FROM your_table WHERE condition
    
  3. Identify Bottlenecks:

    • Check CPU usage on ClickHouse nodes

    • Monitor network throughput

    • Check disk I/O performance

    • Review memory usage patterns

Example QueriesCopied!

Time-Series Analytics

SELECT
    toStartOfHour(timestamp) AS hour,
    count() AS events,
    uniq(user_id) AS unique_users,
    avg(duration_ms) AS avg_duration
FROM user_events
WHERE date >= {start_date:Date} AND date <= {end_date:Date}
GROUP BY hour
ORDER BY hour DESC

Funnel Analysis

SELECT
    date,
    countIf(event_type = 'view') AS views,
    countIf(event_type = 'click') AS clicks,
    countIf(event_type = 'purchase') AS purchases,
    round(countIf(event_type = 'click') / countIf(event_type = 'view'), 3) AS ctr,
    round(countIf(event_type = 'purchase') / countIf(event_type = 'click'), 3) AS conversion
FROM events
WHERE date BETWEEN {start_date:Date} AND {end_date:Date}
GROUP BY date
ORDER BY date

Cohort Analysis

WITH
    first_purchase AS (
        SELECT
            user_id,
            min(toStartOfMonth(order_date)) AS cohort_month
        FROM orders
        GROUP BY user_id
    )
SELECT
    cohort_month,
    toRelativeMonthNum(order_date, cohort_month) AS month_number,
    count(DISTINCT orders.user_id) AS active_users
FROM orders
JOIN first_purchase ON orders.user_id = first_purchase.user_id
GROUP BY cohort_month, month_number
ORDER BY cohort_month, month_number