Skip to main content

Python Integration with ClickHouse Connect

Introduction

ClickHouse Connect is a suite of Python packages providing interoperability with a wide range of Python applications. The three primary components are:

  • A low level driver in the package clickhouse_connect.driver. This package includes a basic client that handles all ClickHouse Connect requests to the ClickHouse Server, as well assorted helper classes and utility functions.
  • A limited SQLAlchemy dialect in the package clickhouse_connect.cc_sqlalchemy. This package focuses implements query/cursor functionality, and does not generally support SQLAlchemy DDL and ORM operations. (SQLAlchemy is targeted toward OLTP databases, and we recommend more specialized tools and frameworks to manage the ClickHouse OLAP database.)
  • An Apache Superset EngineSpec in the clickhouse_connect.cc_superset. Note that this sub-package is deprecated and will be removed in a future release. The functionality has been moved to the main Superset project (starting with version 2.1.0), and is included here for backward compatibility with older Superset versions.

This documentation is current as of the beta release 0.5.24.

note

The official ClickHouse Connect Python driver uses HTTP protocol for communication with the ClickHouse server. It has some advantages (like better flexibility, HTTP-balancers support, better compatibility with JDBC-based tools, etc) and disadvantages (like slightly lower compression and performance, and a lack of support for some complex features of the native TCP-based protocol). For some use cases, you may consider using one of the Community Python drivers that uses native TCP-based protocol.

Requirements and Compatibility

PythonPlatform¹ClickHouseSQLAlchemy²Apache Superset
2.x, <3.7Linux (x86)<22.8³🟡<1.3<1.4
3.7.xLinux (Aarch64)22.8.x1.3.x1.4.x
3.8.xmacOS (x86)22.9-23.1³🟡1.4.x1.5.x
3.9.xmacOS (ARM)23.2.x>=2.x2.0.x
3.10.xWindows23.3.x ³2.1.x
3.11.x23.4.x

¹ClickHouse Connect has been explicitly tested against the listed platforms. In addition, untested binary wheels (with C optimization) are built for all architectures supported by the excellent cibuildwheel project. Finally, because ClickHouse Connect can also run as pure Python, the source installation should work on any recent Python installation.

²Again SQLAlchemy support is limited primarily to query functionality. The full SQLAlchemy API is not supported.

³ClickHouse Connect has been tested against all currently supported ClickHouse versions. Because it uses the HTTP protocol, it should also work correctly for most other versions of ClickHouse, although there may be some incompatibilities with certain advanced data types.

Installation

Install ClickHouse Connect from PyPI via pip:

pip install clickhouse-connect

ClickHouse Connect can also be installed from source:

  • git clone the GitHub repository.
  • (Optional) run pip install cython to build and enable the C/Cython optimizations
  • cd to the project root directory and run pip install .

Support Policy

ClickHouse Connect is currently in beta and only the current beta release is actively supported. Please update to the latest version before reported any issues. Issues should be filed in the GitHub project. Future releases of ClickHouse Connect are guaranteed to be compatible with actively supported ClickHouse versions at the time of release (generally the three most recent stable and two most recent lts releases).

Basic Usage

Gather your connection details

To connect to ClickHouse with HTTP(S) you need this information:

  • The HOST and PORT: typically, the port is 8443 when using TLS or 8123 when not using TLS.

  • The DATABASE NAME: out of the box, there is a database named default, use the name of the database that you want to connect to.

  • The USERNAME and PASSWORD: out of the box, the username is default. Use the username appropriate for your use case.

The details for your ClickHouse Cloud service are available in the ClickHouse Cloud console. Select the service that you will connect to and click Connect:

ClickHouse Cloud service connect button

Choose HTTPS, and the details are available in an example curl command.

ClickHouse Cloud HTTPS connection details

If you are using self-managed ClickHouse, the connection details are set by your ClickHouse administrator.

Establish a connection

There are two examples shown for connecting to ClickHouse:

  • Connecting to a ClickHouse server on localhost.
  • Connecting to a ClickHouse Cloud service.
Use a ClickHouse Connect client instance to connect to a ClickHouse server on localhost:
import clickhouse_connect

client = clickhouse_connect.get_client(host='localhost', username='default', password='password')
Use a ClickHouse Connect client instance to connect to a ClickHouse Cloud service:
tip

Use the connection details gathered earlier. ClickHouse Cloud services require TLS, so use port 8443.

import clickhouse_connect

client = clickhouse_connect.get_client(host='HOSTNAME.clickhouse.cloud', port=8443, username='default', password='your password')

Interact with your database

To run a ClickHouse SQL command, use the client command method:

client.command('CREATE TABLE new_table (key UInt32, value String, metric Float64) ENGINE MergeTree ORDER BY key')

To insert batch data, use the client insert method with a two-dimensional array of rows and values:

row1 = [1000, 'String Value 1000', 5.233]
row2 = [2000, 'String Value 2000', -107.04]
data = [row1, row2]
client.insert('new_table', data, column_names=['key', 'value', 'metric'])

To retrieve data using ClickHouse SQL, use the client query method:

result = client.query('SELECT max(key), avg(metric) FROM new_table')
result.result_rows
Out[13]: [(2000, -50.9035)]

ClickHouse Connect Driver API

Note: Passing keyword arguments is recommended for most api methods given the number of possible arguments, most of which are optional.

Client Initialization

The clickhouse_connect.driver.client class provides the primary interface between a Python application and the ClickHouse database server. Use the clickhouse_connect.get_client function to obtain a Client instance, which accepts the following arguments:

Connection Arguments

ParameterTypeDefaultDescription
interfacestrhttpMust be http or https.
hoststrNoneThe hostname or IP address of the ClickHouse server. If not set, localhost will be used.
portint8123 or 8443The ClickHouse HTTP or HTTPS port. If not set will default to 8123, or to 8443 if secure=True or interface=https.
usernamestrNoneThe ClickHouse user name. If not set, the default ClickHouse user will be used.
passwordstr<empty string>The password for username.
databasestrNoneThe default database for the connection. If not set, ClickHouse Connect will use the default database for username.
secureboolFalseUse https/TLS. This overrides inferred values from the interface or port arguments.
dsnstrNoneA string in standard DSN (Data Source Name) format. Other connection values (such as host or user) will be extracted from this string if not set otherwise.
compressbool or strTrueEnable compression for ClickHouse HTTP inserts and query results. See Additional Options (Compression)
query_limitint0 (unlimited)Maximum number of rows to return for any query response. Set this to zero to return unlimited rows. Note that large query limits may result in out of memory exceptions if results are not streamed, as all results are loaded into memory at once.
query_retriesint2Maximum number of retries for a query request. Only "retryable" HTTP responses will be retried. command or insert requests are not automatically retried by the driver to prevent unintended duplicate requests.
connect_timeoutint10HTTP connection timeout in seconds.
send_receive_timeoutint300Send/receive timeout for the HTTP connection in seconds.
client_namestrNoneclient_name prepended to the HTTP User Agent header. Set this to track client queries in the ClickHouse system.query_log.
pool_mgrobj<default PoolManager>The urllib3 library PoolManager to use. For advanced use cases requiring multiple connection pools to different hosts.
http_proxystrNoneHTTP proxy address (equivalent to setting the HTTP_PROXY environment variable).
https_proxystrNoneHTTPS proxy address (equivalent to setting the HTTPS_PROXY environment variable).
apply_server_timezoneboolTrueUse server timezone for timezone aware query results. See [Timezone Precedence])(#time-zones)

HTTPS/TLS Arguments

ParameterTypeDefaultDescription
verifyboolTrueValidate the ClickHouse server TLS/SSL certificate (hostname, expiration, etc.) if using HTTPS/TLS.
ca_certstrNoneIf verify=True, the file path to Certificate Authority root to validate ClickHouse server certificate, in .pem format. Ignored if verify is False. This is not necessary if the ClickHouse server certificate is a globally trusted root as verified by the operating system.
client_certstrNoneFile path to a TLS Client certificate in .pem format (for mutual TLS authentication). The file should contain a full certificate chain, including any intermediate certificates.
client_cert_keystrNoneFile path to the private key for the Client Certificate. Required if the private key is not included the Client Certificate key file.
server_host_namestrNoneThe ClickHouse server hostname as identified by the CN or SNI of its TLS certificate. Set this to avoid SSL errors when connecting through a proxy or tunnel with a different hostname

Settings Argument

Finally, the settings argument to get_client is used to pass additional ClickHouse settings to the server for each client request. Note that in most cases, users with readonly=1 access cannot alter settings sent with a query, so ClickHouse Connect will drop such settings in the final request and log a warning. The following settings apply only to HTTP queries/sessions used by ClickHouse Connect, and are not documented as general ClickHouse settings.

SettingDescription
buffer_sizeBuffer size (in bytes) used by ClickHouse Server before writing to the HTTP channel.
session_idA unique session id to associate related queries on the server. Required for temporary tables.
compressWhether the ClickHouse server should compress the POST response data. This setting should only be used for "raw" queries.
decompressWhether the data sent to ClickHouse server must be decompressed. This setting is should only be used for "raw" inserts.
quota_keyThe quota key associated with this requests. See the ClickHouse server documentation on quotas.
session_checkUsed to check the session status.
session_timeoutNumber of seconds of inactivity before the identified by the session id will timeout and no longer be considered valid. Defaults to 60 seconds.
wait_end_of_queryBuffers the entire response on the ClickHouse server. This setting is required to return summary information, and is set for automatically on non-streaming queries.

For other ClickHouse settings that can be sent with each query, see the ClickHouse documentation.

Client Creation Examples

  • Without any parameters, a ClickHouse Connect client will connect to the default HTTP port on localhost with the default user and no password:
import clickhouse_connect

client = clickhouse_connect.get_client()
client.server_version
Out[2]: '22.10.1.98'
  • Connecting to a secure (https) external ClickHouse server
import clickhouse_connect

client = clickhouse_connect.get_client(host='play.clickhouse.com', secure=True, port=443, user='play', password='clickhouse')
client.command('SELECT timezone()')
Out[2]: 'Etc/UTC'
  • Connecting with a session id and other custom connection parameters and ClickHouse settings.
import clickhouse_connect

client = clickhouse_connect.get_client(host='play.clickhouse.com',
user='play',
password='clickhouse',
port=443,
session_id='example_session_1',
connect_timeout=15,
database='github',
settings={'distributed_ddl_task_timeout':300)
client.database
Out[2]: 'github'

Common Method Arguments

Several client methods use one or both of the common parameters and settings arguments. These keyword arguments are described below.

Parameters Argument

ClickHouse Connect Client query* and command methods accept an optional parameters keyword argument used for binding Python expressions to a ClickHouse value expression. Two sorts of binding are available.

Server Side Binding

ClickHouse supports server side binding for most query values, where the bound value is sent separate from the query as an HTTP query parameter. ClickHouse Connect will add the appropriate query parameters if it detects a binding expression of the form {<name>:<datatype>}. For server side binding, the parameters argument should be a Python dictionary.

  • Server Side Binding with Python Dictionary, DateTime value and string value
import datetime

my_date = datetime.datetime(2022, 10, 01, 15, 20, 5)

parameters = {'table': 'my_table', 'v1': my_date, 'v2': "a string with a single quote'"}
client.query('SELECT * FROM {table:Identifier} WHERE date >= {v1:DateTime} AND string ILIKE {v2:String}', parameters=parameters)

# Generates the following query on the server
# SELECT * FROM my_table WHERE date >= '2022-10-01 15:20:05' AND string ILIKE 'a string with a single quote\''
Client Side Binding

ClickHouse Connect also supports client side parameter binding which can allow more flexibility in generating templated SQL queries. For client side binding, the parameters argument should be a dictionary or a sequence. Client side binding uses the Python "printf" style string formatting for parameter substitution.

Note that unlike server side binding, client side binding doesn't work for database identifiers such as database, table, or column names, since Python style formatting can't distinguish between the different types of strings, and they need to be formatted differently (backticks or double quotes for database identifiers, single quotes for data values).

  • Example with Python Dictionary, DateTime value and string escaping
import datetime

my_date = datetime.datetime(2022, 10, 01, 15, 20, 5)

parameters = {'v1': my_date, 'v2': "a string with a single quote'"}
client.query('SELECT * FROM some_table WHERE date >= %(v1)s AND string ILIKE %(v2)s', parameters=parameters)

# Generates the following query:
# SELECT * FROM some_table WHERE date >= '2022-10-01 15:20:05' AND string ILIKE 'a string with a single quote\''
  • Example with Python Sequence (Tuple), Float64, and IPv4Address
import ipaddress

parameters = (35200.44, ipaddress.IPv4Address(0x443d04fe))
client.query('SELECT * FROM some_table WHERE metric >= %s AND ip_address = %s', parameters=parameters)

# Generates the following query:
# SELECT * FROM some_table WHERE metric >= 35200.44 AND ip_address = '68.61.4.254''

Settings Argument

All the core SQL Client ClickHouse database server accept an optional settings keyword argument used for passing ClickHouse server user settings for the included SQL statement. The settings argument should be a dictionary. Each item should be a ClickHouse setting name and its associated value. Note that values will be converted to strings when sent to the server as query parameters.

As with client level settings, ClickHouse Connect will drop any settings that the server marks as readonly=1, with an associated log message. Settings that apply only to queries via the ClickHouse HTTP interface are always valid. Those settings are described under the get_client API.

Example of using ClickHouse settings:

settings = {'merge_tree_min_rows_for_concurrent_read': 65535,
'session_id': 'session_1234',
'use_skip_indexes': False}
client.query("SELECT event_type, sum(timeout) FROM event_errors WHERE event_time > '2022-08-01'", settings=settings)

Client command Method

Use the Client.command method to send SQL queries to the ClickHouse Server that do not normally return data or return a simple single value rather than a full dataset. This method takes the following parameters:

ParameterTypeDefaultDescription
cmdstrRequiredA ClickHouse SQL statement that returns a single value or a single row of values.
parametersdict or iterableNoneSee parameters description.
datastr or bytesNoneOptional data to include with the command as the POST body.
settingsdictNoneSee settings description.
use_databaseboolTrueUse the client database (specified when creating the client). False means the command will use the default ClickHouse Server database for the connected user.
external_dataExternalDataNoneAn ExternalData object containing file or binary data to use with the query. See Advanced Queries (External Data)
  • command can be used for DDL statements
client.command('CREATE TABLE test_command (col_1 String, col_2 DateTime) Engine MergeTree ORDER BY tuple()')
client.command('SHOW CREATE TABLE test_command')
Out[6]: 'CREATE TABLE default.test_command\\n(\\n `col_1` String,\\n `col_2` DateTime\\n)\\nENGINE = MergeTree\\nORDER BY tuple()\\nSETTINGS index_granularity = 8192'
  • command can also be used for simple queries that return only a single row
result = client.command('SELECT count() FROM system.tables')
result
Out[7]: 110

Client query Method

The Client.query method is the primary way to retrieve a single "batch" dataset from the ClickHouse Server. It utilizes the Native ClickHouse format over HTTP to transmit large datasets (up to approximately one million rows) efficiently. This method takes the following parameters.

ParameterTypeDefaultDescription
querystrRequiredThe ClickHouse SQL SELECT or DESCRIBE query.
parametersdict or iterableNoneSee parameters description.
settingsdictNoneSee settings description.
query_formatsdictNoneDatatype formatting specification for result values. See Advanced Usage (Read Formats)
column_formatsdictNoneDatatype formatting per column. See Advanced Usage (Read Formats)
encodingstrNoneEncoding used to encode ClickHouse String columns into Python strings. Python defaults to UTF-8 if not set.
use_noneboolTrueUse Python None type for ClickHouse nulls. If False, use a datatype default (such as 0) for ClickHouse nulls. Note - defaults to False for numpy/Pandas for performance reasons.
column_orientedboolFalseReturn the results as a sequence of columns rather than a sequence of rows. Helpful for transforming Python data to other column oriented data formats.
query_tzstrNoneA timezone name from the zoneinfo database. This timezone will be applied to all datetime or Pandas Timestamp objects returned by the query.
column_tzsdictNoneA dictionary of column name to timezone name. Like query_tz, but allows specifying different timezones for different columns.
use_extended_dtypesboolTrueUse Pandas extended dtypes (like StringArray), and pandas.NA and pandas.NaT for ClickHouse NULL values. Apples only to query_df and query_df_stream methods.
external_dataExternalDataNoneAn ExternalData object containing file or binary data to use with the query. See Advanced Queries (External Data)
contextQueryContextNoneA reusable QueryContext object can be used to encapsulate the above method arguments. See Advanced Queries (QueryContexts)

The QueryResult Object

The base query method returns a QueryResult object with the following public properties:

  • result_rows -- A matrix of the data returned in the form of a Sequence of rows, with each row element being a sequence of column values.
  • result_columns -- A matrix of the data returned in the form of a Sequence of columns, with each column element being a sequence of the row values for that column
  • column_names -- A tuple of strings representing the column names in the result_set
  • column_types -- A tuple of ClickHouseType instances representing the ClickHouse data type for each column in the result_columns
  • query_id -- The ClickHouse query_id (useful for examining the query in the system.query_log table)
  • summary -- Any data returned by the X-ClickHouse-Summary HTTP response header
  • first_item -- A convenience property for retrieving the first row of the response as a dictionary (keys are column names)
  • first_row -- A convenience property to return the first row of the result
  • column_block_stream -- A generator of query results in column oriented format. This property should not be referenced directly (see below).
  • row_block_stream -- A generator of query results in row oriented format. This property should not be referenced directly (see below).
  • rows_stream -- A generator of query results that yields a single row per invocation. This property should not be referenced directly (see below).

The *_stream properties return a Python Context that can be used as an iterator for the returned data. They should only be accessed indirectly using the Client *_stream methods. In a future release, the QueryResult object returned by the main Client query method will have consumed the stream and contain the entire populated result_set to provide a clean separation between completed, "batch" results retrieved via the Client query method and streaming results retrieved via the Client query_*_stream methods.

The complete details of streaming query results (using StreamContext objects) are outlined in Advanced Queries (Streaming Queries).

Consuming query results with Numpy, Pandas or Arrow

There are three specialized versions of the main query method:

  • query_np -- This version returns a Numpy Array instead a ClickHouse Connect QueryResult.
  • query_df -- This version returns a Pandas Dataframe instead of a ClickHouse Connect QueryResult.
  • query_arrow -- This version returns a PyArrow Table. It utilizes the ClickHouse Arrow format directly, so it only accepts three arguments in common with the main query method: query, parameters, and settings. In addition there is additional argument use_strings which determines whether the Arrow Table will render ClickHouse String types as strings (if True) or bytes (if False).

Client Streaming Query Methods

The ClickHouse Connect Client provides multiple methods for retrieving data as a stream (implemented as a Python generator):

  • query_column_block_stream -- Returns query data in blocks as a sequence of columns using native Python object
  • query_row_block_stream -- Returns query data as a block of rows using native Python object
  • query_rows_stream -- Returns query data as a sequence of rows using native Python object
  • query_np_stream -- Returns each ClickHouse block of query data as a Numpy array
  • query_df_stream -- Returns each ClickHouse Block of query data as a Pandas Dataframe

Each of these methods returns a ContextStream object that must be opened via a with statement to start consuming the stream. See Advanced Queries (Streaming Queries) for details and examples.

Client insert Method

For the common use case of inserting multiple records into ClickHouse, there is the Client.insert method. It takes the following parameters:

ParameterTypeDefaultDescription
tablestrRequiredThe ClickHouse table to insert into. The full table name (including database) is permitted.
dataSequence of SequencesRequiredThe matrix of data to insert, either a Sequence of rows, each of which is a sequence of column values, or a Sequence of columns, each of which is a sequence of row values.
column_namesSequence of str, or str'*'A list of column_names for the data matrix. If '*' is used instead, ClickHouse Connect will execute a "pre-query" to retrieve all of the column names for the table.
databasestr''The target database of the insert. If not specified, the database for the client will be assumed.
column_typesSequence of ClickHouseTypeNoneA list of ClickHouseType instances. If neither column_types or column_type_names is specified, ClickHouse Connect will execute a "pre-query" to retrieve all the column types for the table.
column_type_namesSequence of ClickHouse type namesNoneA list of ClickHouse datatype names. If neither column_types or column_type_names is specified, ClickHouse Connect will execute a "pre-query" to retrieve all the column types for the table.
column_orientedboolFalseIf True, the data argument is assume to be a Sequence of columns (and no "pivot" will be necessary to insert the data). Otherwise data is interpreted as a Sequence of rows.
settingsdictNoneSee settings description.
insert_contextInsertContextNoneA reusable InsertContext object can be used to encapsulate the above method arguments. See Advanced Inserts (InsertContexts)

This method does not return a value. An exception will be raised if the insert fails for any reason.

There are two specialized versions of the main insert method:

  • insert_df -- Instead of Python Sequence of Sequences data argument, the second parameter of this method requires a df argument that must be a Pandas Dataframe instance. ClickHouse Connect automatically processes the Dataframe as a column oriented datasource, so the column_oriented parameter is not required or available.
  • insert_arrow -- Instead of a Python Sequence of Sequences data argument, this method requires an arrow_table. ClickHouse Connect passes the Arrow table unmodified to the ClickHouse server for processing, so only the database and settings arguments are available in addition to table and arrow_table.

Note: A Numpy array is a valid Sequence of Sequences and can be used as the data argument to the main insert method, so a specialized method is not required.

File Inserts

The clickhouse_connect.driver.tools includes the insert_file method that allows inserting data directly from the file system into an existing ClickHouse table. Parsing is delegated to the ClickHouse server. insert_file accepts the following parameters:

ParameterTypeDefaultDescription
clientClientRequiredThe driver.Client used to perform the insert
tablestrRequiredThe ClickHouse table to insert into. The full table name (including database) is permitted.
file_pathstrRequiredThe native file system path to the data file
fmtstrCSV, CSVWithNamesThe ClickHouse Input Format of the file. CSVWithNames is assumed if column_names is not provided
column_namesSequence of strNoneA list of column_names in the data file. Not required for formats that include column names
databasestrNoneDatabase of the table. Ignored if the table is fully qualified. If not specified, the insert will use the client database
settingsdictNoneSee settings description.

For files with inconsistent data or date/time values in an unusual format, settings that apply to data imports (such as input_format_allow_errors_num and input_format_allow_errors_num) are recognized for this method.

import clickhouse_connect
from clickhouse_connect.driver.tools import insert_file

client = clickhouse_connect.get_client()
insert_file(client, 'example_table', 'my_data.csv',
settings={'input_format_allow_errors_ratio': .2,
'input_format_allow_errors_num': 5})

Raw API

For use cases which do not require transformation between ClickHouse data and native or third party data types and structures, the ClickHouse Connect client provides two methods for direct usage of the ClickHouse connection.

Client raw_query Method

The Client.raw_query method allows direct usage of the ClickHouse HTTP query interface using the client connection. The return value is an unprocessed bytes object. It offers a convenient wrapper with parameter binding, error handling, retries, and settings management using a minimal interface:

ParameterTypeDefaultDescription
querystrRequiredAny valid ClickHouse query
parametersdict or iterableNoneSee parameters description.
settingsdictNoneSee settings description.
fmtstrNoneClickHouse Output Format for the resulting bytes. (ClickHouse uses TSV if not specified)
use_databaseboolTrueUse the clickhouse-connect Client assigned database for the query context
external_dataExternalDataNoneAn ExternalData object containing file or binary data to use with the query. See Advanced Queries (External Data)

It is the caller's responsibility to handle the resulting bytes object. Note that the Client.query_arrow is just a thin wrapper around this method using the ClickHouse Arrow output format.

Client raw_insert Method

The Client.raw_insert method allows direct inserts of bytes objects or bytes object generators using the client connection. Because it does no processing of the insert payload, it is highly performant. The method provides options to specify settings and insert format:

ParameterTypeDefaultDescription
tablestrRequiredEither the simple or database qualified table name
column_namesSequence[str]NoneColumn names for the insert block. Required if the fmt parameter does not include names
insert_blockstr, bytes, Generator[bytes], BinaryIORequiredData to insert. Strings will be encoding with the client encoding.
settingsdictNoneSee settings description.
fmtstrNoneClickHouse Input Format of the insert_block bytes. (ClickHouse uses TSV if not specified)

It is the caller's responsibility that the insert_block is in the specified format. ClickHouse Connect uses these raw inserts for file uploads and PyArrow Tables, delegating parsing to the ClickHouse server.

Multithreaded, Multiprocess, and Async/Event Driven Use Cases

ClickHouse Connect works well in multithreaded, multiprocess, and event loop driven/asynchronous applications. All query and insert processing occurs within a single thread, so operations are generally thread safe. (Parallel processing of some operations at a low level is a possible future enhancement to overcome the performance penalty of a single thread, but even in that case thread safety will be maintained).

Because each query or insert executes maintains state in its own QueryContext or InsertContext object, respectively, these helper objects are not thread safe, and they should not be shared between multiple processing streams. See the additional discussion about context objects in following sections.

Additionally, in an application that has two or more queries and/or inserts "in flight" at the same time, there are two further considerations to keep in mind. The first is the ClickHouse "session" associated with the query/insert, and the second is the HTTP connection pool used by ClickHouse Connect Client instances.

Managing ClickHouse Session Ids

Each ClickHouse query occurs within the context of a ClickHouse "session". Sessions are currently used for two purposes:

  • To associate specific ClickHouse settings with multiple queries (see the user settings). The ClickHouse SET command is used to change the settings for the scope of a user session.
  • To track temporary tables.

By default, each query executed with a ClickHouse Connect Client instance uses the same session id to enable this session functionality. That is, SET statements and temporary table work as expected when using a single ClickHouse client. However, by design the ClickHouse server does not allow concurrent queries within the same session. As a result, there are two options for a ClickHouse Connect application that will execute concurrent queries.

  • Create a separate Client instance for each thread of execution (thread, process, or event handler) that will have its own session id. This is generally the best approach, as it preserves the session state for each client.
  • Use a unique session id for each query. This avoids the concurrent session problem in circumstances where temporary tables or shared session settings are not required. (Shared settings can also be provided when creating the client, but these are sent with each request and not associated with a session). The unique session_id can be added to the settings dictionary for each request, or you can disable the autogenerate_session_id common setting:
from clickhouse_connect import common

common.set_setting('autogenerate_session_id', False) # This should always be set before creating a client
client = clickhouse_connect.get_client(host='somehost.com', user='dbuser', password=1234)

In this case ClickHouse Connect will not send any session id, and a random session id will be generated by the ClickHouse server. Again, temporary tables and session level settings will not be available.

Customizing the HTTP Connection Pool

ClickHouse Connect uses urllib3 connection pools to handle the underlying HTTP connection to the server. By default, all client instances share the same connection pool, which is sufficient for the majority of use cases. This default pool maintains up to 8 HTTP Keep Alive connections to each ClickHouse server used by the application.

For large multithreaded applications, separate connection pools may be appropriate. Customized connection pools can be provided as the pool_mgr keyword argument to the main clickhouse_connect.get_client function:

import clickhouse_connect
import clickhouse_connect.httputil

big_pool_mgr = httputil.get_pool_manager(maxsize=16, num_pools=12)

client1 = clickhouse_connect.get_client(pool_mgr=big_pool_mgr)
client2 = clickhouse_connect.get_client(pool_mgr=big_pool_mgr)

As demonstrated by the above example, clients can share a pool manager, or a separate pool manager can be created for each client. For more details on the options available when creating a PoolManager, see the urllib3 documentation.

Querying Data with ClickHouse Connect: Advanced Usage

QueryContexts

ClickHouse Connect executes standard queries within a QueryContext. The QueryContext contains the key structures that are used to build queries against the ClickHouse database, and the configuration used to process the result into a QueryResult or other response data structure. That includes the query itself, parameters, settings, read formats, and other properties.

A QueryContext can be acquired using the client create_query_context method. This method takes the same parameters as the core query method. This query context can then be passed to the query, query_df, or query_np methods as the context keyword argument instead of any or all of the other arguments to those methods. Note that additional arguments specified for the method call will override any properties of QueryContext.

The clearest use case for a QueryContext is to send the same query with different binding parameter values. All parameter values can be updated by calling the QueryContext.set_parameters method with a dictionary, or any single value can be updated by calling QueryContext.set_parameter with the desired key, value pair.

client.create_query_context(query='SELECT value1, value2 FROM data_table WHERE key = {k:Int32}',
parameters={'k': 2},
column_oriented=True)
result = client.query(context=qc)
assert result.result_set[1][0] == 'second_value2'
qc.set_parameter('k', 1)
result = test_client.query(context=qc)
assert result.result_set[1][0] == 'first_value2'

Note that QueryContexts are not thread safe, but a copy can be obtained in a multithreaded environment by calling the QueryContext.updated_copy method.

Streaming Queries

Data Blocks

ClickHouse Connect processes all data from the primary query method as a stream of blocks received from the ClickHouse server. These blocks are transmitted in the custom "Native" format to and from ClickHouse. A "block" is simply a sequence of columns of binary data, where each column contains an equal number of data values of the specified data type. (As a columnar database, ClickHouse stores this data in a similar form.) The size of a block returned from a query is governed by two user settings that can be set at several levels (user profile, user, session, or query). They are:

Regardless of the preferred_block_size_setting, each block will never be more than max_block_size rows. Depending on the type of query, the actual blocks returned can be of any size. For example, queries to a distributed table covering many shards may contain smaller blocks retrieved directly from each shard.

When using one of the Client query_*_stream methods, results are returned on a block by block basis. ClickHouse Connect only loads a single block at a time. This allows processing large amounts of data without the need to load all of a large result set into memory. Note the application should be prepared to process any number of blocks and the exact size of each block cannot be controlled.

StreamContexts

Each of the query_*_stream methods (like query_row_block_stream) returns a ClickHouse StreamContext object, which is a combined Python context/generator. This is the basic usage:

with client.query_row_block_stream('SELECT pickup, dropoff, pickup_longitude, pickup_latitude FROM taxi_trips') as stream:
for block in stream:
for row in block:
<do something with each row of Python trip data>

Note that trying to use a StreamContext without a with statement will raise an error. The use of a Python context ensures that the stream (in this case, a streaming HTTP response) will be properly closed even if not all the data is consumed and/or an exception is raised during processing. Also, StreamContexts can only be used once to consume the stream. Trying to use a StreamContext after it has exited will produce a StreamClosedError.

You can use the source property of the StreamContext to access the parent QueryResult object, which includes column names and types.

Stream Types

The query_column_block_stream method returns the block as a sequence of column data stored as native Python data types. Using the above taxi_trips queries, the data returned will be a list where each element of the list is another list (or tuple) containing all the data for the associated column. So block[0] would be a tuple containing nothing but strings. Column oriented formats are most used for doing aggregate operations for all the values in a column, like adding up total fairs.

The query_row_block_stream method returns the block as a sequence of rows like a traditional relational database. For taxi trips, the data returned will be a list where each element of the list is another list representing a row of data. So block[0] would contain all the fields (in order) for the first taxi trip , block[1] would contain a row for all the fields in the second taxi trip, and so on. Row oriented results are normally used for display or transformation processes.

The query_row_stream is a convenience method that automatically moves to the next block when iterating through the stream. Otherwise, it is identical to query_row_block_stream.

The query_np_stream method return each block as a two-dimensional Numpy Array. Internally Numpy arrays are (usually) stored as columns, so no distinct row or column methods are needed. The "shape" of the numpy array will be expressed as (columns, rows). The Numpy library provides many methods of manipulating numpy arrays. Note that if all columns in the query share the same Numpy dtype, the returned numpy array will only have one dtype as well, and can be reshaped/rotated without actually changing its internal structure.

Finally, the query_df_stream method returns each ClickHouse Block as a two-dimensional Pandas Dataframe. Here's an example which shows that the StreamContext object can be used as a context in a deferred fashion (but only once).

df_stream = client.query_df_stream('SELECT * FROM hits')
column_names = df_stream.source.column_names
with df_stream:
for df in df_stream:
<do something with the pandas DataFrame>

Read Formats

Read formats control the data types of values returned from the client query, query_np, and query_df methods. (The raw_query and query_arrow do not modify incoming data from ClickHouse, so format control does not apply.) For example, if the read format for a UUID is changed from the default native format to the alternative string format, a ClickHouse query of UUID column will be returned as string values (using the standard 8-4-4-4-12 RFC 1422 format) instead of Python UUID objects.

The "data type" argument for any formatting function can include wildcards. The format is a single lower case string.

Read formats can be set at several levels:

  • Globally, using the methods defined in the clickhouse_connect.datatypes.format package. This will control the format of the configured datatype for all queries.
from clickhouse_connect.datatypes.format import set_read_format

# Return both IPv6 and IPv4 values as strings
set_read_format('IPv*', 'string')

# Return all Date types as the underlying epoch second or epoch day
set_read_format('Date*', 'int')
  • For an entire query, using the optional query_formats dictionary argument. In that case any column (or subcolumn) of the specified data types(s) will use the configured format.
# Return any UUID column as a string
client.query('SELECT user_id, user_uuid, device_uuid from users', query_formats={'UUID': 'string'})
  • For the values in a specific column, using the optional column_formats dictionary argument. The key is the column named as return by ClickHouse, and format for the data column or a second level "format" dictionary of a ClickHouse type name and a value of query formats. This secondary dictionary can be used for nested column types such as Tuples or Maps.
# Return IPv6 values in the `dev_address` column as strings
client.query('SELECT device_id, dev_address, gw_address from devices', column_formats={'dev_address', 'string'})

Read Format Options (Python Types)

ClickHouse TypeNative Python TypeRead FormatsComments
Int[8-64], UInt[8-32]int-
UInt64intsignedSuperset does not currently handle large unsigned UInt64 values
[U]Int[128,256]intstringPandas and Numpy int values are 64 bits maximum, so these can be returned as strings
Float32float-All Python floats are 64 bits internally
Float64float-
Decimaldecimal.Decimal-
StringstringbytesClickHouse String columns have no inherent encoding, so they are also used for variable length binary data
FixedStringbytesstringFixedStrings are fixed size byte arrays, but sometimes are treated as Python strings
Enum[8,16]stringstring, intPython enums don't accept empty strings, so all enums are rendered as either strings or the underlying int value.
Datedatetime.dateintClickHouse stores Dates as days since 01/01/1970. This value is available as an int
Date32datetime.dateintSame as Date, but for a wider range of dates
DateTimedatetime.datetimeintClickHouse stores DateTime in epoch seconds. This value is available as an int
DateTime64datetime.datetimeintPython datetime.datetime is limited to microsecond precision. The raw 64 bit int value is available
IPv4ipaddress.IPv4AddressstringIP addresses can be read as strings and properly formatted strings can be inserted as IP addresses
IPv6ipaddress.IPv6AddressstringIP addresses can be read as strings and properly formatted can be inserted as IP addresses
Tupledict or tupletuple, jsonNamed tuples returned as dictionaries by default. Named tuples can also be returned as JSON strings
Mapdict-
NestedSequence[dict]-
UUIDuuid.UUIDstringUUIDs can be read as strings formatted as per RFC 4122

External Data

ClickHouse queries can accept external data in any ClickHouse format. This binary data is sent along with the query string to be used to process the data. Details of the External Data feature are here. The client query* methods accept an optional external_data parameter to take advantage of this feature. The value for the external_data parameter should be a clickhouse_connect.driver.external.ExternalData object. The constructor for that object accepts the follow arguments:

NameTypeDescription
file_pathstrPath to a file on the local system path to read the external data from. Either file_path or data is required
file_namestrThe name of the external data "file". If not provided, will be determined from the file_path (without extensions)
databytesThe external data in binary form (instead of being read from a file). Either data or file_path is required
fmtstrThe ClickHouse Input Format of the data. Defaults to TSV
typesstr or seq of strA list of column data types in the external data. If a string, types should be separated by commas. Either types or structure is required
structurestr or seq of strA list of column name + data type in the data (see examples). Either structure or types is required
mime_typestrOptional MIME type of the file data. Currently ClickHouse ignores this HTTP subheader

To send a query with an external CSV file containing "movie" data, and combine that data with an directors table already present on the ClickHouse server:

import clickhouse_connect
from clickhouse_connect.driver.external import ExternalData

client = clickhouse_connect.get_client()
ext_data = ExternalData(file_path='/data/movies.csv',
fmt='CSV',
structure=['movie String', 'year UInt16', 'rating Decimal32(3)', 'director String'])
result = client.query('SELECT name, avg(rating) FROM directors INNER JOIN movies ON directors.name = movies.director GROUP BY directors.name',
external_data=ext_data).result_rows

Additional external data files can be added to the initial ExternalData object using the add_file method, which takes the same parameters as the constructor. For HTTP, all external data is transmitted as part of a multi-part/form-data file upload.

Time Zones

There are multiple mechanisms for applying a time zone to ClickHouse DateTime and DateTime64 values. Internally, the ClickHouse server always stores any DateTime or DateTime64 object as a time zone naive number representing seconds since the epoch, 1970-01-01 00:00:00 UTC time. For DateTime64 values, the representation can be milliseconds, microseconds, or nanoseconds since the epoch, depending on precision. As a result, the application of any time zone information always occurs on the client side. Note that this involves meaningful extra calculation, so in performance critical applications it is recommended to treat DateTime types as epoch timestamps except for user display and conversion (Pandas Timestamps, for example, are always a 64-bit integer representing epoch nanoseconds to improve performance).

When using time zone aware data types in queries - in particular the Python datetime.datetime object -- clickhouse-connect applies a client side time zone using the following precedence rules:

  1. If the query method parameter client_tzs is specified for the query, the specific column time zone is applied
  2. If the ClickHouse column has timezone metadata (i.e., it is a type like DateTime64(3, 'America/Denver')), the ClickHouse column timezone is applied. (Note this timezone metadata is not available to clickhouse-connect for DateTime columns previous to ClickHouse version 23.2)
  3. If the query method parameter query_tz is specified for the query, the "query timezone" is applied.
  4. If a timezone setting is applied to the query or session, that timezone is applied. (This functionality is not yet released in the ClickHouse Server)
  5. Finally, if the client apply_server_timezone parameter has been set to True (the default), the ClickHouse server timezone is applied.

Note that if the applied timezone based on these rules is UTC, clickhouse-connect will always return a time zone naive Python datetime.datetime object. Additional timezone information can then be added to this timezone naive object by the application code if desired.

Inserting Data with ClickHouse Connect: Advanced Usage

InsertContexts

ClickHouse Connect executes all inserts within an InsertContext. The InsertContext includes all the values sent as arguments to the client insert method. In addition, when an InsertContext is originally constructed, ClickHouse Connect retrieves the data types for the insert columns required for efficient Native format inserts. By reusing the InsertContext for multiple inserts, this "pre-query" is avoided and inserts are executed more quickly and efficiently.

An InsertContext can be acquired using the client create_insert_context method. The method takes the same arguments as the insert function. Note that only the data property of InsertContexts should be modified for reuse. This is consistent with its intended purpose of providing a reusable object for repeated inserts of new data to the same table.

test_data = [[1, 'v1', 'v2'], [2, 'v3', 'v4']]
ic = test_client.create_insert_context(table='test_table', data='test_data')
client.insert(context=ic)
assert client.command('SELECT count() FROM test_table') == 2
new_data = [[3, 'v5', 'v6'], [4, 'v7', 'v8']]
ic.data = new_data
client.insert(context=ic)
qr = test_client.query('SELECT * FROM test_table ORDER BY key DESC')
assert qr.row_count == 4
assert qr[0][0] == 4

InsertContexts include mutable state that is updated during the insert process, so they are not thread safe.

Write Formats

Write formats are currently implemented for limited number of types. In most cases ClickHouse Connect will attempt to automatically determine the correct write format for a column by checking the type of the first (non-null) data value. For example, if inserting into a DateTime column, and the first insert value of the column is a Python integer, ClickHouse Connect will directly insert the integer value under the assumption that it's actually an epoch second.

In most cases, it is unnecessary to override the write format for a data type, but the associated methods in the clickhouse_connect.datatypes.format package can be used to do so at a global level.

Write Format Options

ClickHouse TypeNative Python TypeWrite FormatsComments
Int[8-64], UInt[8-32]int-
UInt64int
[U]Int[128,256]int
Float32float
Float64float
Decimaldecimal.Decimal
Stringstring
FixedStringbytesstringIf inserted as a string, additional bytes will be set to zeros
Enum[8,16]string
Datedatetime.dateintClickHouse stores Dates as days since 01/01/1970. int types will be assumed to be this "epoch date" value
Date32datetime.dateintSame as Date, but for a wider range of dates
DateTimedatetime.datetimeintClickHouse stores DateTime in epoch seconds. int types will be assumed to be this "epoch second" value
DateTime64datetime.datetimeintPython datetime.datetime is limited to microsecond precision. The raw 64 bit int value is available
IPv4ipaddress.IPv4AddressstringProperly formatted strings can be inserted as IPv4 addresses
IPv6ipaddress.IPv6AddressstringProperly formatted strings can be inserted as IPv6 addresses
Tupledict or tuple
Mapdict
NestedSequence[dict]
JSON/Object('json')dictstringEither dictionaries or JSON strings can be inserted into JSON Columns.
UUIDuuid.UUIDstringProperly formatted strings can be inserted as ClickHouse UUIDs

Additional Options

ClickHouse Connect provides a number of additional options for advanced use cases

Global Settings

There are a small number of settings that control ClickHouse Connect behavior globally. They are accessed from the top level common package:

from clickhouse_connect import common

common.set_setting('autogenerate_session_id', False)
common.get_setting('invalid_setting_action')
'drop'
note

These common settings autogenerate_session_id, product_name, and readonly should always be modified before creating a client with the clickhouse_connect.get_client method. Changing these settings after client creation does not affect the behavior of existing clients.

Eight global settings are currently defined:

Setting NameDefaultOptionsDescription
autogenerate_session_idTrueTrue, FalseAutogenerate a new UUID(1) session id (if not provided) for each client session. If no session id is provided (either at the client or query level, ClickHouse will generate random internal id for each query
invalid_setting_action'error''drop', 'send', 'error'Action to take when an invalid or readonly setting is provided (either for the client session or query). If drop, the setting will be ignored, if send, the setting will be sent to ClickHouse, if error a client side ProgrammingError will be raised
dict_parameter_format'json''json', 'map'This controls whether parameterized queries convert a Python dictionary to JSON or ClickHouse Map syntax. json should be used for inserts into JSON columns, map for ClickHouse Map columns
product_nameA string that is passed with the query to clickhouse for tracking the app using ClickHouse Connect. Should be in the form <product name;&gl/<product version>
max_connection_age600Maximum seconds that an HTTP Keep Alive connection will be kept open/reused. This prevents bunching of connections against a single ClickHouse node behind a load balancer/proxy. Defaults to 10 minutes.
readonly00, 1Implied "read_only" ClickHouse settings for versions prior to 19.17. Can be set to match the ClickHouse "read_only" value for settings to allow operation with very old ClickHouse versions
use_protocol_versionTrueTrue, FalseUse the client protocol version. This is needed for DateTime timezone columns but breaks with the current version of chproxy
max_error_size1024Maximum number of characters that will be returned in a client error messages. Use 0 for this setting to get the full ClickHouse error message. Defaults to 1024 characters.

Compression

ClickHouse Connect supports lz4, zstd, brotli, and gzip compression for both query results and inserts. Always keep in mind that using compression usually involves a tradeoff between network bandwidth/transfer speed against CPU usage (both on the client and the server.)

To receive compressed data, the ClickHouse server enable_http_compression must be set to 1, or the user must have permission to change the setting on a "per query" basis.

Compression is controlled by the compress parameter when calling the clickhouse_connect.get_client factory method. By default, compress is set to True, which will trigger the default compression settings. For queries executed with the query, query_np, and query_df client methods, ClickHouse Connect will add the Accept-Encoding header with the lz4, zstd, br (brotli, if the brotli library is installed), gzip, and deflate encodings to queries executed with the query client method (and indirectly, query_np and query_df. (For the majority of requests the ClickHouse server will return with a zstd compressed payload.) For inserts, by default ClickHouse Connect will compress insert blocks with lz4 compression, and send the Content-Encoding: lz4 HTTP header.

The get_client compress parameter can also be set to a specific compression method, one of lz4, zstd, br, or gzip. That method will then be used for both inserts and query results (if supported by the ClickHouse server.) The required zstd and lz4 compression libraries are now installed by default with ClickHouse Connect. If br/brotli is specified, the brotli library must be installed separately.

Note that the raw* client methods don't use the compression specified by the client configuration.

We also recommend against using gzip compression, as it is significantly slower than the alternatives for both compressing and decompressing data.

HTTP Proxy Support

ClickHouse Connect adds basic HTTP proxy support using the urllib3 library. It recognizes the standard HTTP_PROXY and HTTPS_PROXY environment variables. Note that using these environment variables will apply to any client created with the clickhouse_connect.get_client method. Alternatively, to configure per client, you can use the http_proxy or https_proxy arguments to the get_client method. For details on the implementation of HTTP Proxy support, see the urllib3 documentation.

To use a Socks proxy, you can send a urllib3 SOCKSProxyManager as the pool_mgr argument to get_client. Note that this will require installing the PySocks library either directly or using the [socks] option for the urllib3 dependency.