This library allows you to perform fast queries over HTTP(S)/CSV for QuestDB, a high-performance time-series database.
Query results are obtained as either Pandas dataframes or dicts of numpy arrays.
The library can be installed using the following command:
python3 -m pip install -U git+https://github.com/questdb/py-questdb-query.git#questdb_query
To uninstall the library, you can use the command:
python3 -m pip uninstall questdb_query
Once installed, you can use the library to query a QuestDB database. Here's an example that demonstrates how to query
CPU utilization data using the library against a database running on localhost
on the default HTTP port (9000).
from questdb_query import pandas_query
df = pandas_query('select * from cpu limit 1000')
This allows you, for example, to pre-aggregate results:
>>> df = df[['region', 'usage_user', 'usage_nice']].groupby('region').mean()
>>> df
usage_user usage_nice
region
ap-northeast-1 8.163766 6.492334
ap-southeast-1 6.511215 7.341863
ap-southeast-2 6.788770 6.257839
eu-central-1 7.392642 6.416479
eu-west-1 7.213417 7.185956
sa-east-1 7.143568 5.925026
us-east-1 7.620643 7.243553
us-west-1 6.286770 6.531977
us-west-2 6.228692 6.439672
You can then switch over to numpy with a simple and fast conversion:
>>> from questdb_query import pandas_to_numpy
>>> np_arrs = pandas_to_numpy(df)
>>> np_arrs
{'usage_user': array([8.16376556, 6.51121543, 6.78876964, 7.3926419 , 7.21341716,
7.14356839, 7.62064304, 6.28677006, 6.22869169]), 'usage_nice': array([6.49233392, 7.34186348, 6.25783903, 6.41647863, 7.18595643,
5.92502642, 7.24355328, 6.53197733, 6.43967247]), 'region': array(['ap-northeast-1', 'ap-southeast-1', 'ap-southeast-2',
'eu-central-1', 'eu-west-1', 'sa-east-1', 'us-east-1', 'us-west-1',
'us-west-2'], dtype=object)}
If your database is running on a remote host, specify an endpoint:
from questdb_query import pandas_query, Endpoint
endpoint = Endpoint(host='your.hostname.com', port=22453, https=True, username='user', password='pass')
np_arrs = numpy_query('select * from cpu limit 10', endpoint)
Note how the example above enables HTTPS and specifies a username and password for authentication.
The port is optional and defaults to 9000 for HTTP and 443 for HTTPS.
Alternatively, if the server is set up with token-based authentication you can use the token
parameter:
endpoint = Endpoint(host='your.hostname.com', https=True, token='your_token')
You can sometimes improve performance by splitting up a large query into smaller ones, running them in parallel, and joining the results together. This is especially useful if you have multiple CPUs available.
The numpy_query
function can do this automatically for you, by specifying the chunks
parameter.
The example below, splits up the query into 6 parallel chunks.
from questdb_query import numpy_query
np_arrs = numpy_query('select * from cpu', chunks=6)
The speed-up of splitting up a query into smaller ones is highly query-dependent and we recommend you experiment and benchmark. Mostly due to Python library limitations, not all parts of the query can be parallelized, so whilst you may see great benefits in going from 1 chunk (the default) to 8, the improvement going from 8 to 16 might be marginal.
Read on for more details on benchmarking: This is covered later in this README page.
⚠️ Thechunks > 1
parameter parallelizes queries. If the table(s) queried contain fast-moving data the results may be inconsistent as each chunk's query would be started at slightly different times.To avoid consistency issues formulate the query so that it only queries data that is not changing. You can do this, for example, by specifying a
timestamp
range in theWHERE
clause.
You can also query directly into a dictionary of Numpy arrays.
Notice that Numpy's datatypes are more limited than Panadas, specifically in the handling of null values.
This is a simple shorthand for querying into Pandas and then converting to Numpy:
def numpy_query(query: str, endpoint: Endpoint = None,
chunks: int = 1, timeout: int = None) -> dict[str, np.array]:
df = pandas_query(query, endpoint, chunks, timeout)
return pandas_to_numpy(df)
To use it, pass the query string to the numpy_query
function, along with the
same optional parameters as the pandas_query
function.
from questdb_query import numpy_query
np_arrs = numpy_query('''
select
timestamp, hostname, datacenter, usage_user, usage_nice
from
cpu
limit 10''')
The np_arrs
object is a python dict
which holds a numpy array per column, keyed by column name:
>>> np_arrs
{'timestamp': array(['2016-01-01T00:00:00.000000000', '2016-01-01T00:00:10.000000000',
'2016-01-01T00:00:20.000000000', '2016-01-01T00:00:30.000000000',
'2016-01-01T00:00:40.000000000', '2016-01-01T00:00:50.000000000',
'2016-01-01T00:01:00.000000000', '2016-01-01T00:01:10.000000000',
'2016-01-01T00:01:20.000000000', '2016-01-01T00:01:30.000000000'],
dtype='datetime64[ns]'), 'hostname': array(['host_0', 'host_1', 'host_2', 'host_3', 'host_4', 'host_5',
'host_6', 'host_7', 'host_8', 'host_9'], dtype=object), 'datacenter': array(['ap-southeast-2b', 'eu-west-1b', 'us-west-1b', 'us-west-2c',
'us-west-2b', 'eu-west-1b', 'eu-west-1b', 'us-west-1a',
'ap-southeast-2a', 'us-east-1a'], dtype=object), 'usage_user': array([1.39169048, 0.33846369, 0. , 1.81511203, 0.84273104,
0. , 0. , 0.28085548, 0. , 1.37192634]), 'usage_nice': array([0.30603088, 1.21496673, 0. , 0.16688796, 0. ,
2.77319521, 0.40332488, 1.81585253, 1.92844804, 2.12841919])}
If we wanted to calculate a (rather non-sensical) weighted average of usage_user
and usage_nice
we can
do this by accessing the numpy
columns:
>>> np_arrs['usage_user'].dot(np_arrs['usage_nice'].T)
4.5700692045031985
Each query result also contains a Stats
object with the performance summary which you can print.
>>> from questdb_query import pandas_query
>>> df = pandas_query('select * from cpu', chunks=8)
>>> print(df.query_stats)
Duration: 2.631s
Millions of lines: 5.000
Millions of lines/s: 1.901
MiB: 1332.144
MiB/s: 506.381
You can also extract individual fields:
>>> df.query_stats
Stats(duration_s=2.630711865, line_count=5000000, byte_count=1396853875, throughput_mbs=506.3814407360216, throughput_mlps=1.900626239810569)
>>> df.query_stats.throughput_mlps
1.900626239810569
To get the best performance it may be useful to try queries with different hardware setups, chunk counts etc.
You can run the benchmarking tool from the command line:
$ python3 -m questdb_query.tool --chunks 8 "select * from cpu"
hostname region datacenter rack os arch team service service_version service_environment usage_user usage_system usage_idle usage_nice usage_iowait usage_irq usage_softirq usage_steal usage_guest usage_guest_nice timestamp
0 host_0 ap-southeast-2 ap-southeast-2b 96 Ubuntu16.10 x86 CHI 11 0 test 1.391690 0.000000 2.644812 0.306031 1.194629 0.000000 0.000000 0.726996 0.000000 0.000000 2016-01-01 00:00:00
1 host_1 eu-west-1 eu-west-1b 52 Ubuntu16.04LTS x64 NYC 7 0 production 0.338464 1.951409 2.455378 1.214967 2.037935 0.000000 1.136997 1.022753 1.711183 0.000000 2016-01-01 00:00:10
2 host_2 us-west-1 us-west-1b 69 Ubuntu16.04LTS x64 LON 8 1 production 0.000000 2.800873 2.296324 0.000000 1.754139 1.531160 0.662572 0.000000 0.472402 0.312164 2016-01-01 00:00:20
3 host_3 us-west-2 us-west-2c 8 Ubuntu16.04LTS x86 LON 11 0 test 1.815112 4.412385 2.056344 0.166888 3.507148 3.276577 0.000000 0.000000 0.000000 1.496152 2016-01-01 00:00:30
4 host_4 us-west-2 us-west-2b 83 Ubuntu16.04LTS x64 NYC 6 0 test 0.842731 3.141248 2.199520 0.000000 2.943054 5.032342 0.391105 1.375450 0.000000 1.236811 2016-01-01 00:00:40
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
624995 host_3995 ap-southeast-2 ap-southeast-2a 30 Ubuntu16.04LTS x86 CHI 19 1 staging 33.238309 82.647341 17.272531 52.707720 71.718564 45.605728 100.000000 22.907723 78.130846 15.652954 2017-08-01 16:52:30
624996 host_3996 us-west-2 us-west-2a 67 Ubuntu15.10 x64 CHI 9 0 production 33.344070 81.922739 16.653731 52.107537 71.844945 45.880606 99.835977 23.045458 76.468930 17.091646 2017-08-01 16:52:40
624997 host_3997 us-west-2 us-west-2b 63 Ubuntu15.10 x86 SF 8 0 production 32.932095 80.662915 14.708377 53.354277 72.265215 44.803275 99.013038 20.375169 78.043473 17.870002 2017-08-01 16:52:50
624998 host_3998 eu-west-1 eu-west-1b 53 Ubuntu16.04LTS x86 CHI 11 1 staging 31.199818 80.994859 15.051577 51.923123 74.169828 46.453950 99.107213 21.004499 78.341154 18.880808 2017-08-01 16:53:00
624999 host_3999 us-east-1 us-east-1c 87 Ubuntu16.10 x64 SF 8 1 production 30.310735 81.727637 15.413537 51.417897 74.973555 44.882255 98.821672 19.055040 78.094993 19.263652 2017-08-01 16:53:10
[5000000 rows x 21 columns]
Duration: 2.547s
Millions of lines: 5.000
Millions of lines/s: 1.963
MiB: 1332.144
MiB/s: 522.962
These are the complete command line arguments:
$ python3 -m questdb_query.tool --help
usage: tool.py [-h] [--host HOST] [--port PORT] [--https] [--username USERNAME] [--password PASSWORD] [--chunks CHUNKS] query
positional arguments:
query
optional arguments:
-h, --help show this help message and exit
--host HOST
--port PORT
--https
--username USERNAME
--password PASSWORD
--chunks CHUNKS
The numpy_query
and pandas_query
functions are actually wrappers around async
variants.
If your application is already using async
, then call those directly as it allows other parts of your application to
perform work in parallel during the data download.
The functions take identical arguments as their synchronous counterparts.
import asyncio
from questdb_query.asynchronous import numpy_query
def main():
endpoint = Endpoint(host='your.hostname.com', https=True, username='user', password='pass')
np_arrs = await numpy_query('select * from cpu limit 10', endpoint)
print(np_arrs)
if __name__ == '__main__':
asyncio.run(main())