OLAP workflows
DuckDB exposes analytics-friendly table functions like read_parquet and read_csv_auto. The helpers in duckdb_sqlalchemy.olap make these easy to use with SQLAlchemy.
from sqlalchemy import select
from duckdb_sqlalchemy import read_parquet, read_csv_auto
parquet = read_parquet("data/events.parquet", columns=["event_id", "ts"])
stmt = select(parquet.c.event_id, parquet.c.ts)
csv = read_csv_auto("data/events.csv", columns=["event_id", "ts"])
stmt = select(csv.c.event_id, csv.c.ts)
Explicit CSV settings
Use read_csv when you need to control parsing options:
from duckdb_sqlalchemy import read_csv
csv = read_csv(
"data/events.csv",
columns=["event_id", "ts"],
header=True,
delim="|",
)
stmt = select(csv.c.event_id, csv.c.ts)
Other table functions
Use table_function for any DuckDB table function that does not have a helper:
from duckdb_sqlalchemy import table_function
parquet = table_function(
"read_parquet",
"data/partitioned/events/*.parquet",
columns=["event_id", "ts"],
hive_partitioning=True,
)
stmt = select(parquet.c.event_id, parquet.c.ts)
Arrow results
For large reads, you can request Arrow tables directly:
from pyarrow import Table as ArrowTable
from sqlalchemy import select
with engine.connect().execution_options(duckdb_arrow=True) as conn:
result = conn.execute(select(parquet.c.event_id, parquet.c.ts))
table = result.arrow # or result.all()
assert isinstance(table, ArrowTable)
Notes:
- Arrow results consume the cursor; fetch rows or Arrow, not both.
- Requires
pyarrowin your environment.
Streaming reads
For large result sets, combine stream_results with a larger arraysize:
with engine.connect().execution_options(stream_results=True, duckdb_arraysize=10_000) as conn:
result = conn.execute(select(parquet.c.event_id, parquet.c.ts))
for row in result:
...
duckdb_arraysize maps to the DBAPI cursor arraysize that fetchmany() uses.
Bulk writes
For large INSERT executemany workloads, the dialect can register a pandas/Arrow
object and run INSERT INTO ... SELECT ... internally. Control the threshold
with duckdb_copy_threshold:
rows = [{"event_id": 1, "ts": "2024-01-01"}, {"event_id": 2, "ts": "2024-01-02"}]
with engine.connect().execution_options(duckdb_copy_threshold=10000) as conn:
conn.execute(events.insert(), rows)
If pyarrow/pandas are unavailable, the dialect falls back to regular
executemany. The bulk-register path is skipped when RETURNING or
ON CONFLICT is in use.
On SQLAlchemy 2.x you can also tune multi-row INSERT batching with
duckdb_insertmanyvalues_page_size (defaults to 1000).
COPY helpers
Use COPY to load files directly into DuckDB without row-wise inserts:
from duckdb_sqlalchemy import copy_from_parquet, copy_from_csv
with engine.begin() as conn:
copy_from_parquet(conn, "events", "data/events.parquet")
copy_from_csv(conn, "events", "data/events.csv", header=True)
For safety, string table names, column names, and COPY option keys must be
identifiers. Dotted paths like schema.events are supported, but SQL
fragments are rejected.
If you need quoted or mixed-case identifiers, pass a SQLAlchemy Table object
instead of a plain string so SQLAlchemy handles quoting.
For row iterables, you can stream to a temporary CSV in chunks:
from duckdb_sqlalchemy import copy_from_rows
rows = ({"id": i, "name": f"user-{i}"} for i in range(1_000_000))
with engine.begin() as conn:
copy_from_rows(conn, "users", rows, columns=["id", "name"], chunk_size=100_000)
ATTACH for multi-database analytics
DuckDB can query across multiple databases in a single session:
from sqlalchemy import create_engine, text
conn = create_engine("duckdb:///local.duckdb").connect()
conn.execute(text("ATTACH 'analytics.duckdb' AS analytics"))
rows = conn.execute(text("SELECT * FROM analytics.events LIMIT 10")).fetchall()
Notes
- Column naming for table functions requires SQLAlchemy >= 1.4 (uses
table_valued).