Dask
What is Dask? A Bried Introduction | Dask YouTube What is Dask? And Who Uses It? Use Case Examples
Introduction
- a dynamic task scheduler
- it decides how to allocate resources to tasks
- scales PyData libraries with parallel computing
- pandas, numpy, scikit-learn, xgboost, etc.
- allows you to work with more data in less time
- Analogy: Pandas is a race car, Dask is a big engine
- same, familiar interface, but Dask is faster

Dask Tutorial | Intro to Dask | The Power of Parallel Computing | Module One
Why/When we need parallel computing
- high data size
- high processing time

# Create a 665 million row dataset
ddf = dask.datasets.timeseries(
start="2000-01-01",
end="2020-12-31",
freq="1s",
partition_freq="7d",
seed=42,
)
# Write it to a parquet file
ddf.to_parquet("big_file.parquet")
# 💀 Pandas can't handle this big dataset (on a machine with 16GB RAM)
# df = pd.read_parquet("big_file.parquet") # This will give us a memory error if uncommented
# 🎉 Dask can handle it
ddf = dd.read_parquet("big_file.parquet")
%%time # This is a Jupyter Notebook magic command to measure the execution time of the cell
ddf["id"].nunique().compute()
Why Dask is faster than Pandas
- Pandas is single-threaded
- it can only use one CPU core
- ie. if you have 4 cores, pandas will only use 1/4 of your CPU (the other 3/4 will be idle)
- Dask is multi-threaded
- it can use all CPU cores
- ie. if you have 4 cores, Dask will use all 8 of your CPU

- it can also use multiple machines (distributed processing)
- ie. if you have 8 cores on 2 machines, Dask will use all 16 of your CPU

Data Structures
- Examples
- pandas + dask = Dask DataFrame
- numpy + dask = Dask Array
- scikit-learn + dask = Dask ML
- xgboost + dask = Dask XGBoost
