Skip to content

Dask

What is Dask? A Bried Introduction | Dask YouTube What is Dask? And Who Uses It? Use Case Examples

Introduction

  • a dynamic task scheduler
    • it decides how to allocate resources to tasks
  • scales PyData libraries with parallel computing
    • pandas, numpy, scikit-learn, xgboost, etc.
  • allows you to work with more data in less time
  • Analogy: Pandas is a race car, Dask is a big engine
    • same, familiar interface, but Dask is faster

Dask Tutorial | Intro to Dask | The Power of Parallel Computing | Module One

Why/When we need parallel computing

  • high data size
  • high processing time

# Create a 665 million row dataset
ddf = dask.datasets.timeseries(
    start="2000-01-01",
    end="2020-12-31",
    freq="1s",
    partition_freq="7d",
    seed=42,
)

# Write it to a parquet file
ddf.to_parquet("big_file.parquet")

# 💀 Pandas can't handle this big dataset (on a machine with 16GB RAM)
# df = pd.read_parquet("big_file.parquet")  # This will give us a memory error if uncommented

# 🎉 Dask can handle it
ddf = dd.read_parquet("big_file.parquet")

%%time # This is a Jupyter Notebook magic command to measure the execution time of the cell
ddf["id"].nunique().compute()

Why Dask is faster than Pandas

  • Pandas is single-threaded
    • it can only use one CPU core
    • ie. if you have 4 cores, pandas will only use 1/4 of your CPU (the other 3/4 will be idle)
  • Dask is multi-threaded
    • it can use all CPU cores
    • ie. if you have 4 cores, Dask will use all 8 of your CPU
    • it can also use multiple machines (distributed processing)
      • ie. if you have 8 cores on 2 machines, Dask will use all 16 of your CPU

Data Structures

  • Examples
    • pandas + dask = Dask DataFrame
    • numpy + dask = Dask Array
    • scikit-learn + dask = Dask ML
    • xgboost + dask = Dask XGBoost