Web Performance & Optimization

Python Concurrency Showdown: multiprocessing vs concurrent.futures vs asyncio

MatterAI

5 min read·March 2, 2026

Python Concurrency: multiprocessing vs concurrent.futures vs asyncio Performance

Python provides three primary concurrency models, each optimized for different workload types. The Global Interpreter Lock (GIL) in CPython prevents true parallel execution of threads, making the choice of concurrency model critical for performance.

The GIL Constraint

The GIL is a mutex that prevents multiple native threads from executing Python bytecode simultaneously. This means:

Threading provides concurrency but not parallelism for CPU-bound tasks
Multiprocessing bypasses the GIL by spawning separate processes with independent memory spaces
Asyncio avoids the GIL bottleneck entirely by using single-threaded cooperative multitasking

Python 3.13+ Note: The experimental free-threaded build (3.13t) allows running without the GIL, but currently incurs performance penalties due to disabled specialization optimizations. This should improve in Python 3.14.

Execution Models Compared

Model	Execution Unit	Memory	Best For
`multiprocessing`	OS Processes	Isolated	CPU-bound
`concurrent.futures.ThreadPoolExecutor`	OS Threads	Shared	I/O-bound (blocking)
`concurrent.futures.ProcessPoolExecutor`	OS Processes	Isolated	CPU-bound
`asyncio`	Coroutines	Shared	I/O-bound (async)

multiprocessing

Spawns separate Python interpreter processes, each with its own GIL. True parallelism for CPU-bound workloads.

Overhead costs:

Process spawn time (~10-50ms per process)
Memory duplication (each process has independent memory)
Data serialization via pickle for inter-process communication

import multiprocessing
import time

def cpu_bound_task(n):
    return sum(i * i for i in range(n))

if __name__ == "__main__":
    numbers = [10**6] * 8
    
    # Sequential
    start = time.perf_counter()
    results = [cpu_bound_task(n) for n in numbers]
    print(f"Sequential: {time.perf_counter() - start:.2f}s")
    
    # Parallel
    start = time.perf_counter()
    with multiprocessing.Pool() as pool:
        results = pool.map(cpu_bound_task, numbers)
    print(f"Parallel: {time.perf_counter() - start:.2f}s")

When to use: Heavy CPU computation (numerical processing, image manipulation, data transformation) where the cost of process creation and IPC is amortized over significant computation time.

concurrent.futures

High-level abstraction providing uniform API for both thread and process pools.

ThreadPoolExecutor

Uses OS threads. Limited by GIL for CPU-bound tasks but excellent for I/O-bound operations with blocking calls.

import concurrent.futures
import urllib.request

urls = [
    "https://httpbin.org/delay/1",
    "https://httpbin.org/delay/2",
    "https://httpbin.org/delay/1",
]

def fetch_url(url):
    with urllib.request.urlopen(url) as response:
        return response.read()

with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
    futures = [executor.submit(fetch_url, url) for url in urls]
    for future in concurrent.futures.as_completed(futures):
        print(future.result()[:50])

ProcessPoolExecutor

Wraps multiprocessing with the concurrent.futures API. Same GIL bypass, same overhead costs.

import concurrent.futures

def cpu_intensive(n):
    total = 0
    for i in range(n):
        total += i ** 2
    return total

if __name__ == "__main__":
    with concurrent.futures.ProcessPoolExecutor() as executor:
        results = list(executor.map(cpu_intensive, [10**6] * 4))
    print(results)

Key advantage: Uniform API allows switching between thread and process pools by changing one class name.

asyncio

Single-threaded event loop using cooperative multitasking. Coroutines yield control explicitly at await points.

Performance characteristics:

Near-zero context-switch overhead
Scales to thousands of concurrent connections
Requires async-compatible libraries (cannot mix with blocking I/O)

import asyncio
import urllib.request

def _fetch_blocking(url):
    """Blocking fetch - must complete read within thread context."""
    with urllib.request.urlopen(url) as response:
        return response.read()

async def fetch_url(url):
    return await asyncio.to_thread(_fetch_blocking, url)

async def main():
    urls = [
        "https://httpbin.org/delay/1",
        "https://httpbin.org/delay/2",
        "https://httpbin.org/delay/1",
    ]
    
    # Python 3.11+ TaskGroup (recommended for better error handling)
    results = []
    async with asyncio.TaskGroup() as tg:
        tasks = [tg.create_task(fetch_url(url)) for url in urls]
    
    for task in tasks:
        print(task.result()[:50])

asyncio.run(main())

Critical: When using to_thread() with context managers (like urlopen), all operations on the resource must complete inside the threaded function. The context manager closes before the result returns to the async context.

Python 3.11+ TaskGroup provides superior error handling via ExceptionGroup, which captures all task failures rather than losing exceptions on the first failure (unlike gather() with return_exceptions=False).

When to use: High-concurrency network operations, web servers, API clients, real-time data streaming.

Performance Decision Matrix

CPU-Bound Workloads

Task duration < 100ms    →  Run sequentially (overhead exceeds benefit)
Task duration 100ms-1s   →  ProcessPoolExecutor or multiprocessing.Pool
Task duration > 1s       →  ProcessPoolExecutor or multiprocessing.Pool
Large data transfer      →  Avoid multiprocessing (pickling overhead)
Memory-constrained       →  Limit worker count; consider shared_memory
NUMA systems             →  Pin processes to NUMA nodes; avoid cross-node memory access

Memory and NUMA Considerations:

Each process in a pool duplicates memory footprint. An application using 500MB per worker with 16 workers consumes 8GB+.
On NUMA systems (common in server hardware), processes accessing memory on a remote NUMA node incur 30-50% latency penalty. Use taskset (Linux) or psutil process affinity to pin workers to local nodes.
For large datasets, consider multiprocessing.shared_memory (Python 3.8+) or numpy memmap to avoid per-process copies.

I/O-Bound Workloads

Blocking libraries (requests, stdlib)  →  ThreadPoolExecutor
Async libraries (aiohttp, asyncpg)     →  asyncio
Mixed blocking/async                   →  asyncio with to_thread() or run_in_executor()
High connection count (>100)           →  asyncio

Benchmark Comparison

Note: These ratios are representative examples from typical workloads. Actual performance varies significantly based on hardware (CPU cores, memory bandwidth, disk I/O), OS scheduler behavior, network conditions, and specific workload characteristics. Always benchmark on your target deployment environment.

Workload Type	Sequential	ThreadPool	ProcessPool	asyncio
CPU-bound (compute)	1.0x	1.0x (GIL)	7-8x	1.0x
I/O-bound (network, low latency)	1.0x	3-5x	2-4x	5-10x
I/O-bound (network, high latency)	1.0x	8-15x	6-12x	15-50x
I/O-bound (disk)	1.0x	2-3x	1-2x	1.0x-2.0x (via to_thread)

Common Pitfalls

Using threads for CPU-bound work: GIL serialization negates parallelism benefits
Large data with multiprocessing: Pickle serialization can dominate runtime
Blocking calls in asyncio: Blocks entire event loop, killing concurrency
Shared state with processes: Requires explicit IPC (Queues, Managers, shared_memory)
Over-spawning processes: More workers than CPU cores causes context-switch overhead
Context manager scope with to_thread(): Resources close before async context can use them

Quick Selection Guide

CPU-bound tasks: Use process-based parallelism (ProcessPoolExecutor or multiprocessing.Pool). Choose based on API preference - ProcessPoolExecutor for simpler high-level interface, multiprocessing.Pool for advanced features.

I/O-bound tasks with blocking libraries: Use ThreadPoolExecutor for moderate concurrency (10-100 operations).

I/O-bound tasks with async libraries: Use asyncio for high concurrency (100+ operations).

Mixed blocking/async code: Use asyncio with to_thread() or run_in_executor() to integrate blocking calls.

Getting Started

Profile first: Identify whether your bottleneck is CPU or I/O using cProfile or py-spy
Start simple: Try concurrent.futures for straightforward parallelization
Migrate to asyncio: When you need >100 concurrent I/O operations
Use multiprocessing: For CPU-intensive data processing with minimal inter-process communication
Monitor memory: Process pools multiply memory usage by worker count
Consider NUMA: On multi-socket systems, pin processes to avoid cross-node memory latency

MatterAI builds frontier AI infrastructure for engineering teams — from inference-optimized models to autonomous coding agents and agentic code reviews.

Explore what we're building:

Orbital IDE — Autonomous AI coding agent with background agents and deep codebase memory
AI Code Reviews — Agentic pre-commit reviews across GitHub, GitLab, and Bitbucket
Axon Models — Frontier-grade reasoning models at 70% lower inference cost

Get started free - https://app.matterai.so

Follow us on X · LinkedIn · GitHub

Share this Guide:

More Guides

LLM Integration for AI Agents: A Complete Engineering FAQ

Everything engineers need to know about integrating, testing, and productionizing LLMs in AI agents: model selection, tool calling, structured outputs, error handling, observability, and cost optimization.

22 min read

Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines

Build production-ready AI agents that iteratively improve their outputs through automated feedback loops, combining LangGraph's state machine architecture with CrewAI's multi-agent orchestration for robust, self-correcting workflows.

14 min read

Bun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite

Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.

10 min read

Deno 2.0 Workspaces: Build Monorepos with JSR Packages and TypeScript-First Development

Learn how to configure Deno 2.0 workspaces for monorepo management, publish TypeScript packages to JSR, and automate releases with OIDC-authenticated CI/CD pipelines.

7 min read

Gleam on BEAM: Building Type-Safe, Fault-Tolerant Distributed Systems

Learn how Gleam combines Hindley-Milner type inference with Erlang's actor-based concurrency model to build systems that are both compile-time safe and runtime fault-tolerant. Covers OTP integration, supervision trees, and seamless interoperability with the BEAM ecosystem.

5 min read

Continue Reading

LLM Integration for AI Agents: A Complete Engineering FAQ

22 min read

Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines

14 min read

Bun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite

Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.

10 min read

Ship Faster. Ship Safer.

Join thousands of engineering teams using MatterAI to autonomously build, review, and deploy code with enterprise-grade precision.

Start Building for Free Read the Docs

No credit card requiredSOC 2 Type IISetup in 2 min