Qualcomm AI Engine Direct - calibration thread auto-tuning by abhinaykukkadapu · Pull Request #18184 · pytorch/executorch

abhinaykukkadapu · 2026-03-14T19:54:49Z

TL;DR

Calibration overall time has been cut to near ~20-25 minutes compared to previous 6.5h - 10h for various models. These optimizations are stacked results from multiple commits. Only remaining bottleneck is the QNN SDK Compile which is opaque to us.

Thread tuning

AR1 decode calibration is SGEMV-dominated and memory-bandwidth-bound. The default thread count (os.cpu_count()) causes massive OpenMP sync overhead on multi-core hosts. Add runtime auto-tuning that sweeps candidate thread counts via a quick microbenchmark and picks the fastest. CLI override via --calibration_num_threads.

On a 72-vCPU host, auto-tune selects 18-36 threads, yielding 4.6x faster calibration (24 min vs 1h51m) with no PPL regression.

Host	cpu_count	physical	Candidates
4-core laptop	8	4	[1, 1, 2, 3, 4, 6, 8] → deduped: [1, 2, 3, 4, 6, 8]
8-core workstation	16	8	[1, 2, 4, 6, 8, 12, 16]
36-core VM	72	36	[4, 9, 18, 27, 36, 54, 72]
64-core	128	64	[8, 16, 32, 48, 64, 96, 128]

Llama3.2-1B (hybrid, max_seq_len=1024)

Phase	Baseline	+ SeqMSE + No PREFILL	+ Thread Auto-Tune	Speedup
DECODE calibration	13,225s (3h40m)	6,639s (1h51m)	1,523.6s (25.4m)	8.7x
PREFILL calibration	10,317s (2h52m)	skipped	skipped	-
Total (calib only)	~6h32m	~1h51m	~25 min	15.7x
Compile (QNN SDK)	~7,700s (2h8m)	~7,700s (2h8m)	7,324s (2h2m)	—
Total (with compile)	~8h53m	~4h10m	~2h28m	3.6x

Qwen3-0.6B (hybrid, max_seq_len=1024)

Phase	Optimized (all 3)
DECODE calibration	950.9s (15.8m)
PREFILL calibration	skipped
Total (calib only)	~18 min
Compile (QNN SDK)	not measured (--skip_compile)

cc @cccclai @cbilgin @digantdesai @tanvirislam-meta

AR1 decode calibration is SGEMV-dominated and memory-bandwidth-bound. The default thread count (os.cpu_count()) causes massive OpenMP sync overhead on multi-core hosts. Add runtime auto-tuning that sweeps candidate thread counts via a quick microbenchmark and picks the fastest. CLI override via --calibration_num_threads. On a 72-vCPU host, auto-tune selects 18-36 threads, yielding 4.6x faster calibration (24 min vs 1h51m) with no PPL regression.

pytorch-bot · 2026-03-14T19:54:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18184

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 64f1fb6 with merge base 8bec69b ():

NEW FAILURE - The following job has failed:

pull / unittest-editable / macos / macos-job (gh)
export/tests/test_target_recipes.py::TestTargetRecipes::test_linear_model

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-14T19:55:32Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

abhinaykukkadapu added this to ExecuTorch Core Mar 14, 2026

abhinaykukkadapu added the module: qnn Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/ label Mar 14, 2026

github-project-automation bot moved this to To triage in ExecuTorch Core Mar 14, 2026

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 14, 2026

abhinaykukkadapu linked an issue Mar 14, 2026 that may be closed by this pull request

Optimize decode loop in calibration #18065

Open

abhinaykukkadapu moved this from To triage to In progress in ExecuTorch Core Mar 14, 2026

abhinaykukkadapu marked this pull request as ready for review March 14, 2026 21:38

abhinaykukkadapu requested a review from cccclai as a code owner March 14, 2026 21:38

abhinaykukkadapu requested review from chenweng-quic, haowhsu-quic, shewu-quic and winskuo-quic March 14, 2026 21:38

abhinaykukkadapu mentioned this pull request Mar 14, 2026

Optimize decode loop in calibration #18065

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qualcomm AI Engine Direct - calibration thread auto-tuning#18184

Qualcomm AI Engine Direct - calibration thread auto-tuning#18184
abhinaykukkadapu wants to merge 1 commit intopytorch:mainfrom
abhinaykukkadapu:calibration-thread-tuning

abhinaykukkadapu commented Mar 14, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 14, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abhinaykukkadapu commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Thread tuning

Llama3.2-1B (hybrid, max_seq_len=1024)

Qwen3-0.6B (hybrid, max_seq_len=1024)

Uh oh!

pytorch-bot bot commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18184

❌ 1 New Failure

Uh oh!

github-actions bot commented Mar 14, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abhinaykukkadapu commented Mar 14, 2026 •

edited

Loading

pytorch-bot bot commented Mar 14, 2026 •

edited

Loading

This PR needs a `release notes:` label