Skip to content

Qualcomm AI Engine Direct - calibration thread auto-tuning#18184

Open
abhinaykukkadapu wants to merge 1 commit intopytorch:mainfrom
abhinaykukkadapu:calibration-thread-tuning
Open

Qualcomm AI Engine Direct - calibration thread auto-tuning#18184
abhinaykukkadapu wants to merge 1 commit intopytorch:mainfrom
abhinaykukkadapu:calibration-thread-tuning

Conversation

@abhinaykukkadapu
Copy link
Contributor

@abhinaykukkadapu abhinaykukkadapu commented Mar 14, 2026

TL;DR

Calibration overall time has been cut to near ~20-25 minutes compared to previous 6.5h - 10h for various models. These optimizations are stacked results from multiple commits. Only remaining bottleneck is the QNN SDK Compile which is opaque to us.

Thread tuning

AR1 decode calibration is SGEMV-dominated and memory-bandwidth-bound. The default thread count (os.cpu_count()) causes massive OpenMP sync overhead on multi-core hosts. Add runtime auto-tuning that sweeps candidate thread counts via a quick microbenchmark and picks the fastest. CLI override via --calibration_num_threads.

On a 72-vCPU host, auto-tune selects 18-36 threads, yielding 4.6x faster calibration (24 min vs 1h51m) with no PPL regression.

Host cpu_count physical Candidates
4-core laptop 8 4 [1, 1, 2, 3, 4, 6, 8] → deduped: [1, 2, 3, 4, 6, 8]
8-core workstation 16 8 [1, 2, 4, 6, 8, 12, 16]
36-core VM 72 36 [4, 9, 18, 27, 36, 54, 72]
64-core 128 64 [8, 16, 32, 48, 64, 96, 128]

Llama3.2-1B (hybrid, max_seq_len=1024)

Phase Baseline + SeqMSE + No PREFILL + Thread Auto-Tune Speedup
DECODE calibration 13,225s (3h40m) 6,639s (1h51m) 1,523.6s (25.4m) 8.7x
PREFILL calibration 10,317s (2h52m) skipped skipped -
Total (calib only) ~6h32m ~1h51m ~25 min 15.7x
Compile (QNN SDK) ~7,700s (2h8m) ~7,700s (2h8m) 7,324s (2h2m)
Total (with compile) ~8h53m ~4h10m ~2h28m 3.6x

Qwen3-0.6B (hybrid, max_seq_len=1024)

Phase Optimized (all 3)
DECODE calibration 950.9s (15.8m)
PREFILL calibration skipped
Total (calib only) ~18 min
Compile (QNN SDK) not measured (--skip_compile)

cc @cccclai @cbilgin @digantdesai @tanvirislam-meta

AR1 decode calibration is SGEMV-dominated and memory-bandwidth-bound.
The default thread count (os.cpu_count()) causes massive OpenMP sync
overhead on multi-core hosts. Add runtime auto-tuning that sweeps
candidate thread counts via a quick microbenchmark and picks the
fastest. CLI override via --calibration_num_threads.

On a 72-vCPU host, auto-tune selects 18-36 threads, yielding 4.6x
faster calibration (24 min vs 1h51m) with no PPL regression.
@abhinaykukkadapu abhinaykukkadapu added the module: qnn Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/ label Mar 14, 2026
@github-project-automation github-project-automation bot moved this to To triage in ExecuTorch Core Mar 14, 2026
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 14, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18184

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 64f1fb6 with merge base 8bec69b (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 14, 2026
@abhinaykukkadapu abhinaykukkadapu linked an issue Mar 14, 2026 that may be closed by this pull request
@github-actions
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@abhinaykukkadapu abhinaykukkadapu moved this from To triage to In progress in ExecuTorch Core Mar 14, 2026
@abhinaykukkadapu abhinaykukkadapu marked this pull request as ready for review March 14, 2026 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: qnn Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

Optimize decode loop in calibration

1 participant