[ET-VK][matmul] Re-implement fp32/fp16 matmul and linear with tiled compute and blocked weight packing by SS-JIA · Pull Request #18171 · pytorch/executorch

SS-JIA · 2026-03-13T20:06:53Z

Stack from ghstack (oldest at bottom):

-> [ET-VK][matmul] Re-implement fp32/fp16 matmul and linear with tiled compute and blocked weight packing #18171

Replace all existing matmul/linear operator implementations with new ones built
from the ground up using a tiled compute approach. Delete all legacy
implementations (MatMulLegacy.cpp, LinearLegacy.cpp, addmm_optimized.glsl,
addmm_naive_*.glsl).

New matmul (mm/bmm/addmm):

Single matmul.glsl shader handles mm, bmm, and addmm using FPInputTile,
FPWeightTile, FPOutTile infrastructure from SDPA
Adaptive tile size selection (TILE_M=4/2/1) based on GPU occupancy
When mat2 is a constant tensor, automatically routes through the linear
path for blocked weight packing

New linear:

Custom 4OC×4IC blocked weight prepacking via pack_fp_linear_weight.glsl
for optimal cache line utilization during tiled matmul
Supports both transposed [N,K] and non-transposed [K,N] weights with
batch dimension support
Separate texture2d weight storage with automatic buffer fallback for
large dimensions

Performance on Adreno 750 (fp16, vs legacy):

Linear [4096,1024]x[256,1024]: 1.33x faster (texture)
Linear [4096,64]x[128,64]: 2.67x faster (texture)
BMM [1,4096,256]x[1,256,1024]: 1.63x faster (texture)

Differential Revision: D96488384

…ompute and blocked weight packing Replace all existing matmul/linear operator implementations with new ones built from the ground up using a tiled compute approach. Delete all legacy implementations (MatMulLegacy.cpp, LinearLegacy.cpp, addmm_optimized.glsl, addmm_naive_*.glsl). New matmul (mm/bmm/addmm): - Single matmul.glsl shader handles mm, bmm, and addmm using FPInputTile, FPWeightTile, FPOutTile infrastructure from SDPA - Adaptive tile size selection (TILE_M=4/2/1) based on GPU occupancy - When mat2 is a constant tensor, automatically routes through the linear path for blocked weight packing New linear: - Custom 4OC×4IC blocked weight prepacking via pack_fp_linear_weight.glsl for optimal cache line utilization during tiled matmul - Supports both transposed [N,K] and non-transposed [K,N] weights with batch dimension support - Separate texture2d weight storage with automatic buffer fallback for large dimensions Performance on Adreno 750 (fp16, vs legacy): - Linear [4096,1024]x[256,1024]: 1.33x faster (texture) - Linear [4096,64]x[128,64]: 2.67x faster (texture) - BMM [1,4096,256]x[1,256,1024]: 1.63x faster (texture) Differential Revision: [D96488384](https://our.internmc.facebook.com/intern/diff/D96488384/) [ghstack-poisoned]

pytorch-bot · 2026-03-13T20:06:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18171

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 12 Unrelated Failures

As of commit 855646e with merge base cc27e6b ():

NEW FAILURE - The following job has failed:

Build Presets / windows (pybind) / build (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / test-llama-runner-qnn-linux (fp32, qnn_8a8w, qnn) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
pull / unittest / windows / windows-job (gh) (similar failure)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
pull / unittest-editable / windows / windows-job (gh) (similar failure)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
pull / unittest-nxp-neutron / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
pull / unittest-wasm-bindings (--enable-etdump) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (nvidia, parakeet-tdt, non-quantized) / windows-job (gh) (similar failure)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Build Presets / windows (windows) / build (gh) (trunk failure)
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (mistralai, Voxtral-Mini-3B-2507, non-quantized) / windows-job (gh) (trunk failure)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (mistralai, Voxtral-Mini-3B-2507, quantized-int4-weight-only) / windows-job (gh) (trunk failure)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (mistralai, Voxtral-Mini-4B-Realtime-2602, quantized-int4-tile-packed) / windows-job (gh) (trunk failure)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (nvidia, diar_streaming_sortformer_4spk-v2, non-quantized) / windows-job (gh) (trunk failure)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (nvidia, parakeet-tdt, quantized-int4-weight-only) / windows-job (gh) (trunk failure)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…ompute and blocked weight packing Replace all existing matmul/linear operator implementations with new ones built from the ground up using a tiled compute approach. Delete all legacy implementations (MatMulLegacy.cpp, LinearLegacy.cpp, addmm_optimized.glsl, addmm_naive_*.glsl). New matmul (mm/bmm/addmm): - Single matmul.glsl shader handles mm, bmm, and addmm using FPInputTile, FPWeightTile, FPOutTile infrastructure from SDPA - Adaptive tile size selection (TILE_M=4/2/1) based on GPU occupancy - When mat2 is a constant tensor, automatically routes through the linear path for blocked weight packing New linear: - Custom 4OC×4IC blocked weight prepacking via pack_fp_linear_weight.glsl for optimal cache line utilization during tiled matmul - Supports both transposed [N,K] and non-transposed [K,N] weights with batch dimension support - Separate texture2d weight storage with automatic buffer fallback for large dimensions Performance on Adreno 750 (fp16, vs legacy): - Linear [4096,1024]x[256,1024]: 1.33x faster (texture) - Linear [4096,64]x[128,64]: 2.67x faster (texture) - BMM [1,4096,256]x[1,256,1024]: 1.63x faster (texture) Differential Revision: [D96488384](https://our.internmc.facebook.com/intern/diff/D96488384/) ghstack-source-id: 351923318 Pull Request resolved: #18171

github-actions · 2026-03-13T20:07:49Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…ith tiled compute and blocked weight packing" Replace all existing matmul/linear operator implementations with new ones built from the ground up using a tiled compute approach. Delete all legacy implementations (MatMulLegacy.cpp, LinearLegacy.cpp, addmm_optimized.glsl, addmm_naive_*.glsl). New matmul (mm/bmm/addmm): - Single matmul.glsl shader handles mm, bmm, and addmm using FPInputTile, FPWeightTile, FPOutTile infrastructure from SDPA - Adaptive tile size selection (TILE_M=4/2/1) based on GPU occupancy - When mat2 is a constant tensor, automatically routes through the linear path for blocked weight packing New linear: - Custom 4OC×4IC blocked weight prepacking via pack_fp_linear_weight.glsl for optimal cache line utilization during tiled matmul - Supports both transposed [N,K] and non-transposed [K,N] weights with batch dimension support - Separate texture2d weight storage with automatic buffer fallback for large dimensions Performance on Adreno 750 (fp16, vs legacy): - Linear [4096,1024]x[256,1024]: 1.33x faster (texture) - Linear [4096,64]x[128,64]: 2.67x faster (texture) - BMM [1,4096,256]x[1,256,1024]: 1.63x faster (texture) Differential Revision: [D96488384](https://our.internmc.facebook.com/intern/diff/D96488384/) [ghstack-poisoned]

…ompute and blocked weight packing Pull Request resolved: #18171 Replace all existing matmul/linear operator implementations with new ones built from the ground up using a tiled compute approach. Delete all legacy implementations (MatMulLegacy.cpp, LinearLegacy.cpp, addmm_optimized.glsl, addmm_naive_*.glsl). New matmul (mm/bmm/addmm): - Single matmul.glsl shader handles mm, bmm, and addmm using FPInputTile, FPWeightTile, FPOutTile infrastructure from SDPA - Adaptive tile size selection (TILE_M=4/2/1) based on GPU occupancy - When mat2 is a constant tensor, automatically routes through the linear path for blocked weight packing New linear: - Custom 4OC×4IC blocked weight prepacking via pack_fp_linear_weight.glsl for optimal cache line utilization during tiled matmul - Supports both transposed [N,K] and non-transposed [K,N] weights with batch dimension support - Separate texture2d weight storage with automatic buffer fallback for large dimensions Performance on Adreno 750 (fp16, vs legacy): - Linear [4096,1024]x[256,1024]: 1.33x faster (texture) - Linear [4096,64]x[128,64]: 2.67x faster (texture) - BMM [1,4096,256]x[1,256,1024]: 1.63x faster (texture) ghstack-source-id: 351941171 @exported-using-ghexport Differential Revision: [D96488384](https://our.internmc.facebook.com/intern/diff/D96488384/)

…ith tiled compute and blocked weight packing" Replace all existing matmul/linear operator implementations with new ones built from the ground up using a tiled compute approach. Delete all legacy implementations (MatMulLegacy.cpp, LinearLegacy.cpp, addmm_optimized.glsl, addmm_naive_*.glsl). New matmul (mm/bmm/addmm): - Single matmul.glsl shader handles mm, bmm, and addmm using FPInputTile, FPWeightTile, FPOutTile infrastructure from SDPA - Adaptive tile size selection (TILE_M=4/2/1) based on GPU occupancy - When mat2 is a constant tensor, automatically routes through the linear path for blocked weight packing New linear: - Custom 4OC×4IC blocked weight prepacking via pack_fp_linear_weight.glsl for optimal cache line utilization during tiled matmul - Supports both transposed [N,K] and non-transposed [K,N] weights with batch dimension support - Separate texture2d weight storage with automatic buffer fallback for large dimensions Performance on Adreno 750 (fp16, vs legacy): - Linear [4096,1024]x[256,1024]: 1.33x faster (texture) - Linear [4096,64]x[128,64]: 2.67x faster (texture) - BMM [1,4096,256]x[1,256,1024]: 1.63x faster (texture) Differential Revision: [D96488384](https://our.internmc.facebook.com/intern/diff/D96488384/) [ghstack-poisoned]

…ompute and blocked weight packing Pull Request resolved: #18171 Replace all existing matmul/linear operator implementations with new ones built from the ground up using a tiled compute approach. Delete all legacy implementations (MatMulLegacy.cpp, LinearLegacy.cpp, addmm_optimized.glsl, addmm_naive_*.glsl). New matmul (mm/bmm/addmm): - Single matmul.glsl shader handles mm, bmm, and addmm using FPInputTile, FPWeightTile, FPOutTile infrastructure from SDPA - Adaptive tile size selection (TILE_M=4/2/1) based on GPU occupancy - When mat2 is a constant tensor, automatically routes through the linear path for blocked weight packing New linear: - Custom 4OC×4IC blocked weight prepacking via pack_fp_linear_weight.glsl for optimal cache line utilization during tiled matmul - Supports both transposed [N,K] and non-transposed [K,N] weights with batch dimension support - Separate texture2d weight storage with automatic buffer fallback for large dimensions Performance on Adreno 750 (fp16, vs legacy): - Linear [4096,1024]x[256,1024]: 1.33x faster (texture) - Linear [4096,64]x[128,64]: 2.67x faster (texture) - BMM [1,4096,256]x[1,256,1024]: 1.63x faster (texture) ghstack-source-id: 351992411 @exported-using-ghexport Differential Revision: [D96488384](https://our.internmc.facebook.com/intern/diff/D96488384/)

…ith tiled compute and blocked weight packing" Replace all existing matmul/linear operator implementations with new ones built from the ground up using a tiled compute approach. Delete all legacy implementations (MatMulLegacy.cpp, LinearLegacy.cpp, addmm_optimized.glsl, addmm_naive_*.glsl). New matmul (mm/bmm/addmm): - Single matmul.glsl shader handles mm, bmm, and addmm using FPInputTile, FPWeightTile, FPOutTile infrastructure from SDPA - Adaptive tile size selection (TILE_M=4/2/1) based on GPU occupancy - When mat2 is a constant tensor, automatically routes through the linear path for blocked weight packing New linear: - Custom 4OC×4IC blocked weight prepacking via pack_fp_linear_weight.glsl for optimal cache line utilization during tiled matmul - Supports both transposed [N,K] and non-transposed [K,N] weights with batch dimension support - Separate texture2d weight storage with automatic buffer fallback for large dimensions Performance on Adreno 750 (fp16, vs legacy): - Linear [4096,1024]x[256,1024]: 1.33x faster (texture) - Linear [4096,64]x[128,64]: 2.67x faster (texture) - BMM [1,4096,256]x[1,256,1024]: 1.63x faster (texture) Differential Revision: [D96488384](https://our.internmc.facebook.com/intern/diff/D96488384/) [ghstack-poisoned]

…ompute and blocked weight packing Pull Request resolved: #18171 Replace all existing matmul/linear operator implementations with new ones built from the ground up using a tiled compute approach. Delete all legacy implementations (MatMulLegacy.cpp, LinearLegacy.cpp, addmm_optimized.glsl, addmm_naive_*.glsl). New matmul (mm/bmm/addmm): - Single matmul.glsl shader handles mm, bmm, and addmm using FPInputTile, FPWeightTile, FPOutTile infrastructure from SDPA - Adaptive tile size selection (TILE_M=4/2/1) based on GPU occupancy - When mat2 is a constant tensor, automatically routes through the linear path for blocked weight packing New linear: - Custom 4OC×4IC blocked weight prepacking via pack_fp_linear_weight.glsl for optimal cache line utilization during tiled matmul - Supports both transposed [N,K] and non-transposed [K,N] weights with batch dimension support - Separate texture2d weight storage with automatic buffer fallback for large dimensions Performance on Adreno 750 (fp16, vs legacy): - Linear [4096,1024]x[256,1024]: 1.33x faster (texture) - Linear [4096,64]x[128,64]: 2.67x faster (texture) - BMM [1,4096,256]x[1,256,1024]: 1.63x faster (texture) ghstack-source-id: 352051371 @exported-using-ghexport Differential Revision: [D96488384](https://our.internmc.facebook.com/intern/diff/D96488384/)

SS-JIA mentioned this pull request Mar 13, 2026

[ET-VK][testing] Add Skycastle GPU host test workflow via ECOD #18170

Open

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 13, 2026

meta-codesync bot added fb-exported meta-exported labels Mar 13, 2026

manuelcandales approved these changes Mar 13, 2026

View reviewed changes

ssjia added 2 commits March 13, 2026 13:58

meta-codesync bot merged commit 7a63aff into gh/SS-JIA/487/base Mar 14, 2026
206 of 222 checks passed

meta-codesync bot deleted the gh/SS-JIA/487/head branch March 14, 2026 08:48

meta-codesync bot temporarily deployed to cherry-pick-bot March 14, 2026 08:48 Inactive

pytorchbot mentioned this pull request Mar 14, 2026

[ET-VK][matmul] Re-implement fp32/fp16 matmul and linear with tiled compute and blocked weight packing #18183

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK][matmul] Re-implement fp32/fp16 matmul and linear with tiled compute and blocked weight packing#18171

[ET-VK][matmul] Re-implement fp32/fp16 matmul and linear with tiled compute and blocked weight packing#18171
meta-codesync[bot] merged 5 commits intogh/SS-JIA/487/basefrom
gh/SS-JIA/487/head

SS-JIA commented Mar 13, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 13, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SS-JIA commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18171

❌ 1 New Failure, 12 Unrelated Failures

Uh oh!

github-actions bot commented Mar 13, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SS-JIA commented Mar 13, 2026 •

edited

Loading

pytorch-bot bot commented Mar 13, 2026 •

edited

Loading

This PR needs a `release notes:` label