Eliminate redundant NCHW↔NHWC permute_copy and NHWC-safe view_copy transposes in ToTosaMemoryFormatPass (#18167) by 3l1 · Pull Request #18167 · pytorch/executorch

3l1 · 2026-03-13T19:59:36Z

Summary:

Two optimizations in ToTosaMemoryFormatPass to reduce TOSA TRANSPOSE nodes:

NHWC-safe reshape detection: When a 4D→4D view_copy has monotonic
shape_indices on the raw shapes and preserves the last dimension (NHWC
channel), skip inserting input/output transposes. The view_copy can
operate directly on NHWC data.
Redundant permute_copy elimination: Model-level permute_copy ops whose
permutation matches channels_last_order (NCHW→NHWC) or its inverse
(NHWC→NCHW) are redundant with the tosa_dim_order annotation that
already handles format conversion. Replace them with view_copy (identity
reshape) to avoid generating TOSA TRANSPOSE nodes. Handles both 4D
(rank>=4, sr>=2) and 3D (rank>=3, sr>=1) permutations.

Reviewed By: digantdesai

Differential Revision: D96432610

Differential Revision: D92296625

…ranspose chains Summary: This pass fuses transpose -> reshape -> transpose patterns into a single transpose followed by a reshape. This optimization is particularly useful when reordering tensors between NCHW and NHWC memory formats. The pass now handles both: - torch.ops.aten.permute_copy.default (standard aten ops) - exir_ops.backend.tosa.TRANSPOSE.default (TOSA backend ops) This enables the pass to work with transposes inserted by ToTosaMemoryFormatPass. The pass is placed AFTER ToTosaMemoryFormatPass in arm_pass_manager.py. Example: Consider a reshape on an NCHW tensor that reshapes the batch and channel dimensions into the channel dimension: (N, C, H, W) -> reshape -> (1, (N, C), H, W) If both input and output tensors are reordered to NHWC: (N, H, W, C) -> transpose -> (N, C, H, W) -> reshape -> (1, (N, C), H, W) -> transpose -> (1, H, W, (N, C)) This is equivalent to: (N, H, W, C) -> transpose -> (H, W, N, C) -> reshape -> (1, H, W, (N, C)) Inspired by bolt/nn/espresso/transforms/fuse_ops.py:fuse_transpose_reshape_transpose Differential Revision: D95292796

…ear layers Summary: This pass fuses transpose -> reshape -> linear patterns by eliminating the transpose operation. Instead of transposing at runtime, the pass applies the inverse transpose to the linear layer's weights at compile time. The pass now handles both: - torch.ops.aten.permute_copy.default (standard aten ops) - exir_ops.backend.tosa.TRANSPOSE.default (TOSA backend ops) This enables the pass to work with transposes inserted by ToTosaMemoryFormatPass. The pass is placed AFTER ToTosaMemoryFormatPass in arm_pass_manager.py. Common artifact from NCHW -> NHWC reordering: - Transpose -> Reshape -> Linear where the Reshape flattens all dimensions except the batch dimension. The pass validates: 1. Transpose does not modify the batch dimension (dims[0] == 0) 2. Reshape flattens to 2D with batch dim preserved 3. Followed by a linear/mm operation Inspired by bolt/nn/espresso/transforms/fuse_ops.py:fuse_transpose_reshape_fc Differential Revision: D95268043

Summary: This pass identifies consecutive transpose/permute operations and either: 1. Removes both if they compose to identity (cancel out) 2. Fuses them into a single permute with combined dimensions The pass now handles: - exir_ops.edge.aten.permute.default (standard edge dialect ops) - exir_ops.edge.aten.permute_copy.default (edge dialect copy ops) - exir_ops.backend.tosa.TRANSPOSE.default (TOSA backend ops) This enables the pass to work with transposes inserted by ToTosaMemoryFormatPass. The pass is placed AFTER ToTosaMemoryFormatPass in arm_pass_manager.py. The optimization reduces runtime overhead by eliminating redundant memory movement operations. Inspired by bolt/nn/espresso/transforms/fuse_ops.py:fuse_transposes Differential Revision: D95293181

…riant ops Summary: This pass identifies and removes redundant transpose pairs that sandwich layout-invariant operations (e.g., elementwise ops like relu, add, mul). Pattern targeted: T(perm1) → LayoutInvariantOp → T(perm2) When perm1 and perm2 compose to identity (i.e., they cancel out), both transposes can be safely removed since the middle operation doesn't depend on data layout. This pattern is common in TOSA graphs where ToTosaMemoryFormatPass inserts NCHW↔NHWC transposes at operation boundaries, but some elementwise ops between them don't actually require the format conversion. Example: Before: T([0,2,3,1]) → ReLU → T([0,3,1,2]) After: ReLU (both transposes removed) The pass handles: - edge.aten.permute/permute_copy operations - backend.tosa.TRANSPOSE operations - All common elementwise unary ops (relu, sigmoid, tanh, clamp, etc.) - Elementwise binary ops (add, mul, sub, div, etc.)

…TOSA Rescale operations Summary: This pass targets T1 → Rescale → T2 patterns where transposes surround a TOSA Rescale operation. Since Rescale is elementwise (per-element scaling and zero-point adjustment), it is layout-invariant and transposes can be propagated through it. Pattern targeted: Before: T(perm1) → Rescale → T(perm2) After: Rescale → T_combined(compose(perm1, perm2)) If the composed permutation is identity, both transposes are eliminated: Before: T(perm1) → Rescale → T(inverse(perm1)) After: Rescale (both transposes removed) This optimization is particularly effective for TOSA graphs where: - ToTosaMemoryFormatPass inserts NCHW↔NHWC transposes - Rescale operations are inserted for quantization/dequantization - The pattern Transpose → Rescale → Conv → Rescale → Transpose is common The pass handles: - edge.aten.permute/permute_copy operations - backend.tosa.TRANSPOSE operations

…ugh Concat operations Summary: This pass targets the pattern [T(perm), T(perm), ...] → Concat(dim=d) → T(inv_perm) where all inputs to Concat have the same transpose permutation and the output is transposed with the inverse permutation. The transposes can be eliminated by adjusting the concat dimension. This optimization is particularly useful for TOSA graphs where ToTosaMemoryFormatPass inserts NCHW↔NHWC transposes at graph boundaries. Transformation: - Before: [T(perm)(x1), T(perm)(x2)] → Concat(dim=d) → T(inv_perm) → y - After: [x1, x2] → Concat(dim=inv_perm[d]) → y

Summary: This TODO documents the investigation needed to make PropagateTransposesThroughConcatPass more effective for models like Control Ceres where: - Input transposes have multiple users (not just Concat) - Input transposes may have different permutations Investigation scope: - Option A: Allow input transposes with multiple users (duplicate for other users) - Option B: Handle mixed permutations by propagating matching subsets - Option C: Target a different pattern entirely

Summary: The FuseTransposeReshapeTransposePass was only checking for torch.ops.aten.view_copy.default but not the Edge dialect exir_ops.edge.aten.view_copy.default. This prevented the pass from matching transpose-reshape-transpose patterns in edge graphs where view_copy uses the edge dialect. Changes: - Added exir_ops.edge.aten.view_copy.default to _RESHAPE_TARGETS - Changed logging from debug to warning for better visibility of fusion failures - Removed unused variable permute_output_shape - Created ANALYSIS_expensive_transposes.md documenting findings Note: The expensive transposes in Control Ceres model around reshape operations cannot be fused because the reshapes involve both dimension combining AND reordering (e.g., [1,2,14,72] → [1,1,72,28]). These transposes are mathematically necessary for the reshape to work correctly in NHWC layout.

…at compile time Summary: This pass identifies TOSA TRANSPOSE operations where the input is a static tensor (parameter, buffer, or lifted tensor constant) and folds the transpose at compile time by: 1. Actually permuting the tensor data 2. Creating a new constant placeholder with the permuted data 3. Removing the transpose node and rewiring users This eliminates runtime transpose operations on static tensors like weights, which is especially important for Ethos-U55 where Vela implements transposes as expensive NPU_OP_POOL (1x1 AvgPool) sequences. **Note:** Analysis on test_combined_control_ceres_u55 shows that in this particular model, all 40 transpose operations (19 TOSA TRANSPOSE + 21 aten.permute_copy) are on activation tensors, not constant tensors. The pass correctly identifies this and doesn't fold any transposes. The pass will be beneficial for future models that have transposes on constant tensors.

pytorch-bot · 2026-03-13T19:59:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18167

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 9 Cancelled Jobs, 3 Unrelated Failures

As of commit c019a17 with merge base 4f900b2 ():

NEW FAILURE - The following job has failed:

Lint / lintrunner-mypy (gh)
>>> Lint for backends/arm/_passes/propagate_transposes_through_rescale_pass.py:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

pull / unittest-arm-backend-with-no-deps (test_pytest_models_tosa) / linux-job (gh)
##[error]The operation was canceled.
pull / unittest-arm-backend-with-no-deps (test_pytest_ops_tosa) / linux-job (gh)
##[error]The operation was canceled.
trunk / test-arm-backend-ethos-u (test_pytest_models_ethos_u55) / linux-job (gh)
##[error]The operation was canceled.
trunk / test-arm-backend-ethos-u (test_pytest_models_ethos_u85) / linux-job (gh)
trunk / test-arm-backend-ethos-u (test_pytest_ops_ethos_u55) / linux-job (gh)
trunk / test-arm-backend-ethos-u (test_pytest_ops_ethos_u85) / linux-job (gh)
trunk / test-arm-backend-ethos-u (test_smaller_stories_llama) / linux-job (gh)
trunk / test-arm-backend-vkml (test_pytest_ops_vkml) / linux-job (gh)
trunk / test-arm-ootb-linux (run_deit_e2e_ethos_u) / linux-job (gh)
##[error]The operation was canceled.

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-binary-size-linux-gcc / linux-job (gh) (trunk failure)
/pytorch/executorch/kernels/portable/cpu/op_convolution.cpp:175:41: error: comparison of integer expressions of different signedness: ‘ssize_t’ {aka ‘long int’} and ‘size_t’ {aka ‘long unsigned int’} [-Werror=sign-compare]
trunk / test-arm-cortex-m-size-test (bare_metal) / linux-job (gh) (trunk failure)
/pytorch/executorch/../executorch/kernels/portable/cpu/util/kernel_ops_util.h:233:54: error: comparison of integer expressions of different signedness: 'ssize_t' {aka 'int'} and 'size_t' {aka 'unsigned int'} [-Werror=sign-compare]
trunk / test-arm-cortex-m-size-test (zephyr-preset) / linux-job (gh) (trunk failure)
/pytorch/executorch/../executorch/kernels/portable/cpu/util/kernel_ops_util.h:233:54: error: comparison of integer expressions of different signedness: 'ssize_t' {aka 'int'} and 'size_t' {aka 'unsigned int'} [-Werror=sign-compare]

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-03-13T20:00:17Z

@3l1 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D96432610.

github-actions · 2026-03-13T20:00:55Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

digantdesai

Review automatically exported from Phabricator review in Meta.

…ansposes in ToTosaMemoryFormatPass (pytorch#18167) Summary: Pull Request resolved: pytorch#18167 Two optimizations in ToTosaMemoryFormatPass to reduce TOSA TRANSPOSE nodes: 1. NHWC-safe reshape detection: When a 4D→4D view_copy has monotonic shape_indices on the raw shapes and preserves the last dimension (NHWC channel), skip inserting input/output transposes. The view_copy can operate directly on NHWC data. 2. Redundant permute_copy elimination: Model-level permute_copy ops whose permutation matches channels_last_order (NCHW→NHWC) or its inverse (NHWC→NCHW) are redundant with the tosa_dim_order annotation that already handles format conversion. Replace them with view_copy (identity reshape) to avoid generating TOSA TRANSPOSE nodes. Handles both 4D (rank>=4, sr>=2) and 3D (rank>=3, sr>=1) permutations. This reduces Vela Transpose entries from 75→33 (-56%), Transpose op cycles from 33.4K→6.1K (-82%), and NPU operators from 367→329 (-38). Reviewed By: digantdesai Differential Revision: D96432610

…ansposes in ToTosaMemoryFormatPass (pytorch#18167) Summary: Two optimizations in ToTosaMemoryFormatPass to reduce TOSA TRANSPOSE nodes: 1. NHWC-safe reshape detection: When a 4D→4D view_copy has monotonic shape_indices on the raw shapes and preserves the last dimension (NHWC channel), skip inserting input/output transposes. The view_copy can operate directly on NHWC data. 2. Redundant permute_copy elimination: Model-level permute_copy ops whose permutation matches channels_last_order (NCHW→NHWC) or its inverse (NHWC→NCHW) are redundant with the tosa_dim_order annotation that already handles format conversion. Replace them with view_copy (identity reshape) to avoid generating TOSA TRANSPOSE nodes. Handles both 4D (rank>=4, sr>=2) and 3D (rank>=3, sr>=1) permutations. Reviewed By: digantdesai Differential Revision: D96432610

…ansposes in ToTosaMemoryFormatPass (pytorch#18167) Summary: Pull Request resolved: pytorch#18167 Two optimizations in ToTosaMemoryFormatPass to reduce TOSA TRANSPOSE nodes: 1. NHWC-safe reshape detection: When a 4D→4D view_copy has monotonic shape_indices on the raw shapes and preserves the last dimension (NHWC channel), skip inserting input/output transposes. The view_copy can operate directly on NHWC data. 2. Redundant permute_copy elimination: Model-level permute_copy ops whose permutation matches channels_last_order (NCHW→NHWC) or its inverse (NHWC→NCHW) are redundant with the tosa_dim_order annotation that already handles format conversion. Replace them with view_copy (identity reshape) to avoid generating TOSA TRANSPOSE nodes. Handles both 4D (rank>=4, sr>=2) and 3D (rank>=3, sr>=1) permutations. Reviewed By: digantdesai Differential Revision: D96432610

3l1 added 10 commits March 4, 2026 10:27

arm_vela dump

ee5a271

Differential Revision: D92296625

3l1 requested a review from digantdesai as a code owner March 13, 2026 19:59

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 13, 2026

meta-codesync bot added fb-exported meta-exported labels Mar 13, 2026

digantdesai approved these changes Mar 13, 2026

View reviewed changes

3l1 force-pushed the export-D96432610 branch from f9a57c8 to 6aac88a Compare March 13, 2026 20:45

meta-codesync bot changed the title ~~Eliminate redundant NCHW↔NHWC permute_copy and NHWC-safe view_copy transposes in ToTosaMemoryFormatPass~~ Eliminate redundant NCHW↔NHWC permute_copy and NHWC-safe view_copy transposes in ToTosaMemoryFormatPass (#18167) Mar 13, 2026

3l1 force-pushed the export-D96432610 branch from 6aac88a to 7f039b2 Compare March 13, 2026 20:47

Ninja91 added the ciflow/trunk label Mar 13, 2026

3l1 force-pushed the export-D96432610 branch from 7f039b2 to c019a17 Compare March 13, 2026 21:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate redundant NCHW↔NHWC permute_copy and NHWC-safe view_copy transposes in ToTosaMemoryFormatPass (#18167)#18167

Eliminate redundant NCHW↔NHWC permute_copy and NHWC-safe view_copy transposes in ToTosaMemoryFormatPass (#18167)#18167
3l1 wants to merge 11 commits intopytorch:mainfrom
3l1:export-D96432610

3l1 commented Mar 13, 2026 •

edited by meta-codesync bot

Loading

Uh oh!

pytorch-bot bot commented Mar 13, 2026 •

edited

Loading

Uh oh!

meta-codesync bot commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

digantdesai left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

3l1 commented Mar 13, 2026 • edited by meta-codesync bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18167

❌ 1 New Failure, 9 Cancelled Jobs, 3 Unrelated Failures

Uh oh!

meta-codesync bot commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

This PR needs a release notes: label

Uh oh!

digantdesai left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

3l1 commented Mar 13, 2026 •

edited by meta-codesync bot

Loading

pytorch-bot bot commented Mar 13, 2026 •

edited

Loading

This PR needs a `release notes:` label