Eliminate redundant NCHW↔NHWC permute_copy and NHWC-safe view_copy transposes in ToTosaMemoryFormatPass (#18167)#18167
Eliminate redundant NCHW↔NHWC permute_copy and NHWC-safe view_copy transposes in ToTosaMemoryFormatPass (#18167)#181673l1 wants to merge 11 commits intopytorch:mainfrom
Conversation
Differential Revision: D92296625
…ranspose chains
Summary:
This pass fuses transpose -> reshape -> transpose patterns into a single transpose
followed by a reshape. This optimization is particularly useful when reordering
tensors between NCHW and NHWC memory formats.
The pass now handles both:
- torch.ops.aten.permute_copy.default (standard aten ops)
- exir_ops.backend.tosa.TRANSPOSE.default (TOSA backend ops)
This enables the pass to work with transposes inserted by ToTosaMemoryFormatPass.
The pass is placed AFTER ToTosaMemoryFormatPass in arm_pass_manager.py.
Example: Consider a reshape on an NCHW tensor that reshapes the batch and channel
dimensions into the channel dimension:
(N, C, H, W) -> reshape -> (1, (N, C), H, W)
If both input and output tensors are reordered to NHWC:
(N, H, W, C)
-> transpose -> (N, C, H, W)
-> reshape -> (1, (N, C), H, W)
-> transpose -> (1, H, W, (N, C))
This is equivalent to:
(N, H, W, C) -> transpose -> (H, W, N, C) -> reshape -> (1, H, W, (N, C))
Inspired by bolt/nn/espresso/transforms/fuse_ops.py:fuse_transpose_reshape_transpose
Differential Revision: D95292796
…ear layers Summary: This pass fuses transpose -> reshape -> linear patterns by eliminating the transpose operation. Instead of transposing at runtime, the pass applies the inverse transpose to the linear layer's weights at compile time. The pass now handles both: - torch.ops.aten.permute_copy.default (standard aten ops) - exir_ops.backend.tosa.TRANSPOSE.default (TOSA backend ops) This enables the pass to work with transposes inserted by ToTosaMemoryFormatPass. The pass is placed AFTER ToTosaMemoryFormatPass in arm_pass_manager.py. Common artifact from NCHW -> NHWC reordering: - Transpose -> Reshape -> Linear where the Reshape flattens all dimensions except the batch dimension. The pass validates: 1. Transpose does not modify the batch dimension (dims[0] == 0) 2. Reshape flattens to 2D with batch dim preserved 3. Followed by a linear/mm operation Inspired by bolt/nn/espresso/transforms/fuse_ops.py:fuse_transpose_reshape_fc Differential Revision: D95268043
Summary: This pass identifies consecutive transpose/permute operations and either: 1. Removes both if they compose to identity (cancel out) 2. Fuses them into a single permute with combined dimensions The pass now handles: - exir_ops.edge.aten.permute.default (standard edge dialect ops) - exir_ops.edge.aten.permute_copy.default (edge dialect copy ops) - exir_ops.backend.tosa.TRANSPOSE.default (TOSA backend ops) This enables the pass to work with transposes inserted by ToTosaMemoryFormatPass. The pass is placed AFTER ToTosaMemoryFormatPass in arm_pass_manager.py. The optimization reduces runtime overhead by eliminating redundant memory movement operations. Inspired by bolt/nn/espresso/transforms/fuse_ops.py:fuse_transposes Differential Revision: D95293181
…riant ops Summary: This pass identifies and removes redundant transpose pairs that sandwich layout-invariant operations (e.g., elementwise ops like relu, add, mul). Pattern targeted: T(perm1) → LayoutInvariantOp → T(perm2) When perm1 and perm2 compose to identity (i.e., they cancel out), both transposes can be safely removed since the middle operation doesn't depend on data layout. This pattern is common in TOSA graphs where ToTosaMemoryFormatPass inserts NCHW↔NHWC transposes at operation boundaries, but some elementwise ops between them don't actually require the format conversion. Example: Before: T([0,2,3,1]) → ReLU → T([0,3,1,2]) After: ReLU (both transposes removed) The pass handles: - edge.aten.permute/permute_copy operations - backend.tosa.TRANSPOSE operations - All common elementwise unary ops (relu, sigmoid, tanh, clamp, etc.) - Elementwise binary ops (add, mul, sub, div, etc.)
…TOSA Rescale operations Summary: This pass targets T1 → Rescale → T2 patterns where transposes surround a TOSA Rescale operation. Since Rescale is elementwise (per-element scaling and zero-point adjustment), it is layout-invariant and transposes can be propagated through it. Pattern targeted: Before: T(perm1) → Rescale → T(perm2) After: Rescale → T_combined(compose(perm1, perm2)) If the composed permutation is identity, both transposes are eliminated: Before: T(perm1) → Rescale → T(inverse(perm1)) After: Rescale (both transposes removed) This optimization is particularly effective for TOSA graphs where: - ToTosaMemoryFormatPass inserts NCHW↔NHWC transposes - Rescale operations are inserted for quantization/dequantization - The pattern Transpose → Rescale → Conv → Rescale → Transpose is common The pass handles: - edge.aten.permute/permute_copy operations - backend.tosa.TRANSPOSE operations
…ugh Concat operations Summary: This pass targets the pattern [T(perm), T(perm), ...] → Concat(dim=d) → T(inv_perm) where all inputs to Concat have the same transpose permutation and the output is transposed with the inverse permutation. The transposes can be eliminated by adjusting the concat dimension. This optimization is particularly useful for TOSA graphs where ToTosaMemoryFormatPass inserts NCHW↔NHWC transposes at graph boundaries. Transformation: - Before: [T(perm)(x1), T(perm)(x2)] → Concat(dim=d) → T(inv_perm) → y - After: [x1, x2] → Concat(dim=inv_perm[d]) → y
Summary: This TODO documents the investigation needed to make PropagateTransposesThroughConcatPass more effective for models like Control Ceres where: - Input transposes have multiple users (not just Concat) - Input transposes may have different permutations Investigation scope: - Option A: Allow input transposes with multiple users (duplicate for other users) - Option B: Handle mixed permutations by propagating matching subsets - Option C: Target a different pattern entirely
Summary: The FuseTransposeReshapeTransposePass was only checking for torch.ops.aten.view_copy.default but not the Edge dialect exir_ops.edge.aten.view_copy.default. This prevented the pass from matching transpose-reshape-transpose patterns in edge graphs where view_copy uses the edge dialect. Changes: - Added exir_ops.edge.aten.view_copy.default to _RESHAPE_TARGETS - Changed logging from debug to warning for better visibility of fusion failures - Removed unused variable permute_output_shape - Created ANALYSIS_expensive_transposes.md documenting findings Note: The expensive transposes in Control Ceres model around reshape operations cannot be fused because the reshapes involve both dimension combining AND reordering (e.g., [1,2,14,72] → [1,1,72,28]). These transposes are mathematically necessary for the reshape to work correctly in NHWC layout.
…at compile time Summary: This pass identifies TOSA TRANSPOSE operations where the input is a static tensor (parameter, buffer, or lifted tensor constant) and folds the transpose at compile time by: 1. Actually permuting the tensor data 2. Creating a new constant placeholder with the permuted data 3. Removing the transpose node and rewiring users This eliminates runtime transpose operations on static tensors like weights, which is especially important for Ethos-U55 where Vela implements transposes as expensive NPU_OP_POOL (1x1 AvgPool) sequences. **Note:** Analysis on test_combined_control_ceres_u55 shows that in this particular model, all 40 transpose operations (19 TOSA TRANSPOSE + 21 aten.permute_copy) are on activation tensors, not constant tensors. The pass correctly identifies this and doesn't fold any transposes. The pass will be beneficial for future models that have transposes on constant tensors.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18167
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 9 Cancelled Jobs, 3 Unrelated FailuresAs of commit c019a17 with merge base 4f900b2 ( NEW FAILURE - The following job has failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
digantdesai
left a comment
There was a problem hiding this comment.
Review automatically exported from Phabricator review in Meta.
…ansposes in ToTosaMemoryFormatPass (pytorch#18167) Summary: Pull Request resolved: pytorch#18167 Two optimizations in ToTosaMemoryFormatPass to reduce TOSA TRANSPOSE nodes: 1. NHWC-safe reshape detection: When a 4D→4D view_copy has monotonic shape_indices on the raw shapes and preserves the last dimension (NHWC channel), skip inserting input/output transposes. The view_copy can operate directly on NHWC data. 2. Redundant permute_copy elimination: Model-level permute_copy ops whose permutation matches channels_last_order (NCHW→NHWC) or its inverse (NHWC→NCHW) are redundant with the tosa_dim_order annotation that already handles format conversion. Replace them with view_copy (identity reshape) to avoid generating TOSA TRANSPOSE nodes. Handles both 4D (rank>=4, sr>=2) and 3D (rank>=3, sr>=1) permutations. This reduces Vela Transpose entries from 75→33 (-56%), Transpose op cycles from 33.4K→6.1K (-82%), and NPU operators from 367→329 (-38). Reviewed By: digantdesai Differential Revision: D96432610
…ansposes in ToTosaMemoryFormatPass (pytorch#18167) Summary: Two optimizations in ToTosaMemoryFormatPass to reduce TOSA TRANSPOSE nodes: 1. NHWC-safe reshape detection: When a 4D→4D view_copy has monotonic shape_indices on the raw shapes and preserves the last dimension (NHWC channel), skip inserting input/output transposes. The view_copy can operate directly on NHWC data. 2. Redundant permute_copy elimination: Model-level permute_copy ops whose permutation matches channels_last_order (NCHW→NHWC) or its inverse (NHWC→NCHW) are redundant with the tosa_dim_order annotation that already handles format conversion. Replace them with view_copy (identity reshape) to avoid generating TOSA TRANSPOSE nodes. Handles both 4D (rank>=4, sr>=2) and 3D (rank>=3, sr>=1) permutations. Reviewed By: digantdesai Differential Revision: D96432610
…ansposes in ToTosaMemoryFormatPass (pytorch#18167) Summary: Pull Request resolved: pytorch#18167 Two optimizations in ToTosaMemoryFormatPass to reduce TOSA TRANSPOSE nodes: 1. NHWC-safe reshape detection: When a 4D→4D view_copy has monotonic shape_indices on the raw shapes and preserves the last dimension (NHWC channel), skip inserting input/output transposes. The view_copy can operate directly on NHWC data. 2. Redundant permute_copy elimination: Model-level permute_copy ops whose permutation matches channels_last_order (NCHW→NHWC) or its inverse (NHWC→NCHW) are redundant with the tosa_dim_order annotation that already handles format conversion. Replace them with view_copy (identity reshape) to avoid generating TOSA TRANSPOSE nodes. Handles both 4D (rank>=4, sr>=2) and 3D (rank>=3, sr>=1) permutations. Reviewed By: digantdesai Differential Revision: D96432610
Summary:
Two optimizations in ToTosaMemoryFormatPass to reduce TOSA TRANSPOSE nodes:
NHWC-safe reshape detection: When a 4D→4D view_copy has monotonic
shape_indices on the raw shapes and preserves the last dimension (NHWC
channel), skip inserting input/output transposes. The view_copy can
operate directly on NHWC data.
Redundant permute_copy elimination: Model-level permute_copy ops whose
permutation matches channels_last_order (NCHW→NHWC) or its inverse
(NHWC→NCHW) are redundant with the tosa_dim_order annotation that
already handles format conversion. Replace them with view_copy (identity
reshape) to avoid generating TOSA TRANSPOSE nodes. Handles both 4D
(rank>=4, sr>=2) and 3D (rank>=3, sr>=1) permutations.
Reviewed By: digantdesai
Differential Revision: D96432610