PULPOpen Training Hardness: tiling-ready backward and optimizer deployment by runwangdl · Pull Request #174 · pulp-platform/Deeploy

runwangdl · 2026-03-12T01:30:24Z

Adds end-to-end on-device training graph deployment support for PULPOpen/Siracusa targets. This includes a full code-generation pipeline for training networks (forward + backward + optimizer), tiling support for gradient operators, and the necessary runtime harness changes to run SGD-based on-device learning.

Added

generateTrainingNetwork.py — CLI script to generate tiled training C code; supports --tiling, --l1, --l2, --doublebuffer, --defaultMemLevel
deeployTrainingRunner_siracusa.py — end-to-end training test driver for Siracusa
InPlaceAccumulatorV2 operator: parser, type checker, template, bindings, and SBTiler-based tile constraint (gradient accumulation buffer)
SoftmaxCrossEntropyLoss dual-output variant (loss scalar + log_prob): separate parser, checker, template, bindings, and MultiOutputMixin-based tile constraint
ConvGradX / ConvGradW / ConvGradB operators split from ConvGrad via SplitConvGradPass: individual parsers, templates, and bindings for each
MultiOutputTileConstraint framework (MultiOutputMixin, ScalarOutputAppender, FullTensorOutputAppender) — generic mechanism for wrapping multi-output tile constraints without per-operator boilerplate
deeploytraintest.c — C harness for running training steps on device, with mb % TRAINING_DATA_SIZE data cycling and post-init grad buffer memset
testinputs.h with TRAINING_DATA_SIZE, TRAINING_GRAD_BUF_START_IDX, TRAINING_NUM_GRAD_INPUTS macros

Changed

inputs.npz / outputs.npz format — added meta_data_size (unique samples stored) and meta_n_batches (total training steps) keys; C harness cycles data via mb % TRAINING_DATA_SIZE instead of storing all batches
TilerExtension.py — _setupTensorDimensionProducts and _setupHeuristics now receive layerBinding as parameter; four hasattr(template, 'tileConstraint') guards added so non-tileable ops (e.g. ConvGradB) execute on their current memory level without blocking the tiler

Fixed

PR Merge Checklist

The PR is rebased on the latest devel commit and pointing to devel.
Your PR reviewed and approved.
All checks are passing.
The CHANGELOG.md file has been updated.
If the docker was modified, change back its link after review.

… have finite lifetime

…y are I/O buffers

…framework

…iple epoch

# Conflicts: # .github/workflows/ci-platform-siracusa-tiled.yml # .github/workflows/ci-platform-siracusa.yml # Deeploy/Targets/Generic/Bindings.py # Deeploy/Targets/Generic/Layers.py # Deeploy/Targets/Generic/Parsers.py # Deeploy/Targets/Generic/Platform.py # Deeploy/Targets/Generic/TypeCheckers.py # Deeploy/Targets/PULPOpen/Bindings.py # Deeploy/Targets/PULPOpen/Parsers.py # Deeploy/Targets/PULPOpen/Platform.py # Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py # Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py # Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py # Deeploy/Targets/PULPOpen/Tiler.py # TargetLibraries/Generic/inc/kernel/Layernorm.h # TargetLibraries/Generic/src/Layernorm_fp32.c

…ng merge The merge commit resolved the conflict by using TrainingPlatform's version of Bindings.py which was missing GroupNorm checker and template imports that are needed by the GroupNorm binding definitions (PULPGroupNormGrad*Binding, PULPGroupNormalization*Binding) brought in from CNNTraining.

…lution Adds missing ConvGrad, GroupNorm, AveragePool, and ReluGrad operators that were dropped when TrainingPlatform took precedence during merge. - Generic/Parsers.py: Add Conv2DGradXParser, Conv2DGradWParser, AveragePool2DParser, and 6 GroupNorm parser classes - Generic/Layers.py: Add AveragePoolLayer/Grad, ConvGradX/WLayer, ReluGradLayer, and 6 GroupNorm layer classes - PULPOpen/Parsers.py: Add 6 PULPConvGrad parser classes - PULPOpen/Bindings.py: Add PULPAveragePool2DBindings, PULPAveragePoolGrad2DBindings, and fix GroupNorm/template imports - PULPOpen/Tiler.py: Fix imports to include all CNNTraining bindings and GroupNorm/ConvGrad/AveragePool TileConstraints - PULPOpen/Platform.py: Fix all imports and add GroupNorm entries to PULPMapping

…finitions - Generic/TypeCheckers.py: Add AveragePoolChecker class - Generic/Bindings.py: Add FloatAveragePoolTemplate import, AveragePoolChecker import, and BasicAveragePool2DBindings definition - Generic/Platform.py: Add AveragePoolLayer/Mapper, fix imports for BasicAveragePool2DBindings and AveragePoolLayer - Generic/Parsers.py: Remove duplicate LayerNormGradParser (4-input version from CNNTraining, keep 5-input TrainingPlatform version) - PULPOpen/Bindings.py: Remove duplicate PULPLayernormGradBinding (4-input version from CNNTraining, keep 5-input version), remove duplicate FunctionCallClosure/ForkClosure definitions

The merge added a second LayerNormGradParser class (4-input, 1-output version from CNNTraining) that overrode the correct 5-input, 3-output version from TrainingPlatform. Remove the duplicate to restore the correct implementation that matches the ONNX LayerNormalizationGrad node format.

-t /app/Onnx4Deeploy/onnx/model/simplemlp_train --n-accum=8 --plotMemAlloc" MemoryScheduler (_calculateLifetimes): - Pin is_input buffers to (0, maxStepIdx) instead of (0, last_use_step), preventing minimalloc from aliasing network weight inputs (e.g. fc1_weight) with intermediate tensors computed later in the same kernel call. InPlaceAccumulatorV2: - Add tiledReferenceTemplate (PULPOpen) that writes only to ${accum_buffer}; data_out is excluded because its L2 address may alias other live buffers. - InPlaceAccumulatorV2TileConstraint: remove data_out from addrNames; egress DMA targets accum_buffer's L2 address instead of data_out. - PULPInPlaceAccumulatorV2TilingReadyBindings now uses TiledBindings. - Generic template: write to both accum_buffer and data_out; use > (int8_t)(-128) for reset detection to handle sign-propagation path. Add int8_t binding variant for the sign-propagated lazy_reset_grad. SoftmaxCrossEntropyLoss dual-output: - Add SoftmaxCrossEntropyLossDualOutputTileConstraint with wrapTilingSolution override to bypass the single-output assertion. - Platform: DualOutputMapper now uses TilingReadyBindings. LayerNorm / LayerNormGrad tiling: - LayernormTileConstraint: add mean/inv_std_dev output constraints and wrapTilingSolution to handle multi-output nodes. - LayernormGradTileConstraint: replace bias with mean/inv_std_dev inputs; add weight_grad/bias_grad output constraints; addPolicyConstraint pins batch dims to full size; serializeTilingSolution updated accordingly. FloatGEMMTileConstraint: - Skip bias DMA when bias tensor is not present in the tiling solution (e.g. Constant bias kept entirely in L2), fixing KeyError on small biases. Tiling code generation infrastructure: - SingleBufferingTilingCodeGeneration: add direction suffix (_ref_in/_ref_out) to hoisted external buffer references to avoid collisions for in-place buffers appearing in both ingress and egress schedules; use combined tensorMemoryConstraints for egress lookup so input-side constraints (e.g. accum_buffer) are visible during output DMA generation. - TilingCodeGeneration: add direction suffix to per-tile opRepr variable names to prevent ingress/egress collisions for the same tensor. - TilingVariableReplacement: skip re-hoisting a reference already created on the input pass (in-place buffers); fix _updateReferenceTemplate to use pointer arithmetic (base + idx) instead of dereference. - MemoryConstraintFlows: skip kill-set tracking for ConstantBuffers. - TileConstraint: print diagnostic info before the single-output assertion. Test infrastructure: - execution.py: prepend /usr/bin to PATH so gapy resolves the correct Python; add tiled-training branch (config.training and config.tiling) that invokes testMVPTraining.py and generateOptimizerNetwork.py, then reads back training_meta.json for n_train_steps / n_accum_steps. - deeploytraintest.c: remove debug scaffolding left from liveness investigation.

…ansformer/tinytransformer_train --n-accum 4"

Implements the backward pass of MaxPool2D (MaxPoolGrad) following the same architecture as AveragePoolGrad. The gradient is scattered only to the argmax position in each pooling window (re-computed from the original forward input), unlike AveragePoolGrad which distributes uniformly. New files: - TargetLibraries/PULPOpen/inc/kernel/MaxPool.h: declare PULP_MaxPoolGrad2d_fp32_fp32_HWC - TargetLibraries/PULPOpen/src/MaxPool.c: implement MaxPoolGrad kernel (zero-init + argmax scatter, channel-parallel across cores) - Deeploy/Targets/PULPOpen/TileConstraints/MaxPoolGradTileConstraint.py: MaxPoolGradCTileConstraint (channel-tiling for 3 tensors: grad_output, original_input, grad_input) - Deeploy/Targets/PULPOpen/Templates/FloatMaxPoolTemplate.py: add referenceGradTemplate calling the new kernel Modified files: - Generic/Parsers.py: MaxPoolGradParser (2 inputs, 1 output, same attrs as MaxPool) - Generic/Layers.py: MaxPoolGradLayer - Generic/TypeCheckers.py: MaxPoolGradChecker (2 float32 in, 1 float32 out) - PULPOpen/Bindings.py: PULPMaxPoolGrad2DBindings - PULPOpen/Tiler.py: PULPMaxPoolGrad2DTilingReadyBindings - PULPOpen/Platform.py: MaxPoolGrad2DMapper + 'MaxPoolGrad' in PULPMapping Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Bump pulp-trainlib to 37f70e5 (CNNTiling): - DW ConvGradW/X: padded kernels for non-zero padding or stride != 1 - im2col: fix early-return bug blocking stride > 1 weight gradients - Conv2D bw param grads: pass actual padding to im2col (was zero)

runwangdl and others added 30 commits March 17, 2025 22:19

Add classifier training support

9ec13f9

Fix L3 DMA and Maxpool Bugs

f1a0491

WIP Static Memory Allocation of IOs

29baf2c

Temporary fix broken float softmax

25be229

Fix lifetime of aliased input buffers

da56cbe

Fix output buffer lifetime

721f747

Linting

78685e5

WIP fix output buffer lifetime

02b5435

Change RQHardswish dim due to compiler bug

a2d67a0

Fix typo

bdd92de

Fix duplicated IO in memory allocation visualization

20b1f8b

Fix the Constant Tensor offset to not take into account IO since they…

c708069

… have finite lifetime

Add new attribute to Variable and Transient buffer to annotate if the…

b6e2448

…y are I/O buffers

Adapt calculateLifetime to use buffer I/O annotation

7e96f18

Fix typo

b923520

Remove IO buffer name and refactor var name

f4cb9e0

Linting

435cc9d

Test the correctness of the memory map after memory allocation

731f39f

Allocate memory arena first

dd1370c

correct DMA lengh of copy assertion

8bfdb13

Align memory allocation test

f01eb7f

delete redundant shell scripts

031dc79

Merge branch 'devel' into PULPCCTL3_16_16_64

58e18da

Update node with multioutput to single output

ac2d879

add softmaxcrossentropygrad tiling

6a7198b

Add softmaxcrossentropylossgrad tiling

360aef7

Merge branch 'PULPCCTL3_16_16_64' into GEMM_training_tiled

bc48582

Fix CI issue

b6542ba

Fix CI bugs

fe208d0

update CI

4a21359

runwangdl and others added 25 commits February 13, 2026 23:55

Add float concat and Change padding pattern of ConV

b260e4e

Merge remote-tracking branch 'upstream/devel' into sleepvit

9803232

Initial Training platform

c38a72a

Updated training update with gradient accumulation and optimizer update

9e5957b

Add MLP_Train Test

0c4cfd7

Merge branch 'sleepvit' into TrainingPlatform

78bd0df

Temporal Changes for Multi-Ouput Kernels to fit the new testtraining …

36d145d

…framework

Add Small Conv+Transformer Test for training untiled platform

a89c533

Avoid generation redundant memory copy for the same input during mult…

9428468

…iple epoch

Wrong Free of aliased_input

a495d3e

RISCV-SUMMIT Demo

bf837e3

LATEST DEMO for RISCV SUBMIT

c2f14b2

Add training pytest

40179c2

Pass "deeployTrainingRunner_tiled_siracusa.py -t Tests/Models/SmallTr…

71c36ac

…ansformer/tinytransformer_train --n-accum 4"

Temporary change for sleepvit BP and Add tile traning pytest

50ab6ff

Reafactoring training operators

fee90a2

Update grad kernels

528a8b1

Update Conv Bias for Train Platform

4d297bb

runwangdl requested review from Victor-Jung and Xeratec as code owners March 12, 2026 01:30

runwangdl marked this pull request as draft March 12, 2026 01:30

runwangdl self-assigned this Mar 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PULPOpen Training Hardness: tiling-ready backward and optimizer deployment#174

PULPOpen Training Hardness: tiling-ready backward and optimizer deployment#174
runwangdl wants to merge 167 commits intopulp-platform:develfrom
runwangdl:TrainingPlatform

runwangdl commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

runwangdl commented Mar 12, 2026

Added

Changed

Fixed

PR Merge Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants