Yoshiki Fujiwara(藤原善基)@AWS Community Builder

Posted on May 12

Operational Hardening, CI-Grade Validation, and Pattern C B Hybrid — FSx for ONTAP S3 Access Points, Phase 8

#aws #amazonfsxfornetappontap #serverless #devops

TL;DR

This is Phase 8 of the FSx for ONTAP S3 Access Points serverless pattern library. Building on Phase 7, Phase 8 delivers:

Operational hardening: a Python-based cleanup_generic_ucs.py that handles Athena WorkGroup, S3 versioned buckets, and VPC Endpoint SG inbound rules end-to-end, replacing the bash script that was tripping over three distinct failure modes in Phase 7 cleanup
OutputWriter.put_stream API for artifacts over 5 GB using S3 MultipartUpload with abort-on-failure — single-part S3 PutObject has a 5 GB limit, and Phase 8 adds multipart support to unlock future VFX render, raw FASTQ, and large GeoTIFF outputs
Pattern C → Pattern B migration completion: UC6/UC7/UC8 complete the handler-side Pattern B+C hybrid migration — Athena stays on STANDARD_S3 (AWS spec constraint), while AI/ML output handlers now use OutputWriter.from_env() and are ready for FSXN_S3AP routing; UC9/UC13/UC14 complete full Pattern B migration
VPC Endpoint SG automation via Custom Resource — no more manual inbound rule edits when new UCs deploy or tear down
17-UC DevSecOps validator suite with GitHub Actions CI: lint_all_templates.sh (cfn-lint), check_handler_names.py (pyflakes), check_conditional_refs.py (UC9-class bug detector), check_s3ap_iam_patterns.py (S3AP IAM alias-only detector added after Phase 8 Theme D surfaced 4 more instances), and check_python_quality.py — running on every push and PR via .github/workflows/phase8-validators.yml
Observability baseline reference implementation: UC1 now has an EventBridge failure notification rule, with an observability design doc and three operational runbooks (alarm-response, sfn-rerun, cost-monitoring) defining the rollout model for the remaining UCs
UI/UX screenshot coverage: 104 demo-guide variants updated in Phase 8 (13 UCs × 8 languages), bringing total coverage to all 17 UCs × 8 languages = 136 demo-guide files, masked with v7 OCR redaction (0 leaks)
Template management consolidation (template.yaml vs template-deploy.yaml duplication decision) and 75 unused imports cleaned up across the codebase

All deployable AWS runtime features remain opt-in via CloudFormation Conditions; the default deploy mode keeps legacy behavior bit-for-bit identical. Phase 8 is about making the library easier to operate, easier to validate, and easier to extend — not about adding new industry UCs.

In short: Phase 7 proved the pattern works across 17 industries. Phase 8 makes it safe to run in production without a human watching the console.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns

Why operational hardening was the right next phase

Phase 7 closed with three kinds of paper cuts that kept surfacing during real cleanup and redeployment:

Cleanup got stuck. Every time we tore down a UC stack, at least one of three things went wrong: Athena WorkGroups non-empty, S3 buckets with versioned objects, or VPC Endpoint Security Groups holding references to Lambda SGs that were already gone. The bash cleanup script had to be patched reactively each time.
Silent IAM failures weren't caught by unit tests or cfn-lint. The Phase 7 cross-validation sweep found 9 UCs with alias-only S3 Access Point IAM statements. During Phase 8 Theme D screenshot capture, 4 more UCs surfaced the same bug class on a different Lambda (Discovery). The tooling to detect this at commit time did not exist.
Five-GB artifact ceiling. The S3 API limits single-part PutObject to 5 GB — FSx for NetApp ONTAP S3 Access Points follow the same boundary. Any UC that eventually generates a render frame sequence, a raw FASTQ, or a merged GeoTIFF mosaic above that size would break silently when migrated from STANDARD_S3 to FSXN_S3AP. Phase 7 used single-part put_bytes / put_json only.

These aren't new-feature problems. They're operational problems — the kind that decide whether a pattern library gets adopted or gets quietly abandoned after the first redeploy. Phase 8 is the sprint that resolves them.

Theme A: Python cleanup with three-failure-mode handling

The bash script in Phase 7 (scripts/cleanup_generic_ucs.sh) had three known blind spots. Phase 8 rewrote it as scripts/cleanup_generic_ucs.py (with the bash version kept as a thin wrapper for CI/CD compatibility):

Failure Mode	Phase 7 Behavior	Phase 8 Behavior
Athena WorkGroup non-empty	`aws athena delete-work-group` failed; user had to manually delete queries	`delete-work-group --recursive-delete-option` automatically removes queries, then deletes the WorkGroup
S3 bucket with versioned objects	`aws s3 rb` silently no-op'd; stack deletion stuck	`delete-objects` for versioned objects + delete markers in two batches, then `rb`
VPC Endpoint SG with lingering inbound rules	CloudFormation rollback with "dependent object" error; user had to manually `revoke-security-group-ingress`	Script queries the VPC Endpoint SG, identifies rules pointing at the UC's Lambda SG, and revokes only those

The Python rewrite also adds:

--dry-run flag that prints the delete plan without executing
Per-step failure isolation (a failure in step 3 doesn't abort steps 4-7)
A final summary listing any resources that couldn't be deleted, with recommended recovery steps

Tests: scripts/tests/test_cleanup_generic_ucs.py uses moto to mock the AWS clients and covers success/failure paths for every step, plus a dry-run snapshot test.

Failure mode that surfaced only after the rewrite

Even after the Python rewrite, a fourth failure mode appeared during Phase 8 Theme D cleanup: some UCs had an fsxn-<uc>-demo-output-<ACCOUNT_ID> bucket where <ACCOUNT_ID> was a literal placeholder string (a leftover from an older redaction pass). The cleanup script constructed the bucket name from the template, found a non-existent bucket, and moved on — but the stack then failed to delete because CloudFormation's own AWS::S3::Bucket resource was still trying to delete the real bucket named with the actual account ID. This was not the same bug as the Phase 7 literal placeholder issue; it was a resource-discovery mismatch. The script was still trusting template-derived bucket names instead of CloudFormation's actual resource inventory. The fix: cross-reference the CFN stack's actual resources before constructing bucket names from templates.

This is documented in docs/operational-runbooks/cleanup-troubleshooting.md alongside the three original failure modes.

Operational hardening is also FinOps hardening: orphaned multipart uploads, versioned S3 buckets, stale networking dependencies, and undeleted stacks all become cost or operational leakage if cleanup is not reliable. The --dry-run flag lets operators preview what will be deleted before committing.

Theme B: VPC Endpoint SG automation

The manual workaround in Phase 7 was: after a new UC deploys, go to the EC2 console, find the VPC Endpoint SG, add an inbound rule for port 443 from the UC's Lambda SG, and on cleanup, remember to revoke it. Everyone forgot at least once.

Phase 8 replaces this with a shared-infra CloudFormation Custom Resource:

VpcEndpointSgRule:
  Type: Custom::VpcEndpointSgInboundRule
  Properties:
    ServiceToken: !ImportValue SharedInfra-VpcEndpointSgManagerLambdaArn
    VpcEndpointSgId: !ImportValue SharedInfra-VpcEndpointSgId
    SourceSgId: !Ref LambdaSecurityGroup
    FromPort: 443
    ToPort: 443
    Protocol: tcp
    Description: !Sub "Allow ${AWS::StackName} Lambda SG to VPC Endpoints"

The custom resource Lambda (shared/vpc_endpoint_sg_manager/handler.py) handles three lifecycle events:

Create → authorize-security-group-ingress with the UC's Lambda SG as source
Update → revoke old rule, authorize new rule
Delete → revoke the rule (idempotent — silently succeeds if the SG or rule is already gone)

Why Option A over Option B

Two designs were compared:

Option A (selected): CloudFormation Custom Resource + Lambda
Option B: CDK Construct

Option A won because the entire codebase is CloudFormation YAML today, and a single Custom Resource integrates naturally without forcing a migration to CDK. The goal was not to introduce a new infrastructure framework, but to remove one manual step from the existing CloudFormation-based workflow. The tradeoffs are documented in docs/vpc-endpoint-sg-automation-design.md. A CDK migration is not on the Phase 8 roadmap — it's been parked as a Phase 9+ consideration if the YAML surface becomes unmanageable.

Integration to existing UCs is opt-in via a EnableVpcEndpointSgRule CloudFormation parameter that defaults to false. Phase 1-7 stacks are unaffected unless the user opts in.

Theme D: 17 UCs × 8 languages of UI/UX screenshots

Phase 7 shipped Step Functions Graph SUCCEEDED screenshots for every UC. Phase 8 Theme D captured the UI/UX screens end users actually see — S3 output buckets, DynamoDB tables, SNS notification topics, Bedrock Markdown reports — and embedded them into all 8 language variants of each UC's demo-guide.md.

The capture was done in four coordinated batches between the two parallel Kiro sessions (A: docs, B: AWS deploy + capture):

Batch	UCs	Outcome	Notes
1	UC15 / UC16 / UC17	All SUCCEEDED	11 screenshots total; UC17 Bedrock Markdown report landed on FSx ONTAP in FSXN_S3AP mode
2	UC2 / UC9	All SUCCEEDED	UC2 parallel Map processed 18+ documents in 16.4s; UC9 SkipInference Pass state worked as designed
3	UC3 / UC5 / UC7 / UC8	UC3/UC5 SUCCEEDED, UC7/UC8 FAILED	UC7/UC8 surfaced a new bug class: IAM alias-only on the Discovery Lambda role
4	UC4 / UC10 / UC12 / UC13	All SUCCEEDED	UC13 surfaced the same IAM bug on its Discovery role — caught and fixed inline

Total: 104 demo-guide variants newly updated in Phase 8 (13 UCs × 8 languages = 104, since UC1/6/11/14 already had embedded screenshots from Phase 1-5), bringing final UI/UX screenshot coverage to 136 demo-guide files (17 UCs × 8 languages). All screenshots masked with v7 OCR redaction (lang="eng+jpn"), verified by scripts/_check_sensitive_leaks.py at 0 leaks.

The new IAM bug class found during screenshot capture

UC7 (genomics-pipeline) and UC8 (energy-seismic) Discovery Lambdas both failed with AccessDenied on their first real invocation. The Discovery role had s3:PutObject granted against the S3 Access Point alias ARN only:

# BEFORE — alias-only (silently broken at runtime)
- Sid: S3AccessPointWrite
  Effect: Allow
  Action: s3:PutObject
  Resource: !Sub "arn:aws:s3:::${S3APAlias}/*"

The full Access Point ARN form (arn:aws:s3:<region>:<account>:accesspoint/<name>/object/*) was missing. AWS IAM evaluates S3 Access Point API requests against both forms, so alias-only statements fail at runtime but pass cfn-lint. This is the same bug class Phase 7 Theme Q caught on 9 UCs — but for a different Lambda role (Discovery vs downstream processors). The Phase 7 sweep focused on OutputWriter-using Lambdas; Discovery Lambdas write a manifest JSON at the start of the workflow and had the same alias-only pattern from an earlier template version.

The fix, applied to UC7/UC8/UC10/UC12 (and later UC13):

# AFTER — alias + full ARN with conditional
- Sid: S3AccessPointWrite
  Effect: Allow
  Action: s3:PutObject
  Resource:
    - !Sub "arn:aws:s3:::${S3APAlias}/*"
    - !If
      - HasS3AccessPointName
      - !Sub "arn:aws:s3:${AWS::Region}:${AWS::AccountId}:accesspoint/${S3AccessPointName}/object/*"
      - !Ref "AWS::NoValue"

Theme H: Pattern C → Pattern B migration — hybrid for Athena workloads

Phase 7 unified the OutputDestination API across 13 of 17 UCs. The remaining work in Phase 8 fell into two categories:

Athena-constrained Pattern C workloads: UC6/UC7/UC8, where Athena results must remain on STANDARD_S3 because Athena's OutputLocation does not accept S3 Access Point ARNs (tracked as FR-2 in docs/aws-feature-requests/).
Full Pattern B completion: UC9/UC13/UC14, where non-Athena AI/ML outputs could be migrated fully to OutputWriter routing.

Phase 8 Theme H completed the handler-side migration for these workloads. Athena-result-writing paths remain on STANDARD_S3, while AI/ML output handlers now use OutputWriter.from_env() and are ready for FSXN_S3AP routing. For UC6/UC7/UC8, the template-level OutputDestination parameters are intentionally deferred to Phase 9; Phase 8 proves the code path and removes direct put_object usage.

Athena-constrained hybrid UCs (UC6/UC7/UC8): Athena results remain on STANDARD_S3 (AWS spec constraint — Athena's OutputLocation does not accept S3 Access Point ARNs). AI/ML outputs use OutputWriter.from_env() and are ready for FSXN_S3AP routing once the template-level parameters ship in Phase 9.

Full Pattern B migrations (UC9/UC13/UC14): No Athena OutputLocation constraint. All AI/ML outputs move to OutputWriter routing with full OutputDestination parameter support.

UC	Athena-writing Lambdas (stay Pattern C)	AI-writing Lambdas (moved to Pattern B)
Athena-constrained hybrid (B+C)
UC6 semiconductor-eda	(Glue catalog via separate path)	`report_generation`, `metadata_extraction` ✅
UC7 genomics-pipeline	Variant aggregation to Glue table	`summary`, `qc`, `variant_aggregation` ✅
UC8 energy-seismic	SEG-Y metadata to Glue table	`compliance_report`, `anomaly_detection`, `seismic_metadata` ✅
Full Pattern B migration (no Athena)
UC9 autonomous-driving	(no Athena) — full Pattern B	`sagemaker_invoke` / annotation pipeline ✅
UC13 education-research	(no Athena) — full Pattern B	All 4 Lambdas: `ocr`, `classification`, `metadata`, `citation_analysis` ✅
UC14 insurance-claims	(no Athena) — full Pattern B	All migrated Lambdas verified ✅

UC9 was completed in Theme I because its OutputWriter migration depended on the Theme Q rework from Phase 7. The s3_client.put_object call in the sagemaker_invoke / annotation pipeline was replaced with OutputWriter.from_env().put_bytes(...), and all 104 UC9 unit tests pass.

Design doc: docs/design-pattern-c-to-b-hybrid.md (written at the start of Phase 8, approved 2026-05-12).

Pattern B+C hybrid trade-off

For UC6/UC7/UC8, the hybrid behavior is code-ready in Phase 8 and becomes operator-selectable once Phase 9 adds the template-level OutputDestination parameters. Even then, Athena query results remain on STANDARD_S3 by AWS service constraint. The demo-guide and output-destination-patterns.md now explicitly state: "Athena results bucket is always STANDARD_S3 regardless of OutputDestination". Users who need end-to-end FSx ONTAP residency should not use Athena — they should either disable the Athena-using step or migrate to a different query layer (e.g., direct DynamoDB).

For regulated deployments, the hybrid Pattern B+C model should be reviewed carefully: AI/ML artifacts can land on FSx ONTAP, but Athena query results remain in STANDARD_S3 by service constraint.

Pattern classification summary (post-Phase 8)

Pattern	Source	AI/ML Output	Athena Output	Best for
A	FSx S3AP	FSx S3AP	N/A	Original S3AP-only form; mostly superseded by Pattern B after Phase 7/8
B	FSx S3AP	Switchable (STANDARD_S3 or FSXN_S3AP)	N/A	Unified AI/ML artifact routing (UC1-5, UC9-17 except Athena-hybrid UC6/7/8)
C	S3 / FSx	S3	S3	Legacy S3-only / Athena-constrained workloads
B+C Hybrid	FSx S3AP	Ready for FSXN_S3AP	S3 (always)	Athena + AI/ML mixed workloads (UC6/7/8)

Theme I: OutputWriter.put_stream for > 5 GB artifacts

Single-part S3 PutObject has a 5 GB object-size limit; FSx ONTAP S3 Access Points follow the same S3 API boundary for single-part uploads. Phase 7 OutputWriter used single-part uploads via put_bytes / put_json / put_text. That was fine for Bedrock Markdown reports, JSON manifests, and OCR results — none exceed 5 GB.

Phase 8 adds MultipartUpload support through OutputWriter.put_stream:

from shared.output_writer import OutputWriter

writer = OutputWriter.from_env()

# For small artifacts (< 100 MB): existing API unchanged
writer.put_bytes("reports/summary.pdf", pdf_bytes, content_type="application/pdf")

# For large artifacts (> 5 GB): new streaming API
with open("/tmp/final_composite.mp4", "rb") as fh:
    writer.put_stream(
        "renders/final_composite.mp4",
        fh,
        content_type="video/mp4",
        part_size_mb=100,  # default 5 MB, tune for throughput
    )

Implementation details:

Uses boto3's create_multipart_upload / upload_part / complete_multipart_upload
On any part-upload failure, calls abort_multipart_upload to prevent orphaned part storage (otherwise S3 keeps them and charges for them indefinitely)
Retry logic: 3 attempts per part with exponential backoff
Progress callback (optional): on_progress=lambda done, total: ... for long-running uploads
Works identically with OutputDestination=STANDARD_S3 and OutputDestination=FSXN_S3AP

Tests: shared/tests/test_output_writer_multipart.py uses moto S3 with synthetic stream generators and small part sizes to exercise multipart logic (representing 1 GB / 5 GB / 10 GB payload scenarios without materializing full payloads in memory), verifies round-trip correctness and abort-on-failure behavior.

Primary use cases once the API ships:

UC4 media-vfx: final_composite.mp4 rendered by Deadline Cloud workers
UC7 genomics-pipeline: raw FASTQ re-publish (phase 9 candidate, not yet integrated)
UC17 smart-city-geospatial: merged GeoTIFF mosaics from multi-tile ChangeDetection

For Lambda-based producers, put_stream should be paired with timeout and /tmp sizing decisions. Lambda's default 512 MB /tmp and 15-minute timeout constrain how large an artifact can be assembled in-place. For very large artifacts, producers should stream from worker-local files or pipe-like iterators rather than materializing the entire payload in memory.

Theme K: Template management duplication resolved

Phase 7 ended with two parallel templates per UC:

template.yaml — SAM source (used by sam build / sam deploy for local dev)
template-deploy.yaml — raw CloudFormation (used by deploy_generic_ucs.sh for production)

They drifted. During Phase 7 Extended Round, we'd fix a bug in one and forget the other. The fix would appear in production but not in local SAM dev, or vice versa.

Phase 8 Theme K made a deliberate decision: template-deploy.yaml is the source of truth. template.yaml is now auto-generated from template-deploy.yaml by scripts/create_deploy_template.py (extended to understand the OutputDestination conditional logic, which was the main reason template.yaml had been hand-edited). A deprecation header was added to every template.yaml directing contributors to edit template-deploy.yaml.

The decision is documented in docs/template-management-decision.md. The choice to keep template.yaml at all — rather than removing it — preserves the SAM local development workflow that early users relied on. The generated template.yaml is still committed to preserve discoverability and SAM workflows, but it is no longer hand-edited.

Contributor rule: if you contribute a new UC or modify an existing template, edit template-deploy.yaml, run scripts/create_deploy_template.py to regenerate template.yaml, then run the validator suite before opening a PR.

Theme L: 75 unused imports

Nothing interesting happens in this section. autoflake removed 75 unused imports across the codebase. All tests still pass. cfn-lint, pyflakes, and the new check_python_quality.py are happy. The repo is 75 lines shorter.

This matters because a noisy pyflakes output teaches contributors to ignore pyflakes output, which is exactly how UC2's missing import os slipped in during Phase 7. A clean baseline makes the next real warning actionable.

During Theme I, three UC9 test_point_cloud_qc.py tests that had been silently broken by the Theme L import pruning were caught and fixed — another small reminder that "static check is clean" and "tests still pass" are two independently verifiable claims, and both need to hold.

Theme M: DevSecOps validator suite (CI-integrated)

Phase 7 introduced three reusable validators:

scripts/lint_all_templates.sh — parallel cfn-lint across all 17 UC templates
scripts/check_handler_names.py — pyflakes undefined-name sweep across 197 Python files
scripts/check_conditional_refs.py — UC9-class bug detector (Condition ref in Sub)

Phase 8 adds two more:

scripts/check_python_quality.py — broader pyflakes sweep (undefined names, unused imports, unused variables)
scripts/check_s3ap_iam_patterns.py — S3AP IAM alias-only bug detector, added after Phase 8 Theme D surfaced the Discovery Lambda variant of the Phase 7 sweep bug

check_handler_names.py remains focused on Lambda entrypoint safety, while check_python_quality.py runs the broader repository-wide pyflakes quality sweep.

The validator suite runs locally in under 10 seconds for all 17 UCs. Phase 8 Theme M shipped a GitHub Actions workflow (.github/workflows/phase8-validators.yml) that runs the full suite on every push to main and every PR:

s3ap-iam-patterns job: S3AP IAM alias+ARN pattern check
handler-name-check job: pyflakes undefined-name sweep
conditional-refs job: UC9-class Condition ref detector
cfn-lint job: parallel CloudFormation template validation

The workflow runs on ubuntu-latest with Python 3.13, requires no AWS credentials (all checks are static analysis), and completes in under 60 seconds. The gitignored _sensitive_strings.py leak check is excluded from CI (it requires local secrets) — contributors run it manually before committing screenshots.

Current status: 17/17 templates clean, 0 pyflakes critical, 0 conditional-ref issues at HEAD.

Phase 8 validators catch known recurring patterns (alias-only IAM, undefined names, conditional ref bugs), but they are not a replacement for full IAM reasoning. IAM Access Analyzer integration is planned for Phase 9 to complement the current static pattern checks with automated least-privilege validation.

Theme N: Observability baseline and operational runbooks

Phase 7 had no monitoring beyond "look at the Step Functions console." Phase 8 Theme N adds the observability layer as a reference implementation on UC1. Phase 8 does not roll alarms out to all 17 UCs yet. Instead, UC1 becomes the reference implementation, and the design/runbooks define the rollout path for Phase 9.

Design

docs/observability-design.md defines the monitoring targets, alarm thresholds, and notification routing for the pattern library. The design covers:

Step Functions: ExecutionsFailed, ExecutionsTimedOut, ExecutionsAborted alarms
Lambda: Errors, Throttles, Duration (P99) alarms per UC
DynamoDB: ReadThrottleEvents, WriteThrottleEvents alarms
Notification routing: SNS Topic → email (dev) or PagerDuty/Slack (production)

UC1 EventBridge failure notification (reference implementation)

UC1 (legal-compliance) now has an EventBridge rule that fires on Step Functions ExecutionFailed events and routes to an SNS topic. This is the reference implementation for other UCs to adopt:

StepFunctionsFailureRule:
  Type: AWS::Events::Rule
  Properties:
    EventPattern:
      source: ["aws.states"]
      detail-type: ["Step Functions Execution Status Change"]
      detail:
        status: ["FAILED", "TIMED_OUT", "ABORTED"]
        stateMachineArn: [!Ref StateMachine]
    Targets:
      - Arn: !Ref AlertTopic
        Id: sfn-failure-alert

Operational runbooks

Three runbooks shipped in docs/operational-runbooks/:

alarm-response.md — triage flowchart for each alarm type, escalation paths, and "is this a real problem or a transient spike?" decision tree
sfn-rerun.md — safe re-execution procedure for failed Step Functions workflows (idempotency considerations, input reconstruction, partial-state recovery)
cost-monitoring.md — per-UC cost breakdown methodology, Bedrock token cost estimation, NAT Gateway cost awareness, and "when to tear down vs keep warm" decision framework

Phase 8 architecture

flowchart TD
  subgraph PreReqs["Shared infrastructure (deployed once)"]
    SharedInfra["vpc-endpoint-sg-manager (Custom Resource Lambda)"]
  end

  subgraph PerUC["Per-UC stack (any of 17 UCs)"]
    UCTemplate["template-deploy.yaml"]
    UCTemplate --> OptIn_VPC["EnableVpcEndpointSgRule=true"]
    OptIn_VPC -.opt-in.-> SharedInfra
    UCTemplate --> OutputDest["OutputDestination param"]
    OutputDest -->|STANDARD_S3| PatternA["S3 bucket"]
    OutputDest -->|FSXN_S3AP| PatternB["FSx ONTAP S3 AP"]
    OutputDest -->|Pattern C/Hybrid| PatternC["Athena bucket + AI→FSxN"]
  end

  subgraph Lifecycle["Operational tooling"]
    Deploy["deploy_generic_ucs.sh"]
    Cleanup["cleanup_generic_ucs.py"]
    Cleanup --> Athena["Athena WorkGroup delete"]
    Cleanup --> Versioned["S3 versioned delete"]
    Cleanup --> SGRevoke["VPC Endpoint SG revoke"]
  end

  subgraph Validators["Pre-commit validators (under 10s total)"]
    V1["lint_all_templates.sh"]
    V2["check_handler_names.py"]
    V3["check_conditional_refs.py"]
    V4["check_python_quality.py"]
    V5["check_s3ap_iam_patterns.py"]
  end

  Deploy --> UCTemplate
  Cleanup --> UCTemplate
  V1 & V2 & V3 & V4 & V5 -.gate.-> Deploy

Parallel Kiro session coordination

Phase 8 was built by two Kiro sessions running in parallel (A: documentation, localization, coordination; B: AWS deploy, verification, implementation) against a shared main branch, using the protocol documented in Phase 7's docs/dual-kiro-coordination.md. The protocol was exercised more intensely in Phase 8 than in Phase 7 because Theme D (screenshot capture) required strict turn-taking on AWS stacks and on the Chrome DevTools MCP browser.

Key coordination patterns used in Phase 8:

Batched push / ack cycles: B deploys UC set → captures screenshots → commits PNGs → push → notifies A. A pulls → embeds ![...]() references into demo-guides → commits → push. Each cycle touched distinct files (B: docs/screenshots/masked/, A: <uc>/docs/demo-guide*.md), so no merge conflicts across 4 batches.
Exclusive region declarations: B owned docs/screenshots/masked/, scripts/check_*.py, and all template-deploy.yaml edits. A owned demo-guide*.md, docs/article-phase*-en.md, and docs/screenshot-capture-checklist.md. Neither session wrote into the other's region without a chat-lock request.
Inline fix-forward: When B surfaced the Discovery Lambda IAM bug during Batch 3 / Batch 4, the fix + new validator (check_s3ap_iam_patterns.py) was applied inline before proceeding, rather than deferred to a follow-up phase. This is the behavior Phase 7 Lesson 15 codified ("deploy and actually run it" surfaces bugs no static check can), applied as a standing rule.

A v2 revision of the coordination protocol (Theme F) is now complete — Appendix D added to docs/dual-kiro-coordination.md with 5 new rules (D-1 through D-5) covering pre-deployment validator gates, shared VPC rules, screenshot lifecycle, test regression after bulk changes, and context transfer format.

Scope clarifications: what was deferred vs. closed in Phase 8

Phase 8 originally had three scope-risk items. By the end of the phase, two were closed and one remains intentionally deferred due to AWS feature availability:

Event-driven trigger E2E — deferred until AWS ships native S3 event support for FSx ONTAP S3 Access Points (tracked as FR-2 in docs/aws-feature-requests/). UC1 has the EventBridge rule + DynamoDB idempotency table deployed and verified (commit 4dbf36b). The rule will activate automatically when FR-2 ships. Manual start-execution remains the primary invocation path.
UC7/UC8 SUCCEEDED re-capture — ✅ Completed (commit 2b958db). IAM fix verified, both UCs SUCCEEDED (UC7 3:03, UC8 2:59), screenshots replaced in all 8 language demo-guides.
UC4 Deadline Cloud full verification — ✅ Completed (commit 5c8283c). Deployed with Deadline Cloud farm/queue, SUCCEEDED in 1:06 (Discovery → RenderAssets Map → SubmitJob → WaitForCompletion → QualityCheck → NotifyCompletion). OutputWriter migration applied to quality_check/handler.py. Deadline Cloud console screenshot captured.

Full stats

Code

New Python modules: shared/output_writer.py::put_stream, shared/vpc_endpoint_sg_manager/handler.py, scripts/cleanup_generic_ucs.py (replacing bash script with thin wrapper)
Unit tests: 982 total all PASS at HEAD — includes new tests for put_stream multipart round-trip, cleanup_generic_ucs mocked AWS paths, vpc_endpoint_sg_manager lifecycle, plus per-UC regression tests for Theme I/O migrations (UC4 24/24, UC6 43/43, UC7 54/54, UC8 45/45, UC9 104/104, UC13 20/20, UC14 13/13, shared/ 362/362)
CloudFormation: Custom Resource added to opt-in UCs; template-deploy.yaml is the source of truth; template.yaml is auto-generated with a deprecation header
Validation scripts: 5 total — lint_all_templates.sh, check_handler_names.py, check_conditional_refs.py, check_python_quality.py, check_s3ap_iam_patterns.py; all pass 17/17 at HEAD
Cleanup: 75 unused imports removed; all direct s3_client.put_object calls in non-discovery AI/ML output handlers eliminated
OutputWriter unification: every non-discovery output handler in all 17 UCs now uses OutputWriter.from_env(); Discovery manifest writers use a separate s3ap_output.put_object helper and are covered by the S3AP IAM validator

Screenshots

Batch 1-4 captured: 4 coordinated batches between A/B sessions + UC1/UC4 Phase 8 verification captures
Demo-guide updates: 104 files newly updated in Phase 8 (13 UCs × 8 languages); final coverage is 136 files (17 UCs × 8 languages) — UC1/6/11/14 already had embedded screenshots from Phase 1-5
Multi-language sync: 119 demo-guide files re-translated from JP source (17 UCs × 7 target languages; Japanese originals excluded)
v7 OCR mask: applied to all Phase 8 screenshots; _check_sensitive_leaks.py 0 leaks across 160 images
UI/UX coverage: every UC now has at least one UI/UX screenshot in its demo-guide (not just the Step Functions Graph view)

Documentation

Cross-cutting docs added or updated: docs/vpc-endpoint-sg-automation-design.md, docs/design-output-writer-multipart.md, docs/design-pattern-c-to-b-hybrid.md, docs/template-management-decision.md, docs/event-driven/architecture-design.md, docs/operational-runbooks/cleanup-troubleshooting.md, docs/operational-runbooks/deployment-troubleshooting.md, docs/verification-results-phase8-uc15-17.md
Phase 8 architecture: consolidated in docs/phase8-architecture.md (referenced from the main README)

AWS verification

Phase 8 deployments: All 17 UCs deployed and verified across multiple batches; UC1/UC4/UC7/UC8 re-verified after code changes
Step Functions SUCCEEDED: UC1 (2:38:20, 549 files), UC4 (1:06, Deadline Cloud), UC7 (3:03), UC8 (2:59), UC2/3/5/9/10/12/13/15/16/17 all green
IAM issues caught and fixed: 5 UCs (UC7/UC8/UC10/UC12/UC13) received the Discovery Lambda IAM dual-format fix; validator prevents recurrence
Sensitive leaks: 0 across 160 masked images (UC4 pre-existing leak resolved by removing Phase 7 screenshot)

Looking Forward to Phase 9

Phase 8 closed all 15 themes — operational hardening, CI/CD, observability, OutputWriter unification, event-driven triggers, sample data generators, Deadline Cloud verification, and multi-language documentation sync. Phase 9 is not carrying Phase 8 leftovers; it scales the Phase 8 baseline outward into production-scale rollouts and AWS-feature-dependent E2E validations:

Event-driven trigger E2E verification: When AWS ships FR-2 (native S3 events on FSx ONTAP S3 APs), UC1's EventBridge rule activates automatically. Phase 9 will verify the full S3 PutObject → EventBridge → Step Functions path end-to-end.
Observability rollout to remaining UCs: UC1 is the reference implementation; Phase 9 adds EnableObservability to all 17 UC templates with per-UC alarm thresholds.
Template-level OutputDestination parameters for UC6/7/8: Handler-side OutputWriter migration is complete; Phase 9 adds the CloudFormation parameters so operators can switch output destination at deploy time without editing Lambda environment variables directly.

And two process improvements:

IAM Access Analyzer integration: automated least-privilege validation beyond the current pattern-matching approach
CDK migration evaluation: if the YAML surface becomes unmanageable, Phase 9 will prototype a CDK equivalent for one UC

Who should care about Phase 8?

Platform teams get repeatable cleanup, CI-grade validators, and template source-of-truth rules.
Security teams get S3AP IAM pattern detection, screenshot leak prevention, and clearer operational boundaries.
Data teams get a realistic Pattern B+C hybrid for Athena-constrained workloads without sacrificing AI/ML artifact routing.
AI/ML teams get a unified OutputWriter path and multipart streaming for large artifacts.
FinOps teams get reliable cleanup that prevents orphaned-resource cost leakage.
AWS partners and SIs get a reusable delivery baseline: deploy, validate, demonstrate, clean up, and hand over with runbooks.

Conclusion

Phase 7 proved a single pattern — serverless AI/ML with S3-compatible access to FSx for NetApp ONTAP as the system of record — applies across 17 industries and 8 languages. Phase 8 turned that pattern into an operationally hardened baseline: cleanup that handles real teardown failures, multipart streaming for artifacts beyond 5 GB, hybrid Pattern C→B migration for Athena-constrained workloads, full Pattern B migration for UC9/UC13/UC14, VPC Endpoint SG automation, a five-script validator suite, and UI/UX screenshot coverage across every UC.

The most impactful decisions were:

Python rewrite of the cleanup script: the three failure modes that kept surfacing in Phase 7 cleanup are now handled end-to-end in a single script, with a dry-run flag and per-step failure isolation
put_stream as a streaming API, not a size parameter on put_bytes: makes large-file handling an explicit, reviewable code choice rather than a runtime size branch
Pattern C → Pattern B hybrid at the Lambda level, not the UC level: keeps Athena on S3 (AWS constraint) while moving AI outputs to FSx ONTAP, without forcing whole-UC re-architecture
check_s3ap_iam_patterns.py added mid-phase, not deferred: when Theme D screenshot capture surfaced a new Discovery Lambda IAM bug, the validator was added immediately so the same bug cannot recur silently
Template-deploy.yaml as single source of truth: the template.yaml / template-deploy.yaml duplication was a maintenance tax every Phase 7 round paid; Phase 8 ended it by auto-generating the SAM template and marking it deprecated-for-edits
Paired CI + observability delivery: shipping CI without observability means failed validation with nowhere to notify; the two were treated as one deliverable and shipped together in Phase 8 — GitHub Actions for static validation, EventBridge + SNS for runtime failure notification, and three operational runbooks for human response

Phase 9 will take the event-driven trigger from "deployed and waiting for AWS FR-2" to "end-to-end verified," and roll out the observability baseline to all 17 UCs. Phase 8 closed every item in its committed scope — 15 themes and 224 task items. The remaining items listed for Phase 9 are production-scale rollouts or AWS-feature-dependent E2E validations, not Phase 8 deferrals. CI validates on every push, UC1 establishes the EventBridge failure-notification reference path, OutputWriter standardizes the AI/ML output paths across the library, and the runbooks tell operators what to do next.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns
Previous phases: Phase 1 · Phase 6A/6B · Phase 7
Phase 8 artifacts (all in the GitHub repo):

DEV Community