3035 - Blog Editorial AI and Voice Program

Goals

Build an AI-assisted editorial system that can source ideas, manage a content calendar, generate briefs/outlines/drafts, and support blog operations without flattening the brand voice.
Use WhatsApp exports as a high-signal research corpus for community language, concerns, provider/product recommendations, and visual culture while protecting member privacy.
Add review/authorship controls so sensitive or personal-story posts always require human signoff and can selectively use Maggie's named byline.

Why This Program Exists

The current blog platform is already live and strong on publishing UX, feeds, SEO, and admin editing, but it is still a single-author Markdown workflow without:
- source-ingestion pipelines
- editorial-intelligence tooling
- selective byline controls
- signoff gates for sensitive content
- structured voice/tone governance
The next body of work is large enough to justify its own umbrella program rather than squeezing it into the remaining 3001 stabilization backlog.

Dependency-Ordered Project Sequence

3036_editorial-platform-architecture-and-repo-boundaries.md (completed)
3037_corpus-governance-redaction-and-privacy-policy.md (completed)
3038_whatsapp-intake-and-multimodal-normalization.md (active)
3039_editorial-intelligence-and-retrieval-layer.md (active)
3040_content-calendar-brief-and-draft-pipeline.md (completed)
3041_review-signoff-and-byline-controls.md (active)
3042_voice-calibration-and-model-routing.md (backlog)

Why This Order

3036 comes first because we need stable whole-repo system boundaries, usable repo organization, and editorial-platform architecture before choosing jobs, storage, schemas, or approval surfaces.
3037 comes second because corpus-governance and privacy rules need to constrain ingestion design before large-scale data processing begins.
3038 then builds the normalized corpus pipeline once the boundaries and redaction rules are explicit.
3039 depends on the normalized corpus and turns it into retrieval-ready editorial intelligence.
3040 depends on retrieval and annotations so the content calendar, briefs, and drafts are grounded rather than generic.
3041 then integrates approval, byline, and publish-safety controls into the existing blog workflow.
3042 comes last so model-tiering and any fine-tune decision are informed by the real tasks, corpus shape, and approval workflow instead of guesswork.

Execution Status

Current state: active
Phase: umbrella execution with 3036 completed after the repo-wide architecture, organization, governance-definition, and safe reorganization lane
Current downstream state:
- 3037 is completed after collaborative walkthrough/sign-off locked the governing corpus/privacy boundary for downstream work
- 3038 remains active, but only as a conditional source-layer follow-up lane now that the phase-1 corpus contract and persisted pilot import are stable
- 3039 remains active and is back on the critical path because weak-signal WhatsApp topic retrieval is currently the main blocker on useful brief/draft quality, even though live model-backed brief synthesis is now verified
- 3040 is now completed after collaborative walkthrough/sign-off approved the phase-1 candidate -> calendar -> brief -> draft delivery, live run-key closeout reporting, and verified ready_for_3041_handoff evidence on the persisted pilot run
- 3041 remains active, but the current highest-value work is now upstream quality improvement rather than more admin workflow expansion
Blocked by: none

Planning Inputs

Existing repo baseline:
- DOCS/features/blog.md
- completed blog streams 3031, 3032, and 3033
External planning input reviewed at kickoff:
- March 21, 2026 exported ChatGPT planning transcript on WhatsApp chat AI training
Core planning direction carried forward from that intake:
- retrieval first
- redaction before downstream AI use
- multimodal corpus support (text + media + screenshots + voice notes)
- optional fine-tuning only after runtime tasks and output shape are clear

Non-Negotiable Operating Decisions

Keep the raw WhatsApp exports in a restricted raw vault and do all downstream AI work from a redacted derivative corpus.
Preserve provider, clinic, hospital, product, medication, and service recommendations as knowledge-layer data unless the surrounding context would re-identify a member.
Treat this as AI-assisted redaction plus human review for high-risk items, not as fully autonomous anonymization.
Default blog authorship remains generic/organizational unless a post is explicitly approved for Maggie's named byline.
Any post with strong personal-story framing, intimate health detail, direct first-person lived-experience claims, or near-direct source quotation must require human signoff before publishing.
Do not start with raw-corpus fine-tuning; first reach a working retrieval-grounded editorial pipeline and only then decide whether a curated fine-tune is worth the extra cost/maintenance.

Scope

In Scope

WhatsApp export intake architecture for large text + media corpora.
AI-assisted redaction, pseudonymization, OCR, transcription, media annotation, and sensitivity scoring.
Structured editorial research corpus design:
- thread/chunk records
- topic and pain-point labels
- phrase banks
- quote candidates
- provider/product recommendation extraction
- visual motif and media tags
Retrieval-powered editorial workflows:
- article-idea sourcing
- content-calendar planning
- brief generation
- outline generation
- draft generation with internal source traceability
Voice/tone governance using redacted source material and human-approved exemplars.
Review, approval, authorship, and signoff workflow for blog publishing.
Stack/model evaluation for cost-quality tradeoffs.

Out of Scope

Direct training on raw, unredacted personal chat logs.
Fully autonomous publishing with no human review.
Impersonation of named community members or preserving identifiable private speech as a style target.
Replacing the current blog platform before extending it.

Success Criteria

A single approved architecture exists for raw-vault storage, redaction pipeline, structured corpus, retrieval layer, and editorial workflow orchestration.
One pilot WhatsApp corpus can be ingested end to end into a redacted, searchable editorial dataset.
The system can reliably generate:
- grounded blog ideas
- content calendar candidates
- article briefs
- first-draft posts with internal source traceability
Sensitive posts are always routed through explicit human review/signoff.
Byline policy is implemented at the data/workflow level:
- generic byline default
- Maggie byline only on approved posts
The program reaches a clear model-routing recommendation instead of defaulting to the most expensive model for every step.

Recommended Architecture Direction

Storage And Data Layers

Raw vault:
- original WhatsApp exports and media stored outside normal app flows with restricted access
Redacted working corpus:
- structured text/media records derived from the raw vault
Editorial intelligence layer:
- thread summaries
- topic clusters
- recommendation entities
- quote candidates
- voice markers
- content-angle suggestions
Retrieval layer:
- searchable chunks plus editorial annotations for grounded drafting

Stack Recommendation

Start repo-native instead of jumping straight to n8n.
Recommended first stack:
- TypeScript batch scripts and server-side app actions in this repo
- Neon/Postgres for metadata, workflow state, approvals, and editorial objects
- private object storage for raw and redacted media artifacts
- retrieval index added only after the normalized corpus schema is stable
n8n can still be useful later for notifications, inbox routing, calendar syncing, or cross-tool approvals, but it should not be the first core processing engine for parsing/redaction/governance logic.
If orchestration complexity grows beyond simple repo-native jobs, evaluate a job/workflow layer after schema + review rules are stable rather than before.

Model Strategy Recommendation

Use a tiered model mix, not one model everywhere.
Lower-cost models are appropriate for:
- first-pass parsing
- candidate extraction
- metadata normalization
- coarse classification
Higher-end models are appropriate for:
- multimodal understanding on difficult media
- subtle voice/tone analysis
- sensitive-subject classification
- high-quality brief and draft generation
- approval-ready synthesis
Do not assume expensive models are required for every ingestion step; reserve them for stages where quality materially changes editorial output.

Workflow Guardrails

Default Authorship Policy

Most posts stay on the generic site/organization byline.
Named Maggie byline is reserved for posts explicitly approved as personal/editorial voice pieces.

Mandatory Signoff Categories

Personal-story or memoir-style posts
Posts that read as Maggie's own lived experience
Posts covering fertility, pregnancy loss, mental health, relationship trauma, or other intimate/identity-heavy topics in a first-person or advisory voice
Posts using direct or near-direct quotes from source conversations
Posts whose recommendation strength or emotional framing could be interpreted as personal endorsement

Source-Material Policy

Provider/product/clinic recommendations may be preserved when identity-safe.
Direct identifiers and quasi-identifiers for community members must be redacted, generalized, or withheld.
High-risk media must receive human review before it can inform editorial outputs.

Proposed Workstreams

Corpus Governance And Privacy
- data handling rules
- redaction standards
- approval boundaries
- source-permission posture
WhatsApp Intake And Normalization
- parsing exports
- media manifests
- OCR/transcription pipeline
- structured record schema
Editorial Intelligence Layer
- topic extraction
- quote bank
- provider/product recommendation maps
- voice and phrase bank
Content Operations Layer
- idea backlog
- article brief generation
- calendar planning
- freshness/reuse controls
Drafting, Review, And Authorship Controls
- sensitive-content routing
- signoff states
- named-byline policy
- publish-safe approval flow
Model And Automation Optimization
- cost/quality benchmarks
- model routing
- optional fine-tune decision
- optional automation tooling expansion

Milestones

Milestone 1: Program architecture, governance rules, and stack decision locked.
Milestone 2: Pilot WhatsApp corpus ingested into a redacted multimodal research dataset.
Milestone 3: Retrieval-grounded editorial tools produce blog ideas, briefs, and calendar candidates.
Milestone 4: Draft-generation and signoff workflow integrated with the blog publishing system.
Milestone 5: Byline controls, sensitive-content gates, and model-routing recommendation finalized.

Dependencies

Builds on the existing blog platform documented in DOCS/features/blog.md.
Needs a small carry-forward maintenance lane from 3008/3013 so governance and walkthrough debt do not accumulate while this program becomes the lead focus.

Risks

Privacy and trust risk if redaction is treated as fully solved by AI without human review.
Voice quality risk if the system overfits to private conversati

...[truncated for intake]