Validation

Required Checks

Run the focused validation harness from the project root:

python3 scripts/validate_wiki.py

The script currently checks:

  1. Every data/**/*.json file parses as JSON.
  2. Every data/**/*.jsonl line parses as a JSON object.
  3. Duplicate IDs in data/nodes.jsonl, data/edges.jsonl when edge IDs exist, data/source_catalog.jsonl, and curriculum topic registries.
  4. Duplicate graph edge triples in data/edges.jsonl.
  5. Graph edge from and to references resolve to known node IDs.
  6. Source catalog local paths stay inside the repo root. Use --require-source-files for private archive audits that require every local source file to exist.
  7. Topic page paths exist for data/curriculum_map.json and any future data/curricula/**/*.json files.
  8. Curriculum topic references to form, source, competence, and hub nodes resolve when those fields are present.

The command exits with status 1 if errors are found and 0 when only warnings or a clean result remain.

The raw source archive is intentionally not version-tracked. Use scripts/source_provenance.py --require-local when you need to verify that the local private archive is present and hashable.

Run these checks after Milestone 1 or 2 changes:

  1. data/curriculum_map.json parses as JSON.
  2. Every line in data/nodes.jsonl and data/edges.jsonl parses as JSON.
  3. Every topic registry entry has id, slug, form, competence_id, source_id, review_status, page_path, and hub_category.
  4. Every topic registry page_path exists.
  5. Every topic node appears in data/nodes.jsonl.
  6. Every topic has edges for subject, form, source, competence, and hub connections.
  7. No raw files under raw/ are modified during wiki maintenance.

Run these checks after question-mapping changes:

  1. Every line in data/question_map_2021_2025.jsonl parses as JSON.
  2. Every line in data/review_queue_question_mapping.jsonl parses as JSON.
  3. data/topic_frequency_2021_2025.json parses as JSON.
  4. The expected source files and only the expected source files are used.
  5. Every question_id is unique.
  6. Every mapped primary_topic_id exists in data/curriculum_map.json.topic_registry.
  7. Every mapped secondary_topic_ids entry exists in data/curriculum_map.json.topic_registry.
  8. Every mapped exam_format_group_id exists in data/exam_format_topic_crosswalk_2022.jsonl.
  9. Frequency totals match the question-map record count.
  10. Low-confidence, figure-dependent, and table-dependent records appear in the review queue when present.
  11. Wiki links resolve after adding or renaming pages.

Run these additional checks after 2018-2025 legacy Basic Mathematics mapping changes:

  1. Every mapped legacy primary_topic_id resolves to data/curricula/csee/basic-mathematics/2005.json.
  2. Every mapped legacy secondary_topic_ids entry resolves to data/curricula/csee/basic-mathematics/2005.json.
  3. Any 2023 Mathematics target is explicitly crosswalk-derived from data/curricula/crosswalks/csee-basic-mathematics-2005-to-mathematics-2023.json.
  4. Legacy-only, partial-overlap, missing-text, figure-dependent, table-dependent, and low-confidence records remain reviewable instead of being forced into the 2023 topic registry.
  5. Topic-frequency summaries state whether they are legacy-2005 counts, current-2023 counts, or crosswalk-derived counts.
  6. Learner-facing pages label these records as unreviewed assessment signals unless the original paper and marking guidance have been manually checked.
  7. No generated solution, answer, marking scheme, or official emphasis claim is inferred from the unreviewed mapping layer.

Run these checks after learner-facing topic expansion:

  1. Topic page follows docs/rulebook.md chapter shape.
  2. Learner-facing mathematics uses $...$ for inline math and $$...$$ for display math.
  3. Code formatting is reserved for IDs, paths, literal source strings, or extraction artifacts.
  4. Official syllabus content, unreviewed exam signals, open enrichment, and textbook notes remain clearly separated.
  5. Wikipedia and textbook wording is not copied into learner prose.
  6. Practice tasks progress from direct understanding to application or edge cases.
  7. Renderer limitations are noted when math may not display in a given Markdown surface.

Run these checks after form/class page changes:

  1. Form/class page links to every official topic listed for that form in data/curriculum_map.json.form_topics.
  2. Aliases are recorded without changing the official syllabus form name.
  3. Any readiness topic IDs resolve to existing topic_registry entries.
  4. Form/class page distinguishes navigation status from curriculum authority.
  5. Future tutor notes do not imply that a personalized tutor has already been built.

Current Baseline

  • Curriculum topics: 44.
  • Syllabus topic pages: 44.
  • Topic hubs: 6.
  • Graph nodes: 72.
  • Graph edges: 284.
  • Imported Basic Mathematics exam JSON files: 35.
  • Five-year Basic Mathematics question-map records: 191.
  • Five-year mapped records: 147.
  • Five-year unmapped records: 44.
  • Five-year question-mapping review records: 125.

Milestone Boundary

Milestones 1 and 2 stop at curriculum registry, page scaffolding, navigation, and graph connectivity. Milestone 3 adds unreviewed question-to-topic signals. Worked solutions and marking schemes remain outside the current milestone boundary.