Ingestion Workflow

Source Intake

  1. Place source files under the correct raw/ subfolder.
  2. Add a record to data/source_catalog.jsonl.
  3. Assign an authority level and review status.
  4. Run provenance metadata before extraction:

``bash python3 scripts/source_provenance.py --catalog data/source_catalog.jsonl --summary-only ``

The script resolves local catalog path values relative to the project root and computes sha256 plus file_size_bytes without modifying raw files.

  1. Never modify the raw source after intake.

Provenance Metadata

Use scripts/source_provenance.py to verify that local source records still match the private archive and to create public-safe provenance exports.

Common checks:

python3 scripts/source_provenance.py --catalog data/source_catalog.jsonl
python3 scripts/source_provenance.py --only-source-type syllabus --require-local

Public-safe export:

python3 scripts/source_provenance.py \
  --catalog data/source_catalog.jsonl \
  --write-public data/source_catalog_public.jsonl \
  --summary-only

The public export keeps source identity, citation metadata, review/extraction status, checksums, and file sizes, but omits private raw/ paths. Local archive records are marked with archive_policy: private_archive_not_published. Do not publish raw PDFs, OCR dumps, or full extracted page text unless explicit permission exists.

Syllabus Ingestion

  1. Extract metadata: title, publisher, year, level, subject, language, citation.
  2. Extract forms/classes, competences, learning activities, assessment notes, and periods.
  3. Write or update the versioned curriculum file at data/curricula/{level}/{subject_slug}/{year}.json.
  4. Update data/curricula/index.json with current pointers, aliases, language notes, and source IDs.
  5. Add each official syllabus topic to that curriculum version's topic_registry with a version-scoped ID, slug, form, competence, source, review status, page path, and hub category.
  6. Create one leaf topic page per syllabus topic and keep hub pages as navigation pages.
  7. Update data/nodes.jsonl and data/edges.jsonl so each topic is connected to subject, form, source, competence, and hub.
  8. Put ambiguous topics, duplicate source questions, or extraction issues into data/review_queue.jsonl.

data/curriculum_map.json remains a Mathematics compatibility view while older question-mapping artifacts are migrated. Do not grow it into the multi-subject authority.

Exam JSON Ingestion

  1. Treat extracted JSON as unverified.
  2. Validate whether the file parses and whether sections/questions are present.
  3. Preserve extraction log information.
  4. Add one line per paper to data/question_inventory.jsonl.
  5. Do not generate final solutions or marking schemes from unreviewed JSON.

Exam Question Mapping

  1. Choose a tightly scoped paper set before mapping, such as one subject and a fixed year range.
  2. Flatten only answerable leaf nodes. Parent stems provide context but should not become separate question records.
  3. Use stable question IDs, for example csee_041_2024_p1_q14_b_ii.
  4. Map only to existing topic IDs from data/curriculum_map.json.topic_registry.
  5. Attach an exam-format group from data/exam_format_topic_crosswalk_2022.jsonl when the match is clear.
  6. Store confidence, mapping notes, and review status.
  7. Put low-confidence, multi-topic, figure-dependent, table-dependent, missing-mark, missing-text, and unmapped records into data/review_queue_question_mapping.jsonl.
  8. Generate aggregate topic-frequency signals only after all individual records parse and all mapped IDs resolve.
  9. Do not generate solutions, answers, or marking schemes during question mapping.

Legacy Basic Mathematics Exam Mapping

Use this layer for CSEE Basic Mathematics Paper 1 records from the legacy exam period, including the 2018-2025 working set.

  1. Treat the Mathematics syllabus at data/curricula/csee/basic-mathematics/2005.json as the first mapping authority for legacy Basic Mathematics exam records.
  2. Do not map 2018-2025 Basic Mathematics questions directly to the 2023 Mathematics spine unless a reviewed crosswalk path exists.
  3. When a current-syllabus relationship is useful, record it as a crosswalk-derived target from data/curricula/crosswalks/csee-basic-mathematics-2005-to-mathematics-2023.json, preserving confidence, relationship type, and review status.
  4. Keep legacy-only topics, partial overlaps, and ambiguous current-topic matches in the relevant review queue instead of forcing a 2023 topic.
  5. Keep aggregate topic-frequency files explicitly tied to the mapping authority used. A legacy-first aggregate should not be described as direct 2023 Mathematics coverage.
  6. If a learner page cites a legacy-derived signal, label it as an unreviewed assessment signal until the original paper and marking guidance have been checked.

Assessment Format Ingestion

  1. Treat official NECTA examination-format documents as assessment guidance.
  2. Do not replace the current syllabus spine with older examination-format topic wording.
  3. Extract subject format, question structure, content topics, and table-of-specification groups.
  4. Crosswalk exam-format groups to current syllabus topic IDs where possible.
  5. Queue unmatched or ambiguous groups for manual review.

Wiki Update

  1. Update wiki/index.md.
  2. Append to wiki/log.md.
  3. Confirm citations use source paths.
  4. Confirm pages follow the required page format.
  5. Run the checks in docs/validation.md.

Curriculum Versioning Check

Before ingesting any old or future syllabus, read docs/curriculum-versioning-framework.md and decide whether the source creates a new curriculum version, corrects metadata for an existing version, or only supplies assessment context.

New official syllabus versions should be added at data/curricula/{level}/{subject_slug}/{year}.json. Do not overwrite an older official syllabus version. When a syllabus replaces an earlier version, add crosswalk records only after comparing official topic wording, form placement, and scope; queue uncertain relationships for manual review.