Ingestion Workflow
Source Intake
- Place source files under the correct
raw/subfolder. - Add a record to
data/source_catalog.jsonl. - Assign an authority level and review status.
- Run provenance metadata before extraction:
``bash python3 scripts/source_provenance.py --catalog data/source_catalog.jsonl --summary-only ``
The script resolves local catalog path values relative to the project root and computes sha256 plus file_size_bytes without modifying raw files.
- Never modify the raw source after intake.
Provenance Metadata
Use scripts/source_provenance.py to verify that local source records still match the private archive and to create public-safe provenance exports.
Common checks:
python3 scripts/source_provenance.py --catalog data/source_catalog.jsonl
python3 scripts/source_provenance.py --only-source-type syllabus --require-local
Public-safe export:
python3 scripts/source_provenance.py \
--catalog data/source_catalog.jsonl \
--write-public data/source_catalog_public.jsonl \
--summary-only
The public export keeps source identity, citation metadata, review/extraction status, checksums, and file sizes, but omits private raw/ paths. Local archive records are marked with archive_policy: private_archive_not_published. Do not publish raw PDFs, OCR dumps, or full extracted page text unless explicit permission exists.
Syllabus Ingestion
- Extract metadata: title, publisher, year, level, subject, language, citation.
- Extract forms/classes, competences, learning activities, assessment notes, and periods.
- Write or update the versioned curriculum file at
data/curricula/{level}/{subject_slug}/{year}.json. - Update
data/curricula/index.jsonwith current pointers, aliases, language notes, and source IDs. - Add each official syllabus topic to that curriculum version's
topic_registrywith a version-scoped ID, slug, form, competence, source, review status, page path, and hub category. - Create one leaf topic page per syllabus topic and keep hub pages as navigation pages.
- Update
data/nodes.jsonlanddata/edges.jsonlso each topic is connected to subject, form, source, competence, and hub. - Put ambiguous topics, duplicate source questions, or extraction issues into
data/review_queue.jsonl.
data/curriculum_map.json remains a Mathematics compatibility view while older question-mapping artifacts are migrated. Do not grow it into the multi-subject authority.
Exam JSON Ingestion
- Treat extracted JSON as unverified.
- Validate whether the file parses and whether sections/questions are present.
- Preserve extraction log information.
- Add one line per paper to
data/question_inventory.jsonl. - Do not generate final solutions or marking schemes from unreviewed JSON.
Exam Question Mapping
- Choose a tightly scoped paper set before mapping, such as one subject and a fixed year range.
- Flatten only answerable leaf nodes. Parent stems provide context but should not become separate question records.
- Use stable question IDs, for example
csee_041_2024_p1_q14_b_ii. - Map only to existing topic IDs from
data/curriculum_map.json.topic_registry. - Attach an exam-format group from
data/exam_format_topic_crosswalk_2022.jsonlwhen the match is clear. - Store confidence, mapping notes, and review status.
- Put low-confidence, multi-topic, figure-dependent, table-dependent, missing-mark, missing-text, and unmapped records into
data/review_queue_question_mapping.jsonl. - Generate aggregate topic-frequency signals only after all individual records parse and all mapped IDs resolve.
- Do not generate solutions, answers, or marking schemes during question mapping.
Legacy Basic Mathematics Exam Mapping
Use this layer for CSEE Basic Mathematics Paper 1 records from the legacy exam period, including the 2018-2025 working set.
- Treat the Mathematics syllabus at
data/curricula/csee/basic-mathematics/2005.jsonas the first mapping authority for legacy Basic Mathematics exam records. - Do not map 2018-2025 Basic Mathematics questions directly to the 2023 Mathematics spine unless a reviewed crosswalk path exists.
- When a current-syllabus relationship is useful, record it as a crosswalk-derived target from
data/curricula/crosswalks/csee-basic-mathematics-2005-to-mathematics-2023.json, preserving confidence, relationship type, and review status. - Keep legacy-only topics, partial overlaps, and ambiguous current-topic matches in the relevant review queue instead of forcing a 2023 topic.
- Keep aggregate topic-frequency files explicitly tied to the mapping authority used. A legacy-first aggregate should not be described as direct 2023 Mathematics coverage.
- If a learner page cites a legacy-derived signal, label it as an unreviewed assessment signal until the original paper and marking guidance have been checked.
Assessment Format Ingestion
- Treat official NECTA examination-format documents as assessment guidance.
- Do not replace the current syllabus spine with older examination-format topic wording.
- Extract subject format, question structure, content topics, and table-of-specification groups.
- Crosswalk exam-format groups to current syllabus topic IDs where possible.
- Queue unmatched or ambiguous groups for manual review.
Wiki Update
- Update
wiki/index.md. - Append to
wiki/log.md. - Confirm citations use source paths.
- Confirm pages follow the required page format.
- Run the checks in
docs/validation.md.
Curriculum Versioning Check
Before ingesting any old or future syllabus, read docs/curriculum-versioning-framework.md and decide whether the source creates a new curriculum version, corrects metadata for an existing version, or only supplies assessment context.
New official syllabus versions should be added at data/curricula/{level}/{subject_slug}/{year}.json. Do not overwrite an older official syllabus version. When a syllabus replaces an earlier version, add crosswalk records only after comparing official topic wording, form placement, and scope; queue uncertain relationships for manual review.