Data Model

Core Node Types

  • source
  • subject
  • level
  • curriculum
  • form
  • competence
  • topic_hub
  • topic
  • exam_format
  • concept
  • exam_paper
  • exam_question
  • marking_scheme
  • solution

Versioned Curriculum Files

The long-term curriculum source of truth should be versioned by level, subject, and syllabus year:

  • data/curricula/{level}/{subject_slug}/{year}.json: one official curriculum version for one subject.
  • data/curricula/index.json: aggregate registry for discoverability, current pointers, historical versions, aliases, and review status.
  • data/curriculum_map.json: current Mathematics compatibility view. Do not grow this file into the multi-subject authority.

Example paths:

  • data/curricula/csee/mathematics/2023.json
  • data/curricula/csee/physics/2023.json
  • data/curricula/csee/kiswahili/2023.json

Curriculum versions are append-only. When a new syllabus arrives, add a new {year}.json record and update the current pointer in index.json; do not overwrite the older version. This keeps old exam mappings, archived wiki pages, and syllabus-change comparisons historically honest.

Curriculum Version Shape

Each data/curricula/{level}/{subject_slug}/{year}.json file should include:

  • id: Stable version ID, for example curriculum-csee-physics-2023.
  • level: Level slug, for example csee.
  • subject_id and subject_slug: Canonical subject identity.
  • title: Official curriculum title.
  • year: Official syllabus year as a number.
  • language: Primary language used by the syllabus record.
  • source_ids: Source records that define the curriculum.
  • review_status and extraction_status.
  • forms: Version-scoped forms, with IDs like form-csee-physics-2023-form-i.
  • competences: Version-scoped competence records.
  • topic_hubs: Navigation groupings for the subject version.
  • topic_registry: Official syllabus topic records for that subject version.

Version-scoped IDs should include level, subject, and year unless a node is deliberately timeless. This prevents collisions across subjects and protects older curriculum meaning when new syllabuses arrive.

Curriculum Index Shape

data/curricula/index.json should include:

  • schema_version.
  • current_curricula: Pointers from canonical subject IDs to the active curriculum version, for example subject-physics -> curriculum-csee-physics-2023.
  • subjects: Canonical subject records, aliases, language notes, and historical labels.
  • curricula: Known curriculum versions with path, year, source IDs, review status, and public/private availability policy.
  • legacy_aliases: Search and migration aliases that should not rewrite old records.

Subject aliases support discovery and continuity, not identity replacement. For example, Civics, Civic Education, and Uraia may route learners toward the current subject-historia-ya-tanzania-na-maadili page, while older records that were originally Civics should remain historically labeled.

Topic And Concept Identity

Use separate node types for official syllabus topics and durable learning concepts:

  • topic: A curriculum-specific syllabus item in one subject and year, for example topic-csee-physics-2023-measurement.
  • concept: A reusable learning idea that can span subjects, years, languages, and exams, for example concept-measurement.

A topic answers: "Where does the official syllabus place this item?" A concept answers: "What learning idea is this about?" Exam questions may eventually map to both layers:

  • question_assesses_topic for a specific syllabus-version target.
  • question_assesses_concept for a reusable learning idea.

Do not use unversioned topic-* IDs for new multi-subject records. Existing Mathematics IDs may remain through compatibility aliases or migration mappings until downstream question maps are upgraded.

Legacy Curriculum Topic Registry

data/curriculum_map.json.topic_registry is the current Mathematics compatibility registry. New multi-subject curriculum records should use data/curricula/{level}/{subject_slug}/{year}.json.topic_registry with version-scoped IDs.

Each topic-registry entry must include:

  • id: Stable topic ID, prefixed with topic-.
  • slug: Stable page slug, lowercase with hyphens.
  • title: Official syllabus topic title.
  • form and form_id: Human and graph identifiers for the form.
  • competence and competence_id: Syllabus competence mapping.
  • source and source_id: Official source path and graph ID.
  • review_status: Conservative review status. Official syllabus topics use official.
  • page_path: Markdown page path under wiki/topics/.
  • hub_category and hub_page_path: Hub grouping used for navigation.
  • sequence: Stable ordering within the curriculum spine.
  • summary: Short scope note for the page.

The canonical learning node for a syllabus topic is its wiki/topics/ page. Form pages, subject pages, and hub pages are navigation maps: they organize the official sequence, readiness signals, and recommended next pages, but they should point back to canonical topic pages rather than duplicating topic content.

Form/Class Metadata

Entries in data/curriculum_map.json.form_topics may include learner-navigation metadata in addition to the official form and topic list:

  • aliases: Common names that refer to the same official form/class level.
  • page_path: Human-readable form/class learning-map page.
  • readiness.chapter_ready_topics: Topic IDs with chapter-level learner pages.
  • readiness.recommended_next_topics: Topic IDs that naturally follow current chapter-ready pages.
  • readiness.exam_signal_topics: Topic IDs with useful but still reviewable exam signals.

Official syllabus naming remains authoritative. Aliases support search, retrieval, and future tutor personalization.

Core Edge Types

  • source_supports_page
  • source_supports_curriculum
  • source_supports_topic
  • subject_at_level
  • subject_has_form
  • subject_has_competence
  • competence_has_specific_competence
  • subject_has_topic_hub
  • subject_has_topic
  • form_has_topic
  • topic_in_hub
  • topic_supports_competence
  • topic_has_concept
  • topic_has_learning_activity
  • form_has_alias
  • form_has_readiness_signal
  • source_defines_exam_format
  • exam_format_applies_to_subject
  • exam_format_maps_to_topic
  • exam_format_partially_maps_to_topic
  • exam_paper_has_question
  • question_assesses_topic
  • question_assesses_concept
  • solution_answers_question
  • marking_scheme_validates_solution

Recommended future curriculum-version edge types:

  • curriculum_replaces_curriculum: a newer curriculum version is the active successor to an older version.
  • topic_supersedes_topic: a newer syllabus topic replaces an older syllabus topic.
  • topic_same_as: two curriculum-version topics represent the same official learning target.
  • topic_split_into: one older topic becomes multiple newer topics.
  • topic_merged_into: multiple older topics become one newer topic.
  • topic_moved_form: a topic remains recognizable but shifts form placement.
  • topic_equivalent_to_concept: a curriculum topic maps to a durable concept node.

Recommended future topic-to-topic edge types:

  • topic_prerequisite_of: one canonical topic supports later learning in another canonical topic.
  • topic_related_to: two canonical topics are useful siblings, extensions, or comparisons without a strict prerequisite relationship.
  • topic_cross_subject_bridge: a canonical topic in one subject helps explain, apply, or interpret a canonical topic in another subject.

Use these edges between canonical topic nodes. Do not model form pages as the learning target for prerequisite, related-topic, or cross-subject relationships.

Review Status Values

  • official
  • unreviewed
  • ai_checked
  • human_reviewed
  • pilot_validated
  • needs_manual_review

Extraction Status Values

  • not_extracted
  • valid_extraction
  • schema_invalid
  • empty_sections
  • parse_error
  • runtime_error
  • needs_manual_review

Assessment Format Files

  • exam_format_map.json: machine-readable examination rubric.
  • basic_math_format_topics_2022.json: Basic Mathematics content topics and table-of-specification groups from the 2022 format booklet.
  • exam_format_topic_crosswalk_2022.jsonl: mappings from 2022 examination-format topic groups to current 2023 Mathematics topic IDs.
  • review_queue_exam_format_2022.jsonl: crosswalk terms that need manual review because they do not cleanly match the current syllabus topic registry.

Question Mapping Files

  • question_map_2021_2025.jsonl: one flattened answerable leaf-question record per 2021-2025 Basic Mathematics Paper 1 JSON item.
  • topic_frequency_2021_2025.json: aggregate topic, form, hub, and exam-format signals from the five-year mapping pilot.
  • exponents_exam_signal_audit_2021_2025.json: focused review artifact correcting the Exponents learner-page signal after excluding degree-notation false positives.
  • review_queue_question_mapping.jsonl: question mappings that need manual review because they are low-confidence, multi-topic, figure/table-dependent, missing marks, unmapped, or otherwise uncertain.

Each question-map record should include:

  • question_id: Stable question identifier, for example csee_041_2024_p1_q14_b_ii.
  • exam_year, level, subject, subject_code, and paper.
  • source_file and source_pdf.
  • section_title, section_name, question_number, part_path, and part_label.
  • top_marks and leaf_marks, when available.
  • stem_text, leaf_text, and full_text.
  • tables, figures, has_table, and has_figure.
  • primary_topic_id, primary_topic_label, secondary_topic_ids, form, and hub.
  • mapping_confidence, mapping_status, review_status, and mapping_notes.
  • exam_format_group_id and exam_format_group_label, where mappable.
  • review_reasons.

Mapping Confidence Values

  • 0.90: direct wording match, such as logarithm, matrix, round off, or Venn diagram.
  • 0.75: clear mathematical skill match even when wording differs.
  • 0.55: likely mapping but context-dependent or multi-topic.
  • Below 0.55: leave unmapped or tentative and send to review.

Any figure-dependent or table-dependent question should appear in the review queue even when it has a plausible topic mapping.

Curriculum Crosswalk Records

Crosswalk data lives under data/curricula/crosswalks/ and compares curriculum-version topics without rewriting either side.

Each crosswalk should include source curriculum ID, target curriculum ID, source IDs, review status, and relationship records with from_topic_id, optional to_topic_id, relationship_type, review_status, and mapping_note.

Allowed relationship types are documented in docs/curriculum-versioning-framework.md: same_or_near_same, partial_overlap, split_into, merged_into, moved_form, renamed_from, legacy_only, and new_only.