Project8 min read
Parsing ANU course prerequisites into a graph
Parsing free-text prerequisite descriptions into a graph of courses and degree requirements.
A web app I built during my studies at the Australian National University to help students plan their degrees. The interesting part wasn't the React frontend — it was getting the data out of the university's curriculum website in the first place.
Live site · Scraper source · API source · Frontend source
Problem#
ANU publishes thousands of courses, specialisations, and degree programs across two surfaces: API endpoints that expose search metadata, and the Programs and Courses website that contains the richer human-readable requirement text. Each course page contains free-text prerequisite descriptions like:
A pass in COMP1100 or COMP1130, and one of COMP1110, COMP1140, COMP2300.
Incompatible with COMP1730.Students cross-reference these pages by hand when they plan their degree, checking prerequisites, eligibility, and which courses satisfy which requirements. There was no structured graph dataset, and this was pre-LLM, so there was no shortcut for pulling logical expressions out of natural language. I wanted to build one.
System Architecture#
ANU curriculum site
↓
Scrapy spiders (seeded from ANU search API data)
↓
HTML extractors (BeautifulSoup + indentation/table parsing)
↓
NLP parser (spaCy entity ruler + POS/dependency hints → logical JSON)
↓
Neo4j (courses, prerequisites, program requirements)
↓
GraphQL API
↓
React app (degree planner + interactive graph)The crawler walks every course and program page, the parser turns the prose into an AST of boolean expressions over course codes, and the result is loaded into Neo4j as a property graph: nodes are courses and programs, edges are prerequisites, co-requisites, and "satisfies" relationships into program requirements.
This project was really three repositories: a scraper/ETL repo, a graph API repo, and the React explorer. The scraper was the hard part.
Data Pipeline#
Scraping source data#
The scraper starts from ANU's own API snapshots for classes, programs, majors, minors, and specialisations. Those API records provide stable IDs like COMP1100, academic plan codes, and subplan codes. The Scrapy spiders then use those IDs to visit canonical Programs and Courses pages:
SpiderClass /course/{CourseCode}
SpiderProgram /program/{AcademicPlanCode}
SpiderSpecialisation /{major|minor|specialisation}/{SubPlanCode}For courses, SpiderClass extracts the summary fields from the page chrome, including units, subject, college, academic career, convener, co-taught courses, intro text, and the raw requisite block. The raw requisite text is cleaned before parsing: course codes are space-separated, all-caps program abbreviations in parentheses are removed, & becomes and, parenthesis spacing is normalized, and special cases like R&D are rewritten so they do not confuse the parser.
For programs and specialisations, the scraper has to recover hierarchy from HTML layout. ANU requirement pages use a mix of paragraphs, lists, tables, and visual indentation. The scraper converts the page into (text, indentation) pairs, then normalizes pixel padding and margins into indentation ranks. That rank becomes the tree structure for nested requirements like "24 units from list A, of which 6 units must come from list B."
Tables get flattened into course rows, unordered lists inherit one indentation level deeper than the previous element, and inconsistent indentation between adjacent course rows is corrected when both lines contain course-code entities.
Tokenization and entities#
The parser uses spaCy, but the key piece is a custom EntityRuler. Instead of relying on generic NER, the scraper injects deterministic entity patterns for the domain:
CLASS -> [A-Z]{4}[0-9]{4}
PROGRAM -> Bachelor/Master/Doctor/Graduate Certificate/Diploma of ...That means sentences are first tokenized into normal spaCy tokens, then enriched with domain entities:
To enrol in this course, students must be studying a Master of Computing and have completed COMP6250 or COMP8260.
CLASS: COMP6250, COMP8260
PROGRAM: Master of Computing
VERBS: studying, completedThe parser also maps verbs into normalized edge conditions:
completed -> completed
studying -> studying
enrolled -> enrolled
incompatible -> incompatible
request -> permission
permission -> permissionThat normalization is what lets different prose patterns become graph relationships later.
The named entity step was domain-specific. Generic NER would not reliably know that COMP6250 and COMP8260 are course codes or that Master of Computing is a degree program, so I added spaCy EntityRuler patterns before the built-in NER pipeline. Those patterns produced custom CLASS and PROGRAM spans that the parser could treat as graph-node references.
Relationship extraction#
The parser's job was to extract relationships from sentence structure: which courses were prerequisites, which programs a student had to be enrolled in, and which courses were incompatible. spaCy provided the tokenization, part-of-speech tags, dependency parse, and entity spans used to make those decisions.
For development, I used spaCy's displaCy renderer to inspect dependency parses while tuning the relationship-extraction rules:
from spacy import displacy
doc = nlp("To enrol in this course, students must be studying a Master of Computing and have completed COMP6250 or COMP8260.")
displacy.render(doc, style="dep")That sentence has a program relationship and a nested course relationship: students must be studying Master of Computing, and must have completed either COMP6250 or COMP8260. Seeing the dependency structure made it easier to decide whether each conjunction should split the sentence into graph relationships or remain inside the course list.
The parser did not blindly trust the dependency tree. It used the parse as evidence for relationship-extraction heuristics:
- if both sides of a conjunction had a verb and a named entity, the conjunction was likely a real boolean split;
- if a conjunction appeared after a verb but before any entity, as in "completed or enrolled in COMP6710", the parser treated it as two conditions over the same entity;
- if an entity span covered punctuation, the parser avoided splitting inside the entity name;
- if spaCy produced a surprising attachment, the parser kept the raw requisite text so the UI could still show the original sentence.
The POS tags were just as important as the dependency arcs. The parser looked for verbs to infer conditions, coordinating conjunctions to find boolean structure, and the custom CLASS / PROGRAM entity spans to decide what graph node each condition should point to.
Course requisite parser#
The hard part was turning sentences like "A pass in COMP1100 or COMP1130, and one of COMP1110, COMP1140, COMP2300" into something a machine can reason about, e.g.:
AND(
OR(COMP1100, COMP1130),
OR(COMP1110, COMP1140, COMP2300)
)The parser is recursive. Given a sentence span, parse_requisite_from_sent scans for split points, especially and and or, and decides whether the conjunction is a real boolean boundary or just part of a smaller phrase.
The most important split heuristics were:
- Punctuation boundaries — split on
and/orwhen punctuation immediately before or after it suggests two clauses, unless that punctuation is part of a named entity. - Entity + verb on both sides — split when the left side already has a recognized entity and verb, and the right side also has its own entity and verb.
- Semicolon priority — treat semicolon-separated clauses as stronger candidates for top-level boolean splits.
- Mirrored verb clauses — handle phrases like "completed or be currently enrolled in COMP6710" by reusing the left verb for the right-side entities.
For example:
To enrol in this course you must have completed or be currently enrolled in COMP6710 OR COMP6730.becomes an OR between two conditions over the same course set:
{
"operator": {
"OR": [
{
"condition": "completed",
"classes": ["COMP6710", "COMP6730"]
},
{
"condition": "enrolled",
"classes": ["COMP6710", "COMP6730"]
}
]
}
}Incompatibility text is handled as its own condition. If a sentence lemma contains incompatible, the parser emits an incompatible condition rather than treating the mentioned courses as prerequisites.
Things that made this hard:
- Mixed quantifiers — "one of", "any two of", "all of", "a major in X" all imply different boolean structures.
- Implicit precedence — natural language doesn't bracket its conjunctions; "A or B and C" is ambiguous until you parse the comma and clause structure.
- Cross-references — "a major in Computer Science" points to another program, not a course; following these required a second pass after the courses had been loaded.
- Inconsistent formatting across faculties — every department wrote prerequisites in their own dialect.
The output is far from perfect, but it is useful on the common cases and always keeps the original text as prerequisites_raw when the structured interpretation is uncertain.
Program requirement parser#
Program requirements are a different parsing problem from course prerequisites. They are less like sentences and more like nested lists:
144 units from completion of the following:
48 units from completion of compulsory courses
COMP1100
COMP1110
24 units from completion of courses from the following list
COMP2300
COMP2400The scraper turns visual indentation into a recursive Requirement tree:
{
"description": "24 units from completion of courses from the following list",
"units": 24,
"items": [
{ "id": "COMP2300", "name": "..." },
{ "id": "COMP2400", "name": "..." }
]
}The parser also recognizes majors, minors, and specialisations inside requirement lists. It cleans names, maps words like major, minor, and specialisation to ANU subplan types (MAJ, MIN, SPC), and resolves them back to subplan IDs from the API seed data.
Graph Model#
Schema#
What ends up in Neo4j after the crawler and parser are done:
type Course {
id: String! # e.g. "COMP1100"
name: String!
subject_code: String # "COMP"
course_number: String # "1100"
units: Int # credit units
description: String
subject: String
college: String
offered_by: String
academic_career: String # undergraduate | graduate
course_convener: String
prerequisites_raw: String # original plaintext, kept verbatim
prerequisites: [Course!]! # parsed forward edges
unlocks: [Course!]! # reverse edges (this course as a prereq)
co_taught: [Course!]! # cross-listed courses
}
type Specialisation {
id: String!
name: String!
type: String! # major | minor | specialisation
units: Int
classes: [Course!]!
requirements: [Requirement!]!
}
type Requirement {
units: Int # credit units this requirement satisfies
description: String
classes: [Course!]!
requirements: [Requirement!]! # nested sub-requirements
}A few choices in the shape are deliberate:
prerequisites_rawis kept alongside the parsed edges. The parser is imperfect, so the original sentence stays in the graph as ground truth. When the planner can't make sense of an edge, the UI falls back to displaying the raw string verbatim.prerequisitesandunlocksare both materialised. They're the same edges viewed from opposite ends. Storing both saves a graph traversal in the common queries — "what does this course require?" and "what does this course let me into?".Requirementis recursive. ANU programs aren't flat lists of courses; they're trees of constraints ("24 units from category A, of which at most 12 units from sub-category A1, ..."). The recursiverequirementsfield mirrors that structure directly rather than flattening it into a denormalised list.
Graph construction#
The ETL step reads three scraped JSON files:
data/scraped/classes.json
data/scraped/programs.json
data/scraped/specialisations.jsonand creates Neo4j nodes with uniqueness constraints on id:
Course
Program
Specialisation
RequirementCourse requisites become typed relationships:
completed -> (:Course)-[:Prerequisite]->(:Course)
incompatible -> (:Course)-[:Incompatible]->(:Course)
studying -> (:Course)-[:Enrolled]->(:Program)
enrolled -> (:Course)-[:Enrolled]->(:Program)
permission -> (:Course)-[:Unknown]->(...)Program and specialisation requirements become Requirement nodes connected recursively:
(:Program)-[:Requirement]->(:Requirement)-[:Requirement]->(:Course)
(:Program)-[:Requirement]->(:Requirement)-[:Requirement]->(:Specialisation)That graph shape preserves both machine-queryable edges and the human-readable descriptions attached to each requirement.
Product Experience#
Degree planner#
Pick a program and the planner expands every requirement, every course inside each requirement, and every prerequisite chain. Tick the courses you've completed and the graph updates to show how much of the program is satisfied. The graph also highlights courses that count toward more than one requirement at once — a small thing that saves real time when you're trying to fit a major and a minor into the same degree.

Interactive course cartography#
The full prerequisite graph for the university is dense and hard to read flat. To give it some structure I ran PageRank over the prerequisite edges and used the score as the node radius — so foundational courses that many later courses depend on appear visually larger.

It's a cheap trick (PageRank doesn't know anything about pedagogy), but it gives a surprisingly useful map of the curriculum: introductory courses bubble up, terminal electives shrink, and the spine of each major becomes legible at a glance.
Stack#
- Python — crawler, parser, ETL into Neo4j
- Scrapy / BeautifulSoup — page crawling and HTML extraction
- spaCy — POS tagging and dependency parsing for the prerequisite extractor
- Neo4j — graph store; Cypher queries power the program planner
- GraphQL — API layer over the Neo4j graph
- React — frontend, deployed on Vercel
- PageRank — over the prerequisite edge set for node sizing
Reflection#
With modern LLMs, the parsing layer collapses into a few prompts and a structured-output schema, and I'd skip POS tagging entirely. The interesting design questions move up a level — schema design for the graph, how to represent ambiguity, how to keep the dataset current as the university updates its catalog. The graph itself, and the planner built on top of it, would stay much the same.