Structure and data format

The Open Syllabus dataset consists of three primary entity types:

  • Syllabi (syllabi.json) – Each row represents one syllabus.

  • Catalog Records (catalog.json) - The set of individual bibliographic records for books and articles that we aggregate from places like the Library of Congress, VIAF, Open Textbook Library, etc. These are clustered together into works by the work_id column. For example, if a book has 5 editions, we’d have 5 rows in this table, all with the same work_id.

  • Matches (matches.json) - The output of the citation matching process. Each row represents one instance of a work being assigned in a specific syllabus.

Internally, Open Syllabus uses Apache Spark for ETL, model inference, and distribution packaging. Raw datasets are distributed as JSON lines files produced by the standard JSON dataframe writer in Spark. For full-size datasets, we use gzip compression and split each of the three dataframes into 100 partition files, each of which contains ~1% of the full data.

The partition files can be downloaded individually or in small batches for inspection and testing. The complete dataset can be downloaded in bulk using a tool like rclone.