.. image:: images/os-logo-green.png
  :width: 400px

|

Dataset Documentation
#####################

Welcome to version |version| of the `Open Syllabus <https://opensyllabus.org>`_
dataset! Open Syllabus collects and analyzes one of the largest collections of
college course syllabi in the world. At a glance:

- **7,533,836** syllabi (**22.6 billion** words of full-text data).

- Collected from **8,648** colleges and universities in **152** countries.
  Coverage is currently deepest in the US, UK, Australia, and Canada, which
  together account for about 75% of the corpus.

- Spanning roughly **2010 → present**. Our earliest documents date back to the
  late 1990s, but most of the data is concentrated in the last 10 years.

Syllabi are incredibly rich documents. They often contain: long-format
descriptions of the course material; lists of books, articles, and web resources
assigned in the class; descriptions of learning objectives, grading criteria,
and assignments; and chronologically-ordered sequences of readings and topics.

But, there's also very little standardization in terms of how these elements are
organized and presented, making it difficult to systematically analyze them at
scale. To aid with this, Open Syllabus uses a suite of machine learning models
to extract structured metadata from the documents. As of version |version|, we
provide:

- **Institution** -- The college or university where the course was taught. We
  also include additional information from `IPEDS
  <https://nces.ed.gov/ipeds/>`_, `Grid <https://www.grid.ac/>`_, `Wikidata
  <https://www.wikidata.org/>`_ and `Carnegie
  <http://carnegieclassifications.iu.edu/>`_ to provide metadata about a wide
  range of institutional characteristics.

- **Course code** -- The identifier that appears in the institutional course
  catalog. Generally (but not always) a combination of a department code and
  course number. Eg, :code:`CS224n` or :code:`9.520`.

- **Course title** -- The name of the course. Eg, :code:`Natural Language
  Processing with Deep Learning`, :code:`Statistical Learning Theory and
  Applications`.

- **Year + semester** -- The calendar year and semester in which the course was
  taught. Eg, :code:`Spring 2017`, :code:`Fall 2019`.

- **Field** -- The department in which the course was taught. Because there are
  significant differences in how departments are structured at different
  institutions, we classify into a curated set of ~70 fields derived from the `CIP
  codes <https://nces.ed.gov/ipeds/cipcode/browse.aspx?y=55>`_ published by the
  U.S. Department of Education.

- **Course description** -- Many syllabi contain a 1-2 paragraph description of
  the course content. We extract this from the document using a token-level
  sequence tagging model, trained on ~7,000 hand-labeled documents. These
  paragraphs provide a clean full-text signal about the course material, free of
  boilerplate text and webpage menu content.

- **Book and article assignments** -- The set of books and articles assigned in
  the course. We use a fuzzy entity-linking process to identify references in
  the syllabi to a background database of ~150 million known books and articles,
  in a way that accommodates the significant differences in citation formats and
  practices across disciplines. In version |version|, this results in
  57 million individual book and article assignments in the syllabi, spread
  across 5.2 million unique titles that appear at least once in the corpus.

If you're new to the project, check out our web-facing views onto the data:

Open Syllabus Explorer
----------------------

A comprehensive view of the most frequently-assigned books and articles in the
corpus, sliced by author, institution, field, country, and publisher.

.. image:: images/explorer-screenshot.png
  :target: https://opensyllabus.org/

Open Syllabus Co-assignment Galaxy
----------------------------------

An interactive visualization of the underlying "co-assignment graph" -- the
network of relationships among books and articles formed by aggregating over all
pairs of titles that appear together in the same courses.

.. image:: images/galaxy-screenshot.png
 :target: https://galaxy.opensyllabus.org/

Contents
--------

.. toctree::
   :maxdepth: 1

   structure
   syllabi
   catalog
   matches
   disclaimer
   attributions
   changelog