.. image:: images/os-logo-green.png :width: 400px | Dataset Documentation ##################### Welcome to version |version| of the `Open Syllabus `_ dataset! Open Syllabus collects and analyzes one of the largest collections of college course syllabi in the world. At a glance: - **7,533,836** syllabi (**22.6 billion** words of full-text data). - Collected from **8,648** colleges and universities in **152** countries. Coverage is currently deepest in the US, UK, Australia, and Canada, which together account for about 75% of the corpus. - Spanning roughly **2010 → present**. Our earliest documents date back to the late 1990s, but most of the data is concentrated in the last 10 years. Syllabi are incredibly rich documents. They often contain: long-format descriptions of the course material; lists of books, articles, and web resources assigned in the class; descriptions of learning objectives, grading criteria, and assignments; and chronologically-ordered sequences of readings and topics. But, there's also very little standardization in terms of how these elements are organized and presented, making it difficult to systematically analyze them at scale. To aid with this, Open Syllabus uses a suite of machine learning models to extract structured metadata from the documents. As of version |version|, we provide: - **Institution** -- The college or university where the course was taught. We also include additional information from `IPEDS `_, `Grid `_, `Wikidata `_ and `Carnegie `_ to provide metadata about a wide range of institutional characteristics. - **Course code** -- The identifier that appears in the institutional course catalog. Generally (but not always) a combination of a department code and course number. Eg, :code:`CS224n` or :code:`9.520`. - **Course title** -- The name of the course. Eg, :code:`Natural Language Processing with Deep Learning`, :code:`Statistical Learning Theory and Applications`. - **Year + semester** -- The calendar year and semester in which the course was taught. Eg, :code:`Spring 2017`, :code:`Fall 2019`. - **Field** -- The department in which the course was taught. Because there are significant differences in how departments are structured at different institutions, we classify into a curated set of ~70 fields derived from the `CIP codes `_ published by the U.S. Department of Education. - **Course description** -- Many syllabi contain a 1-2 paragraph description of the course content. We extract this from the document using a token-level sequence tagging model, trained on ~7,000 hand-labeled documents. These paragraphs provide a clean full-text signal about the course material, free of boilerplate text and webpage menu content. - **Book and article assignments** -- The set of books and articles assigned in the course. We use a fuzzy entity-linking process to identify references in the syllabi to a background database of ~150 million known books and articles, in a way that accommodates the significant differences in citation formats and practices across disciplines. In version |version|, this results in 57 million individual book and article assignments in the syllabi, spread across 5.2 million unique titles that appear at least once in the corpus. If you're new to the project, check out our web-facing views onto the data: Open Syllabus Explorer ---------------------- A comprehensive view of the most frequently-assigned books and articles in the corpus, sliced by author, institution, field, country, and publisher. .. image:: images/explorer-screenshot.png :target: https://opensyllabus.org/ Open Syllabus Co-assignment Galaxy ---------------------------------- An interactive visualization of the underlying "co-assignment graph" -- the network of relationships among books and articles formed by aggregating over all pairs of titles that appear together in the same courses. .. image:: images/galaxy-screenshot.png :target: https://galaxy.opensyllabus.org/ Contents -------- .. toctree:: :maxdepth: 1 structure syllabi catalog matches disclaimer attributions changelog