Changelog¶
Changes in 2.2¶
Syllabi data frame¶
Introduce ~1.8 million more syllabi to the underlying dataset.
Update the syllabus, field and date classifiers for performance improvements.
Reorganize the syllabi schema into a series of nested groups based on the matching and classification routines run over the syllabi. Fields directly related to the “raw” syllabus, like syllabus_probability, are available at the top level. Then there are the following groups:
The date group contains output from the OS date classifier.
The field group contains output from the OS academic field classifier.
The institution group contains output from the OS institution matcher.
The extracted_metadata group contains sub-groups representing output from several OS course classifiers. Current sub-groups are code, title, date and description.
Drop the field_score field. The OS academic field classifier no longer provides meaningful or reliable output for this field.
Update institutions:
Update all data sources to recent (2018 or 2019) versions.
Reorganize the institution matcher fields (now all nested under institution, per the above) for clarity and consistency with other fields:
All field names are now lower-case.
Fields that represent info copied over from another dataset are now prefixed with the name of that dataset. For example, the APPLCN data from IPEDS is now ipeds_applcn. Fields that are not prefixed with the name of a dataset are generally aggregated from information about multiple datasets.
Add institution.enrollment.
Add institution.term
Add institution.wikidata_id.
Change policy on removing academic field classification. In previous versions of the OS dataset, certain academic fields considered poor quality (based on performance on a test set) were nulled after classification. With this version of the dataset, OS is no longer nulling academic field classifications. As a consequence, every syllabus is assigned an academic field, and several more academic fields are available. Users can decide which fields they trust; a new value, field.label_precision, has been added to help with that decision.
Change policy on manually marking documents as syllabi. OS occasionally marks certain groups of documents as syllabi even if they weren’t identified by the syllabus classifier as being syllabi. In previous versions of the dataset, documents that bypassed the syllabus classifier in this way were assigned a syllabus_probability of 1.0, overwriting whatever syllabus_probability was assigned to them by the syllabus classifier. In this version of the dataset, syllabus_probability values are never overwritten. This means that some syllabi in the syllabi data frame have a syllabus_probability of less than 0.5.
Drop language.
Matches data frame¶
Introduce ~25 million more matches to the underlying dataset.
Drop m1 and m2.
Catalog data frame¶
Reorganize the catalog schema:
Rename match_count to work_match_count.
Drop the title array. Each catalog record now contains only a single title and subtitle field.
Treat normalized citation data as the primary citation data. Drop un-normalized citation data and remove normalized_ from field names.
Rename normalized_title to title.
Rename normalized_subtitle to subtitle.
Rename normalized_publisher to publisher.
Rename normalized_authors to authors.
Remove position from author arrays.
Add the source and source_id fields, which describe catalog record provenance.
Drop matcher_pairs.
Re-introduce the publication_type column, with a broader set of possible values.
Changes in 2.1¶
Syllabi data frame¶
Introduce ~1 million more syllabi to the underlying dataset.
Matches data frame¶
Introduce ~8 million more matches to the underlying dataset.
Catalog data frame¶
Greatly expand the set of bibliographic datasets that are used as sources for the catalog. The underlying database of work expressions increased to ~150M, up from ~65M in v2.0.
Reorganize and simplify the catalog schema, to better accommodate the wider range of input sources. The details of the changes are best examined in the schema documentation, but as a summary of changes:
title is now a list of known titles and subtitles.
Content related to journal articles (or other content published in a “container”) is now nested in an article field.
Records contain a list of urls instead of a single string url
Renamed:
publication_year to year
authors.given_name to authors.forenames
authors.surname to authors.keyname
journal_title to article.venue
first_page to article.page_start
last_page to article.page_end
Dropped:
language
original_language
medium
series
translator
journal_isbns (rolled into isbns)
journal_issns (rolled into issns)
edition_number
publication_type
matcher_pairs.logp
Changes in 2.0¶
Syllabi data frame¶
Add a heuristic to the date classifier that nulls clearly incorrect year values.
Rename all cases of Timor-Leste in country_name to East Timor.
Add the Philippines to the global country blacklist. All syllabi identified as being from schools in the Philippines are no longer included in the dataset.
Improved coverage of the institution matcher. In version 1.9, there were ~1.4 million syllabi that did not have institutions matched to them. With 2.0, there are ~180 thousand.
Matches data frame¶
Add the pvalid column.
Improve quality of matches with a validation classifier over the document contexts around the raw keyword matches, trained on 12k hand-labeled examples.
Catalog data frame¶
Drop the display_priority field. This ranking was originally meant to, per version 1.9 documentation, represent the “‘quality’ or ‘completeness’ of the metadata on each record”, where the top ranked record was “considered by OS to be the best ‘representative’ record for the work cluster”. OS no longer uses such a ranking, and instead selects representative citation metadata for works – such as the data displayed on the Open Syllabus Explorer – based on aggregations across work clusters.
Improve the quality of the normalized_title, normalized_subtitle and normalized_authors fields.
Greatly improve the quality of the normalized_publisher field.