syllabi
DataFrame¶
Each row in the syllabi data frame corresponds to a syllabus, as identified by our syllabus classifier. Each syllabus is enriched with data from our academic field classifier, our academic year classifier, and our institution classifier.
T.StructType([
T.StructField('id', T.LongType()),
T.StructField('syllabus_probability', T.FloatType()),
T.StructField('date', T.StructType([
T.StructField('year', T.IntegerType()),
T.StructField('term', T.StringType()),
])),
T.StructField('field', T.StructType([
T.StructField('code', T.StringType()),
T.StructField('name', T.StringType()),
T.StructField('label_precision', T.FloatType()),
])),
T.StructField('institution', T.StructType([
T.StructField('id', T.LongType()),
T.StructField('grid_id', T.StringType()),
T.StructField('wikidata_id', T.StringType()),
T.StructField('unitid', T.StringType()),
T.StructField('name', T.StringType()),
T.StructField('url', T.StringType()),
T.StructField('lat', T.FloatType()),
T.StructField('lng', T.FloatType()),
T.StructField('country_code', T.StringType()),
T.StructField('country_name', T.StringType()),
T.StructField('state_code', T.StringType()),
T.StructField('state_name', T.StringType()),
T.StructField('city', T.StringType()),
T.StructField('ipeds_applcn', T.IntegerType()),
T.StructField('carnegie_control', T.ShortType()),
T.StructField('carnegie_basic2018', T.ShortType()),
T.StructField('carnegie_ugprofile2018', T.ShortType()),
T.StructField('carnegie_hbcu', T.BooleanType()),
T.StructField('carnegie_tribal', T.BooleanType()),
T.StructField('carnegie_womens', T.BooleanType()),
T.StructField('two_year', T.BooleanType()),
T.StructField('four_year', T.BooleanType()),
T.StructField('graduate', T.BooleanType()),
T.StructField('research', T.BooleanType()),
T.StructField('enrollment', T.IntegerType()),
])),
T.StructField('extracted_metadata', T.StructType([
T.StructField('code', T.StructType([
T.StructField('text', T.StringType()),
T.StructField('clean_text', T.StringType()),
T.StructField('mean_p', T.FloatType()),
])),
T.StructField('title', T.StructType([
T.StructField('text', T.StringType()),
T.StructField('clean_text', T.StringType()),
T.StructField('mean_p', T.FloatType()),
])),
T.StructField('date', T.StructType([
T.StructField('text', T.StringType()),
T.StructField('clean_text', T.StringType()),
T.StructField('mean_p', T.FloatType()),
])),
T.StructField('description', T.ArrayType(T.StructType([
T.StructField('text', T.StringType()),
T.StructField('clean_text', T.StringType()),
T.StructField('mean_p', T.FloatType()),
]))),
])),
T.StructField('text_md5', T.StringType()),
T.StructField('mime_type', T.StringType()),
T.StructField('text', T.StringType()),
T.StructField('anonymized_text', T.StringType()),
])
id¶
The OS-assigned unique identifier of the syllabus.
Note
This unique identifier is not guaranteed to be consistent across versions of the dataset.
syllabus_probability¶
A number in the range [0.0, 1.0] representing the certainty that the document is a syllabus.
Every document analyzed by OS is assigned a score in the range [0.0, 1.0] by our syllabus classifier, where the closer to 1.0, the greater the certainty that the document is a syllabus. The classifier is trained and tested around a 0.5 threshold: Every document assigned a score above 0.5 is considered a syllabus. The majority of documents in the syllabi data frame have a score greater than 0.5, but ocassionally OS will manually identify certain groups of documents as being syllabi, regardless of the output of the syllabus classifier.
Filtering the syllabi data frame by a value greater than 0.5 will return a set of documents with higher precision, at the cost of recall. Note that, as described in Table 1, a threshold of 0.5 is already very accurate, according to our tests.
Is Syllabus? |
Precision |
Recall |
F1 score |
---|---|---|---|
Yes |
0.97 |
0.97 |
0.97 |
No |
0.97 |
0.97 |
0.97 |
date¶
Date information about when the syllabus is taught. This data is parsed from the content of extracted_metadta.date.clean_text. Null if unknown.
year¶
The academic year the syllabus was or will be taught. Null if unknown.
OS only considers years valid if they fall in the range 1990-2022.
term¶
The academic term of the syllabus.
Possible values are ‘winter’, ‘spring’, ‘summer’ and ‘fall’. Null if unknown.
field¶
Academic field information, as determined by the OS academic field classifier.
Details on the performance of each field classified by the field classifier on a test set are provided in Table 2, ordered alphabetically by field name.
Academic Field |
Precision |
Recall |
F1 score |
---|---|---|---|
Accounting |
0.96 |
0.90 |
0.93 |
Agriculture |
0.83 |
0.67 |
0.74 |
Anthropology |
0.93 |
0.86 |
0.89 |
Architecture |
0.81 |
0.75 |
0.78 |
Astronomy |
0.82 |
0.60 |
0.69 |
Atmospheric Sciences |
0.88 |
1.00 |
0.94 |
Basic Computer Skills |
0.75 |
0.76 |
0.76 |
Basic Skills |
0.77 |
0.62 |
0.69 |
Biology |
0.85 |
0.92 |
0.88 |
Business |
0.77 |
0.85 |
0.80 |
Career Skills |
0.47 |
0.32 |
0.38 |
Chemistry |
0.94 |
0.89 |
0.92 |
Chinese |
0.92 |
0.97 |
0.94 |
Classics |
0.67 |
0.67 |
0.67 |
Computer Science |
0.74 |
0.82 |
0.78 |
Construction |
0.69 |
0.53 |
0.60 |
Cosmetology |
0.98 |
0.98 |
0.98 |
Criminal Justice |
0.74 |
0.71 |
0.73 |
Criminology |
0.60 |
0.50 |
0.55 |
Culinary Arts |
0.81 |
0.97 |
0.88 |
Dance |
0.90 |
0.98 |
0.94 |
Dentistry |
1.00 |
0.97 |
0.99 |
Earth Sciences |
0.91 |
0.80 |
0.85 |
Economics |
0.93 |
0.92 |
0.92 |
Education |
0.87 |
0.87 |
0.87 |
Engineering |
0.66 |
0.70 |
0.68 |
Engineering Technician |
0.58 |
0.57 |
0.57 |
English Literature |
0.87 |
0.91 |
0.89 |
Film and Photography |
0.82 |
0.72 |
0.77 |
Fine Arts |
0.84 |
0.86 |
0.85 |
Fitness and Leisure |
0.83 |
0.75 |
0.79 |
French |
0.97 |
0.98 |
0.97 |
Geography |
0.87 |
0.89 |
0.88 |
German |
0.90 |
0.95 |
0.93 |
Health Technician |
0.71 |
0.70 |
0.71 |
Hebrew |
0.87 |
0.93 |
0.90 |
History |
0.90 |
0.91 |
0.90 |
Japanese |
0.95 |
1.00 |
0.98 |
Journalism |
0.93 |
0.99 |
0.96 |
Law |
0.82 |
0.82 |
0.82 |
Liberal Arts |
0.78 |
0.65 |
0.71 |
Library Science |
0.93 |
0.90 |
0.92 |
Linguistics |
0.97 |
0.87 |
0.92 |
Marketing |
0.82 |
0.82 |
0.82 |
Mathematics |
0.92 |
0.97 |
0.95 |
Mechanic / Repair Tech |
0.81 |
0.69 |
0.74 |
Media / Communications |
0.83 |
0.77 |
0.80 |
Medicine |
0.74 |
0.74 |
0.74 |
Military Science |
0.91 |
0.91 |
0.91 |
Music |
0.78 |
0.89 |
0.83 |
Natural Resource Management |
0.70 |
0.57 |
0.63 |
Nursing |
0.93 |
0.86 |
0.89 |
Nutrition |
0.88 |
0.81 |
0.85 |
Philosophy |
0.83 |
0.83 |
0.83 |
Physics |
0.83 |
0.89 |
0.86 |
Political Science |
0.87 |
0.93 |
0.90 |
Psychology |
0.86 |
0.92 |
0.89 |
Public Administration |
0.83 |
0.33 |
0.48 |
Public Safety |
0.71 |
0.76 |
0.74 |
Religion |
0.85 |
0.80 |
0.82 |
Sign Language |
0.98 |
1.00 |
0.99 |
Social Work |
0.90 |
0.86 |
0.88 |
Sociology |
0.92 |
0.74 |
0.82 |
Spanish |
0.94 |
0.94 |
0.94 |
Theatre Arts |
0.86 |
0.81 |
0.84 |
Theology |
0.94 |
0.81 |
0.87 |
Transportation |
0.69 |
0.53 |
0.60 |
Veterinary Medicine |
0.72 |
0.81 |
0.76 |
Women’s Studies |
0.80 |
0.91 |
0.85 |
code¶
A string containing one or more IPEDS CIP codes, representing the academic field(s) representing the field.
OS’s field classifier draws heavily from the IPEDS 2010 CIP taxonomy in order to determine the academic field best associated with each syllabus. CIP codes come in lengths of two-, four- and six-digits, where two-digit codes represent a discipline, four-digit codes a subdivision of that discipline, and six-digit codes a further subdivision of the previous subdivision. For example, the two-digit CIP code ‘01’ is the code for all ‘Agriculture, Agriculture Operations, and Related Sciences’ courses; within that, the four-digit CIP code ‘01.01’ is the subdivision for all ‘Agricultural Business and Management’ courses, and within that, ‘01.0103’ is the subdivision for all ‘Agricultural Economics’ courses.
Our field classifier is trained and tested on a subset of the CIP taxonomy that we find most useful for describing syllabi. In some cases, we have combined codes, though we have generally done so only within the same two-digit branch of the taxonomy. In those cases, the codes are separated by a forward-slash (‘/’). For example, the code ‘45.09/45.10’ is a combination of ‘International Relations and National Security Studies’ and ‘Political Science and Government’, which are both subdivisions of code ‘45’, ‘Social Sciences’; we combine them into a field that we call “Political Science” (see name).
name¶
The name representing the field, as chosen by OS.
label_precision¶
The precision score of the field, as measured against a test set.
institution¶
The college or university where the syllabus was taught, as determined by the OS institution matcher. Null if no institution was matched.
id¶
The OS-assigned unique identifier for the institution matched to this syllabus.
Note
This unique identifier is not guaranteed to be consistent across versions of the dataset.
unitid¶
The IPEDS unique identifier of the institution. Null if unknown. Only defined for institutions within the United States.
name¶
The name of the institution.
url¶
The URL to the home webpage of the institution.
country_code¶
The ISO 3166-1 alpha-2 code of the institution country.
country_name¶
The full English name of the country corresponding to country_code.
state_code¶
The ISO 3166-2 region code of the region (state, parish, district, etc.) the syllabus was taught in. Null if unknown.
state_name¶
The English name of the state corresponding to state_code. Null if unknown.
city¶
The city the syllabus was taught in. Null if unknown.
ipeds_applcn¶
The number of applicants to the school who applied, were admitted and enrolled, per IPEDS.
This data is for the 2018-2019 school year.
Null if unknown. Only defined for institutions in the United States.
carnegie_control¶
The ‘Control’ of the institution, per Carnegie.
‘Control’ is
A classification of whether an institution is operated by publicly elected or appointed officials (public control) or by privately elected or appointed officials and derives its major source of funds from private sources (private control).
(Source: https://surveys.nces.ed.gov/ipeds/VisGlossaryAll.aspx.)
This field is equal to
1, if a public institution;
2, if a private not-for-profit institution;
3, if a private for-profit institution.
Null if unknown. Only defined for institutions in the United States.
carnegie_basic2018¶
The Carnegie Basic Classification for 2018.
Possible values for this field are described in Table 3.
Null if unknown. Only defined for institutions in the United States.
Value |
Meaning |
---|---|
0 |
(Not classified) |
1 |
Associate’s Colleges: High Transfer-High Traditional |
2 |
Associate’s Colleges: High Transfer-Mixed Traditional/Nontraditional |
3 |
Associate’s Colleges: High Transfer-High Nontraditional |
4 |
Associate’s Colleges: Mixed Transfer/Career & Technical-High Traditional |
5 |
Associate’s Colleges: Mixed Transfer/Career & Technical-Mixed Traditional/Nontraditional |
6 |
Associate’s Colleges: Mixed Transfer/Career & Technical-High Nontraditional |
7 |
Associate’s Colleges: High Career & Technical-High Traditional |
8 |
Associate’s Colleges: High Career & Technical-Mixed Traditional/Nontraditional |
9 |
Associate’s Colleges: High Career & Technical-High Nontraditional |
10 |
Special Focus Two-Year: Health Professions |
11 |
Special Focus Two-Year: Technical Professions |
12 |
Special Focus Two-Year: Arts & Design |
13 |
Special Focus Two-Year: Other Fields |
14 |
Baccalaureate/Associate’s Colleges: Associate’s Dominant |
15 |
Doctoral Universities: Very High Research Activity |
16 |
Doctoral Universities: High Research Activity |
17 |
Doctoral/Professional Universities |
18 |
Master’s Colleges & Universities: Larger Programs |
19 |
Master’s Colleges & Universities: Medium Programs |
20 |
Master’s Colleges & Universities: Small Programs |
21 |
Baccalaureate Colleges: Arts & Sciences Focus |
22 |
Baccalaureate Colleges: Diverse Fields |
23 |
Baccalaureate/Associate’s Colleges: Mixed Baccalaureate/Associate’s |
24 |
Special Focus Four-Year: Faith-Related Institutions |
25 |
Special Focus Four-Year: Medical Schools & Centers |
26 |
Special Focus Four-Year: Other Health Professions Schools |
27 |
Special Focus Four-Year: Engineering Schools |
28 |
Special Focus Four-Year: Other Technology-Related Schools |
29 |
Special Focus Four-Year: Business & Management Schools |
30 |
Special Focus Four-Year: Arts, Music & Design Schools |
31 |
Special Focus Four-Year: Law Schools |
32 |
Special Focus Four-Year: Other Special Focus Institutions |
33 |
Tribal Colleges |
carnegie_ugprofile2018¶
The Carnegie Undergraduate Profile Classification for 2018.
Possible values for this field are described in Table 4.
Null if unknown. Only defined for institutions in the United States.
Value |
Meaning |
---|---|
0 |
Not classified (Exclusively Graduate) |
1 |
Two-year, higher part-time |
2 |
Two-year, mixed part/full-time |
3 |
Two-year, medium full-time |
4 |
Two-year, higher full-time |
5 |
Four-year, higher part-time |
6 |
Four-year, medium full-time, inclusive, lower transfer-in |
7 |
Four-year, medium full-time, inclusive, higher transfer-in |
8 |
Four-year, medium full-time, selective, lower transfer-in |
9 |
Four-year, medium full-time , selective, higher transfer-in |
10 |
Four-year, full-time, inclusive, lower transfer-in |
11 |
Four-year, full-time, inclusive, higher transfer-in |
12 |
Four-year, full-time, selective, lower transfer-in |
13 |
Four-year, full-time, selective, higher transfer-in |
14 |
Four-year, full-time, more selective, lower transfer-in |
15 |
Four-year, full-time, more selective, higher transfer-in |
carnegie_hbcu¶
Whether or not the institution is a historically black college or university, per Carnegie. Null if unknown. Only defined for institutions in the United States.
carnegie_tribal¶
Whether or not the institution is a tribal college or university, per Carnegie. Null if unknown. Only defined for institutions in the United States.
carnegie_womens¶
Whether or not the institution is a womens college, per Carnegie. Null if unknown. Only defined for institutions in the United States.
two_year¶
Whether or not the institution is primarily a two year institution.
This value is derived from the carnegie_basic2018 classification. It is true when carnegie_basic2018 is in the range [1, 14].
four_year¶
Whether or not the institution is a four year institution.
This value is derived from the carnegie_basic2018 classification. It is true when carnegie_basic2018 is in the range [15, 32].
graduate¶
Whether or not the institution is a graduate institution.
This value is derived from the carnegie_basic2018 classification. It is true when carnegie_basic2018 is in the range [15, 20].
research¶
Whether or not the institution is an R1 or R2 research institution.
This value is derived from the carnegie_basic2018 classification. It is true when carnegie_basic2018 is in the range [15, 16].
enrollment¶
The number of total students enrolled at the institution, as aggregated from several data sources. This data is the most recent available, usually the 2018-2019 school year. Null if unknown.
extracted_metadata¶
Structured text fields extracted from the syllabus by a token-level sequence tagging model. (We use DistilBERT via the transformers library, finetuned on a manually annotated corpus of ~7,000 training documents.)
code¶
The identifier that appears in the institutional course catalog. Generally (but
not always) a combination of a department code and course number. Eg,
CS224n
or 9.520
.
text¶
The raw text span extracted from the document.
clean_text¶
A minimally cleaned version of the raw text, often more suitable for display. We normalize encoding and remove extraneous whitespace characters.
mean_p¶
The average probability mass assigned to the predicted tag for each token in the match. This can be used in a comparative sense as a rough indication of the “confidence” of the model on the prediction.
Note
This is experimental, and may be removed in a future release.
title¶
The name of the course. Eg, Natural Language Processing with Deep
Learning
, Statistical Learning Theory and Applications
.
text¶
(See above.)
clean_text¶
(See above.)
mean_p¶
(See above.)
text_md5¶
The md5sum of the text.
Note
This field is only available in full-text versions of the dataset.
mime_type¶
The mime type of the document text was extracted from.
We extract text from HTML, PDF, DOC, DOCX and RTF files.
Note
This field is only available in full-text versions of the dataset.
text¶
The extracted text of the syllabus.
Note
This field is only available in full-text versions of the dataset.
anonymized_text¶
The extracted text of the syllabus, anonymized to remove person names, email addresses and phone numbers.
Note
This field is only available in full-text versions of the dataset.