catalog DataFrame

Each row in the Catalog data frame corresponds to a bibliographic record about a work – title, authorship and publication. In addition, as part of our citation matching process, we cluster together catalog records that represent the same work (see work_id).

T.StructType([
    T.StructField('id', T.LongType()),
    T.StructField('source', T.StringType()),
    T.StructField('source_id', T.StringType()),
    T.StructField('work_id', T.LongType()),
    T.StructField('title', T.StringType()),
    T.StructField('subtitle', T.StringType()),
    T.StructField('publisher', T.StringType()),
    T.StructField('authors', T.ArrayType(T.StructType([
        T.StructField('forenames', T.StringType()),
        T.StructField('keyname', T.StringType()),
    ]))),
    T.StructField('year', T.IntegerType()),
    T.StructField('isbns', T.ArrayType(T.StringType())),
    T.StructField('issns', T.ArrayType(T.StringType())),
    T.StructField('doi', T.StringType()),
    T.StructField('urls', T.ArrayType(T.StringType())),
    T.StructField('open_access', T.BooleanType()),
    T.StructField('publication_type', T.StringType()),
    T.StructField('article', T.StructType([
        T.StructField('venue', T.StringType()),
        T.StructField('volume', T.StringType()),
        T.StructField('issue', T.StringType()),
        T.StructField('page_start', T.StringType()),
        T.StructField('page_end', T.StringType()),
        T.StructField('abstract', T.StringType()),
    ])),
    T.StructField('work_match_count', T.IntegerType()),
    T.StructField('work_isbns', T.ArrayType(T.StringType())),
    T.StructField('work_issns', T.ArrayType(T.StringType())),
    T.StructField('work_dois', T.ArrayType(T.StringType())),
])

id

The OS-assigned unique identifier of the catalog record.

Note

This unique identifier is not guaranteed to be consistent across versions of the dataset.

source

A slug that represents the source bibliographic database that the record was extracted from. For example – viaf, doab, gutenberg.

source_id

The original identifier of the bibliographic record in the source catalog. For example, War and Peace is book 2600 in the Project Gutenberg catalog, so the value here is 2600.

work_id

An identifier that clusters records that represent different copies or editions of the same work, as identified by OS.

For example, if we have 100 different editions of The Iliad in the catalog, each of these 100 records is assigned a common work_id, making it possible to operate on the group of records as a unit. If we only have a single bibliographic record for a given publication (generally the case for most resources published in the last ~10 years) the work_id will be unique to that record.

To operate on unique works, select a single record for each distinct work_id in the catalog.

title

The title of the work.

subtitle

The subtitle of the work. Null if unknown.

publisher

The publisher of the work. Null if unknown.

authors

A list of authors of the work.

forenames

The given name of the author (includes both first and middle names).

keyname

The surname of the author (if a person) or an organization name. Used by the citation matcher as the minimal lexical representation of the author.

year

The year the work was published. Null if unknown.

isbns

A list of any ISBNs associated with the work.

issns

A list of any ISSNs associated with the work.

doi

The DOI of the work. Null if unknown.

urls

A list of URLs associated with the work – either a link to it or a link to information about it.

open_access

Whether or not the work is an open access work. Null if unknown.

publication_type

The type of publication the work is. Possible values are one of ‘book’, ‘book-chapter’, ‘article’, or ‘report’. Null if unknown.

article

Metadata specific to journal articles, book chapters, or other resources that are published as part of a larger “container”. Null if the work is a standalone publication.

venue

The name of the “container” in which the article was published. Generally a journal name, conference name, or the title of an edited volume.

volume

A journal volume number. Null if unknown or not applicable.

issue

A journal issue number. Null if unknown or not applicable.

page_start

The first page of the article. Null if unknown or not applicable.

page_end

The last page of the article. Null if unknown or not applicable.

abstract

The full-text abstract. Null if unknown or unavailable.

work_cluster_size

The number of other records in the catalog that share a work_id with this record.

work_match_count

The total number of times that the record’s work cluster (identified by work_id) appeared in the syllabus corpus.

Note

This number is an aggregate count based on OS’s internal corpus, which includes syllabi not represented in this dataset.

work_isbns

The set of ISBNs associated with all records that share a work_id with this record.

work_issns

The set of ISSNs associated with all records that share a work_id with this record.

work_dois

The set of DOIs associated with all records that share a work_id with this record.