.. _catalog: :code:`catalog` DataFrame ========================= Each row in the Catalog data frame corresponds to a bibliographic record about a work -- title, authorship and publication. In addition, as part of our citation matching process, we cluster together catalog records that represent the same work (see `work_id`_). .. code-block:: python T.StructType([ T.StructField('id', T.LongType()), T.StructField('source', T.StringType()), T.StructField('source_id', T.StringType()), T.StructField('work_id', T.LongType()), T.StructField('title', T.StringType()), T.StructField('subtitle', T.StringType()), T.StructField('publisher', T.StringType()), T.StructField('authors', T.ArrayType(T.StructType([ T.StructField('forenames', T.StringType()), T.StructField('keyname', T.StringType()), ]))), T.StructField('year', T.IntegerType()), T.StructField('isbns', T.ArrayType(T.StringType())), T.StructField('issns', T.ArrayType(T.StringType())), T.StructField('doi', T.StringType()), T.StructField('urls', T.ArrayType(T.StringType())), T.StructField('open_access', T.BooleanType()), T.StructField('publication_type', T.StringType()), T.StructField('article', T.StructType([ T.StructField('venue', T.StringType()), T.StructField('volume', T.StringType()), T.StructField('issue', T.StringType()), T.StructField('page_start', T.StringType()), T.StructField('page_end', T.StringType()), T.StructField('abstract', T.StringType()), ])), T.StructField('work_match_count', T.IntegerType()), T.StructField('work_isbns', T.ArrayType(T.StringType())), T.StructField('work_issns', T.ArrayType(T.StringType())), T.StructField('work_dois', T.ArrayType(T.StringType())), ]) id ** The OS-assigned unique identifier of the catalog record. .. note:: This unique identifier is not guaranteed to be consistent across versions of the dataset. source ****** A slug that represents the source bibliographic database that the record was extracted from. For example -- :code:`viaf`, :code:`doab`, :code:`gutenberg`. source_id ********* The original identifier of the bibliographic record in the source catalog. For example, `War and Peace `_ is book 2600 in the Project Gutenberg catalog, so the value here is :code:`2600`. work_id ******* An identifier that clusters records that represent different copies or editions of the same work, as identified by OS. For example, if we have 100 different editions of *The Iliad* in the catalog, each of these 100 records is assigned a common `work_id`_, making it possible to operate on the group of records as a unit. If we only have a single bibliographic record for a given publication (generally the case for most resources published in the last ~10 years) the `work_id`_ will be unique to that record. To operate on unique works, select a single record for each distinct `work_id`_ in the catalog. title ***** The title of the work. subtitle ******** The subtitle of the work. Null if unknown. publisher ********* The publisher of the work. Null if unknown. authors ******* A list of authors of the work. forenames --------- The given name of the author (includes both first and middle names). keyname ------- The surname of the author (if a person) or an organization name. Used by the citation matcher as the minimal lexical representation of the author. year **** The year the work was published. Null if unknown. isbns ***** A list of any ISBNs associated with the work. issns ***** A list of any ISSNs associated with the work. doi *** The DOI of the work. Null if unknown. urls **** A list of URLs associated with the work -- either a link to it or a link to information about it. open_access *********** Whether or not the work is an open access work. Null if unknown. publication_type **************** The type of publication the work is. Possible values are one of 'book', 'book-chapter', 'article', or 'report'. Null if unknown. article ******* Metadata specific to journal articles, book chapters, or other resources that are published as part of a larger "container". Null if the work is a standalone publication. venue ----- The name of the "container" in which the article was published. Generally a journal name, conference name, or the title of an edited volume. volume ------ A journal volume number. Null if unknown or not applicable. issue ----- A journal issue number. Null if unknown or not applicable. page_start ---------- The first page of the article. Null if unknown or not applicable. page_end -------- The last page of the article. Null if unknown or not applicable. abstract -------- The full-text abstract. Null if unknown or unavailable. work_cluster_size ***************** The number of other records in the catalog that share a `work_id`_ with this record. work_match_count **************** The total number of times that the record’s work cluster (identified by `work_id`_) appeared in the syllabus corpus. .. note:: This number is an aggregate count based on OS's internal corpus, which includes syllabi not represented in this dataset. work_isbns ********** The set of ISBNs associated with all records that share a `work_id`_ with this record. work_issns ********** The set of ISSNs associated with all records that share a `work_id`_ with this record. work_dois ********* The set of DOIs associated with all records that share a `work_id`_ with this record.