openbiblio.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
Der Einstieg in das Fediverse für Bibliotheksmenschen

Administered by:

Server stats:

597
active users

Jakob Voß

Abzug der Inhaltserschließung im PICA+ Rohformat (September 2022): doi.org/10.5281/zenodo.7321969 (~68 Mio. Datensätze)

Die Teilmenge der bereinigten Sacherschließung gibt's schon unter doi.org/10.5281/zenodo.7307966 (~24 Mio. Datensätze)

ZenodoSubject indexing data of K10plus library union catalogThis dataset contains a an extract of K10plus library union catalog with its subject indexing data: kxp-subjects_2022-09-30_??of10.dat : the full data (68.051.434 records) split in files of up to 5.000.000 records each K10plus is a union catalog of German libraries, run by library service centers BSZ and VZG since 2019. The catalog contains bibliographic data of the majority of academic libraries in Germany. The core data of K10plus is made available as OpenData via APIs and in form of database dumps. More information can be found here: K10plus homepage (in German) K10plus Open Data page (in German) Traditional search interface (OPAC) Data format The data is provided in its raw internal format called PICA+ to not loose information during conversion. In particular the data is given in PICA Normalized Format with one record per line. Each record consists of a list of fields and each field consists of a list of subfields. The data can best be processed with command line tools pica-rs or picadata. A detailled description of PICA format and its processing is given in the German textbook Einführung in die Verarbeitung von PICA-Daten. For visual inspection PICA Normalized Format is best converted into PICA Plain Format (pica-rs command pica print). The following example record contains seven fields: 003@ $0010003231 013D $9104450460$VTsvz$3209786884$7gnd/4151278-9$aEinführung 044K $9106080474$VTsv1$7gnd/4077343-7$3209204761$aSekte 044N $aReligionsgemeinschaft 045E $a12 045F $a291 045Q/01 $9181570408$VTkv$a11.97$jNeue religiöse Bewegungen$jSekten 045R $91270641751$VTkv$7rvk/11410:$3200641751$aBG 9600$jAllgemeines$NB$JTheologie und Religionswissenschaften$NBG$JFundamentaltheologie$NBG 9020-BG 9790$JKirche und Kirchen$NBG 9600-BG 9720$JFreikirchen und Sekten 045V $a1 Each K10plus record is uniquely identified by its record identifier PPN, given in field 003@ subfield $0. The PPN can be used: to link into K10plus catalog, e.g. https://opac.k10plus.de/DB=2.299/PPNSET?PPN=010003231 to retrieve the record in other formats via API, e.g. https://unapi.k10plus.de/?id=opac-de-627:ppn:010003231&format=marcxml (MARC/XML format) and https://ws.gbv.de/suggest/csl/?query=pica.ppn=010003231&citationstyle=ieee&language=de (Citation Format) Scope of the data The data is limited to records having a least one holding by a library participating in K10plus. Records are provided with “offline expansion” (some subfield have been added automatically to facilitate re-use of the data) and limited to the following fields: 003@ with internal record identifier “PPN” in subfield $0 010@ language  013D type of content 013E musical type of document 013F target audience 013H additional type of document 041A keywords 044. all subject indexing fields starting with 044 045. all subject indexing fields starting with 045 144Z local library keywords 145S local library classification 145Z local library classification The following fields may also be of interest but are not included: 017G and 017HURL for catalog enrichment (e.g. table of contents) 047I abstract Documentation of the fields can be found at https://format.k10plus.de/k10plushelp.pl?cmd=pplist&katalog=Standard#titel Processing examples Extract CSV file of PPN and RVK-Notation: pica filter '045R?' kxp-subjects_2022-06-30.dat | pica select '003@$0,045Ra' Get a list of PPN of records having RVK but not BK: pica filter '045R? & !045Q/01' kxp-subjects_2022-06-30.dat | pica select '003@$0' See https://github.com/gbv/k10plus-subjects#readme for additional examples of data analysis. Automatic download Given the Zenodo Record ID (e.g. 6810556), a list of all files can be generated with curl and jq: curl -sL https://zenodo.org/api/records/$ID | jq -r '.files|map([.key,.links.self]|@tsv)[]' Changes 2022-09-30: update with additional fields 010@, 013E, 013H, 014A (68.051.434 records) 2022-06-30: update with additional fields 013D and 013F (47.686.064 records) 2021-06-30: first published dump (41.786.820 records) License https://creativecommons.org/publicdomain/zero/1.0/

Die Daten des eignen sich auch für Fragestellungen. Beispiel: aus welche Sprachen werden Publikationen am häufigsten ins Deutsche übersetzt?

cat *.dat | pica filter "010@{a=='ger'}" | pica select "010@.c" | sort | uniq -c | sort -n

@nichtich Oh wow, this is an enormous collection. Could be a super valuable testbed for text classification methods. Are these all manual annotations or does the dataset include automated ones?

CC #NLProc

@lpag by far the largest number is manual. Automatically assigned codes are tagged with subfield code $k/$v plus some subjects inferred by mappings (tagged with "coli-conc" $A) but the fraction is low. You better go with the normalized set doi.org/10.5281/zenodo.7016625 anyway

ZenodoNormalized subject indexing data of K10plus library union catalogThis dataset contains normalized subject indexing data of K10plus library union catalog. It includes links between bibliographic records in K10plus and concepts (subjects or classes) from controlled vocabularies: kxp-subjects_2022-09-30.tsv.gz: TSV format kxp-subjects_2022-09-30.nt.gz: RDF format (in form of NTriples) vocabularies.json: information about vocabularies K10plus K10plus is a union catalog of German libraries, run by library service centers BSZ and VZG since 2019. The catalog contains bibliographic data of the majority of academic libraries in Germany. Bibliographic records in K10plus are uniquely identified by a PPN identifier. Several APIs exist to retrieve more data for a record via its PPN, e.g. link into K10plus OPAC: https://opac.k10plus.de/PPNSET?PPN={PPN} Retrieve full record in MARC/XML format: https://unapi.k10plus.de/?format=marcxml&id=opac-de-627:ppn:{PPN} Get formatted citation for display: https://ws.gbv.de/suggest/csl2?citationstyle=ieee&language=en&database=opac-de-627&query=pica.ppn=${PPN} APIs to look up more data from a notation or identifier of a vocabulary can be found in https://bartoc.org/. For instance BK class 58.55 can be retrieved via DANTE API: https://api.dante.gbv.de/data?uri=http%3A%2F%2Furi.gbv.de%2Fterminology%2Fbk%2F58.55 See vocabularies.json for mapping of vocabulary symbol to BARTOC URI and additional information. Statistics The TSV dataset is 24,367,895 records and 84,408,705 links to concepts. Number of concepts per vocabulary: asb 5337 stw 105054 nlm 134271 ssd 155548 kab 161737 sfb 441508 sdnb 4637639 lcc 5466762 ddc 9483999 rvk 10305961 bk 13613274 gnd 39897615 Number of RDF Triples: 84,408,705 TSV The .tsv file contains three tab-separated columns: Bibliographic record identifier (PPN) Vocabulary symbol Notation or identifier in the vocabulary An example: 010000011 bk 58.55 010000011 gnd 4036582-7 Record 010000011 is indexed with class 58.55 from Basic Classification and with authority record 4036582-7 from Integrated authority file. RDF The NTriples file contains the same information as given in TSV file but identifiers are mapped to URIs. An example: <http://uri.gbv.de/document/opac-de-627:ppn:010000011> <http://purl.org/dc/terms/subject> <http://d-nb.info/gnd/4036582-7> . <http://uri.gbv.de/document/opac-de-627:ppn:010000011> <http://purl.org/dc/terms/subject> <http://uri.gbv.de/terminology/bk/58.55> . Changelog 2022-09-11: Fixed PPN URIs and broken UTF-8 encoding 2022-08-24: Fixed GND URIs, added LCC and KAB (https://doi.org/10.5281/zenodo.7018350) 2022-08-24: First version (https://doi.org/10.5281/zenodo.7016626) License and provenance All data is public domain but references are welcome. See https://coli-conc.gbv.de/ for related projects and documentation. The data has been derived from a larger datase of all subject indexing data, published at https://doi.org/10.5281/zenodo.6817455. This dataset has been created with public scripts from git repository https://github.com/gbv/k10plus-subjects.