JuVer- Hochschule Hannover

Projektumfang

Aufgabenstellungen (Auszug)

Zu den grundlegenden Problemstellungen im Projekt JuVer gehören hierbei:

das Segmentieren von längeren Texten in einzelne Paragraphen und Absätze des Gesetzes. Hierzu können Layout, Signalwörter und semantische Kohärenz genutzt werden.
das Erkennen der thematischen und semantischen Beziehung zwischen zwei Paragraphen. Hierzu werden mit Hilfe von sogenannten Word Embeddings Synonyme zu den (juristischen) Begriffen in den Texten gelernt, damit Beziehungen unabhängig der konkret verwendeten Wörter festgestellt werden können.
das Erkennen der Rolle von Textabschnitten oder Paragraphen im Gesamtdokument sowie der Bezüge zwischen Paragraphen in einem Dokument.
die Identifikation von Verweisen auf andere Texte, wobei auch die Art und Funktion des Verweises festgestellt werden soll.
das Erarbeiten einer grundlegenden Menge von Relationen zwischen Texten, Paragraphen und Entitäten, mit denen alle gefundenen Beziehungen formal festgehalten und Wissensgraphen aufgebaut werden können.

Projektbeteiligte

Projektleiter:
Prof. Dr. Christian Wartena

Verantwortlich Rechtswissenschaften:
Prof. Dr. Fabian Schmieder

Wissenschaftliche Mitarbeiterinnen und Mitarbeiter:
Frieda Josi M.A. und Jean Charbonnier M.Sc.

Daten

Laufzeit: 01.10.2019 bis 30.09.2022
Nachweis: Portal Volkswagenstiftung

Veröffentlichungen

Konferenzen

Frieda Josi, Christian Wartena, Ulrich Heid: Identifizierung von häufig vorkommenden Textabschnitten in juristischen Korpora. Manuskript zur Veröffentlichung akzeptiert (56th Linguistics Colloquium, Nov 2020).
Frieda Josi, Christian Wartena, Ulrich Heid (2021): Representing Standard Text Formulations as Directed Graphs. Link zur Originalveröffentlichung
Konferenzbeitrag - eigene Version
Frieda Josi, Christian Wartena, Ulrich Heid: Preparing Legal Documents for NLP Analysis: Improving the Classification of Text Elements by Using Page Features Veröffentlichung akzeptiert (8th International Conference on Natural Language Processing (NATP 2022), Zürich).
Frieda Josi, Christian Wartena, Ulrich Heid: Generalisierung von formelhaften Textbestandteilen in juristischen Korpora: Einsatz- und Entwicklungspotential Manuskript zur Veröffentlichung akzeptiert (IRI§22 Internationales Rechtsinformatik Symposion IRIS 2022, Salzburg).

siehe ResearchGate

Datensätze - Legal Corpora (german)

1. Case Law Corpus

Short description
	The corpus contains anonymized german court decisions from German criminal procedures from the years 2015 to the beginning of 2020. The court decisions are published by the Federal Court of Justice (BGH). The court decisions were crawled directly from the website of the case law database and are available in HTML format for further work. The source for the documents in the case law corpus is available under: hrr-strafrecht.de.
Original documents 1.1.0.
Description:
	Since the criminal law court decisions were crawled from a web page, the original files are in HTML (PHP) format.
Download:
	Case Law Corpus 1.1.0 (zip, 33,8 MB)
Only case law text 1.2.0.
Description:
	In this corpus there are only the case law texts without metadata. Sentence segmentation was performed. Since legal texts often use abbreviations, many sentence segmentations were incorrect. These incorrectly segmented sentences were recombined. For this purpose, first the sentence tokenizer from the NLTK package (nltk.org/api/nltk.tokenize.html) was trained on legal texts and then applied. Then, a manually created list of legal abbreviations was used to recombine incorrectly separated sentences.
Download:
	Case Law Corpus - Sentences 1.2.0 (zip, 18,5 MB)
Tokenized case law sentences 1.3.0.
Description:
	This corpus contains the tokenized sentences from corpus 1.2.0. In addition to the tokens, the lemma and part-of-speech tags are included. Tokenization was performed using the word tokenizer from the NLTK package (nltk.org/api/nltk.tokenize.html). The lemmatization and one part-of-speech tagging was done with the HanTa tagger (github.com/wartaal/HanTa, C. Wartena) an additional part-of-speech tagging was done with the TreeTagger (cis.uni-muenchen.de/~schmid/tools/TreeTagger/, H. Schmidt).
Download:
	Case Law Corpus - Tokens 1.2.0 (zip, 39,4 MB)
Application area:
	Linguistic Resources, Legal Analyses, POS Tagging, Text Mining, Information Extraction, ...

2. Contract Corpus

Short description:
	This corpus contains contracts of the Hamburg City Administration and the Bremen City Administration. Some cooperation agreements between universities are also included. Among these contracts are several contracts that universities have concluded with external service providers. All contracts are available in PDF format. The contracts in this corpus are from the years 2014 to 2019 and are publicly available under the Data License Germany Attribution 2.0 or Data License Germany Zero Version 2.0 license. Sources for Contract corpus: City administration of Hamburg , City administration of Bremen (Keyword: "Vertrag") , Cooperation contracts between universities and also between universities and service providers: We searched specifically for contract files on university websites and added them to Contract corpus.
Original contracts 2.1.0.
Description:
	The documents in this corpus are in PDF format. Text recognition (OCR technology from Image Recognition Integrated Systems S.A.) was performed on the files by Adobe Acrobat Reader.
Download:
	Contract Corpus 2.1.0 (7z, 5,44 GB external link)
Cleaned and corrected contracts 2.2.0.
Description:
	The contract texts had to be extracted, cleaned and prepared for further processing. The quality of the scanned contracts from the city administrations is not so good. There are documents with a lower scanning resolution, pages have been scanned at different angles, and so on. Moreover, all information that represents personal data has been blacked out. In this corpus, the texts have many OCR errors that needed to be cleaned up. In addition to OCR correction, word separations at the end of lines are also corrected. Then sentence segmentation was performed. Since legal texts often use abbreviations, many sentence segmentations were incorrect. These incorrectly segmented sentences were recombined. For this purpose, first the sentence tokenizer from the NLTK package (nltk.org/api/nltk.tokenize.html) was trained on legal texts and then applied. Then, a manually created list of legal abbreviations was used to recombine incorrectly separated sentences.
Download:
	Contract Corpus - Sentences 2.2.0 (Rar, 23,2 MB)
Tokenized contract sentences 2.3.0.
Description:
	This corpus contains the tokenized sentences from corpus 2.2.0. In addition to the tokens, the lemma and part-of-speech tags are included. Tokenization was performed using the word tokenizer from the NLTK package (nltk.org/api/nltk.tokenize.html). The lemmatization and one part-of-speech tagging was done with the HanTa tagger (github.com/wartaal/HanTa, C. Wartena) an additional part-of-speech tagging was done with the TreeTagger (cis.uni-muenchen.de/~schmid/tools/TreeTagger/, H. Schmidt).
Download:
	Contract Corpus - Tokens 2.3.0 (Rar, 44,3 MB)
Application area:
	Linguistic Resources, Legal Analyses, POS Tagging, Text Mining, Information Extraction, ...

Codes

in Arbeit

Projektförderung

Niedersächsischen Ministeriums für Wissenschaft und Kultur
Förderinitiative: Niedersächsisches Vorab
Förderlinie: Geistes- und Kulturwissenschaften – digital: Forschungschancen, Methodenentwicklung und Reflexionspotenziale

Poster

Projektumfang

Aufgabenstellungen (Auszug)

Projektbeteiligte

Daten

Veröffentlichungen

Konferenzen

Datensätze - Legal Corpora (german)

1. Case Law Corpus

Original documents 1.1.0.

Only case law text 1.2.0.

Tokenized case law sentences 1.3.0.

2. Contract Corpus

Original contracts 2.1.0.

Cleaned and corrected contracts 2.2.0.

Tokenized contract sentences 2.3.0.

Codes

Projektförderung