├── LICENSE └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 Haitao Li 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # Awesome-LegalAI-Resources 3 | 4 | This repository aims to collect and curate resources related to Legal AI, including datasets, websites, and other useful links. Whether you are a researcher, developer, or simply interested in the intersection of law and artificial intelligence, we hope this repository provides valuable information and references. 5 | 6 | ## About 7 | The rapid advancements in AI technologies have significantly impacted various domains, including the legal industry. The purpose of this repository is to collect and organize resources that cover a wide range of topics related to Legal AI, including but not limited to: 8 | 9 | - Natural Language Processing (NLP) for legal text analysis 10 | - AI-powered legal research tools 11 | - Automated contract analysis and generation 12 | - Predictive analytics in legal decision-making 13 | - Legal document classification and summarization 14 | - Ethical considerations in Legal AI 15 | - Legal implications of AI adoption in the legal profession 16 | 17 | This repository serves as a centralized hub for researchers, practitioners, and enthusiasts to discover, share, and collaborate on Legal AI resources. Whether you're looking for tutorials, datasets, websites or open-source projects, you'll find valuable material here. 18 | 19 | ## General Corpus 20 | 21 | - **MultiLegalPile**: A 689GB corpus in 24 languages from 17 jurisdictions. 22 | [Paper](https://arxiv.org/abs/2306.02069v2) [Link](https://huggingface.co/datasets/joelito/Multi_Legal_Pile) 23 | 24 | **Language**: multilingual **Country**: multinational 25 | 26 | - **MC4_legal**: This dataset contains large text resources (~106GB in total) from mc4 filtered for legal data that can be used for pretraining language models. 27 | [Link](https://huggingface.co/datasets/joelito/legal-mc4) 28 | 29 | **Language**: multilingual **Country**: multinational 30 | 31 | - **EurlexResources**: This dataset contains large text resources (~179GB in total) from EURLEX that can be used for pretraining language models. 32 | [Link](https://huggingface.co/datasets/joelito/eurlex_resources) 33 | 34 | **Language**: multilingual **Country**: multinational 35 | 36 | - **LeXFile**: The LeXFiles is a new diverse English multinational legal corpus that we created including 11 distinct sub-corpora that cover legislation and case law from 6 primarily English-speaking legal systems (EU, CoE, Canada, US, UK, India). The corpus contains approx. 19 billion tokens. 37 | [Paper](https://arxiv.org/abs/2305.07507) [Link](https://huggingface.co/datasets/lexlms/lex_files) 38 | 39 | **Language**: English **Country**: multinational 40 | 41 | - **Pile of Law**: A 256GB (and growing) dataset of open-source English-language legal and administrative data, covering court opinions, contracts, administrative rules, and legislative records. 42 | [Paper](https://arxiv.org/abs/2207.00220) [Link](https://github.com/Breakend/PileOfLaw) 43 | 44 | **Language**: English **Country**: Unknown 45 | 46 | - **Spanish Legal Domain Corpora**: Our corpora comprises multiple digital resources and it has a total of 8.9GB of textual data. 47 | [Paper](https://arxiv.org/abs/2110.12201) [Link](https://github.com/PlanTL-GOB-ES/lm-legal-es) 48 | 49 | **Language**: Spanish **Country**: Spanish 50 | 51 | - **GeLeCo**: GeLeCo is a large German Legal Corpus for research, teaching and translation purposes. It includes the complete collection of federal laws, administrative regulations and court decisions published on three online databases by the German Federal Ministry of Justice and Consumer Protection and the Federal Office of Justice. 52 | [Link](https://github.com/antcont/GeLeCo) 53 | 54 | **Language**: German **Country**: German 55 | 56 | - **CourtListener**: The original Court Listener dataset is a collection of every court opinion published by every court in the United States. It covers 406 jurisdictions (out of 423), with opinions from the year 1754 up to now. It is constantly updated with newly filed opinions, and digitized archives. 57 | [Link](https://www.courtlistener.com/help/api/bulk-data/) 58 | 59 | **Language**: English **Country**: America 60 | 61 | ## Evaluation Benchmark 62 | 63 | ### Multi Legal Task 64 | 65 | - **LegalLAMA**: LegalLAMA is a diverse probing benchmark suite comprising 8 sub-tasks that aims to assess the acquaintance of legal knowledge that PLMs acquired in pre-training. 66 | [Paper](https://arxiv.org/abs/2305.07507) [Link](https://huggingface.co/datasets/lexlms/legal_lama) 67 | 68 | **Language**: English **Country**: multinational 69 | 70 | - **LexGLUE**: LexGLUE comprises seven datasets: ECtHR Task A and B, SCOTUS, EUR-LEX, LEDGAR, UNFAIR-ToS, and CaseHOLD that are available for re-use and re-share with appropriate attribution. 71 | [Paper](https://arxiv.org/abs/2110.00976) [Link](https://github.com/coastalcph/lex-glue) 72 | 73 | **Language**: English **Country**: multinational 74 | 75 | - **LEXTREME**: The dataset consists of 11 diverse multilingual legal NLU datasets. 6 datasets have one single configuration and 5 datasets have two or three configurations. This leads to a total of 18 tasks. 76 | [Paper](https://arxiv.org/abs/2301.13126) [Link](https://huggingface.co/datasets/joelito/lextreme) 77 | 78 | **Language**: multilingual **Country**: multinational 79 | 80 | - **LegalBench**: LegalBench is a collaborative benchmark intended to evaluate English large language models on legal reasoning and legal text-based tasks. LegalBench currently consists of more than 90 tasks. 81 | [Paper](https://arxiv.org/abs/2209.06120) [Link](https://github.com/HazyResearch/legalbench) 82 | 83 | **Language**: English **Country**: multinational 84 | 85 | - **LBOX OPEN**: This paper present the first large-scale benchmark of Korean legal AI datasets, LBOX OPEN, that consists of one legal corpus, two classification tasks, two legal judgement prediction (LJP) tasks, and one summarization task. 86 | [Paper](https://arxiv.org/abs/2206.05224) [Link](https://github.com/lbox-kr/lbox-open) 87 | 88 | **Language**: Korean **Country**: Korean 89 | 90 | - **GENTLE**: We present GENTLE, a new mixed-genre English challenge corpus totaling 17K tokens and consisting of 8 unusual text types for out-of domain evaluation: dictionary entries, esports commentaries, legal documents, medical notes, poetry, mathematical proofs, syllabuses, and threat letters. 91 | [Paper](https://arxiv.org/abs/2306.01966) [Link](https://github.com/gucorpling/gentle) 92 | 93 | **Language**: English **Country**: Unknown 94 | 95 | - **SCALE**: In this paper, we introduce a novel NLP benchmark that poses challenges to current LLMs across four key dimensions: processing long documents (up to 50K tokens), utilizing domain specific knowledge (embodied in legal texts), multilingual understanding (covering five languages), and multitasking (comprising legal document to document Information Retrieval, Court View Generation, Leading Decision Summarization, Citation Extraction, and eight challenging Text Classification tasks). 96 | [Paper](https://arxiv.org/abs/2306.09237) [Link](https://huggingface.co/rcds) 97 | 98 | **Language**: multilingual **Country**: Switzerland 99 | 100 | ### Legal Case Retrieval 101 | 102 | - **LeCaRD**: LeCaRD composes of 107 query cases and 10,700 candidate cases selected from a corpus 103 | of over 43,000 Chinese criminal judgements. 104 | [Paper](https://dl.acm.org/doi/10.1145/3404835.3463250) [Link](https://github.com/myx666/LeCaRD) 105 | 106 | **Language**: Chinese **Country**: China 107 | 108 | - **LeCaRDv2**: LeCaRDv2 is one of the largest Chinese legal case retrieval datasets with the widest coverage of criminal charges. The dataset comprises 800 query cases and 55,192 candidate cases extracted from 4.3 million criminal case documents. 109 | [Link](https://github.com/THUIR/LeCaRDv2) 110 | 111 | **Language**: Chinese **Country**: China 112 | 113 | - **COLIEE**: The Competition on Legal Information Extraction/Entailment (COLIEE) is an annual international competition whose aim is to achieve state-of-the-art methods for legal text processing. Task 1 is the legal case retrieval task. Task 3 is the statute law retrieval task. 114 | [Paper](https://sites.ualberta.ca/~rabelo/COLIEE2023/COLIEE2022_summary.pdf) [Link](https://sites.ualberta.ca/~rabelo/COLIEE2023/) 115 | 116 | **Language**: English/Japanese **Country**: Canada/Japan 117 | 118 | - **document-similarity**: The task here is to calculate a similarity score (in the range 0-1) between two case documents. The dataset collected 53, 210 publicly available case documents from the Supreme Court of India and and 12, 814 Acts from the Indian judiciary. 119 | [Paper](https://arxiv.org/abs/2209.12474) [Link](https://github.com/Law-AI/document-similarity) 120 | 121 | **Language**: English **Country**: India 122 | 123 | ### Question Answering 124 | 125 | - **JEC-QA**: the largest question answering dataset in the legal domain, collected from the National Judicial Examination of China. There are 26,365 questions in JEC-QA. 126 | [Paper](https://arxiv.org/abs/1911.12011) [Link](https://jecqa.thunlp.org/) 127 | 128 | **Language**: Chinese **Country**: China 129 | 130 | - **CaseHOLD**: This CaseHOLD dataset provides 53,000+ multiple choice questions with prompts from a judicial decision and multiple potential holdings, one of which is correct, that could be cited. 131 | [Paper](https://dl.acm.org/doi/10.1145/3462757.3466088) [Link](https://github.com/reglab/casehold) 132 | 133 | **Language**: English **Country**: America 134 | 135 | - **SARA**: A novel dataset based on US tax law, together with test cases. 136 | [Paper](https://ceur-ws.org/Vol-2645/paper5.pdf) [Link](https://nlp.jhu.edu/law/) 137 | 138 | **Language**: English **Country**: America 139 | 140 | - **PrivacyQA**: PrivacyQA is a corpus consisting of 1750 questions about the contents of privacy policies, paired with expert annotations. 141 | [Paper](https://arxiv.org/abs/1911.00841) [Link](https://github.com/AbhilashaRavichander/PrivacyQA_EMNLP) 142 | 143 | **Language**: English **Country**: America 144 | 145 | 146 | ### Legal Case Entailment 147 | 148 | - **COLIEE**: The Competition on Legal Information Extraction/Entailment (COLIEE) is an annual international competition whose aim is to achieve state-of-the-art methods for legal text processing. Task 2 is the legal case entailment task. Task 4 is the legal textual entailment data corpus. 149 | [Paper](https://sites.ualberta.ca/~rabelo/COLIEE2023/COLIEE2022_summary.pdf) [Link](https://sites.ualberta.ca/~rabelo/COLIEE2023/) 150 | 151 | **Language**: English/Japanese **Country**: Canada/Japan 152 | 153 | - **Legal Linking**: This paper describes a dataset and baseline systems for linking paragraphs from court cases to clauses or amendments in the US Constitution. 154 | [Paper](https://aclanthology.org/W19-2205.pdf) [Link](https://github.com/mayhewsw/legal-linking) 155 | 156 | **Language**: English **Country**: America 157 | 158 | ### Document Classification 159 | 160 | - **CAIL2018**: CAIL2018 contains more than 2.6 million criminal cases published by the Supreme People’s Court of China. It consists of applicable law articles, charges, and prison terms, which are expected to be inferred according to the fact descriptions of cases. 161 | [Paper](https://arxiv.org/abs/1807.02478) [Link](https://github.com/thunlp/CAIL2018) 162 | 163 | **Language**: Chinese **Country**: China 164 | 165 | - **ECHR**: This paper contributes a new publicly available English legal judgment prediction dataset of cases from the European Court of Human Rights (~11.5k cases). 166 | [Paper](https://aclanthology.org/P19-1424/) [Link](https://archive.org/details/ECHR-ACL2019) 167 | 168 | **Language**: English **Country**: European 169 | 170 | - **Swiss-Judgment-Prediction**: The paper publicly release a multilingual (German, French, and Italian), diachronic (2000-2020) corpus of 85K cases from the Federal Supreme Court of Switzer- land (FSCS). 171 | [Paper](https://aclanthology.org/2021.nllp-1.3/) [Link](https://github.com/joelniklaus/swissjudgementprediction) 172 | 173 | **Language**: multilingual **Country**: multinational 174 | 175 | - **German Legal Decision Corpora**:To meet this need for publicly available German legal text corpora this paper presents two German legal text corpora. The first corpus contains 32,748 decisions from 131 German courts, enriched with metadata. The second one is a subset of the first corpus and consists of 200 randomly chosen judgements. 176 | [Paper](https://www.scitepress.org/PublishedPapers/2021/101873/101873.pdf) [Link](https://zenodo.org/record/3936490#.X1ed7ovgomK) 177 | 178 | **Language**: German **Country**: German 179 | 180 | - **EURLEX57K**:We release a new dataset of 57k legislative documents from EUR-LEX, the European Union’s public document database, annotated with concepts from EUROVOC, a multidisciplinary thesaurus. 181 | [Paper](https://aclanthology.org/W19-2209/) [Link](http://nlp.cs.aueb.gr/software_and_datasets/EURLEX57K/) 182 | 183 | **Language**: English **Country**: European 184 | 185 | - **German rental agreements**:601 sentences from the tenancy law of the German Civil Code and 312 sentences, classified according to a semantic type system consisting of 9 different classes, from German rental agreements. 186 | [Paper](https://www.researchgate.net/publication/332171940_Classifying_Semantic_Types_of_Legal_Sentences_Portability_of_Machine_Learning_Models) [Link](https://github.com/sebischair/Legal-Sentence-Classification-Datasets-and-Models) 187 | 188 | **Language**: English **Country**: German 189 | 190 | 191 | ### Summarization 192 | 193 | - **BillSum**: We introduce the BillSum dataset, which contains a primary corpus of 22,218 US Congressional bills and reference summaries split into a train and a test set. 194 | [Paper](https://aclanthology.org/D19-5406/) [Link](https://huggingface.co/datasets/billsum) 195 | 196 | **Language**: English **Country**: America 197 | 198 | - **EUR-Lex-Sum**: We obtain up to 1,500 document/summary pairs per language, including a subset of 375 crosslingually aligned legal acts with texts available in all 24 languages. 199 | [Paper](https://arxiv.org/abs/2210.13448) [Link](https://huggingface.co/datasets/dennlinger/eur-lex-sum) 200 | 201 | **Language**: multilingual **Country**: European 202 | 203 | - **Plain English Summarization of Contracts**: The dataset we propose contains 446 sets of parallel text. 204 | [Paper](https://www.aclweb.org/anthology/W19-2201) [Link](https://github.com/lauramanor/legal_summarization#plain-english-summarization-of-contracts) 205 | 206 | **Language**: English **Country**: America 207 | 208 | - **Summarization-of-Privacy-Policies**: This dataset was extracted from the text of privacy policy, terms of service, and cookie policy of 151 companies. The Points and Plain English Summaries are extracted from tosdr.org. 209 | [Paper](https://ceur-ws.org/Vol-2645/paper3.pdf) [Link](https://github.com/senjed/Summarization-of-Privacy-Policies) 210 | 211 | **Language**: English **Country**: Unknown 212 | 213 | - **Multi-LexSum**: We introduce Multi-LexSum, a collection of 9,280 expert-authored summaries drawn from ongoing CRLC writing. 214 | [Paper](https://arxiv.org/abs/2206.10883) [Link](https://multilexsum.github.io/) 215 | 216 | **Language**: English **Country**: Unknown 217 | 218 | 219 | 220 | ### Entity extraction 221 | 222 | - **CDJUR-BR**: We describe the development of the Golden Collection of the Brazilian Judiciary (CDJUR-BR) contemplating a set of fine-grained named entities that have been annotated by experts in legal documents. This contains 44,526 annotations for 21 entities. 223 | [Paper](https://arxiv.org/abs/2305.18315) [Link](https://huggingface.co/datasets/dennlinger/eur-lex-sum) 224 | 225 | **Language**: Portuguese **Country**: Brazilian 226 | 227 | - **Extracting Contract Elements**: The paper describes and is accompanied by a new benchmark dataset of approximately 3,500 English contracts with gold contract element annotations. 228 | [Paper](http://nlp.cs.aueb.gr/pubs/icail2017.pdf) 229 | 230 | **Language**: English **Country**: England 231 | 232 | - **LEVEN**: LEVEN contains 108 event types in total, including 64 charge-oriented events and 44 general events. Their distribution is shown below. 233 | [Paper](https://aclanthology.org/2022.findings-acl.17.pdf) [Link](https://github.com/thunlp/LEVEN) 234 | 235 | **Language**: Chinese **Country**: China 236 | 237 | ### Others 238 | 239 | - **MAUD**: To address this challenge, we introduce the Merger Agreement Understanding Dataset (MAUD), an expert-annotated reading comprehension dataset based on the American Bar Association’s 2021 Public Target Deal Points Study, with over 39,000 examples and over 47,000 total annotations. 240 | [Paper](https://arxiv.org/abs/2301.00876) [Link](https://drive.google.com/drive/folders/1RujOK2FZKdFSCJ15tqdyd42g8WLsYagj) 241 | 242 | **Language**: English **Country**: America 243 | 244 | - **VerbCL**: This paper presents a new dataset that consists of the citation graph of court opinions, which cite previously published court opinions in support of their arguments. 245 | [Paper](https://arxiv.org/abs/2108.10120) [Link](https://uvaauas.figshare.com/articles/dataset/VerbCL_Dataset/14798878/1) 246 | 247 | **Language**: English **Country**: America 248 | 249 | - **MultiLegalSBD**: Sentence Boundary Detection (SBD) is one of the foundational building blocks of Natural Language Processing (NLP), with incorrectly split sentences heavily influencing the output quality of downstream tasks. We curated a diverse multilingual legal dataset consisting of over 130'000 annotated sentences in 6 languages. 250 | [Paper](https://arxiv.org/abs/2305.01211) [Link](https://huggingface.co/datasets/rcds/MultiLegalSBD) 251 | 252 | **Language**: multilingual **Country**: multinational 253 | 254 | - **FairLex**: Our benchmarks cover four jurisdictions (European Council, USA, Switzerland, and China), five languages (English, German, French, Italian and Chinese) and fairness across five attributes (gender, age, region, language, and legal area). 255 | [Paper](https://arxiv.org/abs/2203.07228) [Link](https://huggingface.co/datasets/coastalcph/fairlex) 256 | 257 | **Language**: multilingual **Country**: multinational 258 | 259 | - **ContractNLI**: In this work, we propose documentlevel natural language inference (NLI) for contracts, a novel, real-world application of NLI that addresses such problems. We annotated and release the largest corpus to date consisting of 607 annotated contracts. 260 | [Paper](https://arxiv.org/abs/2110.01799) [Link](https://stanfordnlp.github.io/contract-nli/) 261 | 262 | **Language**: English **Country**: America 263 | 264 | - **Demosthen**: A novel corpus for argument mining in legal documents, composed of 40 decisions of the Court of Justice of the European Union on matters of fiscal state ai. 265 | [Paper](https://aclanthology.org/2022.argmining-1.14.pdf) [Link](https://github.com/adele-project/demosthenes) 266 | 267 | **Language**: English **Country**: European 268 | 269 | ## Websites 270 | - https://flk.npc.gov.cn/ all Chinese laws and regulations. 271 | 272 | - https://wenshu.court.gov.cn/ judicial documents in China. 273 | 274 | - https://www.westlaw.com/: a well-known legal research platform that provides access to legal documents, cases, statutes, commentaries, and legal news from around the world. 275 | 276 | - https://www.lexisnexis.com/: another widely used legal research tool that offers global legal documents, cases, statutes, news, and commentaries. 277 | 278 | - https://home.heinonline.org/: a specialized legal and law-related research database that includes legal documents, journals, statutes, and more from the United States and other countries. 279 | 280 | - https://case.law/ all official, book-published United States case law. 281 | 282 | - https://www.legifrance.gouv.fr/ a French legal publisher providing access to law codes and legal decisions. 283 | 284 | - http://scdb.wustl.edu/ information about every case decided by the US Supreme Court between 1791 and today. 285 | 286 | - https://www.statmt.org/europarl/ Parallel text of the proceedings of the European Parliment, collected in 11 languages. 287 | 288 | - https://uscode.house.gov/download/download.shtml downloadable version of the US Code in XML format 289 | 290 | - https://www.uspto.gov/ip-policy/economic-research/research-datasets/patent-litigation-docket-reports-data detailed patent litigation data on over 80k unique district court cases 291 | 292 | - https://curia.europa.eu/jcms/jcms/j_6/: the official website of the European Court of Justice, offering access to European Union legal documents and cases. 293 | 294 | - https://www.justia.com/: provides access to a wide range of legal information, including cases, statutes, regulations, and legal articles. Covers both U.S. federal and state laws. 295 | 296 | - https://www.findlaw.com/: offers legal resources and information, including cases, statutes, regulations, and legal news. Covers U.S. federal and state laws. 297 | 298 | - https://www.courtlistener.com/: a free legal research platform that provides access to U.S. federal and state court cases, along with other legal documents and opinions. 299 | 300 | - https://www.pacer.gov/: the Public Access to Court Electronic Records (PACER) system provides access to U.S. federal court documents, including case filings, docket information, and court opinions. Registration and fees may apply. 301 | 302 | - https://www.law.cornell.edu/: operated by Cornell Law School, the LII offers access to U.S. federal and state laws, regulations, and court cases, along with legal articles and resources. 303 | 304 | - https://www.bailii.org/: provides access to legal materials from the United Kingdom and Ireland, including cases, legislation, and legal journals. 305 | 306 | - https://www.austlii.edu.au/: Offers access to legal materials from Australia and neighboring countries, including cases, legislation, treaties, and law reform reports. 307 | 308 | - https://www.canlii.org/: Provides access to Canadian legal documents, including cases, statutes, regulations, and court rules. 309 | 310 | - http://www.worldlii.org/: a free and independent global legal research resource, aggregating legal materials from various countries and regions. 311 | 312 | 313 | ## Contact 314 | 315 | If you believe there's any missing resources or have any questions, suggestions, or concerns, please feel free to open an issue on the repository or contact us via email liht22@mails.tsinghua.edu.cn. 316 | 317 | 318 | ## License 319 | This repository is licensed under the [MIT](LICENSE) License. You are free to use, modify, and distribute the content of this repository for both commercial and non-commercial purposes. However, we kindly request that you provide attribution by linking back to this repository. 320 | 321 | --- 322 | 323 | We hope you find the Awesome-LegalAI-Resource Repository valuable and discover new insights and tools for your Legal AI journey. If you have any questions, suggestions, or concerns, please don't hesitate to open an issue. Happy exploring! --------------------------------------------------------------------------------