├── README.md
├── figs
├── carlini_extraction.png
├── copyright_hobbit.png
├── felps_unlearning.png
├── google_fl.png
└── tindall_mia.png
└── scraping
└── scrape.py
/README.md:
--------------------------------------------------------------------------------
1 | # Privacy Issues in Large Language Models
2 |
3 | This repository is a collection of links to papers and code repositories relevant in implementing LLMs with reduced privacy risks.
4 | These correspond to papers discussed in our survey available at: https://arxiv.org/abs/2312.06717
5 |
6 | This repository will be periodically updated with relevant papers scraped from Arxiv. The survey paper itself will be
7 | updated on a slightly less frequent basis. **Papers that have been added to this repository but not the paper will be
8 | marked with asterisks.**
9 |
10 | If you have a paper relevant to LLM privacy, [please nominate them for inclusion](https://forms.gle/jCZgXaEfb8R4CnDb7)
11 |
12 | *Repo last updated 5/30/2024*
13 |
14 | *Paper last updated 5/30/2024*
15 |
16 | ## Table of Contents
17 |
18 | - [Privacy Risks of LLMs](#privacy-risks-of-large-language-models)
19 | - [Memorization](#memorization)
20 | - [Privacy Attacks](#privacy-attacks)
21 | - [Private LLMs](#private-llms)
22 | - [Unlearning](#unlearning)
23 | - [Copyright](#copyright)
24 | - [Additional Relevant Surveys](#additional-relevant-surveys)
25 | - [Contact Info](#contact-info)
26 |
27 |
28 | ## Citation
29 |
30 | ```
31 | @misc{neel2023privacy,
32 | title={Privacy Issues in Large Language Models: A Survey},
33 | author={Seth Neel and Peter Chang},
34 | year={2023},
35 | eprint={2312.06717},
36 | archivePrefix={arXiv},
37 | primaryClass={cs.AI}
38 | }
39 | ```
40 |
41 | # Memorization
42 |
43 |
44 | Image from [Carlini 2020](https://arxiv.org/abs/2012.07805)
45 |
46 | | **Paper Title** | **Year** | **Author** | **Code** |
47 | |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|-------------------|:----------------------------------------------------------------------:|
48 | | [Emergent and Predictable Memorization in Large Language Models](https://arxiv.org/abs/2304.11158) | 2023 | Biderman et al. | [[Code]](https://github.com/EleutherAI/pythia) |
49 | | [The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks](https://arxiv.org/abs/1802.08232) | 2019 | Carlini et al. | |
50 | | [Quantifying Memorization Across Neural Language Models](https://arxiv.org/abs/2202.07646) | 2023 | Carlini et al. | |
51 | | [Do Localization Methods Actually Localize Memorized Data in LLMs?](https://arxiv.org/abs/2311.09060) | 2023 | Chang et al. | [[Code]](https://github.com/terarachang/MemData) |
52 | | [Does Learning Require Memorization? A Short Tale about a Long Tail](https://arxiv.org/abs/1906.05271) | 2020 | Feldman et al. | |
53 | | [Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy](https://arxiv.org/abs/2210.17546) | 2023 | Ippolito et al. | |
54 | | [Measuring Forgetting of Memorized Training Examples](https://arxiv.org/abs/2207.00099) | 2023 | Jagielski et al. | |
55 | | [Deduplicating Training Data Mitigates Privacy Risks in Language Models](https://arxiv.org/abs/2202.06539) | 2022 | Kandpal et al. | |
56 | | [How BPE Affects Memorization in Transformers](https://arxiv.org/abs/2110.02782) | 2021 | Kharitonov et al. | |
57 | | [Deduplicating Training Data Makes Language Models Better](https://arxiv.org/abs/2107.06499) | 2022 | Lee et al. | [[Code]](https://github.com/google-research/deduplicate-text-datasets) |
58 | | [How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN](https://arxiv.org/abs/2111.09509) | 2021 | McCoy et al. | [[Code]](https://github.com/tommccoy1/raven) |
59 | | [Training Production Language Models without Memorizing User Data](https://arxiv.org/abs/2009.10031) | 2020 | Ramaswamy et al. | |
60 | | [Finding Memo: Extractive Memorization in Constrained Sequence Generation Tasks](https://arxiv.org/abs/2210.12929) | 2022 | Raunak et al. | |
61 | | [Understanding Unintended Memorization in Federated Learning](https://arxiv.org/abs/2006.07490) | 2020 | Thakkar et al. | |
62 | | [Investigating the Impact of Pre-trained Word Embeddings on Memorization in Neural Networks](https://www.researchgate.net/publication/344105641_Investigating_the_Impact_of_Pre-trained_Word_Embeddings_on_Memorization_in_Neural_Networks) | 2020 | Thomas et al. | |
63 | | [Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models](https://arxiv.org/abs/2205.10770) | 2022 | Tirumala et al. | |
64 | | [Counterfactual Memorization in Neural Language Models](https://arxiv.org/abs/2112.12938) | 2021 | Zhang et al. | |
65 | | [Provably Confidential Language Modelling](https://arxiv.org/abs/2205.01863) | 2022 | Zhao et al. | [[Code]](https://github.com/XuandongZhao/CRT) |
66 | | [Quantifying and Analyzing Entity-level Memorization in Large Language Models](https://paperswithcode.com/paper/quantifying-and-analyzing-entity-level) | 2023 | Zhou et al. | |
67 |
68 |
71 |
72 | # Privacy Attacks
73 | 
74 |
75 | Image from [Tindall](https://gab41.lab41.org/membership-inference-attacks-on-neural-networks-c9dee3db67da)
76 |
77 | | **Paper Title** | **Year** | **Author** | **Code** |
78 | |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|-------------------------|:---------------------------------------------------------------------------------:|
79 | | [Detecting Pretraining Data from Large Language Models](https://arxiv.org/abs/2310.16789) | 2023 | Shi et al. | |
80 | | [TMI! Finetuned Models Leak Private Information from their Pretraining Data](https://arxiv.org/abs/2306.01181) | 2023 | Abascal et al. | |
81 | | [Extracting Training Data from Large Language Models](https://arxiv.org/abs/2012.07805) | 2020 | Carlini et al. | [[Code]](https://github.com/ftramer/LM_Memorization) |
82 | | [Membership Inference Attacks From First Principles](https://arxiv.org/abs/2112.03570) | 2022 | Carlini et al. | [[Code]](https://github.com/tensorflow/privacy/tree/master/research/mi_lira_2021) |
83 | | [Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration](https://arxiv.org/abs/2311.06062) | 2023 | Fu et al. | |
84 | | [Membership Inference Attacks on Sequence-to-Sequence Models: Is My Data In Your Machine Translation System?](https://arxiv.org/abs/1904.05506) | 2020 | Hisamoto et al. | [[Code]](https://github.com/sorami/tacl-membership) |
85 | | [Membership Inference Attacks on Machine Learning: A Survey](https://arxiv.org/abs/2103.07853) | 2021 | Hu et al. | |
86 | | [Does BERT Pretrained on Clinical Notes Reveal Sensitive Data?](https://arxiv.org/abs/2104.07762) | 2021 | Lehman et al. | [[Code]](https://github.com/elehman16/exposing_patient_data_release) |
87 | | [MoPe: Model Perturbation-based Privacy Attacks on Language Models](https://paperswithcode.com/paper/mope-model-perturbation-based-privacy-attacks) | 2023 | Li et al. | |
88 | | [When Machine Learning Meets Privacy: A Survey and Outlook](https://dl.acm.org/doi/abs/10.1145/3436755) | 2021 | Liu et al. | |
89 | | [Data Portraits: Recording Foundation Model Training Data](https://arxiv.org/abs/2303.03919) | 2023 | Marone et al. | [[Code]](https://dataportraits.org/) |
90 | | [Membership Inference Attacks against Language Models via Neighbourhood Comparison](https://aclanthology.org/2023.findings-acl.719/) | 2023 | Mattern et al. | |
91 | | [Did the Neurons Read your Book? Document-level Membership Inference for Large Language Models](https://arxiv.org/abs/2310.15007) | 2023 | Meeus et al. | |
92 | | [Quantifying Privacy Risks of Masked Language Models Using Membership Inference Attacks](https://arxiv.org/abs/2203.03929) | 2022 | Mireshghallah et al. | Contact fmireshg@eng.ucsd.edu |
93 | | [Scalable Extraction of Training Data from Production Language Models](https://arxiv.org/abs/2311.17035) | 2023 | Nasr et al. | |
94 | | [Detecting Pretraining Data from Large Language Models](https://arxiv.org/abs/2310.16789) | 2023 | Shi et al. | [[Code]](https://swj0419.github.io/detect-pretrain.github.io/) |
95 | | [Membership Inference Attacks against Machine Learning Models](https://arxiv.org/abs/1610.05820) | 2017 | Shokri et al. | |
96 | | [Information Leakage in Embedding Models](https://arxiv.org/abs/2004.00053) | 2020 | Song and Raghunathan | |
97 | | [Auditing Data Provenance in Text-Generation Models](https://arxiv.org/abs/1811.00513) | 2019 | Song and Shmatikov | |
98 | | [Beyond Memorization: Violating Privacy Via Inference with Large Language Models](https://arxiv.org/abs/2310.07298) | 2023 | Staab et al. | [[Code]](https://github.com/eth-sri/llmprivacy) |
99 | | [Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting](https://arxiv.org/abs/1709.01604) | 2018 | Yeom et al. | |
100 | | [Bag of Tricks for Training Data Extraction from Language Models](https://arxiv.org/abs/2302.04460) | 2023 | Yu et al. | [[Code]](https://github.com/weichen-yu/LM-Extraction) |
101 | | [Analyzing Information Leakage of Updates to Natural Language Models](https://arxiv.org/abs/1912.07942) | 2020 | Zanella-Béguelin et al. | |
102 | | [Ethicist: Targeted Training Data Extraction Through Loss Smoothed Soft Prompting and Calibrated Confidence Estimation](https://arxiv.org/abs/2307.04401) | 2023 | Zhang et al. | [[Code]](https://github.com/thu-coai/Targeted-Data-Extraction) |
103 |
104 | Furthermore, see [[Google Training Data Extraction Challenge]](https://github.com/google-research/lm-extraction-benchmark)
105 |
106 |
109 |
110 | # Private LLMs
111 |
112 |
113 | Image from [Google AI Blog](https://blog.research.google/2017/04/federated-learning-collaborative.html)
114 |
115 | | **Paper Title** | **Year** | **Author** | **Code** |
116 | |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|--------------------|:----------------------------------------------------------------------------------------------------:|
117 | | [Deep Learning with Differential Privacy](https://arxiv.org/abs/1607.00133) | 2016 | Abadi et al. | |
118 | | [Large-Scale Differentially Private BERT](https://arxiv.org/abs/2108.01624) | 2021 | Anil et al. | |
119 | | [Sanitizing Sentence Embeddings (and Labels) for Local Differential Privacy](https://dl.acm.org/doi/abs/10.1145/3543507.3583512) | 2023 | Du et al. | [[Code]](https://github.com/xiangyue9607/Sentence-LDP) |
120 | | [DP-Forward: Fine-tuning and Inference on Language Models with Differential Privacy in Forward Pass](https://arxiv.org/abs/2309.06746) | 2023 | Du et al. | [[Code]](https://github.com/xiangyue9607/DP-Forward) |
121 | | [An Efficient DP-SGD Mechanism for Large Scale NLP Models](https://arxiv.org/abs/2107.14586) | 2022 | Dupuy et al. | |
122 | | [The Algorithmic Foundations of Differential Privacy](https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf) | 2006 | Dwork and Roth | |
123 | | [Submix: Practical Private Prediction for Large-Scale Language Models](https://arxiv.org/abs/2201.00971) | 2022 | Ginart et al. | |
124 | | [Federated Learning for Mobile Keyboard Prediction](https://arxiv.org/abs/1811.03604) | 2019 | Hard et al. | |
125 | | [Learning and Evaluating a Differentially Private Pre-trained Language Model](https://aclanthology.org/2021.findings-emnlp.102/) | 2021 | Hoory et al. | |
126 | | [Knowledge Sanitization of Large Language Models](https://arxiv.org/abs/2309.11852) | 2023 | Ishibashi and Shimodaira | |
127 | | [Differentially Private Language Models Benefit from Public Pre-training](https://arxiv.org/abs/2009.05886) | 2020 | Kerrigan et al. | [[Code]](https://github.com/dylan-slack/Finetuning-DP-Language-Models) |
128 | | [Large Language Models Can Be Strong Differentially Private Learners](https://arxiv.org/abs/2110.05679) | 2022 | Li et al. | |
129 | | [Differentially Private Decoding in Large Language Models](https://arxiv.org/abs/2205.13621) | 2022 | Majmudar et al. | |
130 | | [Communication-Efficient Learning of Deep Networks from Decentralized Data](https://arxiv.org/abs/1602.05629) | 2016 | McMahan et al. | |
131 | | [Learning Differentially Private Recurrent Language Models](https://arxiv.org/abs/1710.06963) | 2018 | McMahan et al. | |
132 | | [Selective Differential Privacy for Language Modeling](https://arxiv.org/abs/2108.12944) | 2022 | Shi et al. | [[Code]](https://github.com/wyshi/lm_privacy) |
133 | | [Training Production Language Models without Memorizing User Data](https://arxiv.org/abs/2009.10031) | 2020 | Ramaswamy et al. | |
134 | | [Privacy-Preserving In-Context Learning with Differentially Private Few-Shot Generation](https://arxiv.org/abs/2309.11765v1) | 2023 | Tang et al. | [[Code]](https://github.com/microsoft/dp-few-shot-generation) |
135 | | [Understanding Unintended Memorization in Federated Learning](https://arxiv.org/abs/2006.07490) | 2020 | Thakkar et al. | |
136 | | [Differentially Private Fine-tuning of Language Models](https://arxiv.org/abs/2110.06500) | 2022 | Yu et al. | [[Code]](https://github.com/huseyinatahaninan/Differentially-Private-Fine-tuning-of-Language-Models) |
137 | | [Provably Confidential Language Modelling](https://arxiv.org/abs/2205.01863) | 2022 | Zhao et al. | [[Code]](https://github.com/XuandongZhao/CRT) |
138 | | [***Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory](https://paperswithcode.com/paper/can-llms-keep-a-secret-testing-privacy) | 2023 | Mireshghallah et al. | [[Code]](https://confaide.github.io/) |
139 | | [***Silent Guardian: Protecting Text from Malicious Exploitation by Large Language Models](https://arxiv.org/abs/2312.09669) | 2023 | Zhou et al. | |
140 |
141 |
144 |
145 | # Unlearning
146 |
147 |
148 | Image from [Felps 2020](https://www.researchgate.net/publication/346879997_Class_Clown_Data_Redaction_in_Machine_Unlearning_at_Enterprise_Scale)
149 |
150 | | **Paper Title** | **Year** | **Author** | **Code** |
151 | |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|------------------|:------------------------------------------------------------------------:|
152 | | [Machine Unlearning](https://arxiv.org/abs/1912.03817) | 2020 | Bourtoule et al. | [[Code]](https://github.com/cleverhans-lab/machine-unlearning) |
153 | | [Unlearn What You Want to Forget: Efficient Unlearning for LLMs](https://arxiv.org/abs/2310.20150) | 2023 | Chen et al. | |
154 | | [Who's Harry Potter? Approximate Unlearning in LLMs](https://arxiv.org/abs/2310.02238) | 2023 | Eldan et al. | |
155 | | [Amnesiac Machine Learning](https://arxiv.org/abs/2010.10981) | 2020 | Graves et al. | |
156 | | [Adaptive Machine Unlearning](https://arxiv.org/abs/2106.04378) | 2021 | Gupta et al. | [[Code]](https://github.com/ChrisWaites/adaptive-machine-unlearning) |
157 | | [Inexact Unlearning Needs More Careful Evaluations to Avoid a False Sense of Privacy](https://arxiv.org/abs/2403.01218) | 2024 | Hayes et al. | |
158 | | [Preserving Privacy Through DeMemorization: An Unlearning Technique For Mitigating Memorization Risks In Language Models](https://aclanthology.org/2023.emnlp-main.265.pdf) | 2023 | Kassem et al. | |
159 | | [Privacy Adhering Machine Un-learning in NLP](https://arxiv.org/abs/2212.09573) | 2022 | Kumar et al. | |
160 | | [Knowledge Unlearning for Mitigating Privacy Risks in Language Models](https://arxiv.org/pdf/2210.01504) | 2022 | Jang et al. | [[Code]](https://github.com/joeljang/knowledge-unlearning) |
161 | | [Unlearnable Algorithms for In-context Learning](https://arxiv.org/abs/2402.00751) | 2024 | Muresanu et al. | |
162 | | [Descent-to-Delete: Gradient-Based Methods for Machine Unlearning](https://arxiv.org/abs/2007.02923) | 2020 | Neel et al. | |
163 | | [A Survey of Machine Unlearning](https://arxiv.org/abs/2209.02299) | 2022 | Nguyen et al. | |
164 | | [Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks](https://openreview.net/forum?id=7erlRDoaV8) | 2023 | Patil et al. | [[Code](https://github.com/Vaidehi99/InfoDeletionAttacks)] |
165 | | [In-Context Unlearning: Language Models as Few Shot Unlearners](https://arxiv.org/abs/2310.07579) | 2023 | Pawelczyk et al. | |
166 | | [Large Language Model Unlearning](https://arxiv.org/abs/2310.10683) | 2023 | Yao et al. | [[Code]](https://github.com/kevinyaobytedance/llm_unlearn) |
167 |
168 |
172 |
173 | # Copyright
174 | 
175 |
176 | Custom Image
177 |
178 | | **Paper Title** | **Year** | **Author** | **Code** |
179 | |---------------------------------------------------------------------------------------------------------------------------------------|:--------:|----------------------|:-------------:|
180 | | [Can Copyright be Reduced to Privacy?](https://arxiv.org/abs/2305.14822) | 2023 | Elkin-Koren et al. | |
181 | | [DeepCreativity: Measuring Creativity with Deep Learning Techniques](https://arxiv.org/abs/2201.06118) | 2022 | Franceschelli et al. | |
182 | | [Foundation Models and Fair Use](https://arxiv.org/abs/2303.15715) | 2023 | Henderson et al. | |
183 | | [Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy](https://arxiv.org/abs/2210.17546) | 2023 | Ippolito et al. | |
184 | | [Copyright Violations and Large Language Models](https://paperswithcode.com/paper/copyright-violations-and-large-language) | 2023 | Karamolegkou et al. | [[Code]](https://github.com/coastalcph/CopyrightLLMs) |
185 | | [Formalizing Human Ingenuity: A Quantitative Framework for Copyright Law's Substantial Similarity](https://arxiv.org/abs/2206.01230) | 2022 | Scheffler et al. | |
186 | | [On Provable Copyright Protection for Generative Models](https://arxiv.org/abs/2302.10870) | 2023 | Vyas et al. | |
187 |
188 |
191 |
192 | # Additional Related Surveys
193 |
194 | | **Paper Title** | **Year** | **Author** | **Code** |
195 | |------------------------------------------------------------------------------------------------------|:--------:|---------------|:-------------:|
196 | | [Membership Inference Attacks on Machine Learning: A Survey](https://dl.acm.org/doi/10.1145/3523273) | 2022 | Hu et al. | |
197 | | [When Machine Learning Meets Privacy: A Survey and Outlook](https://dl.acm.org/doi/10.1145/3436755) | 2021 | Liu et al. | |
198 | | [Rethinking Machine Unlearning for Large Language Models](https://arxiv.org/abs/2402.08787) | 2024 | Liu et al. | |
199 | | [A Survey of Machine Unlearning](https://arxiv.org/abs/2209.02299) | 2022 | Nguyen et al. | |
200 | | [***A Survey of Large Language Models](https://arxiv.org/abs/2303.18223) | 2023 | Zhao et al. | |
201 |
202 |
203 | # Contact Info
204 |
205 | Repository maintained by Peter Chang (pchang@hbs.edu)
206 |
--------------------------------------------------------------------------------
/figs/carlini_extraction.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/safr-ai-lab/survey-llm/bc373c9f76dcdf04741fe25c489b783645804620/figs/carlini_extraction.png
--------------------------------------------------------------------------------
/figs/copyright_hobbit.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/safr-ai-lab/survey-llm/bc373c9f76dcdf04741fe25c489b783645804620/figs/copyright_hobbit.png
--------------------------------------------------------------------------------
/figs/felps_unlearning.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/safr-ai-lab/survey-llm/bc373c9f76dcdf04741fe25c489b783645804620/figs/felps_unlearning.png
--------------------------------------------------------------------------------
/figs/google_fl.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/safr-ai-lab/survey-llm/bc373c9f76dcdf04741fe25c489b783645804620/figs/google_fl.png
--------------------------------------------------------------------------------
/figs/tindall_mia.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/safr-ai-lab/survey-llm/bc373c9f76dcdf04741fe25c489b783645804620/figs/tindall_mia.png
--------------------------------------------------------------------------------
/scraping/scrape.py:
--------------------------------------------------------------------------------
1 | import requests
2 | from bs4 import BeautifulSoup
3 | import pandas as pd
4 | import re
5 |
6 | def clean_soup(string):
7 | string = re.sub("\<.*?\>","",string.lower()).replace('\n','')
8 | string = re.sub(' +', ' ',string)
9 | return string
10 |
11 | from_date = '2023-11-01'
12 |
13 | results_df = pd.DataFrame()
14 |
15 | keywords = {'memorization': ['llm', 'language model', 'memorization', 'memorized', 'privacy', 'eidetic', 'extractibility', 'forgetting'],
16 | 'privacy attack': ['llm', 'language model', 'privacy', 'attack', 'membership inference', 'shadow model', 'threshold', 'extraction'],
17 | 'inference attack': ['llm', 'language model', 'privacy', 'attack', 'membership inference', 'shadow model', 'threshold', 'extraction'],
18 | 'extraction attack': ['llm', 'language model', 'privacy', 'attack', 'membership inference', 'shadow model', 'threshold', 'extraction'],
19 | 'private model': ['llm', 'language model', 'privacy preserving', 'private', 'differential privacy', 'federated learning', 'federated averaging'],
20 | 'privacy preserving': ['llm', 'language model', 'privacy preserving', 'private', 'differential privacy', 'federated learning', 'federated averaging'],
21 | 'unlearning': ['llm', 'language model', 'unlearning', 'leave one out', 'sisa'],
22 | 'copyright': ['llm', 'language model', 'copyright', 'fair use', 'near access', 'unlearning', 'novelty']}
23 |
24 | for category in keywords:
25 | print(f'searching {category}')
26 | URL = f"https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=language+model&terms-0-field=abstract&terms-1-operator=AND&terms-1-term={category}&terms-1-field=abstract&classification-computer_science=y&classification-physics_archives=all&classification-include_cross_list=include&date-year=&date-filter_by=date_range&date-from_date={from_date}&date-to_date=&date-date_type=submitted_date_first&abstracts=show&size=200&order=-announced_date_first"
27 | page = requests.get(URL)
28 | soup = BeautifulSoup(page.content, "html.parser")
29 | results = soup.find_all('li', attrs={'class':'arxiv-result'})
30 | print(f'{len(results)} results found')
31 | for result in results:
32 | title = clean_soup(str(result.find('p', attrs={'class': 'title is-5 mathjax'})))
33 | authors = list(clean_soup(str(result.find('p', attrs={'class': 'authors'}))).replace('authors:','').split(', '))
34 | abstract = clean_soup(str(result.find('span',attrs={'class':'abstract-full has-text-grey-dark mathjax'})))
35 | relevance = 0
36 | for k in keywords[category]:
37 | relevance += abstract.count(k)
38 | if category in ['inference attack','extraction attack', 'privacy attack']:
39 | cat = 'attack'
40 | elif category in ['privacy']:
41 | cat = 'private'
42 | else:
43 | cat = category
44 | data = pd.DataFrame({"title": [title],
45 | "authors": authors[0],
46 | "category": [cat],
47 | "relevance": [relevance],
48 | "abstract": [abstract]})
49 | results_df = results_df.append(data,ignore_index=True)
50 |
51 | results_df = results_df[results_df.relevance > 4]
52 | results_df = results_df.sort_values('relevance', ascending=False).drop_duplicates(subset='title', keep='first')
53 | results_df.sort_values('relevance', ascending=False).to_csv(f'results/results_{from_date}.csv',index=False)
54 |
--------------------------------------------------------------------------------