├── README.md ├── app.py ├── fatma.txt ├── image.png ├── john.txt ├── juma.txt ├── pictures └── become_a_patron_button.png └── requirements.txt /README.md: -------------------------------------------------------------------------------- 1 | # Plagiarism-checker-Python 2 | 3 | This repo consists of a source code of a Python script which detects plagiarism in a textual document using **cosine similarity**. 4 | 5 | [![Become a patron](pictures/become_a_patron_button.png)](https://www.patreon.com/kalebujordan) 6 | 7 | ## How is it Done? 8 | 9 | You might be wondering how plagiarism detection on textual data is done, well it ain't as complicated as you may think. 10 | 11 | We all know that computers are good with numbers; so in order to compute the similarity between two text documents, the textual raw data is transformed into vectors => arrays of numbers and from that, we make use of basic knowledge of vectors to compute the similarity between them. 12 | 13 | This repo contains a basic example on how to do that. 14 | 15 | 16 | ## Getting Started 17 | 18 | To get started with the code on this repo, you need to either *clone* or *download* this repo into your machine as shown below; 19 | 20 | ```bash 21 | git clone https://github.com/Kalebu/Plagiarism-checker-Python 22 | ``` 23 | 24 | ## Dependencies 25 | 26 | Before you begin playing with the source code, you might need to install dependencies just as shown below; 27 | 28 | ```bash 29 | pip3 install -r requirements.txt 30 | ``` 31 | 32 | ## Running the App 33 | 34 | To run this code you need to have your textual documents in your project directory with the **.txt** extension. When you run the script, it will automatically load all the documents with that extension and then compute the similarities between them as shown below; 35 | 36 | ```bash 37 | $-> cd Plagiarism-checker-Python 38 | $ Plagiarism-checker-Python-> python3 app.py 39 | ('john.txt', 'juma.txt', 0.5465972177348937) 40 | ('fatma.txt', 'john.txt', 0.14806887549598566) 41 | ('fatma.txt', 'juma.txt', 0.18643448370323362) 42 | 43 | ``` 44 | 45 | ## A Python Library? 46 | 47 | Would you like to use a Python library instead to help you compare strings and documents without spending time writing the vectorizers by yourself, then take a look at [Pysimilar](https://github.com/Kalebu/pysimilar). 48 | 49 | ## Explore it 50 | 51 | Explore it and twist it to your own use case. In case of any questions feel free to reach me directly at *isaackeinstein@gmail.com*. 52 | 53 | ## Issues 54 | 55 | In case you have any difficulties or issues while trying to run the script 56 | you can raise an issue. 57 | 58 | ## Pull Requests 59 | 60 | If you have something to add, I welcome pull requests on improvement; your helpful contribution will be merged as soon as possible. 61 | 62 | ## Give it a Star 63 | 64 | If you find this repo useful, give it a star so that many people can get to know it. 65 | 66 | ## Credits 67 | 68 | All the credit goes to [kalebu](https://github.com/kalebu). 69 | -------------------------------------------------------------------------------- /app.py: -------------------------------------------------------------------------------- 1 | import os 2 | from sklearn.feature_extraction.text import TfidfVectorizer 3 | from sklearn.metrics.pairwise import cosine_similarity 4 | 5 | student_files = [doc for doc in os.listdir() if doc.endswith('.txt')] 6 | student_notes = [open(_file, encoding='utf-8').read() 7 | for _file in student_files] 8 | 9 | 10 | def vectorize(Text): return TfidfVectorizer().fit_transform(Text).toarray() 11 | def similarity(doc1, doc2): return cosine_similarity([doc1, doc2]) 12 | 13 | 14 | vectors = vectorize(student_notes) 15 | s_vectors = list(zip(student_files, vectors)) 16 | plagiarism_results = set() 17 | 18 | 19 | def check_plagiarism(): 20 | global s_vectors 21 | for student_a, text_vector_a in s_vectors: 22 | new_vectors = s_vectors.copy() 23 | current_index = new_vectors.index((student_a, text_vector_a)) 24 | del new_vectors[current_index] 25 | for student_b, text_vector_b in new_vectors: 26 | sim_score = similarity(text_vector_a, text_vector_b)[0][1] 27 | student_pair = sorted((student_a, student_b)) 28 | score = (student_pair[0], student_pair[1], sim_score) 29 | plagiarism_results.add(score) 30 | return plagiarism_results 31 | 32 | 33 | for data in check_plagiarism(): 34 | print(data) 35 | -------------------------------------------------------------------------------- /fatma.txt: -------------------------------------------------------------------------------- 1 | Life is all about doing your best in trying to 2 | find what works out for you and taking most time in 3 | trying to pursue those skills -------------------------------------------------------------------------------- /image.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalebu/Plagiarism-checker-Python/3fb55b90c6cc5ede661e40a20147d94727f29da1/image.png -------------------------------------------------------------------------------- /john.txt: -------------------------------------------------------------------------------- 1 | Life is all about finding money and spending on luxury stuffs 2 | Coz this life is kinda short , trust -------------------------------------------------------------------------------- /juma.txt: -------------------------------------------------------------------------------- 1 | Life to me is about finding money and use it on things that makes you happy 2 | coz this life is kinda short -------------------------------------------------------------------------------- /pictures/become_a_patron_button.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalebu/Plagiarism-checker-Python/3fb55b90c6cc5ede661e40a20147d94727f29da1/pictures/become_a_patron_button.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | scikit_learn==0.24.2 2 | --------------------------------------------------------------------------------