├── README.md
├── app.py
├── fatma.txt
├── image.png
├── john.txt
├── juma.txt
├── pictures
    └── become_a_patron_button.png
└── requirements.txt


/README.md:
--------------------------------------------------------------------------------
 1 | # Plagiarism-checker-Python
 2 | 
 3 | This repo consists of a source code of a Python script which detects plagiarism in a textual document using **cosine similarity**.
 4 | 
 5 | [![Become a patron](pictures/become_a_patron_button.png)](https://www.patreon.com/kalebujordan)
 6 | 
 7 | ## How is it Done?
 8 | 
 9 | You might be wondering how plagiarism detection on textual data is done, well it ain't as complicated as you may think.
10 | 
11 | We all know that computers are good with numbers; so in order to compute the similarity between two text documents, the textual raw data is transformed into vectors => arrays of numbers and from that, we make use of basic knowledge of vectors to compute the similarity between them.
12 | 
13 | This repo contains a basic example on how to do that.
14 | 
15 | 
16 | ## Getting Started
17 | 
18 | To get started with the code on this repo, you need to either *clone* or *download* this repo into your machine as shown below;
19 | 
20 | ```bash
21 | git clone https://github.com/Kalebu/Plagiarism-checker-Python
22 | ```
23 | 
24 | ## Dependencies
25 | 
26 | Before you begin playing with the source code, you might need to install dependencies just as shown below;
27 | 
28 | ```bash
29 | pip3 install -r requirements.txt
30 | ```
31 | 
32 | ## Running the App
33 | 
34 | To run this code you need to have your textual documents in your project directory with the **.txt** extension. When you run the script, it will automatically load all the documents with that extension and then compute the similarities between them as shown below;
35 | 
36 | ```bash
37 | $-> cd Plagiarism-checker-Python
38 | $ Plagiarism-checker-Python-> python3 app.py
39 | ('john.txt', 'juma.txt', 0.5465972177348937)
40 | ('fatma.txt', 'john.txt', 0.14806887549598566)
41 | ('fatma.txt', 'juma.txt', 0.18643448370323362)
42 | 
43 | ```
44 | 
45 | ## A Python Library?
46 | 
47 | Would you like to use a Python library instead to help you compare strings and documents without spending time writing the vectorizers by yourself, then take a look at [Pysimilar](https://github.com/Kalebu/pysimilar).
48 | 
49 | ## Explore it 
50 | 
51 | Explore it and twist it to your own use case. In case of any questions feel free to reach me directly at *isaackeinstein@gmail.com*.
52 | 
53 | ## Issues
54 | 
55 | In case you have any difficulties or issues while trying to run the script
56 | you can raise an issue. 
57 | 
58 | ## Pull Requests
59 | 
60 | If you have something to add, I welcome pull requests on improvement; your helpful contribution will be merged as soon as possible.
61 | 
62 | ## Give it a Star
63 | 
64 | If you find this repo useful, give it a star so that many people can get to know it.
65 | 
66 | ## Credits
67 | 
68 | All the credit goes to [kalebu](https://github.com/kalebu).
69 | 


--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from sklearn.feature_extraction.text import TfidfVectorizer
 3 | from sklearn.metrics.pairwise import cosine_similarity
 4 | 
 5 | student_files = [doc for doc in os.listdir() if doc.endswith('.txt')]
 6 | student_notes = [open(_file, encoding='utf-8').read()
 7 |                  for _file in student_files]
 8 | 
 9 | 
10 | def vectorize(Text): return TfidfVectorizer().fit_transform(Text).toarray()
11 | def similarity(doc1, doc2): return cosine_similarity([doc1, doc2])
12 | 
13 | 
14 | vectors = vectorize(student_notes)
15 | s_vectors = list(zip(student_files, vectors))
16 | plagiarism_results = set()
17 | 
18 | 
19 | def check_plagiarism():
20 |     global s_vectors
21 |     for student_a, text_vector_a in s_vectors:
22 |         new_vectors = s_vectors.copy()
23 |         current_index = new_vectors.index((student_a, text_vector_a))
24 |         del new_vectors[current_index]
25 |         for student_b, text_vector_b in new_vectors:
26 |             sim_score = similarity(text_vector_a, text_vector_b)[0][1]
27 |             student_pair = sorted((student_a, student_b))
28 |             score = (student_pair[0], student_pair[1], sim_score)
29 |             plagiarism_results.add(score)
30 |     return plagiarism_results
31 | 
32 | 
33 | for data in check_plagiarism():
34 |     print(data)
35 | 


--------------------------------------------------------------------------------
/fatma.txt:
--------------------------------------------------------------------------------
1 | Life is all about doing your best in trying to
2 | find what works out for you and taking most time in
3 | trying to pursue those skills 


--------------------------------------------------------------------------------
/image.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Kalebu/Plagiarism-checker-Python/3fb55b90c6cc5ede661e40a20147d94727f29da1/image.png


--------------------------------------------------------------------------------
/john.txt:
--------------------------------------------------------------------------------
1 | Life is all about finding money and spending on luxury stuffs
2 | Coz this life is kinda short , trust 


--------------------------------------------------------------------------------
/juma.txt:
--------------------------------------------------------------------------------
1 | Life to me is about finding money and use it on things that makes you happy
2 | coz this life is kinda short 


--------------------------------------------------------------------------------
/pictures/become_a_patron_button.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Kalebu/Plagiarism-checker-Python/3fb55b90c6cc5ede661e40a20147d94727f29da1/pictures/become_a_patron_button.png


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | scikit_learn==0.24.2
2 | 


--------------------------------------------------------------------------------