├── requirements.txt
├── README.md
├── .gitignore
└── wikipedia_reference_scraper.py


/requirements.txt:
--------------------------------------------------------------------------------
 1 | beautifulsoup4==4.11.1
 2 | certifi==2022.6.15
 3 | chardet==5.0.0
 4 | fire==0.4.0
 5 | idna==3.3
 6 | requests==2.28.1
 7 | six==1.16.0
 8 | urllib3==1.26.5
 9 | wi==0.0.2
10 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # wikipedia-reference-scraper
 2 | Wikipedia API wrapper for references
 3 | 
 4 | # Motivation
 5 | 
 6 | I just graduated from Physiology department, University of Ibadan. I started typing my final year project some days before submission
 7 | deadline. I made use of Wikipedia for my literature review because each page is supported with enough references. The next task was
 8 | to copy and paste the references. This was a lot of work considering the fact that some pages has over 200 references and I wasn't
 9 | working with just one page. I decided to make use of Wikipedia API wrappers but all the ones I checked didn't do what I needed. So
10 | I decided to write a simple script that scraped Wikipedia page.
11 | 
12 | # Usage
13 | 
14 | ## Python Interactive Shell
15 | 
16 | ```
17 | >>> from wikipedia_reference_scraper import WikipediaReferenceScraper as wrs
18 | 
19 | >>> wrs().write_to_document('https://en.wikipedia.org/wiki/Blood_pressure#References', 'filename.docx')
20 | ```
21 | 
22 | ## Through the Command Line (with python-fire)
23 | 
24 | ```
25 | $ python wikipedia_reference_scraper.py write_to_document https://en.wikipedia.org/wiki/Blood_pressure#References filename.docx
26 | ```
27 | 
28 | It pulls the references from a Wikipedia page and saves the references in a file.
29 | 
30 | # Tools I Used
31 | 
32 | Requests
33 | 
34 | BeautifulSoup
35 | 
36 | python-fire
37 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | env/
 12 | build/
 13 | develop-eggs/
 14 | dist/
 15 | downloads/
 16 | eggs/
 17 | .eggs/
 18 | lib/
 19 | lib64/
 20 | parts/
 21 | sdist/
 22 | var/
 23 | wheels/
 24 | *.egg-info/
 25 | .installed.cfg
 26 | *.egg
 27 | 
 28 | # PyInstaller
 29 | #  Usually these files are written by a python script from a template
 30 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 31 | *.manifest
 32 | *.spec
 33 | 
 34 | # Installer logs
 35 | pip-log.txt
 36 | pip-delete-this-directory.txt
 37 | 
 38 | # Unit test / coverage reports
 39 | htmlcov/
 40 | .tox/
 41 | .coverage
 42 | .coverage.*
 43 | .cache
 44 | nosetests.xml
 45 | coverage.xml
 46 | *.cover
 47 | .hypothesis/
 48 | 
 49 | # Translations
 50 | *.mo
 51 | *.pot
 52 | 
 53 | # Django stuff:
 54 | *.log
 55 | local_settings.py
 56 | 
 57 | # Flask stuff:
 58 | instance/
 59 | .webassets-cache
 60 | 
 61 | # Scrapy stuff:
 62 | .scrapy
 63 | 
 64 | # Sphinx documentation
 65 | docs/_build/
 66 | 
 67 | # PyBuilder
 68 | target/
 69 | 
 70 | # Jupyter Notebook
 71 | .ipynb_checkpoints
 72 | 
 73 | # pyenv
 74 | .python-version
 75 | 
 76 | # celery beat schedule file
 77 | celerybeat-schedule
 78 | 
 79 | # SageMath parsed files
 80 | *.sage.py
 81 | 
 82 | # dotenv
 83 | .env
 84 | 
 85 | # virtualenv
 86 | .venv
 87 | venv/
 88 | ENV/
 89 | 
 90 | # Spyder project settings
 91 | .spyderproject
 92 | .spyproject
 93 | 
 94 | # Rope project settings
 95 | .ropeproject
 96 | 
 97 | # mkdocs documentation
 98 | /site
 99 | 
100 | # mypy
101 | .mypy_cache/
102 | 
103 | .idea/
104 | pip-selfcheck.json
105 | pyvenv.cfg
106 | lib64
107 | share/
108 | bin/
109 | lib/
110 | __pycache__/
111 | 


--------------------------------------------------------------------------------
/wikipedia_reference_scraper.py:
--------------------------------------------------------------------------------
 1 | import requests
 2 | from bs4 import BeautifulSoup
 3 | import fire
 4 | 
 5 | 
 6 | class WikipediaReferenceScraper(object):
 7 | 
 8 |     def get_url_page(self, url):
 9 |         '''
10 |         This function creates a response object for the url that is
11 |         to be scrapped.
12 | 
13 |         :param url: This is the url of the page that is to be scrapped
14 |         :return: IT returns the a response object of the url
15 |         '''
16 |         page = requests.get(url)
17 |         if page.status_code == 200:
18 |             return page.text
19 |         else:
20 |             print('Check the url entered')
21 | 
22 | 
23 |     def get_references(self, url):
24 |         '''
25 |         This function calls the get_url_page() function.
26 |         The response from the get_url_page() function is parsed as html.
27 | 
28 |         :param url: This is the url of the page that is to be scrapped
29 |         :return: The return object is an html object
30 |         '''
31 |         page_text = self.get_url_page(url)
32 |         soup = BeautifulSoup(page_text, 'html.parser')
33 |         citations = soup.find_all('cite')
34 |         return citations
35 | 
36 | 
37 |     def write_to_document(self, url, document_name):
38 |         '''
39 |         This is the main fucntion that calls the above fucntions and then does the actual write to
40 |         document.
41 | 
42 |         :param url: This is the url of the page that is to be scrapped
43 |         :param document_name: The name of the document that our scrapped data will be written to
44 |         :return: There is no return object
45 |         '''
46 |         with open(document_name, 'w') as file:
47 |             for citation in self.get_references(url):
48 |                 file.write('\n\n')
49 |                 for string in citation.strings:
50 |                     file.write(string)
51 | 
52 | 
53 | if __name__ == '__main__':
54 |   fire.Fire(WikipediaReferenceScraper)
55 | 


--------------------------------------------------------------------------------