├── requirements.txt ├── README.md ├── .gitignore └── wikipedia_reference_scraper.py /requirements.txt: -------------------------------------------------------------------------------- 1 | beautifulsoup4==4.11.1 2 | certifi==2022.6.15 3 | chardet==5.0.0 4 | fire==0.4.0 5 | idna==3.3 6 | requests==2.28.1 7 | six==1.16.0 8 | urllib3==1.26.5 9 | wi==0.0.2 10 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # wikipedia-reference-scraper 2 | Wikipedia API wrapper for references 3 | 4 | # Motivation 5 | 6 | I just graduated from Physiology department, University of Ibadan. I started typing my final year project some days before submission 7 | deadline. I made use of Wikipedia for my literature review because each page is supported with enough references. The next task was 8 | to copy and paste the references. This was a lot of work considering the fact that some pages has over 200 references and I wasn't 9 | working with just one page. I decided to make use of Wikipedia API wrappers but all the ones I checked didn't do what I needed. So 10 | I decided to write a simple script that scraped Wikipedia page. 11 | 12 | # Usage 13 | 14 | ## Python Interactive Shell 15 | 16 | ``` 17 | >>> from wikipedia_reference_scraper import WikipediaReferenceScraper as wrs 18 | 19 | >>> wrs().write_to_document('https://en.wikipedia.org/wiki/Blood_pressure#References', 'filename.docx') 20 | ``` 21 | 22 | ## Through the Command Line (with python-fire) 23 | 24 | ``` 25 | $ python wikipedia_reference_scraper.py write_to_document https://en.wikipedia.org/wiki/Blood_pressure#References filename.docx 26 | ``` 27 | 28 | It pulls the references from a Wikipedia page and saves the references in a file. 29 | 30 | # Tools I Used 31 | 32 | Requests 33 | 34 | BeautifulSoup 35 | 36 | python-fire 37 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | 49 | # Translations 50 | *.mo 51 | *.pot 52 | 53 | # Django stuff: 54 | *.log 55 | local_settings.py 56 | 57 | # Flask stuff: 58 | instance/ 59 | .webassets-cache 60 | 61 | # Scrapy stuff: 62 | .scrapy 63 | 64 | # Sphinx documentation 65 | docs/_build/ 66 | 67 | # PyBuilder 68 | target/ 69 | 70 | # Jupyter Notebook 71 | .ipynb_checkpoints 72 | 73 | # pyenv 74 | .python-version 75 | 76 | # celery beat schedule file 77 | celerybeat-schedule 78 | 79 | # SageMath parsed files 80 | *.sage.py 81 | 82 | # dotenv 83 | .env 84 | 85 | # virtualenv 86 | .venv 87 | venv/ 88 | ENV/ 89 | 90 | # Spyder project settings 91 | .spyderproject 92 | .spyproject 93 | 94 | # Rope project settings 95 | .ropeproject 96 | 97 | # mkdocs documentation 98 | /site 99 | 100 | # mypy 101 | .mypy_cache/ 102 | 103 | .idea/ 104 | pip-selfcheck.json 105 | pyvenv.cfg 106 | lib64 107 | share/ 108 | bin/ 109 | lib/ 110 | __pycache__/ 111 | -------------------------------------------------------------------------------- /wikipedia_reference_scraper.py: -------------------------------------------------------------------------------- 1 | import requests 2 | from bs4 import BeautifulSoup 3 | import fire 4 | 5 | 6 | class WikipediaReferenceScraper(object): 7 | 8 | def get_url_page(self, url): 9 | ''' 10 | This function creates a response object for the url that is 11 | to be scrapped. 12 | 13 | :param url: This is the url of the page that is to be scrapped 14 | :return: IT returns the a response object of the url 15 | ''' 16 | page = requests.get(url) 17 | if page.status_code == 200: 18 | return page.text 19 | else: 20 | print('Check the url entered') 21 | 22 | 23 | def get_references(self, url): 24 | ''' 25 | This function calls the get_url_page() function. 26 | The response from the get_url_page() function is parsed as html. 27 | 28 | :param url: This is the url of the page that is to be scrapped 29 | :return: The return object is an html object 30 | ''' 31 | page_text = self.get_url_page(url) 32 | soup = BeautifulSoup(page_text, 'html.parser') 33 | citations = soup.find_all('cite') 34 | return citations 35 | 36 | 37 | def write_to_document(self, url, document_name): 38 | ''' 39 | This is the main fucntion that calls the above fucntions and then does the actual write to 40 | document. 41 | 42 | :param url: This is the url of the page that is to be scrapped 43 | :param document_name: The name of the document that our scrapped data will be written to 44 | :return: There is no return object 45 | ''' 46 | with open(document_name, 'w') as file: 47 | for citation in self.get_references(url): 48 | file.write('\n\n') 49 | for string in citation.strings: 50 | file.write(string) 51 | 52 | 53 | if __name__ == '__main__': 54 | fire.Fire(WikipediaReferenceScraper) 55 | --------------------------------------------------------------------------------