├── requirements.txt ├── README.md └── noderank.py /requirements.txt: -------------------------------------------------------------------------------- 1 | requests 2 | bs4 3 | numpy 4 | nltk 5 | networkx 6 | lxml 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # NodeRank 2 | Web Content Extraction using the PageRank algorithm to find the element containing the best content. 3 | 4 | This project was primarily started as a way to extract quality content from web pages with reasonable accuracy, omitting chrome and other superfluous code, on serverless infrastructures like AWS Lambda. There are still a few hurdles on Lambda like making the Punkt (NLTK) library available and fitting within size restrictions of stored code, but it is doable, where in the case of Dragnet and Boilerpipe, dependancy and size issues make this a very difficult, perhaps impossible, task. 5 | 6 | ## Running: 7 | ``` python 8 | # install requirements `pip install -r requirements.txt` 9 | # Put the noderank.py file in your working directory 10 | import noderank 11 | url="https://salience.co.uk/insight/magazine/basic-log-file-analysis/" 12 | content - extract_content_noderank(url=url, timeout=10) 13 | ``` 14 | 15 | ## Idea 16 | Since there is a structural relationship between the nodes in the DOM of a webpage, can we use that structure to help identify the parent which is most likely to contain the quality content on the page. 17 | 18 | This algorithm does the following. 19 | 1. The `body` element of the page is exracted using BeautifulSoup. 20 | 1. Unneccesary elements are stripped out of the `body` node. eg. `