├── .env ├── LICENSE ├── README.md ├── docker-compose.yml ├── docker └── webscraper │ └── Dockerfile ├── images_for_readme ├── overview.png ├── overview.svg ├── results_1.png └── results_2.png ├── requirements ├── requirements.in └── requirements.txt ├── scrapy.cfg └── webscraper_for_sophie ├── __init__.py ├── database_manager.py ├── items.py ├── middlewares.py ├── pipelines.py ├── settings.py └── spiders ├── __init__.py └── willhaben_spider.py /.env: -------------------------------------------------------------------------------- 1 | # Name of the database 2 | MYSQL_DATABASE=webscraper_database 3 | 4 | # Name of the table 5 | MYSQL_TABLENAME=condos 6 | 7 | # Password for the root user 8 | MYSQL_ROOT_PASSWORD=kfj3gjlf&4MGdf596 9 | 10 | # Sign in as root 11 | MYSQL_USER=webscraper 12 | MYSQL_PASSWORD=ackj8636&fkf 13 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Michael Haar 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | > Note: This code is quite old and I don't recommend using it. Please check out the [Immobilien Suchmaschine](https://apify.com/michaelhaar/immobilien-suchmaschine) 2 | 3 | # Webscraper for Sophie :ant: 4 | 5 | This is a small webscraper application, which I wrote to do a market research 6 | on the condo prices in my hometown :house: (Graz, Styria, Austria). 7 | 8 | The webscraper crawls the [Willhaben](www.willhaben.at) website, extracts the 9 | relevant data and stores it in a local database as shown in the following figure: 10 | 11 | ![website data is extracted and stored in DB](images_for_readme/overview.png) 12 | 13 | The data can be used in various ways. I did a small market research in 14 | Tableau, which is not included in this project, but I'd like to provide a 15 | teaser of the results: :chart_with_upwards_trend: 16 | 17 | ![results](images_for_readme/results_1.png) 18 | ![results](images_for_readme/results_2.png) 19 | 20 | ## Prerequisites 21 | 22 | Make sure **Docker-Compose** is installed on your machine. See 23 | [Install Docker Compose](https://docs.docker.com/compose/install/) 24 | 25 | Note for Ubuntu users: 26 | 27 | 1. Install Docker as descriped [here](https://docs.docker.com/engine/install/ubuntu/) 28 | 2. Do the [post install steps](https://docs.docker.com/engine/install/linux-postinstall/) 29 | 3. Install Docker-Compose as descriped [here](https://docs.docker.com/compose/install/) 30 | 31 | ## :rocket: Usage 32 | 33 | The `docker-compose.yml` file will start two docker containers 34 | (one for the webscraper and one for the MySQL database) and can be used like 35 | this: 36 | 37 | 1. Open a new terminal window and type the following commands 38 | 39 | ```bash 40 | git clone https://github.com/michaelhaar/webscraper_sophie.git 41 | 42 | # cd into the repository 43 | cd webscraper_sophie 44 | ``` 45 | 46 | 2. **Build the containers**. The following command will download 47 | the base images (Python + MySQL) from the internet and install some packages 48 | inside the webscraper container. 49 | 50 | ```bash 51 | docker-compose build # may take a few minutes 52 | ``` 53 | 54 | 3. We can **start the containers** and immediately run the webscraper 55 | application with 56 | 57 | ```bash 58 | docker-compose up # starts the docker containers 59 | ``` 60 | 61 | 4. Verify that the webscraper is working properly by viewing the output of the 62 | webscraper in the terminal window. It should print out the extracted items. 63 | 64 | Note: It's very likely that the webscraper application isn't working anymore 65 | because willhaben changes the HTML of their website from time to time. 66 | Please feel free to contact me or create a new issue if you need something. 67 | 68 | 5. Access the results, which were extracted and then inserted into the database. 69 | You can use your favorite MySQL Client for this step. (I'm using 70 | [DataGrip](https://www.jetbrains.com/de-de/datagrip/)). 71 | 72 | - host=`localhost` 73 | - port=`3306` 74 | - The login credentials can be found in the `.env` file. 75 | 76 | ## Hints 77 | 78 | The following commands might be useful 79 | 80 | ```bash 81 | # Build the docker containers 82 | docker-compose build 83 | 84 | # Start the docker containers 85 | docker-compose up 86 | 87 | # Bash into the webscraper container 88 | docker-compose run --rm webscraper bash 89 | 90 | # Manually start the webscraper application, 91 | # when bashed into the webscraper container 92 | scrapy crawl willhaben 93 | 94 | # Re-build PIP requirements 95 | docker-compose run --rm webscraper pip-compile requirements/requirements.in 96 | 97 | ``` 98 | 99 | ## Author 100 | 101 | 👤 **Michael Haar** 102 | 103 | - LinkedIn: [@michaelhaar](https://www.linkedin.com/in/michaelhaar/) 104 | - Github: [@michaelhaar](https://github.com/michaelhaar) 105 | - Email: michael.haar@gmx.at 106 | -------------------------------------------------------------------------------- /docker-compose.yml: -------------------------------------------------------------------------------- 1 | version: "3.4" 2 | 3 | services: 4 | db: 5 | image: mysql:5.7 6 | ports: 7 | - "3306:3306" 8 | volumes: 9 | - db_data:/var/lib/mysql 10 | environment: 11 | MYSQL_DATABASE: ${MYSQL_DATABASE} # from env file 12 | MYSQL_USER: "${MYSQL_USER}" # from env file 13 | MYSQL_PASSWORD: ${MYSQL_PASSWORD} # from env file 14 | MYSQL_ROOT_PASSWORD: ${MYSQL_ROOT_PASSWORD} # from env file 15 | webscraper: 16 | build: 17 | context: . 18 | dockerfile: ./docker/webscraper/Dockerfile 19 | command: scrapy crawl willhaben 20 | environment: 21 | MYSQL_DATABASE: ${MYSQL_DATABASE} # from env file 22 | MYSQL_TABLENAME: ${MYSQL_TABLENAME} # from env file 23 | MYSQL_USER: "${MYSQL_USER}" # from env file 24 | MYSQL_PASSWORD: ${MYSQL_PASSWORD} # from env file 25 | depends_on: 26 | - db 27 | volumes: 28 | - .:/src 29 | 30 | volumes: 31 | db_data: {} # make db data persistent 32 | -------------------------------------------------------------------------------- /docker/webscraper/Dockerfile: -------------------------------------------------------------------------------- 1 | # Start a new Python 3.8 parent image (see https://hub.docker.com/r/library/python) 2 | FROM python:3.8 3 | 4 | # Avoid .pyc files 5 | ENV PYTHONDONTWRITEBYTECODE 1 6 | # Ensure that Python output is sent straight to terminal and that we can see the output of our application 7 | ENV PYTHONUNBUFFERED 1 8 | 9 | #install system dependencies 10 | RUN apt-get update \ 11 | && apt-get install -y python3 python3-dev python3-pip libxml2-dev \ 12 | libxslt1-dev zlib1g-dev libffi-dev libssl-dev 13 | 14 | # install the required python packages 15 | COPY requirements/requirements.txt /tmp/requirements.txt 16 | 17 | RUN set -ex \ 18 | && pip install --upgrade pip \ 19 | && pip install pip-tools \ 20 | && pip install -r /tmp/requirements.txt \ 21 | && rm -rf /root/.cache/ 22 | 23 | # copy in the rest of your app’s source code from your host to your image filesystem. 24 | COPY . /src/ 25 | 26 | WORKDIR /src/ -------------------------------------------------------------------------------- /images_for_readme/overview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/michaelhaar/webscraper_sophie/c66a1c08376dc32b8cc752e792fa60025636e7bc/images_for_readme/overview.png -------------------------------------------------------------------------------- /images_for_readme/results_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/michaelhaar/webscraper_sophie/c66a1c08376dc32b8cc752e792fa60025636e7bc/images_for_readme/results_1.png -------------------------------------------------------------------------------- /images_for_readme/results_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/michaelhaar/webscraper_sophie/c66a1c08376dc32b8cc752e792fa60025636e7bc/images_for_readme/results_2.png -------------------------------------------------------------------------------- /requirements/requirements.in: -------------------------------------------------------------------------------- 1 | scrapy==2.4.0 2 | beautifulsoup4==4.9.3 3 | environs==9.0.0 4 | mysql-connector-python==8.0.22 -------------------------------------------------------------------------------- /requirements/requirements.txt: -------------------------------------------------------------------------------- 1 | # 2 | # This file is autogenerated by pip-compile 3 | # To update, run: 4 | # 5 | # pip-compile requirements/requirements.in 6 | # 7 | attrs==20.2.0 # via automat, service-identity, twisted 8 | automat==20.2.0 # via twisted 9 | beautifulsoup4==4.9.3 # via -r requirements/requirements.in 10 | cffi==1.14.3 # via cryptography 11 | constantly==15.1.0 # via twisted 12 | cryptography==3.2.1 # via pyopenssl, scrapy, service-identity 13 | cssselect==1.1.0 # via parsel, scrapy 14 | environs==9.0.0 # via -r requirements/requirements.in 15 | hyperlink==20.0.1 # via twisted 16 | idna==2.10 # via hyperlink 17 | incremental==17.5.0 # via twisted 18 | itemadapter==0.1.1 # via itemloaders, scrapy 19 | itemloaders==1.0.3 # via scrapy 20 | jmespath==0.10.0 # via itemloaders 21 | lxml==4.6.1 # via parsel, scrapy 22 | marshmallow==3.9.0 # via environs 23 | mysql-connector-python==8.0.22 # via -r requirements/requirements.in 24 | parsel==1.6.0 # via itemloaders, scrapy 25 | protego==0.1.16 # via scrapy 26 | protobuf==3.13.0 # via mysql-connector-python 27 | pyasn1-modules==0.2.8 # via service-identity 28 | pyasn1==0.4.8 # via pyasn1-modules, service-identity 29 | pycparser==2.20 # via cffi 30 | pydispatcher==2.0.5 # via scrapy 31 | pyhamcrest==2.0.2 # via twisted 32 | pyopenssl==19.1.0 # via scrapy 33 | python-dotenv==0.15.0 # via environs 34 | queuelib==1.5.0 # via scrapy 35 | scrapy==2.4.0 # via -r requirements/requirements.in 36 | service-identity==18.1.0 # via scrapy 37 | six==1.15.0 # via automat, cryptography, parsel, protego, protobuf, pyopenssl, w3lib 38 | soupsieve==2.0.1 # via beautifulsoup4 39 | twisted==20.3.0 # via scrapy 40 | w3lib==1.22.0 # via itemloaders, parsel, scrapy 41 | zope.interface==5.1.2 # via scrapy, twisted 42 | 43 | # The following packages are considered to be unsafe in a requirements file: 44 | # setuptools 45 | -------------------------------------------------------------------------------- /scrapy.cfg: -------------------------------------------------------------------------------- 1 | # Automatically created by: scrapy startproject 2 | # 3 | # For more information about the [deploy] section see: 4 | # https://scrapyd.readthedocs.io/en/latest/deploy.html 5 | 6 | [settings] 7 | default = webscraper_for_sophie.settings 8 | 9 | [deploy] 10 | #url = http://localhost:6800/ 11 | project = webscraper_for_sophie 12 | -------------------------------------------------------------------------------- /webscraper_for_sophie/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/michaelhaar/webscraper_sophie/c66a1c08376dc32b8cc752e792fa60025636e7bc/webscraper_for_sophie/__init__.py -------------------------------------------------------------------------------- /webscraper_for_sophie/database_manager.py: -------------------------------------------------------------------------------- 1 | # import default packages 2 | import logging 3 | import time 4 | # import installed packages 5 | import environs 6 | import mysql.connector 7 | from mysql.connector import errorcode 8 | # import project modules 9 | from webscraper_for_sophie.items import CondoItem 10 | 11 | 12 | env = environs.Env() 13 | USER = env("MYSQL_USER") 14 | PASSWORD = env("MYSQL_PASSWORD") 15 | DATABASE = env("MYSQL_DATABASE") 16 | TABLENAME = env("MYSQL_TABLENAME") 17 | HOST = 'db' # name of the docker container 18 | 19 | # Settings for connection error handling 20 | NUM_ATTEMPTS = 30 21 | DELAY_BTW_ATTEMPTS = 1 # in seconds 22 | RETRY_MSG = ("Waiting for MySQL container to start gracefully " + 23 | "(Attempt {} of {}) failed") 24 | 25 | 26 | class DatabaseManager(): 27 | """ 28 | Simplies our database operations 29 | """ 30 | 31 | def connect(self): 32 | """ Connect to the database """ 33 | for attempt_no in range(1, NUM_ATTEMPTS+1): 34 | try: 35 | self.connection = mysql.connector.connect(host=HOST, 36 | database=DATABASE, 37 | user=USER, 38 | password=PASSWORD) 39 | self.cursor = self.connection.cursor() 40 | logging.debug("Database connection opened") 41 | return 42 | except mysql.connector.Error as err: 43 | logging.debug(RETRY_MSG.format(attempt_no, NUM_ATTEMPTS)) 44 | if attempt_no < NUM_ATTEMPTS: 45 | time.sleep(DELAY_BTW_ATTEMPTS) 46 | else: 47 | if err.errno == errorcode.ER_ACCESS_DENIED_ERROR: 48 | logging.error( 49 | "Something is wrong with your user name or password") 50 | elif err.errno == errorcode.ER_BAD_DB_ERROR: 51 | logging.error("Database does not exist") 52 | else: 53 | logging.error(err) 54 | 55 | def close(self): 56 | """ Close the database connection """ 57 | self.connection.close() 58 | logging.debug("Database connection closed") 59 | 60 | def is_connected(self): 61 | """ 62 | Returns: 63 | bool: True if connected. False otherwise 64 | """ 65 | self.connection.is_connected() 66 | 67 | def prep_table(self): 68 | """ create a new table if the provided table name does not exist. """ 69 | sql_command = "SHOW TABLES LIKE '{0}'".format(TABLENAME) 70 | self.cursor.execute(sql_command) 71 | result = self.cursor.fetchone() # fetch will return a python tuple 72 | if result is None: 73 | logging.debug("Database table does not exist") 74 | 75 | # create table 76 | sql_command = """ 77 | CREATE TABLE {0} ( 78 | id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY, 79 | willhaben_code VARCHAR(10) COLLATE utf8_bin, 80 | postal_code VARCHAR(10) COLLATE utf8_bin, 81 | district VARCHAR(100) COLLATE utf8_bin, 82 | price INTEGER, 83 | commission_fee FLOAT, 84 | size INTEGER, 85 | room_count INTEGER, 86 | price_per_m2 FLOAT, 87 | discovery_date DATE, 88 | title TEXT COLLATE utf8_bin, 89 | url TEXT COLLATE utf8_bin, 90 | edit_date VARCHAR(100) COLLATE utf8_bin, 91 | address VARCHAR(100) COLLATE utf8_bin);""".format(TABLENAME) 92 | self.cursor.execute(sql_command) 93 | self.connection.commit() 94 | logging.debug("New database table has been created") 95 | 96 | def store_item(self, item): 97 | """ 98 | Store a new item in the database 99 | 100 | Args: 101 | item: the CondoItem that should be inserted in the database. 102 | """ 103 | 104 | # fill table of database with data 105 | sql_command = """INSERT INTO {0} 106 | (id, willhaben_code, postal_code, district, price, 107 | commission_fee, size, room_count, price_per_m2, 108 | discovery_date, title, url, edit_date, address) 109 | VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, 110 | %s, %s, %s, %s); 111 | """.format(TABLENAME) 112 | 113 | insert_tuple = (None, item['willhaben_code'], item['postal_code'], 114 | item['district'], item['price'], item['commission_fee'], 115 | item['size'], item['room_count'], item['price_per_m2'], 116 | item['discovery_date'], item['title'], item['url'], 117 | item['edit_date'], item['address']) 118 | # use parameterized input to avoid SQL injection 119 | self.cursor.execute(sql_command, insert_tuple) 120 | # never forget this, if you want the changes to be saved: 121 | self.connection.commit() 122 | -------------------------------------------------------------------------------- /webscraper_for_sophie/items.py: -------------------------------------------------------------------------------- 1 | # Define here the models for your scraped items 2 | # 3 | # See documentation in: 4 | # https://docs.scrapy.org/en/latest/topics/items.html 5 | 6 | import re 7 | import logging 8 | import scrapy 9 | 10 | 11 | class CondoItem(scrapy.Item): 12 | # define the fields for your item here like: 13 | # name = scrapy.Field() 14 | url = scrapy.Field() 15 | title = scrapy.Field() 16 | price = scrapy.Field() 17 | size = scrapy.Field() 18 | room_count = scrapy.Field() 19 | postal_code = scrapy.Field() 20 | district = scrapy.Field() 21 | discovery_date = scrapy.Field() 22 | edit_date = scrapy.Field() 23 | description = scrapy.Field() 24 | address = scrapy.Field() 25 | willhaben_code = scrapy.Field() 26 | commission_fee = scrapy.Field() 27 | price_per_m2 = scrapy.Field() 28 | 29 | DEFAULT_VALUE_STRING = '' 30 | DEFAULT_VALUE_INT = 0 31 | DEFAULT_VAULE_BOOL = False 32 | 33 | MIN_PRICE = 1000 34 | MAX_PRICE = 1500000 35 | MIN_SIZE = 10 36 | MAX_SIZE = 250 37 | 38 | def set_default_values(self): 39 | # init fields if needed 40 | # no init value needed for self['url'] and self['discovery_date'] 41 | self['title'] = self.DEFAULT_VALUE_STRING 42 | self['price'] = self.DEFAULT_VALUE_INT 43 | self['size'] = self.DEFAULT_VALUE_INT 44 | self['room_count'] = self.DEFAULT_VALUE_INT 45 | self['postal_code'] = self.DEFAULT_VALUE_STRING 46 | self['district'] = self.DEFAULT_VALUE_STRING 47 | self['edit_date'] = self.DEFAULT_VALUE_STRING 48 | self['description'] = self.DEFAULT_VALUE_STRING 49 | self['address'] = self.DEFAULT_VALUE_STRING 50 | self['willhaben_code'] = self.DEFAULT_VALUE_STRING 51 | self['commission_fee'] = self.DEFAULT_VALUE_STRING 52 | self['price_per_m2'] = self.DEFAULT_VALUE_INT 53 | 54 | def calc_price_per_m2(self): 55 | """ Calculate the price per square meter. """ 56 | if self['size']: 57 | self['price_per_m2'] = self['price'] / self['size'] 58 | 59 | def parse_price(self, price_text): 60 | """ Parses the price from the input text. 61 | 62 | Args: 63 | price_text (string): something like "€ 99.750" 64 | """ 65 | cleaned_price_text = price_text.replace('.', '') 66 | match = re.search(r'\d+', cleaned_price_text) # search for numbers 67 | if match: 68 | price_string = match[0] # get entire match 69 | try: 70 | price_int = int(price_string) # convert to int 71 | except ValueError: 72 | logging.error("Could not convert price to int at page " + 73 | self['url']) 74 | else: 75 | # realisitic value check 76 | if price_int > self.MIN_PRICE and price_int < self.MAX_PRICE: 77 | self['price'] = price_int 78 | else: 79 | logging.error("Unrealistic price at page " + self['url']) 80 | 81 | def parse_size(self, size_text): 82 | """ Parses the size from the input text. 83 | 84 | Args: 85 | size_text (string): something like " 42m²" 86 | """ 87 | match = re.search(r'\d+', size_text) # search for numbers 88 | if match: 89 | size_string = match[0] # The entire match 90 | try: 91 | size_int = int(size_string) # convert to int 92 | except ValueError: 93 | logging.error("Could not convert size to int at page " + 94 | self['url']) 95 | else: 96 | # realisitic value check 97 | if size_int > self.MIN_SIZE and size_int < self.MAX_SIZE: 98 | self['size'] = size_int 99 | else: 100 | logging.warning("Unrealistic size at page " + self['url']) 101 | else: 102 | logging.error("size parsing failed on page " + self['url']) 103 | 104 | def parse_size_2(self, size_text): 105 | """ Parses the size from the input text if it contains a keyword 106 | 107 | Keyword is `Nutzfläche` 108 | 109 | Args: 110 | size_text (string): something like "Nutzfläche: 73m2" 111 | """ 112 | keyword_match = re.search(r'Nutzfläche', size_text) 113 | if keyword_match: 114 | match = re.search(r'\d+', size_text) # search for numbers 115 | if match: 116 | size_string = match[0] # The entire match 117 | try: 118 | size_int = int(size_string) # convert to int 119 | except ValueError: 120 | logging.error("Could not convert size to int at page " + 121 | self['url']) 122 | else: 123 | # realisitic value check 124 | if size_int > self.MIN_SIZE and size_int < self.MAX_SIZE: 125 | self['size'] = size_int 126 | else: 127 | logging.error( 128 | "Unrealistic size at page " + self['url']) 129 | else: 130 | logging.error("secondary size parsing failed on page " + 131 | self['url']) 132 | 133 | def parse_room_count(self, room_count_text): 134 | """ Parses the room_count from the input text. 135 | 136 | Args: 137 | room_count_text (string): something like " 3 Zimmer" 138 | """ 139 | match = re.search(r'\d', room_count_text) # search for a single number 140 | if match: 141 | room_count_string = match[0] # The entire match 142 | try: 143 | self['room_count'] = int(room_count_string) 144 | except ValueError: 145 | logging.error("Could not convert room_count to int at page " + 146 | self['url']) 147 | else: 148 | logging.warning( 149 | "room_count parsing failed on page " + self['url']) 150 | 151 | def parse_room_count_2(self, room_count_text): 152 | """ Parses the room_count from the input text if it contains a keyword 153 | 154 | Keyword is `Zimmer` 155 | 156 | Args: 157 | room_count_text (string): something like "Zimmer: 3" 158 | """ 159 | keyword_match = re.search(r'Zimmer', room_count_text) 160 | if keyword_match: 161 | match = re.search(r'\d+', room_count_text) # search for numbers 162 | if match: 163 | room_count_string = match[0] # The entire match 164 | try: 165 | self['room_count'] = int( 166 | room_count_string) # convert to int 167 | except ValueError: 168 | logging.error("Could not convert room_count to int at page " 169 | + self['url']) 170 | else: 171 | logging.error("secondary room_count parsing failed on page " + 172 | self['url']) 173 | -------------------------------------------------------------------------------- /webscraper_for_sophie/middlewares.py: -------------------------------------------------------------------------------- 1 | # Define here the models for your spider middleware 2 | # 3 | # See documentation in: 4 | # https://docs.scrapy.org/en/latest/topics/spider-middleware.html 5 | 6 | from scrapy import signals 7 | 8 | # useful for handling different item types with a single interface 9 | from itemadapter import is_item, ItemAdapter 10 | 11 | 12 | class WebscraperForSophieSpiderMiddleware: 13 | # Not all methods need to be defined. If a method is not defined, 14 | # scrapy acts as if the spider middleware does not modify the 15 | # passed objects. 16 | 17 | @classmethod 18 | def from_crawler(cls, crawler): 19 | # This method is used by Scrapy to create your spiders. 20 | s = cls() 21 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 22 | return s 23 | 24 | def process_spider_input(self, response, spider): 25 | # Called for each response that goes through the spider 26 | # middleware and into the spider. 27 | 28 | # Should return None or raise an exception. 29 | return None 30 | 31 | def process_spider_output(self, response, result, spider): 32 | # Called with the results returned from the Spider, after 33 | # it has processed the response. 34 | 35 | # Must return an iterable of Request, or item objects. 36 | for i in result: 37 | yield i 38 | 39 | def process_spider_exception(self, response, exception, spider): 40 | # Called when a spider or process_spider_input() method 41 | # (from other spider middleware) raises an exception. 42 | 43 | # Should return either None or an iterable of Request or item objects. 44 | pass 45 | 46 | def process_start_requests(self, start_requests, spider): 47 | # Called with the start requests of the spider, and works 48 | # similarly to the process_spider_output() method, except 49 | # that it doesn’t have a response associated. 50 | 51 | # Must return only requests (not items). 52 | for r in start_requests: 53 | yield r 54 | 55 | def spider_opened(self, spider): 56 | spider.logger.info('Spider opened: %s' % spider.name) 57 | 58 | 59 | class WebscraperForSophieDownloaderMiddleware: 60 | # Not all methods need to be defined. If a method is not defined, 61 | # scrapy acts as if the downloader middleware does not modify the 62 | # passed objects. 63 | 64 | @classmethod 65 | def from_crawler(cls, crawler): 66 | # This method is used by Scrapy to create your spiders. 67 | s = cls() 68 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 69 | return s 70 | 71 | def process_request(self, request, spider): 72 | # Called for each request that goes through the downloader 73 | # middleware. 74 | 75 | # Must either: 76 | # - return None: continue processing this request 77 | # - or return a Response object 78 | # - or return a Request object 79 | # - or raise IgnoreRequest: process_exception() methods of 80 | # installed downloader middleware will be called 81 | return None 82 | 83 | def process_response(self, request, response, spider): 84 | # Called with the response returned from the downloader. 85 | 86 | # Must either; 87 | # - return a Response object 88 | # - return a Request object 89 | # - or raise IgnoreRequest 90 | return response 91 | 92 | def process_exception(self, request, exception, spider): 93 | # Called when a download handler or a process_request() 94 | # (from other downloader middleware) raises an exception. 95 | 96 | # Must either: 97 | # - return None: continue processing this exception 98 | # - return a Response object: stops process_exception() chain 99 | # - return a Request object: stops process_exception() chain 100 | pass 101 | 102 | def spider_opened(self, spider): 103 | spider.logger.info('Spider opened: %s' % spider.name) 104 | -------------------------------------------------------------------------------- /webscraper_for_sophie/pipelines.py: -------------------------------------------------------------------------------- 1 | # Define your item pipelines here 2 | # 3 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting 4 | # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html 5 | 6 | 7 | # useful for handling different item types with a single interface 8 | from itemadapter import ItemAdapter 9 | 10 | from webscraper_for_sophie.database_manager import DatabaseManager 11 | 12 | 13 | class WebscraperForSophiePipeline: 14 | 15 | def open_spider(self, spider): 16 | """ This method is called when the spider is opened. """ 17 | self.db_manager = DatabaseManager() 18 | self.db_manager.connect() 19 | self.db_manager.prep_table() 20 | 21 | def close_spider(self, spider): 22 | """ This method is called when the spider is closed. """ 23 | self.db_manager.close() 24 | 25 | def process_item(self, item, spider): 26 | """ This method is called for every item pipeline component. """ 27 | self.db_manager.store_item(item) 28 | return item 29 | -------------------------------------------------------------------------------- /webscraper_for_sophie/settings.py: -------------------------------------------------------------------------------- 1 | # Scrapy settings for webscraper_for_sophie project 2 | # 3 | # For simplicity, this file contains only settings considered important or 4 | # commonly used. You can find more settings consulting the documentation: 5 | # 6 | # https://docs.scrapy.org/en/latest/topics/settings.html 7 | # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html 8 | # https://docs.scrapy.org/en/latest/topics/spider-middleware.html 9 | 10 | BOT_NAME = 'webscraper_for_sophie' 11 | 12 | SPIDER_MODULES = ['webscraper_for_sophie.spiders'] 13 | NEWSPIDER_MODULE = 'webscraper_for_sophie.spiders' 14 | 15 | 16 | # Crawl responsibly by identifying yourself (and your website) on the user-agent 17 | #USER_AGENT = 'webscraper_for_sophie (+http://www.yourdomain.com)' 18 | 19 | # Obey robots.txt rules 20 | ROBOTSTXT_OBEY = False 21 | 22 | # Configure maximum concurrent requests performed by Scrapy (default: 16) 23 | #CONCURRENT_REQUESTS = 32 24 | 25 | # Configure a delay for requests for the same website (default: 0) 26 | # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay 27 | # See also autothrottle settings and docs 28 | DOWNLOAD_DELAY = 2 29 | # The download delay setting will honor only one of: 30 | #CONCURRENT_REQUESTS_PER_DOMAIN = 16 31 | #CONCURRENT_REQUESTS_PER_IP = 16 32 | 33 | # Disable cookies (enabled by default) 34 | COOKIES_ENABLED = False 35 | 36 | # Disable Telnet Console (enabled by default) 37 | #TELNETCONSOLE_ENABLED = False 38 | 39 | # Override the default request headers: 40 | # DEFAULT_REQUEST_HEADERS = { 41 | # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 42 | # 'Accept-Language': 'en', 43 | # } 44 | 45 | # Enable or disable spider middlewares 46 | # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html 47 | # SPIDER_MIDDLEWARES = { 48 | # 'webscraper_for_sophie.middlewares.WebscraperForSophieSpiderMiddleware': 543, 49 | # } 50 | 51 | # Enable or disable downloader middlewares 52 | # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html 53 | # DOWNLOADER_MIDDLEWARES = { 54 | # 'webscraper_for_sophie.middlewares.WebscraperForSophieDownloaderMiddleware': 543, 55 | # } 56 | 57 | # Enable or disable extensions 58 | # See https://docs.scrapy.org/en/latest/topics/extensions.html 59 | # EXTENSIONS = { 60 | # 'scrapy.extensions.telnet.TelnetConsole': None, 61 | # } 62 | 63 | # Configure item pipelines 64 | # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html 65 | ITEM_PIPELINES = { 66 | 'webscraper_for_sophie.pipelines.WebscraperForSophiePipeline': 300, 67 | } 68 | 69 | # Enable and configure the AutoThrottle extension (disabled by default) 70 | # See https://docs.scrapy.org/en/latest/topics/autothrottle.html 71 | #AUTOTHROTTLE_ENABLED = True 72 | # The initial download delay 73 | #AUTOTHROTTLE_START_DELAY = 5 74 | # The maximum download delay to be set in case of high latencies 75 | #AUTOTHROTTLE_MAX_DELAY = 60 76 | # The average number of requests Scrapy should be sending in parallel to 77 | # each remote server 78 | #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 79 | # Enable showing throttling stats for every response received: 80 | #AUTOTHROTTLE_DEBUG = False 81 | 82 | # Enable and configure HTTP caching (disabled by default) 83 | # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings 84 | #HTTPCACHE_ENABLED = True 85 | #HTTPCACHE_EXPIRATION_SECS = 0 86 | #HTTPCACHE_DIR = 'httpcache' 87 | #HTTPCACHE_IGNORE_HTTP_CODES = [] 88 | #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' 89 | -------------------------------------------------------------------------------- /webscraper_for_sophie/spiders/__init__.py: -------------------------------------------------------------------------------- 1 | # This package will contain the spiders of your Scrapy project 2 | # 3 | # Please refer to the documentation for information on how to create and manage 4 | # your spiders. 5 | -------------------------------------------------------------------------------- /webscraper_for_sophie/spiders/willhaben_spider.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | """ 5 | This class defines how the willhaben website will be crawled 6 | """ 7 | 8 | # default python packages 9 | import datetime 10 | import re 11 | import logging 12 | # installed packages 13 | import scrapy 14 | from scrapy.spiders import CrawlSpider, Rule 15 | from scrapy.linkextractors import LinkExtractor 16 | from bs4 import BeautifulSoup 17 | # project modules 18 | from webscraper_for_sophie.items import CondoItem 19 | 20 | 21 | class WillhabenSpider(scrapy.Spider): 22 | """ 23 | Spider‘s are classes which define how a certain site (or a group of sites) 24 | will be scraped, including how to perform the crawl (i.e. follow links) and 25 | how to extract structured data from their pages (i.e. scraping items) 26 | Here is a summary of the most important spider attributes. A detailed 27 | documentation can be found in the `official Scrapy documentation 28 | """ 29 | # Graz 30 | START_URL = 'https://www.willhaben.at/iad/immobilien/eigentumswohnung/steiermark/graz/' 31 | ITEM_URL_REGEX = r"\"url\":\"(\/iad\/immobilien\/d\/eigentumswohnung\/steiermark\/graz\/[a-z,A-Z,0-9,-]+\/)\"" 32 | # # Graz-Umgebung 33 | # START_URL = 'https://www.willhaben.at/iad/immobilien/eigentumswohnung/steiermark/graz-umgebung/' 34 | # ITEM_URL_REGEX = r"\"url\":\"(\/iad\/immobilien\/d\/eigentumswohnung\/steiermark\/graz-umgebung\/[a-z,A-Z,0-9,-]+\/)\"" 35 | 36 | ITEM_IMG_REGEX = r'"referenceImageUrl":"(https:\/\/cache.willhaben.at[-a-zA-Z0-9@:%._\+~#=/]+)"' 37 | BASE_URL = "https://www.willhaben.at" 38 | name = 'willhaben' 39 | allowed_domains = ['willhaben.at'] 40 | start_urls = [ 41 | START_URL 42 | ] 43 | 44 | def parse(self, response): 45 | """ 46 | This is the default callback used by Scrapy to process downloaded 47 | responses, when their requests don’t specify a callback like the 48 | `start_urls` 49 | """ 50 | 51 | # get item urls and yield a request for each item 52 | relative_item_urls = re.findall(self.ITEM_URL_REGEX, response.text) 53 | item_count = len(relative_item_urls) 54 | if item_count == 25: 55 | logging.info("Found {} items on page {}".format( 56 | item_count, response.url)) 57 | elif item_count >= 20: 58 | logging.warning("Found only {} items on page {}".format( 59 | item_count, response.url)) 60 | else: 61 | logging.error("Found only {} items on page {}".format( 62 | item_count, response.url)) 63 | 64 | for relative_item_url in relative_item_urls: 65 | full_item_url = self.BASE_URL + relative_item_url 66 | yield scrapy.Request(full_item_url, self.parse_item) 67 | 68 | # get the next page of the list 69 | soup = BeautifulSoup(response.text, 'lxml') 70 | pagination_btn = soup.find( 71 | 'a', attrs={"data-testid": "pagination-top-next-button"}) 72 | next_page_url = self.BASE_URL + pagination_btn['href'] 73 | yield scrapy.Request(next_page_url, self.parse) 74 | # TODO error handling 75 | 76 | def parse_item(self, response): 77 | """returns/yields a :py:class:`WillhabenItem`. 78 | 79 | This is the callback used by Scrapy to parse downloaded item pages. 80 | """ 81 | item = CondoItem() 82 | item.set_default_values() 83 | item['url'] = response.url 84 | item['discovery_date'] = datetime.datetime.now().strftime("%Y-%m-%d") 85 | # time could also be added if needed: "%Y-%m-%d %H:%M:%S" 86 | 87 | soup = BeautifulSoup(response.text, 'lxml') 88 | # remove all script tags from soup 89 | for s in soup('script'): 90 | s.clear() 91 | 92 | # title 93 | title_tag = soup.find('h1') 94 | if title_tag: 95 | item['title'] = title_tag.get_text() 96 | else: 97 | logging.error("title element not found on page " + item['url']) 98 | 99 | # price 100 | price_tag = soup.find( 101 | 'span', attrs={"data-testid": "contact-box-price-box-price-value"}) 102 | if price_tag: 103 | visible_price_text = price_tag.get_text() 104 | item.parse_price(visible_price_text) 105 | else: 106 | logging.error("price element not found on page " + item['url']) 107 | 108 | # size 109 | size_tag = soup.find( 110 | 'div', attrs={"data-testid": "ad-detail-teaser-attribute-0"}) 111 | if size_tag: 112 | visible_size_text = size_tag.get_text() 113 | item.parse_size(visible_size_text) 114 | else: 115 | logging.error("size element not found on page " + item['url']) 116 | 117 | # room_count 118 | room_count_tag = soup.find( 119 | 'div', attrs={"data-testid": "ad-detail-teaser-attribute-1"}) 120 | if room_count_tag: 121 | room_count_text = room_count_tag.get_text() 122 | item.parse_room_count(room_count_text) 123 | else: 124 | logging.error( 125 | "room_count element not found on page " + item['url']) 126 | 127 | # alternative size and room count parsing (from attributes) 128 | attribute_tags = soup.findAll( 129 | 'li', attrs={"data-testid": "attribute-item"}) 130 | if attribute_tags: 131 | for attribute_tag in attribute_tags: 132 | attribute_text = attribute_tag.get_text() 133 | # parse size again if zero 134 | if item['size'] == 0: 135 | item.parse_size_2(attribute_text) 136 | # parse room_count again if zero 137 | if item['room_count'] == 0: 138 | item.parse_room_count_2(attribute_text) 139 | else: 140 | logging.error( 141 | "attribute elements not found on page " + item['url']) 142 | 143 | # address, postal_code and district 144 | location_address_tag = soup.find( 145 | 'div', attrs={"data-testid": "object-location-address"}) 146 | if location_address_tag: 147 | location_address_text = location_address_tag.get_text() 148 | # parse address 149 | item['address'] = location_address_text 150 | # parse postal_code 151 | match = re.search(r'8\d\d\d', location_address_text) 152 | if match: 153 | item['postal_code'] = match[0] # The entire match 154 | else: 155 | logging.error( 156 | "postal_code parsing failed on page " + item['url']) 157 | # parse district 158 | match = re.search(r'8\d\d\d ([^,]+)', location_address_text) 159 | if match: 160 | item['district'] = match[1] # The first group 161 | else: 162 | logging.error( 163 | "district parsing failed on page " + item['url']) 164 | else: 165 | logging.error("element for address, postal_code and district " + 166 | "not found on page " + item['url']) 167 | 168 | # willhaben_code 169 | willhaben_code_tag = soup.find( 170 | 'span', attrs={"data-testid": "ad-detail-ad-id"}) 171 | if willhaben_code_tag: 172 | willhaben_code_text = willhaben_code_tag.get_text() 173 | match = re.search(r'\d+', willhaben_code_text) 174 | if match: 175 | item['willhaben_code'] = match[0] # The first group 176 | else: 177 | logging.error( 178 | "willhaben_code parsing failed on page " + item['url']) 179 | else: 180 | logging.error( 181 | "willhaben_code element not found on page " + item['url']) 182 | 183 | # edit_date 184 | edit_date_tag = soup.find( 185 | 'span', attrs={"data-testid": "ad-detail-ad-edit-date"}) 186 | if edit_date_tag: 187 | item['edit_date'] = edit_date_tag.get_text() 188 | else: 189 | logging.error("edit_date element not found on page " + item['url']) 190 | 191 | # commission_fee 192 | body_tag = soup.find('article') 193 | if body_tag: 194 | body_text = body_tag.get_text() 195 | if 'provisionsfrei' in body_text.lower(): 196 | item['commission_fee'] = 0 197 | else: 198 | item['commission_fee'] = 3.6 199 | else: 200 | logging.error( 201 | "commission_fee element not found on page " + item['url']) 202 | 203 | # price_per_m2 204 | item.calc_price_per_m2() 205 | 206 | # futher item processing is done in the item pipeline 207 | yield item 208 | --------------------------------------------------------------------------------