├── .env
├── LICENSE
├── README.md
├── docker-compose.yml
├── docker
    └── webscraper
    │   └── Dockerfile
├── images_for_readme
    ├── overview.png
    ├── overview.svg
    ├── results_1.png
    └── results_2.png
├── requirements
    ├── requirements.in
    └── requirements.txt
├── scrapy.cfg
└── webscraper_for_sophie
    ├── __init__.py
    ├── database_manager.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        ├── __init__.py
        └── willhaben_spider.py


/.env:
--------------------------------------------------------------------------------
 1 | # Name of the database
 2 | MYSQL_DATABASE=webscraper_database
 3 | 
 4 | # Name of the table
 5 | MYSQL_TABLENAME=condos
 6 | 
 7 | # Password for the root user
 8 | MYSQL_ROOT_PASSWORD=kfj3gjlf&4MGdf596
 9 | 
10 | # Sign in as root
11 | MYSQL_USER=webscraper
12 | MYSQL_PASSWORD=ackj8636&fkf
13 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 Michael Haar
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | > Note: This code is quite old and I don't recommend using it. Please check out the [Immobilien Suchmaschine](https://apify.com/michaelhaar/immobilien-suchmaschine)
  2 | 
  3 | # Webscraper for Sophie :ant:
  4 | 
  5 | This is a small webscraper application, which I wrote to do a market research
  6 | on the condo prices in my hometown :house: (Graz, Styria, Austria).
  7 | 
  8 | The webscraper crawls the [Willhaben](www.willhaben.at) website, extracts the
  9 | relevant data and stores it in a local database as shown in the following figure:
 10 | 
 11 | ![website data is extracted and stored in DB](images_for_readme/overview.png)
 12 | 
 13 | The data can be used in various ways. I did a small market research in
 14 | Tableau, which is not included in this project, but I'd like to provide a
 15 | teaser of the results: :chart_with_upwards_trend:
 16 | 
 17 | ![results](images_for_readme/results_1.png)
 18 | ![results](images_for_readme/results_2.png)
 19 | 
 20 | ## Prerequisites
 21 | 
 22 | Make sure **Docker-Compose** is installed on your machine. See
 23 | [Install Docker Compose](https://docs.docker.com/compose/install/)
 24 | 
 25 | Note for Ubuntu users:
 26 | 
 27 | 1. Install Docker as descriped [here](https://docs.docker.com/engine/install/ubuntu/)
 28 | 2. Do the [post install steps](https://docs.docker.com/engine/install/linux-postinstall/)
 29 | 3. Install Docker-Compose as descriped [here](https://docs.docker.com/compose/install/)
 30 | 
 31 | ## :rocket: Usage
 32 | 
 33 | The `docker-compose.yml` file will start two docker containers
 34 | (one for the webscraper and one for the MySQL database) and can be used like
 35 | this:
 36 | 
 37 | 1. Open a new terminal window and type the following commands
 38 | 
 39 | ```bash
 40 | git clone https://github.com/michaelhaar/webscraper_sophie.git
 41 | 
 42 | # cd into the repository
 43 | cd webscraper_sophie
 44 | ```
 45 | 
 46 | 2. **Build the containers**. The following command will download
 47 |    the base images (Python + MySQL) from the internet and install some packages
 48 |    inside the webscraper container.
 49 | 
 50 | ```bash
 51 | docker-compose build		# may take a few minutes
 52 | ```
 53 | 
 54 | 3. We can **start the containers** and immediately run the webscraper
 55 |    application with
 56 | 
 57 | ```bash
 58 | docker-compose up			# starts the docker containers
 59 | ```
 60 | 
 61 | 4. Verify that the webscraper is working properly by viewing the output of the
 62 |    webscraper in the terminal window. It should print out the extracted items.
 63 | 
 64 | Note: It's very likely that the webscraper application isn't working anymore
 65 | because willhaben changes the HTML of their website from time to time.
 66 | Please feel free to contact me or create a new issue if you need something.
 67 | 
 68 | 5. Access the results, which were extracted and then inserted into the database.
 69 |    You can use your favorite MySQL Client for this step. (I'm using
 70 |    [DataGrip](https://www.jetbrains.com/de-de/datagrip/)).
 71 | 
 72 | - host=`localhost`
 73 | - port=`3306`
 74 | - The login credentials can be found in the `.env` file.
 75 | 
 76 | ## Hints
 77 | 
 78 | The following commands might be useful
 79 | 
 80 | ```bash
 81 | # Build the docker containers
 82 | docker-compose build
 83 | 
 84 | # Start the docker containers
 85 | docker-compose up
 86 | 
 87 | # Bash into the webscraper container
 88 | docker-compose run --rm webscraper bash
 89 | 
 90 | # Manually start the webscraper application,
 91 | # when bashed into the webscraper container
 92 | scrapy crawl willhaben
 93 | 
 94 | # Re-build PIP requirements
 95 | docker-compose run --rm webscraper pip-compile requirements/requirements.in
 96 | 
 97 | ```
 98 | 
 99 | ## Author
100 | 
101 | 👤 **Michael Haar**
102 | 
103 | - LinkedIn: [@michaelhaar](https://www.linkedin.com/in/michaelhaar/)
104 | - Github: [@michaelhaar](https://github.com/michaelhaar)
105 | - Email: michael.haar@gmx.at
106 | 


--------------------------------------------------------------------------------
/docker-compose.yml:
--------------------------------------------------------------------------------
 1 | version: "3.4"
 2 | 
 3 | services:
 4 |   db:
 5 |     image: mysql:5.7
 6 |     ports:
 7 |       - "3306:3306"
 8 |     volumes:
 9 |       - db_data:/var/lib/mysql
10 |     environment:
11 |       MYSQL_DATABASE: ${MYSQL_DATABASE} # from env file
12 |       MYSQL_USER: "${MYSQL_USER}" # from env file
13 |       MYSQL_PASSWORD: ${MYSQL_PASSWORD} # from env file
14 |       MYSQL_ROOT_PASSWORD: ${MYSQL_ROOT_PASSWORD} # from env file
15 |   webscraper:
16 |     build:
17 |       context: .
18 |       dockerfile: ./docker/webscraper/Dockerfile
19 |     command: scrapy crawl willhaben
20 |     environment:
21 |       MYSQL_DATABASE: ${MYSQL_DATABASE} # from env file
22 |       MYSQL_TABLENAME: ${MYSQL_TABLENAME} # from env file
23 |       MYSQL_USER: "${MYSQL_USER}" # from env file
24 |       MYSQL_PASSWORD: ${MYSQL_PASSWORD} # from env file
25 |     depends_on:
26 |       - db
27 |     volumes:
28 |       - .:/src
29 | 
30 | volumes:
31 |   db_data: {} # make db data persistent
32 | 


--------------------------------------------------------------------------------
/docker/webscraper/Dockerfile:
--------------------------------------------------------------------------------
 1 | # Start a new Python 3.8 parent image (see https://hub.docker.com/r/library/python)
 2 | FROM python:3.8
 3 | 
 4 | # Avoid .pyc files
 5 | ENV PYTHONDONTWRITEBYTECODE 1
 6 | # Ensure that Python output is sent straight to terminal and that we can see the output of our application
 7 | ENV PYTHONUNBUFFERED 1
 8 | 
 9 | #install system dependencies
10 | RUN apt-get update \
11 | 	&& apt-get install -y python3 python3-dev python3-pip libxml2-dev \
12 | 	libxslt1-dev zlib1g-dev libffi-dev libssl-dev
13 | 
14 | # install the required python packages
15 | COPY requirements/requirements.txt /tmp/requirements.txt
16 | 
17 | RUN set -ex \
18 | 	&& pip install --upgrade pip \
19 | 	&& pip install pip-tools \
20 | 	&& pip install -r /tmp/requirements.txt \
21 | 	&& rm -rf /root/.cache/
22 | 
23 | # copy in the rest of your app’s source code from your host to your image filesystem.
24 | COPY . /src/
25 | 
26 | WORKDIR /src/


--------------------------------------------------------------------------------
/images_for_readme/overview.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/michaelhaar/webscraper_sophie/c66a1c08376dc32b8cc752e792fa60025636e7bc/images_for_readme/overview.png


--------------------------------------------------------------------------------
/images_for_readme/results_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/michaelhaar/webscraper_sophie/c66a1c08376dc32b8cc752e792fa60025636e7bc/images_for_readme/results_1.png


--------------------------------------------------------------------------------
/images_for_readme/results_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/michaelhaar/webscraper_sophie/c66a1c08376dc32b8cc752e792fa60025636e7bc/images_for_readme/results_2.png


--------------------------------------------------------------------------------
/requirements/requirements.in:
--------------------------------------------------------------------------------
1 | scrapy==2.4.0
2 | beautifulsoup4==4.9.3
3 | environs==9.0.0
4 | mysql-connector-python==8.0.22


--------------------------------------------------------------------------------
/requirements/requirements.txt:
--------------------------------------------------------------------------------
 1 | #
 2 | # This file is autogenerated by pip-compile
 3 | # To update, run:
 4 | #
 5 | #    pip-compile requirements/requirements.in
 6 | #
 7 | attrs==20.2.0             # via automat, service-identity, twisted
 8 | automat==20.2.0           # via twisted
 9 | beautifulsoup4==4.9.3     # via -r requirements/requirements.in
10 | cffi==1.14.3              # via cryptography
11 | constantly==15.1.0        # via twisted
12 | cryptography==3.2.1       # via pyopenssl, scrapy, service-identity
13 | cssselect==1.1.0          # via parsel, scrapy
14 | environs==9.0.0           # via -r requirements/requirements.in
15 | hyperlink==20.0.1         # via twisted
16 | idna==2.10                # via hyperlink
17 | incremental==17.5.0       # via twisted
18 | itemadapter==0.1.1        # via itemloaders, scrapy
19 | itemloaders==1.0.3        # via scrapy
20 | jmespath==0.10.0          # via itemloaders
21 | lxml==4.6.1               # via parsel, scrapy
22 | marshmallow==3.9.0        # via environs
23 | mysql-connector-python==8.0.22  # via -r requirements/requirements.in
24 | parsel==1.6.0             # via itemloaders, scrapy
25 | protego==0.1.16           # via scrapy
26 | protobuf==3.13.0          # via mysql-connector-python
27 | pyasn1-modules==0.2.8     # via service-identity
28 | pyasn1==0.4.8             # via pyasn1-modules, service-identity
29 | pycparser==2.20           # via cffi
30 | pydispatcher==2.0.5       # via scrapy
31 | pyhamcrest==2.0.2         # via twisted
32 | pyopenssl==19.1.0         # via scrapy
33 | python-dotenv==0.15.0     # via environs
34 | queuelib==1.5.0           # via scrapy
35 | scrapy==2.4.0             # via -r requirements/requirements.in
36 | service-identity==18.1.0  # via scrapy
37 | six==1.15.0               # via automat, cryptography, parsel, protego, protobuf, pyopenssl, w3lib
38 | soupsieve==2.0.1          # via beautifulsoup4
39 | twisted==20.3.0           # via scrapy
40 | w3lib==1.22.0             # via itemloaders, parsel, scrapy
41 | zope.interface==5.1.2     # via scrapy, twisted
42 | 
43 | # The following packages are considered to be unsafe in a requirements file:
44 | # setuptools
45 | 


--------------------------------------------------------------------------------
/scrapy.cfg:
--------------------------------------------------------------------------------
 1 | # Automatically created by: scrapy startproject
 2 | #
 3 | # For more information about the [deploy] section see:
 4 | # https://scrapyd.readthedocs.io/en/latest/deploy.html
 5 | 
 6 | [settings]
 7 | default = webscraper_for_sophie.settings
 8 | 
 9 | [deploy]
10 | #url = http://localhost:6800/
11 | project = webscraper_for_sophie
12 | 


--------------------------------------------------------------------------------
/webscraper_for_sophie/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/michaelhaar/webscraper_sophie/c66a1c08376dc32b8cc752e792fa60025636e7bc/webscraper_for_sophie/__init__.py


--------------------------------------------------------------------------------
/webscraper_for_sophie/database_manager.py:
--------------------------------------------------------------------------------
  1 | # import default packages
  2 | import logging
  3 | import time
  4 | # import installed packages
  5 | import environs
  6 | import mysql.connector
  7 | from mysql.connector import errorcode
  8 | # import project modules
  9 | from webscraper_for_sophie.items import CondoItem
 10 | 
 11 | 
 12 | env = environs.Env()
 13 | USER = env("MYSQL_USER")
 14 | PASSWORD = env("MYSQL_PASSWORD")
 15 | DATABASE = env("MYSQL_DATABASE")
 16 | TABLENAME = env("MYSQL_TABLENAME")
 17 | HOST = 'db'     # name of the docker container
 18 | 
 19 | # Settings for connection error handling
 20 | NUM_ATTEMPTS = 30
 21 | DELAY_BTW_ATTEMPTS = 1     # in seconds
 22 | RETRY_MSG = ("Waiting for MySQL container to start gracefully " +
 23 |              "(Attempt {} of {}) failed")
 24 | 
 25 | 
 26 | class DatabaseManager():
 27 |     """
 28 |     Simplies our database operations
 29 |     """
 30 | 
 31 |     def connect(self):
 32 |         """ Connect to the database """
 33 |         for attempt_no in range(1, NUM_ATTEMPTS+1):
 34 |             try:
 35 |                 self.connection = mysql.connector.connect(host=HOST,
 36 |                                                           database=DATABASE,
 37 |                                                           user=USER,
 38 |                                                           password=PASSWORD)
 39 |                 self.cursor = self.connection.cursor()
 40 |                 logging.debug("Database connection opened")
 41 |                 return
 42 |             except mysql.connector.Error as err:
 43 |                 logging.debug(RETRY_MSG.format(attempt_no, NUM_ATTEMPTS))
 44 |                 if attempt_no < NUM_ATTEMPTS:
 45 |                     time.sleep(DELAY_BTW_ATTEMPTS)
 46 |                 else:
 47 |                     if err.errno == errorcode.ER_ACCESS_DENIED_ERROR:
 48 |                         logging.error(
 49 |                             "Something is wrong with your user name or password")
 50 |                     elif err.errno == errorcode.ER_BAD_DB_ERROR:
 51 |                         logging.error("Database does not exist")
 52 |                     else:
 53 |                         logging.error(err)
 54 | 
 55 |     def close(self):
 56 |         """ Close the database connection """
 57 |         self.connection.close()
 58 |         logging.debug("Database connection closed")
 59 | 
 60 |     def is_connected(self):
 61 |         """ 
 62 |         Returns:
 63 |             bool: True if connected. False otherwise        
 64 |         """
 65 |         self.connection.is_connected()
 66 | 
 67 |     def prep_table(self):
 68 |         """ create a new table if the provided table name does not exist. """
 69 |         sql_command = "SHOW TABLES LIKE '{0}'".format(TABLENAME)
 70 |         self.cursor.execute(sql_command)
 71 |         result = self.cursor.fetchone()  # fetch will return a python tuple
 72 |         if result is None:
 73 |             logging.debug("Database table does not exist")
 74 | 
 75 |             # create table
 76 |             sql_command = """
 77 |             CREATE TABLE {0} ( 
 78 |             id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
 79 |             willhaben_code VARCHAR(10) COLLATE utf8_bin,
 80 |             postal_code VARCHAR(10) COLLATE utf8_bin,
 81 |             district VARCHAR(100) COLLATE utf8_bin,
 82 |             price INTEGER,
 83 |             commission_fee FLOAT,
 84 |             size INTEGER,
 85 |             room_count INTEGER,
 86 |             price_per_m2 FLOAT,
 87 |             discovery_date DATE,
 88 |             title TEXT COLLATE utf8_bin,
 89 |             url TEXT COLLATE utf8_bin,
 90 |             edit_date VARCHAR(100) COLLATE utf8_bin,
 91 |             address VARCHAR(100) COLLATE utf8_bin);""".format(TABLENAME)
 92 |             self.cursor.execute(sql_command)
 93 |             self.connection.commit()
 94 |             logging.debug("New database table has been created")
 95 | 
 96 |     def store_item(self, item):
 97 |         """ 
 98 |         Store a new item in the database
 99 | 
100 |         Args:
101 |             item: the CondoItem that should be inserted in the database.
102 |         """
103 | 
104 |         # fill table of database with data
105 |         sql_command = """INSERT INTO {0} 
106 | 							(id, willhaben_code, postal_code, district, price,
107 |                             commission_fee, size, room_count, price_per_m2, 
108 |                             discovery_date, title, url, edit_date, address)
109 | 						VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, 
110 |                         %s, %s, %s, %s);
111 | 						""".format(TABLENAME)
112 | 
113 |         insert_tuple = (None, item['willhaben_code'], item['postal_code'],
114 |                         item['district'], item['price'], item['commission_fee'],
115 |                         item['size'], item['room_count'], item['price_per_m2'],
116 |                         item['discovery_date'], item['title'], item['url'],
117 |                         item['edit_date'], item['address'])
118 |         # use parameterized input to avoid SQL injection
119 |         self.cursor.execute(sql_command, insert_tuple)
120 |         # never forget this, if you want the changes to be saved:
121 |         self.connection.commit()
122 | 


--------------------------------------------------------------------------------
/webscraper_for_sophie/items.py:
--------------------------------------------------------------------------------
  1 | # Define here the models for your scraped items
  2 | #
  3 | # See documentation in:
  4 | # https://docs.scrapy.org/en/latest/topics/items.html
  5 | 
  6 | import re
  7 | import logging
  8 | import scrapy
  9 | 
 10 | 
 11 | class CondoItem(scrapy.Item):
 12 |     # define the fields for your item here like:
 13 |     # name = scrapy.Field()
 14 |     url = scrapy.Field()
 15 |     title = scrapy.Field()
 16 |     price = scrapy.Field()
 17 |     size = scrapy.Field()
 18 |     room_count = scrapy.Field()
 19 |     postal_code = scrapy.Field()
 20 |     district = scrapy.Field()
 21 |     discovery_date = scrapy.Field()
 22 |     edit_date = scrapy.Field()
 23 |     description = scrapy.Field()
 24 |     address = scrapy.Field()
 25 |     willhaben_code = scrapy.Field()
 26 |     commission_fee = scrapy.Field()
 27 |     price_per_m2 = scrapy.Field()
 28 | 
 29 |     DEFAULT_VALUE_STRING = ''
 30 |     DEFAULT_VALUE_INT = 0
 31 |     DEFAULT_VAULE_BOOL = False
 32 | 
 33 |     MIN_PRICE = 1000
 34 |     MAX_PRICE = 1500000
 35 |     MIN_SIZE = 10
 36 |     MAX_SIZE = 250
 37 | 
 38 |     def set_default_values(self):
 39 |         # init fields if needed
 40 |         # no init value needed for self['url'] and self['discovery_date']
 41 |         self['title'] = self.DEFAULT_VALUE_STRING
 42 |         self['price'] = self.DEFAULT_VALUE_INT
 43 |         self['size'] = self.DEFAULT_VALUE_INT
 44 |         self['room_count'] = self.DEFAULT_VALUE_INT
 45 |         self['postal_code'] = self.DEFAULT_VALUE_STRING
 46 |         self['district'] = self.DEFAULT_VALUE_STRING
 47 |         self['edit_date'] = self.DEFAULT_VALUE_STRING
 48 |         self['description'] = self.DEFAULT_VALUE_STRING
 49 |         self['address'] = self.DEFAULT_VALUE_STRING
 50 |         self['willhaben_code'] = self.DEFAULT_VALUE_STRING
 51 |         self['commission_fee'] = self.DEFAULT_VALUE_STRING
 52 |         self['price_per_m2'] = self.DEFAULT_VALUE_INT
 53 | 
 54 |     def calc_price_per_m2(self):
 55 |         """ Calculate the price per square meter. """
 56 |         if self['size']:
 57 |             self['price_per_m2'] = self['price'] / self['size']
 58 | 
 59 |     def parse_price(self, price_text):
 60 |         """ Parses the price from the input text.
 61 | 
 62 |         Args:
 63 |             price_text (string): something like "€ 99.750"
 64 |         """
 65 |         cleaned_price_text = price_text.replace('.', '')
 66 |         match = re.search(r'\d+', cleaned_price_text)  # search for numbers
 67 |         if match:
 68 |             price_string = match[0]  # get entire match
 69 |             try:
 70 |                 price_int = int(price_string)   # convert to int
 71 |             except ValueError:
 72 |                 logging.error("Could not convert price to int at page " +
 73 |                               self['url'])
 74 |             else:
 75 |                 # realisitic value check
 76 |                 if price_int > self.MIN_PRICE and price_int < self.MAX_PRICE:
 77 |                     self['price'] = price_int
 78 |                 else:
 79 |                     logging.error("Unrealistic price at page " + self['url'])
 80 | 
 81 |     def parse_size(self, size_text):
 82 |         """ Parses the size from the input text.
 83 | 
 84 |         Args:
 85 |             size_text (string): something like " 42m²"
 86 |         """
 87 |         match = re.search(r'\d+', size_text)  # search for numbers
 88 |         if match:
 89 |             size_string = match[0]  # The entire match
 90 |             try:
 91 |                 size_int = int(size_string)  # convert to int
 92 |             except ValueError:
 93 |                 logging.error("Could not convert size to int at page " +
 94 |                               self['url'])
 95 |             else:
 96 |                 # realisitic value check
 97 |                 if size_int > self.MIN_SIZE and size_int < self.MAX_SIZE:
 98 |                     self['size'] = size_int
 99 |                 else:
100 |                     logging.warning("Unrealistic size at page " + self['url'])
101 |         else:
102 |             logging.error("size parsing failed on page " + self['url'])
103 | 
104 |     def parse_size_2(self, size_text):
105 |         """ Parses the size from the input text if it contains a keyword
106 | 
107 |         Keyword is `Nutzfläche`
108 | 
109 |         Args:
110 |             size_text (string): something like "Nutzfläche: 73m2"
111 |         """
112 |         keyword_match = re.search(r'Nutzfläche', size_text)
113 |         if keyword_match:
114 |             match = re.search(r'\d+', size_text)  # search for numbers
115 |             if match:
116 |                 size_string = match[0]  # The entire match
117 |                 try:
118 |                     size_int = int(size_string)  # convert to int
119 |                 except ValueError:
120 |                     logging.error("Could not convert size to int at page " +
121 |                                   self['url'])
122 |                 else:
123 |                     # realisitic value check
124 |                     if size_int > self.MIN_SIZE and size_int < self.MAX_SIZE:
125 |                         self['size'] = size_int
126 |                     else:
127 |                         logging.error(
128 |                             "Unrealistic size at page " + self['url'])
129 |             else:
130 |                 logging.error("secondary size parsing failed on page " +
131 |                               self['url'])
132 | 
133 |     def parse_room_count(self, room_count_text):
134 |         """ Parses the room_count from the input text.
135 | 
136 |         Args:
137 |             room_count_text (string): something like " 3 Zimmer"
138 |         """
139 |         match = re.search(r'\d', room_count_text)  # search for a single number
140 |         if match:
141 |             room_count_string = match[0]  # The entire match
142 |             try:
143 |                 self['room_count'] = int(room_count_string)
144 |             except ValueError:
145 |                 logging.error("Could not convert room_count to int at page " +
146 |                               self['url'])
147 |         else:
148 |             logging.warning(
149 |                 "room_count parsing failed on page " + self['url'])
150 | 
151 |     def parse_room_count_2(self, room_count_text):
152 |         """ Parses the room_count from the input text if it contains a keyword
153 | 
154 |         Keyword is `Zimmer`
155 | 
156 |         Args:
157 |             room_count_text (string): something like "Zimmer: 3"
158 |         """
159 |         keyword_match = re.search(r'Zimmer', room_count_text)
160 |         if keyword_match:
161 |             match = re.search(r'\d+', room_count_text)  # search for numbers
162 |             if match:
163 |                 room_count_string = match[0]  # The entire match
164 |                 try:
165 |                     self['room_count'] = int(
166 |                         room_count_string)  # convert to int
167 |                 except ValueError:
168 |                     logging.error("Could not convert room_count to int at page "
169 |                                   + self['url'])
170 |             else:
171 |                 logging.error("secondary room_count parsing failed on page " +
172 |                               self['url'])
173 | 


--------------------------------------------------------------------------------
/webscraper_for_sophie/middlewares.py:
--------------------------------------------------------------------------------
  1 | # Define here the models for your spider middleware
  2 | #
  3 | # See documentation in:
  4 | # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
  5 | 
  6 | from scrapy import signals
  7 | 
  8 | # useful for handling different item types with a single interface
  9 | from itemadapter import is_item, ItemAdapter
 10 | 
 11 | 
 12 | class WebscraperForSophieSpiderMiddleware:
 13 |     # Not all methods need to be defined. If a method is not defined,
 14 |     # scrapy acts as if the spider middleware does not modify the
 15 |     # passed objects.
 16 | 
 17 |     @classmethod
 18 |     def from_crawler(cls, crawler):
 19 |         # This method is used by Scrapy to create your spiders.
 20 |         s = cls()
 21 |         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
 22 |         return s
 23 | 
 24 |     def process_spider_input(self, response, spider):
 25 |         # Called for each response that goes through the spider
 26 |         # middleware and into the spider.
 27 | 
 28 |         # Should return None or raise an exception.
 29 |         return None
 30 | 
 31 |     def process_spider_output(self, response, result, spider):
 32 |         # Called with the results returned from the Spider, after
 33 |         # it has processed the response.
 34 | 
 35 |         # Must return an iterable of Request, or item objects.
 36 |         for i in result:
 37 |             yield i
 38 | 
 39 |     def process_spider_exception(self, response, exception, spider):
 40 |         # Called when a spider or process_spider_input() method
 41 |         # (from other spider middleware) raises an exception.
 42 | 
 43 |         # Should return either None or an iterable of Request or item objects.
 44 |         pass
 45 | 
 46 |     def process_start_requests(self, start_requests, spider):
 47 |         # Called with the start requests of the spider, and works
 48 |         # similarly to the process_spider_output() method, except
 49 |         # that it doesn’t have a response associated.
 50 | 
 51 |         # Must return only requests (not items).
 52 |         for r in start_requests:
 53 |             yield r
 54 | 
 55 |     def spider_opened(self, spider):
 56 |         spider.logger.info('Spider opened: %s' % spider.name)
 57 | 
 58 | 
 59 | class WebscraperForSophieDownloaderMiddleware:
 60 |     # Not all methods need to be defined. If a method is not defined,
 61 |     # scrapy acts as if the downloader middleware does not modify the
 62 |     # passed objects.
 63 | 
 64 |     @classmethod
 65 |     def from_crawler(cls, crawler):
 66 |         # This method is used by Scrapy to create your spiders.
 67 |         s = cls()
 68 |         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
 69 |         return s
 70 | 
 71 |     def process_request(self, request, spider):
 72 |         # Called for each request that goes through the downloader
 73 |         # middleware.
 74 | 
 75 |         # Must either:
 76 |         # - return None: continue processing this request
 77 |         # - or return a Response object
 78 |         # - or return a Request object
 79 |         # - or raise IgnoreRequest: process_exception() methods of
 80 |         #   installed downloader middleware will be called
 81 |         return None
 82 | 
 83 |     def process_response(self, request, response, spider):
 84 |         # Called with the response returned from the downloader.
 85 | 
 86 |         # Must either;
 87 |         # - return a Response object
 88 |         # - return a Request object
 89 |         # - or raise IgnoreRequest
 90 |         return response
 91 | 
 92 |     def process_exception(self, request, exception, spider):
 93 |         # Called when a download handler or a process_request()
 94 |         # (from other downloader middleware) raises an exception.
 95 | 
 96 |         # Must either:
 97 |         # - return None: continue processing this exception
 98 |         # - return a Response object: stops process_exception() chain
 99 |         # - return a Request object: stops process_exception() chain
100 |         pass
101 | 
102 |     def spider_opened(self, spider):
103 |         spider.logger.info('Spider opened: %s' % spider.name)
104 | 


--------------------------------------------------------------------------------
/webscraper_for_sophie/pipelines.py:
--------------------------------------------------------------------------------
 1 | # Define your item pipelines here
 2 | #
 3 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 4 | # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
 5 | 
 6 | 
 7 | # useful for handling different item types with a single interface
 8 | from itemadapter import ItemAdapter
 9 | 
10 | from webscraper_for_sophie.database_manager import DatabaseManager
11 | 
12 | 
13 | class WebscraperForSophiePipeline:
14 | 
15 |     def open_spider(self, spider):
16 |         """ This method is called when the spider is opened. """
17 |         self.db_manager = DatabaseManager()
18 |         self.db_manager.connect()
19 |         self.db_manager.prep_table()
20 | 
21 |     def close_spider(self, spider):
22 |         """ This method is called when the spider is closed. """
23 |         self.db_manager.close()
24 | 
25 |     def process_item(self, item, spider):
26 |         """ This method is called for every item pipeline component. """
27 |         self.db_manager.store_item(item)
28 |         return item
29 | 


--------------------------------------------------------------------------------
/webscraper_for_sophie/settings.py:
--------------------------------------------------------------------------------
 1 | # Scrapy settings for webscraper_for_sophie project
 2 | #
 3 | # For simplicity, this file contains only settings considered important or
 4 | # commonly used. You can find more settings consulting the documentation:
 5 | #
 6 | #     https://docs.scrapy.org/en/latest/topics/settings.html
 7 | #     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
 8 | #     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
 9 | 
10 | BOT_NAME = 'webscraper_for_sophie'
11 | 
12 | SPIDER_MODULES = ['webscraper_for_sophie.spiders']
13 | NEWSPIDER_MODULE = 'webscraper_for_sophie.spiders'
14 | 
15 | 
16 | # Crawl responsibly by identifying yourself (and your website) on the user-agent
17 | #USER_AGENT = 'webscraper_for_sophie (+http://www.yourdomain.com)'
18 | 
19 | # Obey robots.txt rules
20 | ROBOTSTXT_OBEY = False
21 | 
22 | # Configure maximum concurrent requests performed by Scrapy (default: 16)
23 | #CONCURRENT_REQUESTS = 32
24 | 
25 | # Configure a delay for requests for the same website (default: 0)
26 | # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
27 | # See also autothrottle settings and docs
28 | DOWNLOAD_DELAY = 2
29 | # The download delay setting will honor only one of:
30 | #CONCURRENT_REQUESTS_PER_DOMAIN = 16
31 | #CONCURRENT_REQUESTS_PER_IP = 16
32 | 
33 | # Disable cookies (enabled by default)
34 | COOKIES_ENABLED = False
35 | 
36 | # Disable Telnet Console (enabled by default)
37 | #TELNETCONSOLE_ENABLED = False
38 | 
39 | # Override the default request headers:
40 | # DEFAULT_REQUEST_HEADERS = {
41 | #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
42 | #   'Accept-Language': 'en',
43 | # }
44 | 
45 | # Enable or disable spider middlewares
46 | # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
47 | # SPIDER_MIDDLEWARES = {
48 | #    'webscraper_for_sophie.middlewares.WebscraperForSophieSpiderMiddleware': 543,
49 | # }
50 | 
51 | # Enable or disable downloader middlewares
52 | # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
53 | # DOWNLOADER_MIDDLEWARES = {
54 | #    'webscraper_for_sophie.middlewares.WebscraperForSophieDownloaderMiddleware': 543,
55 | # }
56 | 
57 | # Enable or disable extensions
58 | # See https://docs.scrapy.org/en/latest/topics/extensions.html
59 | # EXTENSIONS = {
60 | #    'scrapy.extensions.telnet.TelnetConsole': None,
61 | # }
62 | 
63 | # Configure item pipelines
64 | # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
65 | ITEM_PIPELINES = {
66 |     'webscraper_for_sophie.pipelines.WebscraperForSophiePipeline': 300,
67 | }
68 | 
69 | # Enable and configure the AutoThrottle extension (disabled by default)
70 | # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
71 | #AUTOTHROTTLE_ENABLED = True
72 | # The initial download delay
73 | #AUTOTHROTTLE_START_DELAY = 5
74 | # The maximum download delay to be set in case of high latencies
75 | #AUTOTHROTTLE_MAX_DELAY = 60
76 | # The average number of requests Scrapy should be sending in parallel to
77 | # each remote server
78 | #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
79 | # Enable showing throttling stats for every response received:
80 | #AUTOTHROTTLE_DEBUG = False
81 | 
82 | # Enable and configure HTTP caching (disabled by default)
83 | # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
84 | #HTTPCACHE_ENABLED = True
85 | #HTTPCACHE_EXPIRATION_SECS = 0
86 | #HTTPCACHE_DIR = 'httpcache'
87 | #HTTPCACHE_IGNORE_HTTP_CODES = []
88 | #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
89 | 


--------------------------------------------------------------------------------
/webscraper_for_sophie/spiders/__init__.py:
--------------------------------------------------------------------------------
1 | # This package will contain the spiders of your Scrapy project
2 | #
3 | # Please refer to the documentation for information on how to create and manage
4 | # your spiders.
5 | 


--------------------------------------------------------------------------------
/webscraper_for_sophie/spiders/willhaben_spider.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | """
  5 | This class defines how the willhaben website will be crawled
  6 | """
  7 | 
  8 | # default python packages
  9 | import datetime
 10 | import re
 11 | import logging
 12 | # installed packages
 13 | import scrapy
 14 | from scrapy.spiders import CrawlSpider, Rule
 15 | from scrapy.linkextractors import LinkExtractor
 16 | from bs4 import BeautifulSoup
 17 | # project modules
 18 | from webscraper_for_sophie.items import CondoItem
 19 | 
 20 | 
 21 | class WillhabenSpider(scrapy.Spider):
 22 |     """
 23 |     Spider‘s are classes which define how a certain site (or a group of sites)
 24 |     will be scraped, including how to perform the crawl (i.e. follow links) and
 25 |     how to extract structured data from their pages (i.e. scraping items)
 26 |     Here is a summary of the most important spider attributes. A detailed
 27 |     documentation can be found in the `official Scrapy documentation
 28 |     """
 29 |     # Graz
 30 |     START_URL = 'https://www.willhaben.at/iad/immobilien/eigentumswohnung/steiermark/graz/'
 31 |     ITEM_URL_REGEX = r"\"url\":\"(\/iad\/immobilien\/d\/eigentumswohnung\/steiermark\/graz\/[a-z,A-Z,0-9,-]+\/)\""
 32 |     # # Graz-Umgebung
 33 |     # START_URL = 'https://www.willhaben.at/iad/immobilien/eigentumswohnung/steiermark/graz-umgebung/'
 34 |     # ITEM_URL_REGEX = r"\"url\":\"(\/iad\/immobilien\/d\/eigentumswohnung\/steiermark\/graz-umgebung\/[a-z,A-Z,0-9,-]+\/)\""
 35 | 
 36 |     ITEM_IMG_REGEX = r'"referenceImageUrl":"(https:\/\/cache.willhaben.at[-a-zA-Z0-9@:%._\+~#=/]+)"'
 37 |     BASE_URL = "https://www.willhaben.at"
 38 |     name = 'willhaben'
 39 |     allowed_domains = ['willhaben.at']
 40 |     start_urls = [
 41 |         START_URL
 42 |     ]
 43 | 
 44 |     def parse(self, response):
 45 |         """
 46 |         This is the default callback used by Scrapy to process downloaded
 47 |         responses, when their requests don’t specify a callback like the
 48 |         `start_urls`
 49 |         """
 50 | 
 51 |         # get item urls and yield a request for each item
 52 |         relative_item_urls = re.findall(self.ITEM_URL_REGEX, response.text)
 53 |         item_count = len(relative_item_urls)
 54 |         if item_count == 25:
 55 |             logging.info("Found {} items on page {}".format(
 56 |                 item_count, response.url))
 57 |         elif item_count >= 20:
 58 |             logging.warning("Found only {} items on page {}".format(
 59 |                 item_count, response.url))
 60 |         else:
 61 |             logging.error("Found only {} items on page {}".format(
 62 |                 item_count, response.url))
 63 | 
 64 |         for relative_item_url in relative_item_urls:
 65 |             full_item_url = self.BASE_URL + relative_item_url
 66 |             yield scrapy.Request(full_item_url, self.parse_item)
 67 | 
 68 |         # get the next page of the list
 69 |         soup = BeautifulSoup(response.text, 'lxml')
 70 |         pagination_btn = soup.find(
 71 |             'a', attrs={"data-testid": "pagination-top-next-button"})
 72 |         next_page_url = self.BASE_URL + pagination_btn['href']
 73 |         yield scrapy.Request(next_page_url, self.parse)
 74 |         # TODO error handling
 75 | 
 76 |     def parse_item(self, response):
 77 |         """returns/yields a :py:class:`WillhabenItem`.
 78 | 
 79 |         This is the callback used by Scrapy to parse downloaded item pages.
 80 |         """
 81 |         item = CondoItem()
 82 |         item.set_default_values()
 83 |         item['url'] = response.url
 84 |         item['discovery_date'] = datetime.datetime.now().strftime("%Y-%m-%d")
 85 |         # time could also be added if needed: "%Y-%m-%d %H:%M:%S"
 86 | 
 87 |         soup = BeautifulSoup(response.text, 'lxml')
 88 |         # remove all script tags from soup
 89 |         for s in soup('script'):
 90 |             s.clear()
 91 | 
 92 |         # title
 93 |         title_tag = soup.find('h1')
 94 |         if title_tag:
 95 |             item['title'] = title_tag.get_text()
 96 |         else:
 97 |             logging.error("title element not found on page " + item['url'])
 98 | 
 99 |         # price
100 |         price_tag = soup.find(
101 |             'span', attrs={"data-testid": "contact-box-price-box-price-value"})
102 |         if price_tag:
103 |             visible_price_text = price_tag.get_text()
104 |             item.parse_price(visible_price_text)
105 |         else:
106 |             logging.error("price element not found on page " + item['url'])
107 | 
108 |         # size
109 |         size_tag = soup.find(
110 |             'div', attrs={"data-testid": "ad-detail-teaser-attribute-0"})
111 |         if size_tag:
112 |             visible_size_text = size_tag.get_text()
113 |             item.parse_size(visible_size_text)
114 |         else:
115 |             logging.error("size element not found on page " + item['url'])
116 | 
117 |         # room_count
118 |         room_count_tag = soup.find(
119 |             'div', attrs={"data-testid": "ad-detail-teaser-attribute-1"})
120 |         if room_count_tag:
121 |             room_count_text = room_count_tag.get_text()
122 |             item.parse_room_count(room_count_text)
123 |         else:
124 |             logging.error(
125 |                 "room_count element not found on page " + item['url'])
126 | 
127 |         # alternative size and room count parsing (from attributes)
128 |         attribute_tags = soup.findAll(
129 |             'li', attrs={"data-testid": "attribute-item"})
130 |         if attribute_tags:
131 |             for attribute_tag in attribute_tags:
132 |                 attribute_text = attribute_tag.get_text()
133 |                 # parse size again if zero
134 |                 if item['size'] == 0:
135 |                     item.parse_size_2(attribute_text)
136 |                 # parse room_count again if zero
137 |                 if item['room_count'] == 0:
138 |                     item.parse_room_count_2(attribute_text)
139 |         else:
140 |             logging.error(
141 |                 "attribute elements not found on page " + item['url'])
142 | 
143 |         # address, postal_code and district
144 |         location_address_tag = soup.find(
145 |             'div', attrs={"data-testid": "object-location-address"})
146 |         if location_address_tag:
147 |             location_address_text = location_address_tag.get_text()
148 |             # parse address
149 |             item['address'] = location_address_text
150 |             # parse postal_code
151 |             match = re.search(r'8\d\d\d', location_address_text)
152 |             if match:
153 |                 item['postal_code'] = match[0]  # The entire match
154 |             else:
155 |                 logging.error(
156 |                     "postal_code parsing failed on page " + item['url'])
157 |             # parse district
158 |             match = re.search(r'8\d\d\d ([^,]+)', location_address_text)
159 |             if match:
160 |                 item['district'] = match[1]  # The first group
161 |             else:
162 |                 logging.error(
163 |                     "district parsing failed on page " + item['url'])
164 |         else:
165 |             logging.error("element for address, postal_code and district " +
166 |                           "not found on page " + item['url'])
167 | 
168 |         # willhaben_code
169 |         willhaben_code_tag = soup.find(
170 |             'span', attrs={"data-testid": "ad-detail-ad-id"})
171 |         if willhaben_code_tag:
172 |             willhaben_code_text = willhaben_code_tag.get_text()
173 |             match = re.search(r'\d+', willhaben_code_text)
174 |             if match:
175 |                 item['willhaben_code'] = match[0]  # The first group
176 |             else:
177 |                 logging.error(
178 |                     "willhaben_code parsing failed on page " + item['url'])
179 |         else:
180 |             logging.error(
181 |                 "willhaben_code element not found on page " + item['url'])
182 | 
183 |         # edit_date
184 |         edit_date_tag = soup.find(
185 |             'span', attrs={"data-testid": "ad-detail-ad-edit-date"})
186 |         if edit_date_tag:
187 |             item['edit_date'] = edit_date_tag.get_text()
188 |         else:
189 |             logging.error("edit_date element not found on page " + item['url'])
190 | 
191 |         # commission_fee
192 |         body_tag = soup.find('article')
193 |         if body_tag:
194 |             body_text = body_tag.get_text()
195 |             if 'provisionsfrei' in body_text.lower():
196 |                 item['commission_fee'] = 0
197 |             else:
198 |                 item['commission_fee'] = 3.6
199 |         else:
200 |             logging.error(
201 |                 "commission_fee element not found on page " + item['url'])
202 | 
203 |         # price_per_m2
204 |         item.calc_price_per_m2()
205 | 
206 |         # futher item processing is done in the item pipeline
207 |         yield item
208 | 


--------------------------------------------------------------------------------