Skip to content

PythonSpiderMan/wb_spider_wozwaardeloket

Repository files navigation

Spider_wozwaardeloket

This is a scrapper for site: www.wozwaardeloket.nl which is a real estate information website.

target data with some examples

search_locatie street housenumber housenumber_ext postcode plaatsnaam identificatie price_2016 price_2015 bouwjaar gebruiksdoel oppervlakte pdf_link
7523 VT Jekerstraat 102 none 7523VS Enschede 015300024745 115.000 115.000 1969 woonfunctie 92 https://xxxx

This is a good example of massive-level spider

  • target website use cookies to protect itself from scrapping.
  • we have tons of postcodes that need to be scrapped, we need to increase the speed to at least 200 pages/seconds.
  • this script is built on top of scrapy framwork and utilized docker container technology to speed up.

how the website works

this is a website's entry point, before we can initiate any request, we need to send a get request to the following link https://www.wozwaardeloket.nl/index.jsp?a=1&accept=true&

  • the website requires you to have a list of post code in Netherlands, here we have a copy of that.

all post codes of netherlands (csv file)

  • here we choose a postcode for demostration 7523VT, when you type in the code in that textfield, the website will start a new request then give you the following result spider_stage_1

  • click on Jekerstraat 102 which is on the right side, then the website start a new request with following response spider_stage_2

What is happening in the background

  • when we initiated following request to the website, it will set cookies in the browser, which is used for the website to detect web-bot

  • when we input a postcode, says 7523VT to the website, the site will search the postcode on it's database.

    • this is the corresponding ajax request (GET):

    • this ajax gives us following response:

      • stage1_ajax.PNG
    • another ajax request (GET) will start immediately after this request is completed

      This request takes the id from previous response.

    • another ajax request (POST) will start immediately after this request is completed

      I have tested that if we remove all the headers in the post except the Cookie value, the request will also work normally. If we use different JSESSIONID on each request we send, we'll never get banned. this request takes the

      • https://www.wozwaardeloket.nl/woz-proxy/wozloket
      • POST headers
        • stage3_ajax_headers.PNG
      • POST body
        <wfs:GetFeature
            xmlns:wfs="http://www.opengis.net/wfs" service="WFS" version="1.1.0" xsi:schemaLocation="http://www.opengis.net/wfs http://schemas.opengis.net/wfs/1.1.0/wfs.xsd"
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
            <wfs:Query typeName="wozloket:woz_woz_object" srsName="EPSG:28992"
                xmlns:WozViewer="http://WozViewer.geonovum.nl"
                xmlns:ogc="http://www.opengis.net/ogc">
                <ogc:Filter
                    xmlns:ogc="http://www.opengis.net/ogc">
                    <ogc:And>
                        <ogc:Contains>
                            <ogc:PropertyName>wobj_geometrie</ogc:PropertyName>
                            <gml:Point
                                xmlns:gml="http://www.opengis.net/gml">
                                <gml:pos>257727.8 473442.88</gml:pos>
                            </gml:Point>
                        </ogc:Contains>
                        <ogc:BBOX>
                            <ogc:PropertyName>wobj_geometrie</ogc:PropertyName>
                            <gml:Envelope
                                xmlns:gml="http://www.opengis.net/gml">
                                <gml:lowerCorner>257726.8 473441.88</gml:lowerCorner>
                                <gml:upperCorner>257728.8 473443.88</gml:upperCorner>
                            </gml:Envelope>
                        </ogc:BBOX>
                    </ogc:And>
                </ogc:Filter>
            </wfs:Query>
        </wfs:GetFeature>
        
      • gives us following response
        • stage3_ajax_response.PNG
  • after we click Jekerstraat 102 which is on the right side, the browser will start some new requests, we only need to know the critical request that brings us the price of property

    • this is the url of the request, same as the previous request. https://www.wozwaardeloket.nl/woz-proxy/wozloket
    • with the same header as the previous request.
    • but with different POST body
      <wfs:GetFeature
          xmlns:wfs="http://www.opengis.net/wfs" service="WFS" version="1.1.0" xsi:schemaLocation="http://www.opengis.net/wfs http://schemas.opengis.net/wfs/1.1.0/wfs.xsd"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
          <wfs:Query typeName="wozloket:woz_woz_object" srsName="EPSG:28992"
              xmlns:WozViewer="http://WozViewer.geonovum.nl"
              xmlns:ogc="http://www.opengis.net/ogc">
              <ogc:Filter
                  xmlns:ogc="http://www.opengis.net/ogc">
                  <ogc:PropertyIsEqualTo matchCase="true">
                      <ogc:PropertyName>wobj_obj_id</ogc:PropertyName>
                      <ogc:Literal>015300024745</ogc:Literal>
                  </ogc:PropertyIsEqualTo>
              </ogc:Filter>
          </wfs:Query>
      </wfs:GetFeature>
      
    • gives us following response
      • stage4_response.PNG

how we can scrape the site

  • are you still considering using the traditional ways to scrape this website? well we don't nee to do that. Look at that request BODY, you should find that we can change the filters to dump his database, AMAZING and let's do it.

The intellegent way (Hacking the openlayer 3 database)

  • All we need is to have a list of wobj_obj_id we need to find the range of the wobj_obj_id, so that we can dump the database in batches.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published