This is a scrapper for site: www.wozwaardeloket.nl which is a real estate information website.
search_locatie | street | housenumber | housenumber_ext | postcode | plaatsnaam | identificatie | price_2016 | price_2015 | bouwjaar | gebruiksdoel | oppervlakte | pdf_link |
---|---|---|---|---|---|---|---|---|---|---|---|---|
7523 VT | Jekerstraat | 102 | none | 7523VS | Enschede | 015300024745 | 115.000 | 115.000 | 1969 | woonfunctie | 92 | https://xxxx |
- target website use cookies to protect itself from scrapping.
- we have tons of postcodes that need to be scrapped, we need to increase the speed to at least 200 pages/seconds.
- this script is built on top of scrapy framwork and utilized docker container technology to speed up.
this is a website's entry point, before we can initiate any request, we need to send a get request to the following link https://www.wozwaardeloket.nl/index.jsp?a=1&accept=true&
- the website requires you to have a list of post code in Netherlands, here we have a copy of that.
all post codes of netherlands (csv file)
-
here we choose a postcode for demostration
7523VT
, when you type in the code in that textfield, the website will start a new request then give you the following result -
click on
Jekerstraat 102
which is on the right side, then the website start a new request with following response
-
when we initiated following request to the website, it will
set cookies
in the browser, which is used for the website to detect web-bot -
when we input a postcode, says
7523VT
to the website, the site will search the postcode on it's database.-
this is the corresponding ajax request (GET):
-
this ajax gives us following response:
-
another ajax request (GET) will start immediately after this request is completed
This request takes the
id
from previous response.- https://www.wozwaardeloket.nl/api/geocoder/v2/lookup?id=pcd-55b2e73c50b5f35b9adcadf40768de91
- gives us following response:
-
another ajax request (POST) will start immediately after this request is completed
I have tested that if we remove all the
headers
in the post except theCookie
value, the request will also work normally. If we use differentJSESSIONID
on each request we send, we'll never get banned. this request takes the- POST headers
- POST body
<wfs:GetFeature xmlns:wfs="http://www.opengis.net/wfs" service="WFS" version="1.1.0" xsi:schemaLocation="http://www.opengis.net/wfs http://schemas.opengis.net/wfs/1.1.0/wfs.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <wfs:Query typeName="wozloket:woz_woz_object" srsName="EPSG:28992" xmlns:WozViewer="http://WozViewer.geonovum.nl" xmlns:ogc="http://www.opengis.net/ogc"> <ogc:Filter xmlns:ogc="http://www.opengis.net/ogc"> <ogc:And> <ogc:Contains> <ogc:PropertyName>wobj_geometrie</ogc:PropertyName> <gml:Point xmlns:gml="http://www.opengis.net/gml"> <gml:pos>257727.8 473442.88</gml:pos> </gml:Point> </ogc:Contains> <ogc:BBOX> <ogc:PropertyName>wobj_geometrie</ogc:PropertyName> <gml:Envelope xmlns:gml="http://www.opengis.net/gml"> <gml:lowerCorner>257726.8 473441.88</gml:lowerCorner> <gml:upperCorner>257728.8 473443.88</gml:upperCorner> </gml:Envelope> </ogc:BBOX> </ogc:And> </ogc:Filter> </wfs:Query> </wfs:GetFeature>
- gives us following response
-
-
after we click
Jekerstraat 102
which is on the right side, the browser will start some new requests, we only need to know the critical request that brings us the price of property- this is the url of the request, same as the previous request. https://www.wozwaardeloket.nl/woz-proxy/wozloket
- with the same header as the previous request.
- but with different POST body
<wfs:GetFeature xmlns:wfs="http://www.opengis.net/wfs" service="WFS" version="1.1.0" xsi:schemaLocation="http://www.opengis.net/wfs http://schemas.opengis.net/wfs/1.1.0/wfs.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <wfs:Query typeName="wozloket:woz_woz_object" srsName="EPSG:28992" xmlns:WozViewer="http://WozViewer.geonovum.nl" xmlns:ogc="http://www.opengis.net/ogc"> <ogc:Filter xmlns:ogc="http://www.opengis.net/ogc"> <ogc:PropertyIsEqualTo matchCase="true"> <ogc:PropertyName>wobj_obj_id</ogc:PropertyName> <ogc:Literal>015300024745</ogc:Literal> </ogc:PropertyIsEqualTo> </ogc:Filter> </wfs:Query> </wfs:GetFeature>
- gives us following response
- are you still considering using the traditional ways to scrape this website? well we don't nee to do that. Look at that request BODY, you should find that we can change the filters to dump his database, AMAZING and let's do it.
- All we need is to have a list of
wobj_obj_id
we need to find the range of the wobj_obj_id, so that we can dump the database in batches.