California police and law enforcement agencies that upload SB 1421 data on their websites
- Since nextrequest.com is used by a lot of agencies and since nextrequest subdomains are the respective pages, I sought to find the number of subdomains of nextrequest.
- I found all the subdomains of nextrequest.com using a python script called Knock.
- Here’s how I installed Knock.
- On Terminal, I installed Knock using the following code:
cd knock
pip3 install -r requirements.txt
python3 knockpy.py nextrequest.com
- A JSON file containing all the subdomains of nextrequest.com is created.
- I converted the large JSON file to CSV using the website:
https://www.convertcsv.com/json-to-csv.htm
- On the website, I used Choose file option to upload the JSON file and clicked on the Convert JSON to CSV
- The file was then converted to CSV, which I downloaded and then imported on the main spreadsheet as a new sheet titled nextrequest.com
- On the spreadsheet, I had more than 4,000 URLs to possible nextrequest subdomains, but most of them are actually 302 redirects.
- I then wanted to get rid of all the 302 redirects and keep only 200s. But to do so, I needed to identify the status codes of all 4,000+ URLs.
- I used the tip offered by a blogger.
- On the spreadsheet’s Extension option, I clicked on Apps Script.
- In the resultant page, I removed all the existing code and pasted the following:
var url_trimmed = url.trim();
// Check if script cache has a cached status code for the given url
var cache = CacheService.getScriptCache();
var result = cache.get(url_trimmed);
// If value is not in cache/or cache is expired fetch a new request to the url
if (!result) {
var options = {
'muteHttpExceptions': true,
'followRedirects': false
};
var response = UrlFetchApp.fetch(url_trimmed, options);
var responseCode = response.getResponseCode();
// Store the response code for the url in script cache for subsequent retrievals
cache.put(url_trimmed, responseCode, 21600); // cache maximum storage duration is 6 hours
result = responseCode;
}
return result;
}
- I created a new column next to the URLs titled Status Code. On the first cell, I typed the following function:
=getstauscode(b2)
— where (b2) refers to the cell that contains the URL. - All the URLs came back as 302. That is because, I realized, all the URLs had http protocol, instead of https.
- I copied all the URLs, which didn’t have any http. I pasted all of them in a new column only as values. I created a new column and filled all the cells with https://.
- I then combined two columns — the one that had text values of all URLs without hyperlinks and the one that had https:// — using this formula:
=A2&""&B2
- All URLs became hyperlinks — except with https
- I then re-employed the
=getstauscode(b2)
formula. - It now correctly stated the status codes of all URLs.
- I found only 48 URLs with valid 202 code; only a handful of error/404 codes, which I checked manually and removed manually. The remaining URLs were 302 redirects.
- I used filtered out 48 valid URLs and copied them, and pasted them into a new tab/sheet titled valid_nextrequest
- On the new sheet titled valid_nextrequest, I created a new column next to status code.
- On the new column, I used the formula
=IMPORTXML(C2,"//h1")
— where C2 refers to the cell that contains hyperlink of a website — to fetch all h1 information of the websites. - I found only a handful of errors; websites that no longer use the service have the errors.
- After getting rid of the errors, I found 42 valid nextrequest subdomains
- Since the number of valid subdomains is low, I manually identified URLs of agencies in California.