Project Excalibur: Serverless, Chromium-powered HKUST SIS Scraper

The ultimate weapon against ever changing, hard to reverse-engineer scraping targets

This personal project is an experiment to see whether headless chrome and serverless architecture could be deployed at USThing, so as to reduce workload on reverse engineering and long term maintainence. The product, if successful, could replace our current php-based ~~spaghetti-ish~~ SIS scraper.

TODO

Usage

Local Development / Trial

npm i
npm run dev

Deploy to AWS

Deploys to ap-southeast-1 (Singapore) 512MB ram lambda instances
You will need to create an aws account (free tier?) and setup serverless client beforehand

serverless deploy

REST API

GET /:scopes

list of requested data scopes
separated by +, e.g. /grades+program_info
valid scopes: all, grades, program_info, schedule (more to come)

Parameters

course_status

filter for schedule scope
possible values: enrolled, dropped, waitlisted
separated by +, e.g. /schedule?course_status=enrolled+waitlisted
default: enrolled

GET /login

this endpoint is unnecessary for vanilla usage, request /:scopes directly with auth headers set.
special endpoint for AWS Lambda usage, where timeout is limited to 29s (just enough for login/2FA step alone)
return response immediately after login (with forwarded cookies to preserve login state)
actual data requests can be done in subsequent requests

Authenication

pass itsc username and password via X-Excalibur-Username and X-Excalibur-Password headers
2FA approval on duo app required during first request
cookies received from source sites are forwarded so login state is retained
for program_info scope, cookie forwarding is not working due to short-lived cas cookies. fresh login is required.

Config

this was for debugging before web api is developed, leaving it here for possible cli development
add credenitals to config.sample.json and rename the file to config.json

Benchmarks

The following benchmark was measured from an off-campus location. Excalibur instance was hosted on AWS Lambda in Tokyo and PHP API was hosted on USThing server at HKUST campus (with Cloudflare CDN in between, ~5ms taken)

(Note: Appearently Singapore region has better ping to HK, will revisit this later)

	Excalibur				PHP API
(response time in ms)	run 1	run 2	run 3	mean	run 1	run 2	run 3	mean
Ping	55.469	57.405	56.426	56.433	7.201	6.431	6.541	6.724
Login with cookie cached	3016	5937	3066	4006.333	2622	2679	2224	2508.333
Timetable	7677	7321	7107	7368.333	816	893	850	853
Grades	14713	14866	14375	14651.333	838	861	753	817.333

Discussion

compared to php-based api, excalibur responses were much slower especially in Grades
this is probably due to time consumed by real browser rendering and lack of result caching (PHP API cache past grades for quick re-access)
another major factor is the physical distance between AWS datacenters and Hong Kong, where the performance penalty were multiplied by number of pages fetched
further enhancements can be done on Excalibur e.g. filters for fetching data of latest semester only (useful for grades / waitlist refresh; or in the use case of USThing past data were stored on client)
TODO: re-run benchmarks for Excalibur on a local server

Contributing

Open Issue -> Discussion -> Pull Request -> Merge after Review -> Our world made better :)
Please use StandardJS linting and async-await

License

Open sourced under MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
src		src
.babelrc		.babelrc
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
LICENSE		LICENSE
config.sample.json		config.sample.json
package-lock.json		package-lock.json
package.json		package.json
readme.md		readme.md
serverless.yml		serverless.yml
webpack.config.js		webpack.config.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Excalibur: Serverless, Chromium-powered HKUST SIS Scraper

TODO

Usage

Local Development / Trial

Deploy to AWS

REST API

GET /:scopes

Parameters

course_status

GET /login

Authenication

Config

Benchmarks

Discussion

Contributing

License

About

Releases

Packages

Languages

License

elise-ng/excalibur

Folders and files

Latest commit

History

Repository files navigation

Project Excalibur: Serverless, Chromium-powered HKUST SIS Scraper

TODO

Usage

Local Development / Trial

Deploy to AWS

REST API

GET /:scopes

Parameters

course_status

GET /login

Authenication

Config

Benchmarks

Discussion

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages