Skip to content
Asura Enkhbayar edited this page Jun 24, 2021 · 9 revisions

Research Log

  • 01.03.2021 -- started the data collection
  • 09.03.2021 -- finally got the cron job working. automatic collection script running at 6pm everyday
  • 15.03.2021 -- noticed that two of the sources (sciblogs, sciline) haven't been publishing frequently enough.
  • 19.03.2021 -- added three new sources: HealthDay, News Medical, and MedPageToday and started data collection
  • 29.03.2021 -- changed popsci RSS feed URL as the old one did not work anymore
  • 31.03.2021 -- frequency of RSS collection has been increased to every 3hrs in order to determine if feeds are capping at 10 entries
  • 31.03.2021 -- newsmed was still maxing out with 10 articles. collecting every hour now
  • 12.04.2021 -- popsci seems broken. call with juan: decision to write Twitter crawlers for popsci and iflscience
  • 31.04.2021 -- collection server taken down. collection migrated to new machine.
  • 03.05.2021 -- collection was stopped
  • 23.06.2021 -- final filtering was adjusted (some Spanish articles were missed, major bug in the removing older articles). final sample was created with 100 articles per source.
Clone this wiki locally