Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate scraping #15

Merged
merged 24 commits into from
Jul 4, 2024
Merged

Integrate scraping #15

merged 24 commits into from
Jul 4, 2024

Conversation

freddyheppell
Copy link
Member

@freddyheppell freddyheppell commented Jul 3, 2024

Integrate wp-json-scraper into this codebase.

Closes #10

  • integrate library
  • add cli command and remove unneeded code
  • additional code removal
    • refactor out console library
    • remove or refactor out utils
  • improve resilience of scrape process
  • make media dl use request session
  • HTML scrape (or command to generate list to pass to wget?)

@freddyheppell freddyheppell linked an issue Jul 3, 2024 that may be closed by this pull request
@freddyheppell freddyheppell merged commit 4c9c672 into dev Jul 4, 2024
6 checks passed
@freddyheppell freddyheppell deleted the feature/scrape branch July 4, 2024 16:03
freddyheppell added a commit that referenced this pull request Jul 11, 2024
* migrate to poetry (#3)

* migrate to poetry

* move poetry install first

* update lockfile

* fix script declaration

* update docs

* bump workflow versions

* Support more Python versions (#4)

* test up to py3.12

* drop py3.8 as it's nearly eol

* bump package versions

* Fix removed numpy NaN

* update py version in readme

* Change linting to Ruff (#5)

* Install ruff

* remove makefile

* ruff check fix

* manual ruff fixes

* undo .at to .loc with noqa

* update lint workflow

* delete old flake8 config

* fix CI lint commands

* convert pickle fixtures to json tables (#6)

* Remove yoast plugin requirement (#12)

* better handling without yoast plugin

* run ruff

* Better support no scrape (#13)

* Allow empty scrape properly

* ruff lint

* Integrate scraping (#15)

* incorporate wp-json-scraper module

* ruff autofix

* docconvert

* manual fixes

* fix noqa comment

* refactor cli to support subcommands

* Basic integration of downloader

* swap progress bar implementation

* ruff

* remove object display code

* add media dl support

* lint

* remove totally unused modules

* remove unused csv export code

* ruff

* implement proxy/cookies/auth on download client

* improve subparser documentation

* ruff

* add NOTICE file

* remove plugin list

* remove console util

* dl utils cleanup

* implement request robustness features

* make media dl use request session

* Add docs (#18)

* docs

* ruff

* fix broken test

* fix typing union

* add deploy dir to gitignore

* change project name to match package name

* hotfix: remove docconvert

* hotfix: fix package name for version get

* Packaging improvements (#19)

* package meta and readme updates

* relax dependency constraints

* regen lock

* readme tweaks

* put back gh markdown admonition

* rename main package for consistency (#20)

* rename main package

* ruff

* Feature/prefix consistency (#21)

* wip

* make prefix behaviour consistent between commands

* update cli docs

* Some tests for downloader (#22)

* add downloader tests

* test no prefix downloader too

* test wpapi and exporter

* ruff

* hotfix: print and wrong command name

* fix dl input directory path

* Prerelease restructure (#23)

* expose dl api a bit better

* improve docs for dl module

* enable pyupgrade rules

* enable ruff specific rules

* remove some doc noqas

* ruff reformat

* fix export decorator usage

* change dl media to use pathlib

* fix build system in pyproject

* minor cli and docs fixes

* add changelog for 1.0.0

* clean up naming of translation pickers

* set version to 1.0.0rc1

* change pytest to use importlib mode (#25)

* Change CLI to use click (#26)

* swap to click for cli

* ruff

* remove old argparse support

* use download instead of dl consistently

* update docs for cli

* ruff

* Feature/packaging meta (#27)

* add classifiers

* update license docs to match new path

* rename repo

* Remove print statements (#28)

* fix name in notice file

* replace prints with logs

* add cli version

* Ruff

* prepare release 1.0.0

* add publish workflow

* Add building to CI test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate site scraping
1 participant