- specific exception classes within
trove
- better search api error responses
- better search-api html experience
- more static vocabs
- fix various errors
- fix: jsonapi renderer now chooses
type
consistently
- speed up oai-pmh queries
- improve trove simple-json and html experience
- add "simple json" renderer for search api responses
- update django to 3.2.25
- fix oai-pmh feed
- add
osfmap:hasCedarTemplate
to trove.vocab
- fix: allow date literals for legacy sharev2_elastic deriver
- add docs:
- /trove/docs/openapi.json
- /trove/docs/openapi.html
- /vocab/2023/trove/...
- allow adding propertypaths to
cardSearchText
andvalueSearchText
- e.g.
cardSearchText[creator.name]=...
- e.g.
- anywhere a set of propertypaths is encoded in query params, allow
simple glob-paths ("", ".", "..") that match any propertypath
of the given length
- note: partial globs (e.g. ".name" or "publisher.") are not supported (...yet?)
- when an iri value returned by an index-value-search has a full index-card, include that index-card instead of the stub built from indexed values
- friendlier FeatureFlag admin list
- BREAKING: allow multiple propertypaths in query params
- use
.
to delimit steps in a path; e.g.creator.affiliation
is a path of two steps (previously would becreator,affiliation
) - use
,
to delimit multiple paths; e.g.creator.name,contributor.name
would be two paths (previously impossible) - hidden behind feature flag:
periodic_propertypaths
- use
- add missing OSFMAP shorthands
- fix: in
index-card-search
, do not show "next" link when no results
- more consistent pagination over randomly ordered results
- correct test setup for
trove_indexcard_flats
- skip "first" link from first page
- disable pagination on large, randomly-sorted result sets
- more efficient random sort (for sorting by relevance to nothingness)
- remove
trove_indexcard
(fully replaced bytrove_indexcard_flats
) trove_indexcard_flats
updates:- log search queries when in DEBUG mode
- disable "unnamed filter values" aggregations (expensive and yet unused)
- fix:
trove_indexcard_flats
would clobber some iri values while flattering - skip indexing cards that don't have
osfmap_json
- more gracefully handle erroneously circular
skos:Concept
hierarchies
- lil optimization to skip unhelpful aggregations
- disable tests using elasticsearch5 on github actions
- (will soon reenable or remove elastic5 altogether)
- add
trove_indexcard_flats
index strategy- copy of
trove_indexcard
with flatter queries (and more info on the root doc)
- copy of
- fix: allow more than 11 related properties on an
index-card-search
to have non-zero count
- small improvements to
trove_indexcard
index strategy- skip indexing metadata with
osfmap:contains
in the path (don't index file metadata with its container) - better consolidate
nested_iri
to reduce number of nested docs
- skip indexing metadata with
- introducing "trove"
- store metadata records as small rdf documents called "index cards"
- ingest rdf
- add iri-centric search
- "shtrove": working to preserve back-compat (because trove may be trouble)
- make
SourceConfig.disabled
preventharvest
tasks running
- downgrade to python 3.10 (for now)
- improve logging
- replace
raven
(deprecated) withsentry-sdk
- add logging formatter for json with
severity
(for logging in deployments)
- replace
- remove squashed migrations, dead code
- fix a typo
- admin interface: allow re-ingesting all data for a source config
(see "ingest" buttons at
/admin/share/sourceconfig/
) - address possible cause of some backfill gaps
- fix logging errors
- upgrade to python 3.11
- upgrade to elasticsearch 8
- add
share.search.index_strategy
to act as a slippery abstraction layer between search-engine backend and planned friendly search api- configure two index strategies (and make it easy to add more in the future):
sharev2_elastic5
: the existing/legacy SHAREv2 search index as exists on elasticsearch5 and exposed via/api/v2/search/creativeworks/_search
sharev2_elastic8
: a mirror/replacement forsharev2_elastic5
with all the same_source
docs (but possible incompatibilities for the existing pass-thru api)
- configure two index strategies (and make it easy to add more in the future):
- add a happy-path index-backfill workflow to the admin interface at
/admin/search-indexes
- when changing index-strategy settings/mappings/whatever, the "happy path" is to create, backfill, verify a new copy of the index; then switch which is used for searching, verify again, and finally delete the old index.
- not intended to have the power of a full elasticsearch management interface -- just enough visibility to see whether things are going ok and where to start looking if something goes wrong
- for testing, support
indexStrategy
query param to/api/v2/search/creativeworks/_search
,/api/feeds/rss
,/api/feeds/atom
- may request a configured strategy (e.g.
indexStrategy=sharev2_elastic8
) or a specific version of an index within a strategy (e.g.indexStrategy=sharev2_elastic8__bcaa90e8fa8a772580040a8edbedb5f727202d1fca20866948bc0eb0e935e51f
)
- may request a configured strategy (e.g.
- add
FeatureFlag
model, use it to switch default search strategy (name="elastic_eight_default"
)
- add
suid
value tosharev2_elastic
index
- easy additive elastic mapping changes
- add
osf_related_resource_types
field - dockerfile updates
- update raven
- update and consolidate docs
- audit and upgrade all dependencies
- switch to github actions for tests/ci
- fix: feeds should not break on null date_published
- fix: oai_dc formatter breaks on deletions
- big rend! remove many things:
- concepts:
- merging data from multiple sources together (aiming instead for a simple, robust repository of metadata records -- let's talk later/soon about how we might do merging well)
- models:
ShareObject
and all its descendentsShareObjectVersion
and all its descendentsChange
ChangeSet
SubjectTaxonomy
UnusedCeleryProviderTask
UnusedCeleryTask
- api routes:
- all auto-generated
ShareObject
routes (e.g./api/v2/creativeworks/
) - all
schema
routes (except the root/api/v2/schema/
)- auto-generated schema routes (e.g.
/api/v2/schema/disputes/
) - work type hierarchy (
/api/v2/schema/creativeworks/hierarchy/
)
- auto-generated schema routes (e.g.
/api/v2/graph/
- all auto-generated
- concepts:
- admin features/improvements
- add FormattedMetadataRecord admin
- when investigating a problem, start by finding the suid and navigate relationships from there
- add action to delete all FormattedMetadataRecords for some chosen suid(s) (good for spam control)
- fix a 500 error at
/api/v2/
- fix sending useful debugging info to sentry
- make the oai-pmh feed respect switch-flipping
- give an accurate
date_created
in sharev2_elastic formatter - fix admin bug -- don't hide the search box
- add django-debug-toolbar to dev dependencies
- tidy up some admin inefficiencies
- expose a few models in read-only json:api, so the frontend can be useful given a suid
/api/v2/formattedmetadatarecords/
/api/v2/sourceconfigs/
/api/v2/suids/
- add new atom/rss feeds that get results from the new backcompat index
/api/v2/feeds/atom/
/api/v2/feeds/rss/
- (old feeds now deprecated, will be gone with ShareObject)
- add
--pls-reingest
arg to format_metadata_records command
- fix: facility != funder (in gov.clinicaltrials transformer)
- remove feature: oai_dc formatter no longer puts first author last
- add utility:
share.util.names.get_related_agent_name
for consistently getting an agent name from an "agent-work relation" node- if missing both
cited_as
andname
(true of some old, unregulated production data), reluctantly apply some cultural assumptions and build a name from parts (given_name
,additional_name
,family_name
,suffix
)
- if missing both
- bugfix: in share.util.graph, handle merging nodes with dictionary values
- bugfix: when formatting oai_dc, strip characters illegal in XML
- when regulating, discard gravatars as agent identifiers
- bugfix: deduping subjects in custom taxonomies
- fix up
populate_osf_suids
with more useful messaging - improve "central node" guessing to handle old osf data on prod
- speed up
populate_osf_suids
-- excludeNormalizedData
with nullraw
, since they'll be ignored anyway
- fix
populate_osf_suids
script to handle fun situations
- new model:
FormattedMetadataRecord
- new sharectl commands:
sharectl search purge
sharectl search setup <index_name>
sharectl search setup --initial
sharectl search set_primary <index_name>
sharectl search reindex_all_suids <index_name>
- new management commands:
format_metadata_records
populate_osf_suids
- new doc:
README-docker-quickstart.md
-- the easy way to get started - define the "share schema" statically (in
share.schema
)- stop inferring everything from the
ShareObject
models
- stop inferring everything from the
- add a parallel ingestion path, preparing for a future without
ShareObject
- use only the most recent
NormalizedData
for each suid (no merging) - allow explicitly stating the suid when pushing a
NormalizedData
- if not specified, try looking for an OSF guid
- build a
FormattedMetadataRecord
for each metadata format - currently two metadata formatters (and room for more):
sharev2_elastic
: for a back-compatible elasticsearch index -- builds a document just likeshare.search.fetchers.CreativeWorkFetcher
, but from aNormalizedData
instead of all theShareObject
tablesoai_dc
: dublin core XML, for the OAI-PMH feed
- use only the most recent
- indexer daemon overhaul
- assorted cleanup; dead/useless code removal
- add
ElasticManager
to encapsulate all requests sent to elasticsearch - add
IndexSetup
concept to describe how to get/build documents for an index and what messages to send to that index's daemon - currently two index setups:
share_classic
: index byAbstractCreativeWork
id, using existingshare.search.fetchers
logicpostrend_backcompat
: index bySourceUniqueIdentifier
id, using thesharev2_elastic
FormattedMetadataRecord
s
- add a parallel OAI-PMH that uses
FormattedMetadataRecord
withoai_dc
- remains dormant for the moment -- enable with
pls_trove
query param - NOTE: when we switch over, OAI-PMH datestamps will all be new and recent
- remains dormant for the moment -- enable with
- admin updates:
- search
IngestJob
by suid value
- search
- Add a decorator for marking views deprecated
- Mark some views deprecated
- Sources added via API default to canonical
- Automatically schedule
ingest
tasks after harvesting - Schedule
ingest
tasks in adminreenqueue
action - Pin
faker
to 4.0.3 - Update
.travis.yml
- Fix bug in
io.osf.registrations
transformer
- Ensure order in oai-pmh
- Exclude frankenworks from oai-pmh
- Reduce oai-pmh page size
- Pin
graphql-relay
to a compatible version
- Dockerfile fixes & improvements
- Optimize oai-pmh endpoint to avoid timeouts
- Add
reindex_works
shell util
- Pin python-dateutil to a version that doesn't break tests (2.8.0)
- Temporarily (i hope) skip tests broken by 19.0.5
- Temporary fix to avoid slow IngestJob queries
- Possibly fix a rare forceingest error
- Skip indexing works with too many agent relations
- Make the indexer more configurable by environment variables
- Fix indexer deadlock
- Allow turning off ingestion (but not harvest) for non-canonical sources
- Ingestion perf improvements (faster attr access in MutableGraph)
- Handle indexer errors better
- Ingestion perf improvements
- Update
requests
dependency
- Make it easier to reingest all OSF data
- Fix worker out of memory errors
- Update nameparser dependency
- Add datacite oai-1.1 schema namespace
- Fix common datacite transform errors
- Update django to 1.11.16
- Clean up disambiguation logic to make extending it less painful
- Extend disambiguation to match contributors with different name formats
- Rename
fixpreprintdisambiguations
command toforceingest
- Handle more complex merges
- Improve error message for transformer errors
- Fix OSF registration transformer
- Update NSF harvester to look farther into the past
- Fix a bug in the OSF project harvester
- Fix --osf-only flag in fix_datacite command
- When a job is marked "skipped", not even
superfluous
will re-run it
- All retried jobs should be marked "rescheduled"
- Harvest jobs that are retried when the same source is already being harvested should be marked "rescheduled" rather than "failed"
- Handle OSF harvest errors gracefully
- Pin kombu to 4.1.0
- Harvest all set specs from CSIC
- Allow sorting Atom feed by
date_created
anddate_published
- Don't create unnecessary source configs for each new source
- Update pytest-django dependency to avoid version conflict
- Fix bug in indexer daemon, stop all threads when one dies
- Fix typo in
sharectl ingest
that prevented bulk reingestion
- Fix date range filtering in com.figshare.v2 harvester
- Bulk reingestion with
IngestScheduler.bulk_reingest()
andsharectl ingest
- Admin interface updates
- More stable and reliable indexer daemon
- "Urgent" queues for ingestion and indexing, allowing pushed data to jump ahead of harvested data
- Various source config updates
- Fix PeerJ transformer error
- Prevent infinite task loop for certain types of errors
- Update raw data janitor to skip over datums from disabled/deleted sources
- Fix bug in fixpreprintdisambiguations command
- Fix a broken test
- Fix some time-sensitive tests
- Add IngestJob, used to keep track of a RawDatum's ingestion status
- Exposed in API at
/api/v2/ingestjobs/
- In the response to pushed data, include a link to the IngestJob
- Exposed in API at
- Rename HarvestLog to HarvestJob
- Combine
transform
anddisambiguate
tasks intoingest
task - Catch all errors caused by bad input data, store them on the IngestJob
- Add Regulator, a place to put logic/transforms/validation that should run on all data, regardless of source
- Fix: Prevent indexer daemon threads from exiting when elasticsearch times out
- Map work relation types in MODS transformer
- Update edu.utah source config to include more approved sets
- Update edu.umassmed source config to use HTTPS
- Update pendulum dependency to avoid infinite janitor loop
- Fix elasticsearch_janitor task
- Expect (and give) str arguments, avoiding error
- Use the indexer daemon by default
- Speed up update_elasticsearch task:
- Don't count the works just for a log message
- Use the indexer daemon by default, instead of index_model tasks
- Only run one update_elasticsearch task at a time
- Add --delete-related and --superfluous flags to
enforce_set_lists
- Improve script output by including ids in ShareObject.repr
- Devops updates for new environment
- Actually speed up OAI feed
- Speed up OAI feed when filtering by
set
- Delete merged works with no identifiers in
fixpreprintdisambiguations
- Allow omitting arXiv from
fix_datacite
script
- Add parameters to
fix_datacite
script
- Use normalized agent name in Atom feed, instead of
cited_as
- Update psycopg dependency
- Type map for Columbia Academic Commons (edu.columbia)
- Type map for University of Cambridge (uk.cambridge)
- Allow reading/writing
Source.canonical
at/api/v2/sources/
- Include
<author>
in atom feed at/api/v2/atom/
- ScholarsArchive@OSU source config for their new API
- Prevent OSF harvester from being throttled
- Update NSFAwards harvester/transformer to include more fields
- Use request context to build URLs in the API instead of SHARE_API_URL setting
- Stop displaying
localhost:8000
links
- Stop displaying
- Add
--from
parameter tofixpreprintdisambiguations
management command
- Support for set blacklists for sources that follow OAI-PMH protocol
enforce_set_lists
command to enforce set blacklist and whitelist
- Set whitelist for UA Campus Repository
- Support for encrypted json field and start using it in SourceConfig model
- Enable Coveralls
- Include work lineage (based on IsPartOf relations) in the search index payload
- Add
self
links to objects returned by the API
- Collect metadata in MODS format from UA Campus Repository
- Update columbia.edu harvester source config (disabled set to false)
- Improve creating Sources at
/api/v2/sources/
- Use POST to create, PATCH to update
- Respond with sensical status codes (409 on name conflict, etc.)
- Backfill CHANGELOG.md to include
2.10.0
and2.11.0
- Correctly encode &, <, > characters in the Atom feed
- Avoid DB connection leak by disabling persistent connections
editsubjects
management command to modifyshare/subjects.yaml
- Replace
share/models/subjects.json
withshare/subjects.yaml
- Update central subjects taxonomy to match Bepress' 2017-07 update
- Symbiota as a source
- AEA as a source
- Used django-include for a faster OAI-PMH endpoint
- Updated regex for compatibility with Python 3.6
- University of Arizona as a source
- NAU Open Knowledge as a source
- Started collecting analytics on source APIs (response time, etc.)
- Support for custom taxonomies
- sharectl command line tool
- Profiling middleware for local development
- Janitor tasks to find and process unprocessed data
- Timestamp field to RawData
- Mendeley Harvester!
- Started to use deprecation warning
- Timeouts for harvests
- The concept of "Bots"
- A lot of dead code
- A GPL licenced library
- Upgraded to Celery 4.0
- Deleted works now return 403s from the API
- Deleted works are now excluded from the API
- Corrected to date fields used to audit the Elasticsearch index
- Strongly defined the Harvester interface
- Harvests are now scheduled in a more friendly manner
- Updated the configurations for many OAI sources
- HarvestLogs no longer get stuck in progress
- Text parsing transformer utilties
- MODS transformer looks at the location field in addition to other fields for a work identifier
- Elasticsearch Janitor task to keep Postgres and ES in sync
- Concurrently added indexes
- Admin updates to allow quicker fixing of broken data
- More test coverage
- Elasticsearch's scroll API explicitly disabled
- Upgraded to Django 1.11
- Elasticsearch now pulls last_modified from itself rather than Postgres
- API pagination no longer times out on large collections
- Timestamps are now included in the ATOM feed
- OAI endpoint
- Sources
- OpenBU
- Updated documentation
- Sources
- A table for managing SHARE data sources
- Replaces the apps in the providers folder
- SourceConfigs
- A table for managing different methods of acquire data from given source
- Replaces nested apps/app labels
- HarvestLogs
- First class support for managing harvesting/back harvesting
- Source Unique Identifiers
- First class representation of what was RawData.provider_doc_id
- The Django admin now supports starting harvesters over long periods of time
- Support for the MODs OAI PHM prefix
- Provider Django applications have been removed
- Source specific fields have been removed from ShareUser
- Harvesters have been relocated to share/harvesters/
- Various renaming/vocabulary changes
- RawData -> RawDatum
- Favicon -> Icon
- Provider -> Source
- Provider App -> SourceConfig
- Normalizer -> Transformer
- Updates to the getting started guide
- Squashed migrations to speed up local development
- Harvesters are now expected to return utf-8 strings
- Sources are no longer tied to the ShareUser model
- Title now has an "exact" multi-field in elasticsearch
- A robot that archives old succeeded celery jobs
- New Harvesters
- Scholarly Commons @ JMU
- Compensate for potential race conditions with the push API
- New Harvesters
- Research Registry Harvester
- SSOAR
- Status API endpoint
- Updated set_specs for University of Kansas
- ClinicalTrials.gov now output registrations
- Source icons are now stored in the database
- Removed "Notify" from the page title in the browsable API
- Support for OSF Registries
- New Harvesters
- University of Utah
- Updated the API
- Improved Elasticsearch mappings
- Updated NIH and NSFAwards
- Affiliations are now gathered
- Non-Unique URLs are no longer collected
- Lots of under the hood changes to make dev's lives easier
- New Harvesters
- es.csic
- edu.purdue.epubs
- Site status banners
- Retraction harvesting
- A little bit of documentation
- OAuth login failure pages look nice now
- Cascade deletes are now implemented as database cascades
- New Harvesters
- edu.cornell
- edu.richmond
- edu.scholarworks_montana
- edu.ucf
- edu.umd
- edu.utahstate
- org.seafdec
- Relations between creative works
- Updated harvesters
- Figshare v2 API
- PeerJ XML API
- Pubmed PMC prefix
- Datacite 4.0
- BePress Taxonomy for subjects
- Travis now uses postgres 9.5
- Comprehensive test suite for normalization and disambiguation
- Updated data model
- More expressive relations between people/organizations and works
- Type hierarchies
- Creative works: Publication, Preprint, DataSet, Patent, Thesis, Software, etc.
- Agents: Person, Organization, Institution, Consortium
- More aggressive and intelligent data parsing
- Stricter validation of incoming data
- Prune duplicate objects from submitted changesets
- Various bug fixes
- Formalized disambiguation methods
- App bootstrap time improved by 4x
- Better elasticsearch mappings
- URI may now be searched/matched directly
- Prettier table names
- Backport of the V1 push API
- New and improved source registration form
- JSON schema endpoint
- New sources
- College of William and Mary
- University of Wisconsin