-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Show recent deleted pages #102
Labels
enhancement
New feature or request
Comments
Draft approach using ES based on "search deleted phrases":
GET _search
{
// don't return any hits, we get all the data from aggregations
"size": 0,
"query": {
"bool": {
"filter": {
"term": {
"dataset": "web_objects_revisions"
}
}
}
},
"aggs": {
"top-urls": {
// we are dividing all versions into buckets, one objects/resource in each bucket
"terms": {
"field": "data.web_objects_revisions.object_id",
"size": 10
//"order": {
// order is affected by `should` query above
// "data.web_objects_revisions.timestamp": "desc"
//}
},
// comment
"aggs": {
"top_hits": {
"top_hits": {
// in each bucket(object/resource) best-matching versions will be enough
"size": 1
}
},
// we get last matching version (the word was present)
"matching": {
"filter": {
"term": {
"http_code": 200
}
},
"aggs": { "last_seen": { "max": { "field": "data.web_objects_revisions.timestamp" }}}
},
// we get the first non-matching version (the word disappeared)
"not_matching": {
"filter" : {
"bool": {
"must_not": {
"term": {
"http_code": 200
}
}
}
},
"aggs": { "first_seen": { "min": { "field": "data.web_objects_revisions.timestamp" }}}
},
// show only deleted phrases
// we are filtering (bucket_selector) only those versions that had a non-matching version after matching
"deleted phrases": {
"bucket_selector": {
"buckets_path": {
"first_not_matching_date": "not_matching.first_seen",
"last_matching_date": "matching.last_seen"
},
"script": "params.first_not_matching_date > params.last_matching_date"
}
}
}
}
}
}
|
Early draft of SQL approach: SELECT r.object_id, r.timestamp, r.code, last_problematic_revisions.timestamp, last_problematic_revisions.code
FROM web_objects_revisions r
INNER JOIN
(SELECT code, timestamp FROM web_objects_revisions WHERE id = ) THIS IS MISSING
(SELECT object_id, MAX(id) from web_objects_revisions WHERE code >= 400 GROUP BY object_id) as last_problematic_revisions
ON r.object_id = last_problematic_revisions.object_id; |
KrzysztofMadejski
added a commit
that referenced
this issue
Sep 11, 2018
KrzysztofMadejski
added a commit
that referenced
this issue
Sep 11, 2018
@danielmacyszyn gotowe do stylowania: http://archiwum.io/deleted-pages |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Deleted page: url is answering
404
Possibly deleted page may result in redirect:
The text was updated successfully, but these errors were encountered: