Improve handling of apparent duplicates from EQ realtime #60

timlinux · 2017-02-02T08:24:34Z

When we receive duplicate shakes, we should only process one.
- remove second version
Use the shakeup id as a way to determine if it is a duplicate

lucernae · 2017-02-06T09:51:30Z

Shouldn't we update with the newer one?

timlinux · 2017-02-06T09:54:56Z

They are both the same event. @ivanbusthomi shared with you his google doc explaining this. Can you take a look at that?

gubuntu · 2017-03-02T06:03:22Z

there are up to three stages or levels of shakemap for the same location. They appear to be duplicates at present because they lack distinguishing metadata.
@lucernae is figuring out how to manage unique IDs and requesting extra metadata from BNPB to clearly distinguish them
@lucernae will figure out a way to process, display and push each one.

gubuntu · 2017-09-26T05:39:15Z

from conversation with Pak Yedi from BMKG:

the shakemap id is purely a timestamp reflecting generation time. There should not be duplicate IDs in the 'raw' (modelled) shakemaps from his department (earthquakes and tsunamis).

Shakemaps coming from Pak Hartadi's department (earthquake engineering) are generated from scratch completely separately from accelerometer data and will most likely have a different id for the same event.

So one can assume duplicates are events that match on location, depth, magnitude etc.

Charlotte-Morgan · 2017-12-22T01:01:18Z

Please use the following logic to search for and handle apparent duplicates. This logic handles wanted and unwanted duplicate and will help to identify links between original and post processed shakemaps

cc
@hadighasemi @RikkiWeber

lucernae · 2017-12-22T03:06:08Z

Some question from me:

What is folders?

Let's suppose BMKG provides us with two folders, one for initial shakemaps, we call it shakemaps. The second one for processed shakemaps, we call it shakemaps-processed. In shakemaps folder, BMKG have a directory structure like this: shakemaps\<shake_id_for_initial_shakemaps>\grid.xml. In shakemaps-processed, the directory structure is something like this: shakemaps-processed\<shake_id_for_processed_shakemaps>\output\grid.xml.

Now, when you said: if the duplicate shakemaps are in the same folder, which one did you mean?:

shakemaps\20171222051458 is the folder, and if in the future this folder were updated (the grid.xml is changed), do we call it same folder?

or:

shakemaps is the folder, and if in the future we have more subfolders, such as 20171222051458 and 20171220095545, then we process them all?

Then, when you said: if the duplicate shakemaps are in different folders, this means shakemaps is the folder, and shakemaps-processed is the folder. So, in the case of 2 grid.xml files containing (maybe) the exact shake id, but both comes from different folders, that means one is initial (from shakemaps) and one is processed (from shakemaps-processed).

Which variable is which?

To identify duplicate shakemaps, what field in grid.xml should we look? This is just to clarify.

For comparison, this is screenshot of grid.xml in initial shakemaps

and this is in processed shakemaps

Possible field candidate:

event_time: event\event_timestamp
event_magnitude: event\magnitude
location: event\lat and event\lon
shake_id: shakemap_grid\shakemap_id or shakemap_grid\event_id not sure about the difference, but we want the most consistent one.

Is it possible to have one initial shakemap, but more than one processed shakemap linked for it? This is to handle the possibility of duplicate for processed shakemaps.
Which shakemap product/report should we link to InAWARE?

Implementation Plan

Depending on the answer we can plan on how to manage this different product. The way I previously did is:

event_id/shake_id no longer a primary key in the database, so we can't rely on it as a unique key. This will affects mant codes that uses assumptions that shake_id is unique
A new column (in the database) for shake event needs to be added to differentiate the product type. I was thinking of it as source column.
Create another table to manage linked relation between these shakemap product. So the relationship is handled by database table, rather than codes.
Create appropriate UI for these linked shakemap product
Need to change logic and parameters for Shakemap REST API
Need to change download filename and URL

cc
@Charlotte-Morgan @hadighasemi @RikkiWeber

Charlotte-Morgan · 2017-12-22T04:23:01Z

Folders are shakemaps and shakemaps-processed - ie the top level of the structure for the 2 different sources of shakemaps from BMKG. There are only 2 sources - to find and process duplicates, the code will look within each source (both of them) and then between sources (both of them).
For the variables; I would assume we are talking about the variables you already use in the reporting, and you can continue to use the same ones. These look to me like the event variables you list above. Except for the shakemap_id which is a shakemap_grid variable.
We should expect that duplicates can occur in either source or both. And it may be that 'duplicate' could be a triplicate - ie 3 events with variables within the search parameters.

If the process runs smoothly; the single record for a shakemap event (from the initial source) will be the most recent of any duplicates and will use the shakemap_id of this event. Its paired record will also be the most recent of any duplicates in the shakemap_processed events (from the post processed source). This paired record will assume a linking shakemap_id the same as its duplicate from the shakemap source.
I am assuming that its not likely to get a shakemap_processed without having a duplicate in the shakemap.

I am assuming you already link on the shakemap_id and that the proposed process for identifying and removing duplicates and for linking post processed reports to initial shakemaps will not change the method for linking to InAWARE.

Maybe @gubuntu or @timlinux can comment on the implementation plan

lucernae · 2017-12-22T04:33:37Z

Ok, its clear now
Yes, this one is just for clarification, in case I'm using the wrong variable. So, since the shake id of shakemaps processed is very long (and different), we should use shake id in the initial shakemaps (to be shown in the table), right?
Basically, there can't be more than one shakemaps_processed linked into initial shakemaps at one point of time (the latest shakempas_processed will overwrite the old if any), right?
What I meant by this is the url product to link to InAWARE. Do we include both pdf reports (initial and processed shakemaps) for each language (that means a total of 4 Realtime URL product listed in InAWARE). Or we only include initials shakemaps, like the current one?

Charlotte-Morgan · 2018-01-05T00:29:08Z

yes
no
ideally all 4

timlinux added the ready label Feb 2, 2017

lucernae added the bug label Feb 6, 2017

Charlotte-Morgan added the earthquake label Feb 17, 2017

gubuntu changed the title ~~We need to remove duplicates from EQ realtime~~ Improve handling of apparent duplicates from EQ realtime Mar 2, 2017

lucernae added in progress and removed ready labels Mar 7, 2017

gubuntu assigned lucernae Sep 18, 2017

gubuntu mentioned this issue Sep 27, 2017

Plan for supporting shakemap-corrected inasafe/inasafe-django#171

Open

gubuntu mentioned this issue Jan 9, 2018

One record per EQ on landing page #73

Closed

lucernae added this to the InaSAFE v4 Migration milestone Jan 11, 2018

lucernae mentioned this issue Jan 11, 2018

Needed Information for Realtime inasafe/inasafe#4857

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of apparent duplicates from EQ realtime #60

Improve handling of apparent duplicates from EQ realtime #60

timlinux commented Feb 2, 2017 •

edited by lucernae

Loading

lucernae commented Feb 6, 2017

timlinux commented Feb 6, 2017

gubuntu commented Mar 2, 2017

gubuntu commented Sep 26, 2017

Charlotte-Morgan commented Dec 22, 2017

lucernae commented Dec 22, 2017 •

edited

Loading

Charlotte-Morgan commented Dec 22, 2017

lucernae commented Dec 22, 2017

Charlotte-Morgan commented Jan 5, 2018

Improve handling of apparent duplicates from EQ realtime #60

Improve handling of apparent duplicates from EQ realtime #60

Comments

timlinux commented Feb 2, 2017 • edited by lucernae Loading

lucernae commented Feb 6, 2017

timlinux commented Feb 6, 2017

gubuntu commented Mar 2, 2017

gubuntu commented Sep 26, 2017

Charlotte-Morgan commented Dec 22, 2017

lucernae commented Dec 22, 2017 • edited Loading

Implementation Plan

Charlotte-Morgan commented Dec 22, 2017

lucernae commented Dec 22, 2017

Charlotte-Morgan commented Jan 5, 2018

timlinux commented Feb 2, 2017 •

edited by lucernae

Loading

lucernae commented Dec 22, 2017 •

edited

Loading