Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of apparent duplicates from EQ realtime #60

Open
timlinux opened this issue Feb 2, 2017 · 9 comments
Open

Improve handling of apparent duplicates from EQ realtime #60

timlinux opened this issue Feb 2, 2017 · 9 comments

Comments

@timlinux
Copy link
Contributor

timlinux commented Feb 2, 2017

  • When we receive duplicate shakes, we should only process one.
    • remove second version
  • Use the shakeup id as a way to determine if it is a duplicate
@timlinux timlinux added the ready label Feb 2, 2017
@lucernae
Copy link
Contributor

lucernae commented Feb 6, 2017

Shouldn't we update with the newer one?

@timlinux
Copy link
Contributor Author

timlinux commented Feb 6, 2017

They are both the same event. @ivanbusthomi shared with you his google doc explaining this. Can you take a look at that?

@lucernae lucernae added the bug label Feb 6, 2017
@gubuntu gubuntu changed the title We need to remove duplicates from EQ realtime Improve handling of apparent duplicates from EQ realtime Mar 2, 2017
@gubuntu
Copy link
Collaborator

gubuntu commented Mar 2, 2017

  • there are up to three stages or levels of shakemap for the same location. They appear to be duplicates at present because they lack distinguishing metadata.
  • @lucernae is figuring out how to manage unique IDs and requesting extra metadata from BNPB to clearly distinguish them
  • @lucernae will figure out a way to process, display and push each one.

@gubuntu
Copy link
Collaborator

gubuntu commented Sep 26, 2017

from conversation with Pak Yedi from BMKG:

the shakemap id is purely a timestamp reflecting generation time. There should not be duplicate IDs in the 'raw' (modelled) shakemaps from his department (earthquakes and tsunamis).

Shakemaps coming from Pak Hartadi's department (earthquake engineering) are generated from scratch completely separately from accelerometer data and will most likely have a different id for the same event.

So one can assume duplicates are events that match on location, depth, magnitude etc.

@Charlotte-Morgan
Copy link
Member

Please use the following logic to search for and handle apparent duplicates. This logic handles wanted and unwanted duplicate and will help to identify links between original and post processed shakemaps

image
cc
@hadighasemi @RikkiWeber

@lucernae
Copy link
Contributor

lucernae commented Dec 22, 2017

Some question from me:

  1. What is folders?

Let's suppose BMKG provides us with two folders, one for initial shakemaps, we call it shakemaps. The second one for processed shakemaps, we call it shakemaps-processed. In shakemaps folder, BMKG have a directory structure like this: shakemaps\<shake_id_for_initial_shakemaps>\grid.xml. In shakemaps-processed, the directory structure is something like this: shakemaps-processed\<shake_id_for_processed_shakemaps>\output\grid.xml.

Now, when you said: if the duplicate shakemaps are in the same folder, which one did you mean?:

shakemaps\20171222051458 is the folder, and if in the future this folder were updated (the grid.xml is changed), do we call it same folder?

or:

shakemaps is the folder, and if in the future we have more subfolders, such as 20171222051458 and 20171220095545, then we process them all?

Then, when you said: if the duplicate shakemaps are in different folders, this means shakemaps is the folder, and shakemaps-processed is the folder. So, in the case of 2 grid.xml files containing (maybe) the exact shake id, but both comes from different folders, that means one is initial (from shakemaps) and one is processed (from shakemaps-processed).

  1. Which variable is which?

To identify duplicate shakemaps, what field in grid.xml should we look? This is just to clarify.

For comparison, this is screenshot of grid.xml in initial shakemaps
screen shot 2017-12-22 at 09 42 04

and this is in processed shakemaps

screen shot 2017-12-22 at 09 31 59

Possible field candidate:

  • event_time: event\event_timestamp
  • event_magnitude: event\magnitude
  • location: event\lat and event\lon
  • shake_id: shakemap_grid\shakemap_id or shakemap_grid\event_id not sure about the difference, but we want the most consistent one.
  1. Is it possible to have one initial shakemap, but more than one processed shakemap linked for it? This is to handle the possibility of duplicate for processed shakemaps.

  2. Which shakemap product/report should we link to InAWARE?

Implementation Plan

Depending on the answer we can plan on how to manage this different product. The way I previously did is:

  1. event_id/shake_id no longer a primary key in the database, so we can't rely on it as a unique key. This will affects mant codes that uses assumptions that shake_id is unique
  2. A new column (in the database) for shake event needs to be added to differentiate the product type. I was thinking of it as source column.
  3. Create another table to manage linked relation between these shakemap product. So the relationship is handled by database table, rather than codes.
  4. Create appropriate UI for these linked shakemap product
  5. Need to change logic and parameters for Shakemap REST API
  6. Need to change download filename and URL

cc
@Charlotte-Morgan @hadighasemi @RikkiWeber

@Charlotte-Morgan
Copy link
Member

  1. Folders are shakemaps and shakemaps-processed - ie the top level of the structure for the 2 different sources of shakemaps from BMKG. There are only 2 sources - to find and process duplicates, the code will look within each source (both of them) and then between sources (both of them).

  2. For the variables; I would assume we are talking about the variables you already use in the reporting, and you can continue to use the same ones. These look to me like the event variables you list above. Except for the shakemap_id which is a shakemap_grid variable.

  3. We should expect that duplicates can occur in either source or both. And it may be that 'duplicate' could be a triplicate - ie 3 events with variables within the search parameters.

If the process runs smoothly; the single record for a shakemap event (from the initial source) will be the most recent of any duplicates and will use the shakemap_id of this event. Its paired record will also be the most recent of any duplicates in the shakemap_processed events (from the post processed source). This paired record will assume a linking shakemap_id the same as its duplicate from the shakemap source.
I am assuming that its not likely to get a shakemap_processed without having a duplicate in the shakemap.

  1. I am assuming you already link on the shakemap_id and that the proposed process for identifying and removing duplicates and for linking post processed reports to initial shakemaps will not change the method for linking to InAWARE.

Maybe @gubuntu or @timlinux can comment on the implementation plan

@lucernae
Copy link
Contributor

  1. Ok, its clear now
  2. Yes, this one is just for clarification, in case I'm using the wrong variable. So, since the shake id of shakemaps processed is very long (and different), we should use shake id in the initial shakemaps (to be shown in the table), right?
  3. Basically, there can't be more than one shakemaps_processed linked into initial shakemaps at one point of time (the latest shakempas_processed will overwrite the old if any), right?
  4. What I meant by this is the url product to link to InAWARE. Do we include both pdf reports (initial and processed shakemaps) for each language (that means a total of 4 Realtime URL product listed in InAWARE). Or we only include initials shakemaps, like the current one?

@Charlotte-Morgan
Copy link
Member

  1. yes
  2. no
  3. ideally all 4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants