Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different files format #17

Open
dhrou opened this issue Jan 31, 2017 · 8 comments
Open

Different files format #17

dhrou opened this issue Jan 31, 2017 · 8 comments

Comments

@dhrou
Copy link
Contributor

dhrou commented Jan 31, 2017

This is the proposal for the file format

hits

hits.csv
event id, hit id, x, y
Event ID : integer
hit ID : integer
x,y : float in mm, with 4 digit after decimal point (=micron precision)
hit : should be ordered, for example in layer, and in phi (-pi, to pi) (this is probably better than randomizing)
??? currently we have this:
100000710,10000060,[977.9869479074608, 208.68578813535396]
==>how are bracket dealt with by cvs reading module ???
==>also, what is the point to not start at 10000000 ?

ground truth

track parameter and list of hits for each track, and
tracks.csv
event id, track id, signP, phi, d0, hit id1, hit id2, hit id3 ....
signP : momentum signed by charge (+1 or -1, no neutral) ??? other option : sign inverse momentum, which is proportional to curvature
phi : angle ]-pi,pi]
d0 : impact parameter in mm, with 4 digits after decimal point (so micron precision)

??? : is it a problem if number of hit not always the same ?, should maybe fill zeroes up to maximum number of hits (=number of layer)

The points for not giving track id in the hit file :

  • same format for training and testing files (people are not given the tracks file for testing)
  • possibility for merged hits (one hit id, appear for several tracks)

Solution file

tracks_soln.csv
event id, reco track id, hit id1, hit id2, hit id3,...
reco track id : arbitrary integer (should be unique for one event id). Not absolutely needed, but probably easier for debugging
??? require maximum number of hits ?
??? leave room for reco track parameter (even if not required for the challenge), so that there is a nice symmetry with ground truth file ?

@dhrou dhrou changed the title Output file format Different files format Jan 31, 2017
@tboser
Copy link
Collaborator

tboser commented Jan 31, 2017

hits.csv
what is the difference between event ID and hit ID?
as for ordering hits.csv - that is fine, this can be sorted before the file is written or after it is written.

  • I think there are many ways to deal with brackets depending on what you're looking to do, simplest would be to simply strip them before processing each line.
  • It can start wherever you want it to start (I think it does start at 10,000,00?)

ground truth
looks good, i don't think it matters if number of hits isn't same, don't think it is necessary to fill list.

solution file
reco track id - i'll have to think about this one (may be helpful when creating solution file but past that i can't see how it would help).

  • no need to require maximum number of hits (python is flexible)
  • not necessary to leave room (could program it so that it can take input of either, probably simpler to just give a strict guideline of what the solution file should look like so there is no ambiguity). it's worth noting the only things that can really compared between ground truth and solution is hits so to that capacity all that is really necessary for the solution is sets of hit_ids that are part of the same particle's path.

@dhrou
Copy link
Contributor Author

dhrou commented Jan 31, 2017

what is the difference between event ID and hit ID?

  • We typically have p different events each one with tracks in average, and with ~nlayer * q hits, not just one event.
    eventide would be a number, starting e.g. at 0. It is a shame to repeat it for all the hit of that event, but this is presumably gzipped away

I think there are many ways to deal with brackets depending on what you're looking to do, simplest would be to simply strip them before processing each line.

  • I just mean if reading with the csv module, will bracket not cause problem ?

It can start wherever you want it to start (I think it does start at 10,000,00?)

  • well I find it more readable to read 1 2 3 than 10000001 1000002 1000003...that's all

no need to require maximum number of hits (python is flexible)

  • OK I'm only a bit worried by the cvs reading module but I may be wrong

not necessary to leave room

  • I tend to agree this is not necessary

(disconnecting now...)

@hushchyn-mikhail
Copy link
Collaborator

hushchyn-mikhail commented Feb 2, 2017

Yet another proposal for the file format.

Probably it will more convenient if data is provided in table-like format:

hits.csv:
event_id, hit id, x, y
1, 1, 1.1234, 2.3455
1, 2, 3.2453, 5.1234
...

tracks.csv:
event_id, track_id, signP, phi, d0, hit_id
1, 1, 0, 1.2345, 3.4521, 1
1, 1, 0, 1.2345, 3.4521, 2
1, 1, 0, 1.2345, 3.4521, 3
...
n, k, 1, 1.2345, 3.4521, m

Solution file:
event id, reco_track_id, hit_id
1, 1, 1
1, 1, 2
1, 1, 3
...
n, k, m
n, k, m+1
...

Files in this format can be easily readed using numpy or pandas.

@dhrou
Copy link
Contributor Author

dhrou commented Feb 2, 2017

mikhail : for hits.csv , we agree, but for tracks.csv and tracks_soln.csv it would be very inefficient to repeat the track parameter each line.
Since we have a well defined number of layers nlayer, and the layer are simple circles, a true track has at most nlayer hits (it could have less because of inefficiency)(also we would not simulate loopers). It could have less because of inefficiency.
Same for reco track, I don't see how a reco track could have more than nlayer hit.
I was imagining in my proposal to have nlayer columns one for each hit, padded with 0 or -1 if there are less and if csv reading module requires it

@hushchyn-mikhail
Copy link
Collaborator

Now I see and agree.

@yetkinyilmaz
Copy link
Contributor

Hi.
Sorry for not contributing to the discussion earlier, but for the ramp I would like the file formats to be in the following way:

Input:
https://www.dropbox.com/s/9iq0qbt54vtq489/input.csv?dl=0

Output:
https://www.dropbox.com/s/7r3udt7v8omu8nt/result.csv?dl=0

or another version of output where truth is also kept:
https://www.dropbox.com/s/1n90o0mwbetp6eg/result_truth.csv?dl=0

This is communicated to @tboser already. Thanks.

@hushchyn-mikhail
Copy link
Collaborator

Hi,
Is this the final input/output data format?

@yetkinyilmaz
Copy link
Contributor

yetkinyilmaz commented Feb 23, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants