Different files format #17

dhrou · 2017-01-31T22:01:52Z

This is the proposal for the file format

hits

hits.csv
event id, hit id, x, y
Event ID : integer
hit ID : integer
x,y : float in mm, with 4 digit after decimal point (=micron precision)
hit : should be ordered, for example in layer, and in phi (-pi, to pi) (this is probably better than randomizing)
??? currently we have this:
100000710,10000060,[977.9869479074608, 208.68578813535396]
==>how are bracket dealt with by cvs reading module ???
==>also, what is the point to not start at 10000000 ?

ground truth

track parameter and list of hits for each track, and
tracks.csv
event id, track id, signP, phi, d0, hit id1, hit id2, hit id3 ....
signP : momentum signed by charge (+1 or -1, no neutral) ??? other option : sign inverse momentum, which is proportional to curvature
phi : angle ]-pi,pi]
d0 : impact parameter in mm, with 4 digits after decimal point (so micron precision)

??? : is it a problem if number of hit not always the same ?, should maybe fill zeroes up to maximum number of hits (=number of layer)

The points for not giving track id in the hit file :

same format for training and testing files (people are not given the tracks file for testing)
possibility for merged hits (one hit id, appear for several tracks)

Solution file

tracks_soln.csv
event id, reco track id, hit id1, hit id2, hit id3,...
reco track id : arbitrary integer (should be unique for one event id). Not absolutely needed, but probably easier for debugging
??? require maximum number of hits ?
??? leave room for reco track parameter (even if not required for the challenge), so that there is a nice symmetry with ground truth file ?

tboser · 2017-01-31T23:01:29Z

hits.csv
what is the difference between event ID and hit ID?
as for ordering hits.csv - that is fine, this can be sorted before the file is written or after it is written.

I think there are many ways to deal with brackets depending on what you're looking to do, simplest would be to simply strip them before processing each line.
It can start wherever you want it to start (I think it does start at 10,000,00?)

ground truth
looks good, i don't think it matters if number of hits isn't same, don't think it is necessary to fill list.

solution file
reco track id - i'll have to think about this one (may be helpful when creating solution file but past that i can't see how it would help).

no need to require maximum number of hits (python is flexible)
not necessary to leave room (could program it so that it can take input of either, probably simpler to just give a strict guideline of what the solution file should look like so there is no ambiguity). it's worth noting the only things that can really compared between ground truth and solution is hits so to that capacity all that is really necessary for the solution is sets of hit_ids that are part of the same particle's path.

dhrou · 2017-01-31T23:30:22Z

what is the difference between event ID and hit ID?

We typically have p different events each one with tracks in average, and with ~nlayer * q hits, not just one event.
eventide would be a number, starting e.g. at 0. It is a shame to repeat it for all the hit of that event, but this is presumably gzipped away

I think there are many ways to deal with brackets depending on what you're looking to do, simplest would be to simply strip them before processing each line.

I just mean if reading with the csv module, will bracket not cause problem ?

It can start wherever you want it to start (I think it does start at 10,000,00?)

well I find it more readable to read 1 2 3 than 10000001 1000002 1000003...that's all

no need to require maximum number of hits (python is flexible)

OK I'm only a bit worried by the cvs reading module but I may be wrong

not necessary to leave room

I tend to agree this is not necessary

(disconnecting now...)

hushchyn-mikhail · 2017-02-02T13:13:22Z

Yet another proposal for the file format.

Probably it will more convenient if data is provided in table-like format:

hits.csv:
event_id, hit id, x, y
1, 1, 1.1234, 2.3455
1, 2, 3.2453, 5.1234
...

tracks.csv:
event_id, track_id, signP, phi, d0, hit_id
1, 1, 0, 1.2345, 3.4521, 1
1, 1, 0, 1.2345, 3.4521, 2
1, 1, 0, 1.2345, 3.4521, 3
...
n, k, 1, 1.2345, 3.4521, m

Solution file:
event id, reco_track_id, hit_id
1, 1, 1
1, 1, 2
1, 1, 3
...
n, k, m
n, k, m+1
...

Files in this format can be easily readed using numpy or pandas.

dhrou · 2017-02-02T14:42:24Z

mikhail : for hits.csv , we agree, but for tracks.csv and tracks_soln.csv it would be very inefficient to repeat the track parameter each line.
Since we have a well defined number of layers nlayer, and the layer are simple circles, a true track has at most nlayer hits (it could have less because of inefficiency)(also we would not simulate loopers). It could have less because of inefficiency.
Same for reco track, I don't see how a reco track could have more than nlayer hit.
I was imagining in my proposal to have nlayer columns one for each hit, padded with 0 or -1 if there are less and if csv reading module requires it

hushchyn-mikhail · 2017-02-02T16:53:59Z

Now I see and agree.

yetkinyilmaz · 2017-02-13T11:47:58Z

Hi.
Sorry for not contributing to the discussion earlier, but for the ramp I would like the file formats to be in the following way:

Input:
https://www.dropbox.com/s/9iq0qbt54vtq489/input.csv?dl=0

Output:
https://www.dropbox.com/s/7r3udt7v8omu8nt/result.csv?dl=0

or another version of output where truth is also kept:
https://www.dropbox.com/s/1n90o0mwbetp6eg/result_truth.csv?dl=0

This is communicated to @tboser already. Thanks.

hushchyn-mikhail · 2017-02-23T10:20:16Z

Hi,
Is this the final input/output data format?

yetkinyilmaz · 2017-02-23T11:14:09Z

Hi. This will be the input and output format: Input: https://www.dropbox.com/s/9iq0qbt54vtq489/input.csv?dl=0 The output file I wish to have, should have the following format, similar to the input, but the ' particle' column is replaced by the 'track' column, and the absolute values of track ids don't matter, only their connections do. Output: https://www.dropbox.com/s/7r3udt7v8omu8nt/result.csv?dl=0 or another version of output where truth is also kept: https://www.dropbox.com/s/1n90o0mwbetp6eg/result_truth.csv?dl=0 This is the minimum info, we are also considering adding "layer" and "iphi" information into the input, which are redundant but may be useful.

…

On 23/02/17 11:20, hushchyn-mikhail wrote: Hi, Is this the final input/output data format? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#17 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEv_EATZsQEw3jtuxhu4EFTXEdsDY_c4ks5rfV1ggaJpZM4LzOtj>.

dhrou changed the title ~~Output file format~~ Different files format Jan 31, 2017

dhrou mentioned this issue Jan 31, 2017

Large quantity of changes #15

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different files format #17

Different files format #17

dhrou commented Jan 31, 2017 •

edited

Loading

tboser commented Jan 31, 2017

dhrou commented Jan 31, 2017 •

edited

Loading

hushchyn-mikhail commented Feb 2, 2017 •

edited

Loading

dhrou commented Feb 2, 2017

hushchyn-mikhail commented Feb 2, 2017

yetkinyilmaz commented Feb 13, 2017

hushchyn-mikhail commented Feb 23, 2017

yetkinyilmaz commented Feb 23, 2017 via email

Different files format #17

Different files format #17

Comments

dhrou commented Jan 31, 2017 • edited Loading

hits

ground truth

Solution file

tboser commented Jan 31, 2017

dhrou commented Jan 31, 2017 • edited Loading

hushchyn-mikhail commented Feb 2, 2017 • edited Loading

dhrou commented Feb 2, 2017

hushchyn-mikhail commented Feb 2, 2017

yetkinyilmaz commented Feb 13, 2017

hushchyn-mikhail commented Feb 23, 2017

yetkinyilmaz commented Feb 23, 2017 via email

dhrou commented Jan 31, 2017 •

edited

Loading

dhrou commented Jan 31, 2017 •

edited

Loading

hushchyn-mikhail commented Feb 2, 2017 •

edited

Loading