Skip to content
This repository has been archived by the owner on Feb 8, 2018. It is now read-only.

backfill remaining exchanges #3912

Closed
wants to merge 63 commits into from
Closed

Conversation

chadwhitacre
Copy link
Contributor

Picks up where #3807 left off. Hopefully we can close #2779, which is blocking gratipay/finances#3.

@chadwhitacre
Copy link
Contributor Author

type of exchange number % of total
fine 49,947 62.3
null ref 30,276 37.7
null route 24,116 30.0
null status 7,930 9.9
total 80,233 100.0

@rohitpaulk
Copy link
Contributor

null status

Ooh, what are we going to do about those? Mark them all as succeeded?

@chadwhitacre
Copy link
Contributor Author

@rohitpaulk I guess so? @kaguillera and I were thinking we should approach this month by month alongside gratipay/finances#3, rather than trying to get the backfilling all done before the catching up. For June, 2012 we only have 66 exchanges, and I think we can assume they all succeeded?

@chadwhitacre
Copy link
Contributor Author

@kaguillera points out that payday logfiles include username and payment network ("charging foo on Stripe"), which should be helpful here.

@chadwhitacre chadwhitacre force-pushed the backfill-remaining-exchanges branch from c758d96 to 3a8076e Compare February 6, 2016 02:02
@chadwhitacre
Copy link
Contributor Author

Script (and branch.sql) in 3a8076e. Watch for input logs to accumulate in https://github.com/gratipay/logs/tree/master/misc/3912.

@chadwhitacre
Copy link
Contributor Author

@rohitpaulk @aandis et al. If this script looks good then @kaguillera and I can start loading up data with it once we are able to produce the input CSVs.

sql/branch.sql Outdated
ALTER TYPE payment_net ADD VALUE 'samurai';
ALTER TYPE payment_net ADD VALUE 'stripe';

ALTER TABLE exchanges ADD UNIQUE (ref);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this unique constraint be for a combination of (network + ref)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait - network is in the exchange routes table. Hmm, how would this work out? Isn't it possible that the ref from a braintree and that from a balanced payment collide?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fair enough. Fixed in 8bec476.

@chadwhitacre
Copy link
Contributor Author

P.S. I'm not expecting this PR to be merged. I'm expecting to use this PR to track code and comments for backfilling the database (#2779).

@chadwhitacre
Copy link
Contributor Author

Blech. We don't have network in exchanges, only in exchange_routes. It'll take a bit work to enforce this uniqueness constraint.

@chadwhitacre
Copy link
Contributor Author

And I guess we'll have to merge this PR or another one like it, to apply the schema changes.

@chadwhitacre
Copy link
Contributor Author

@kaguillera and I are working on a script to match up Stripe transfer details with exchanges in our database. The problem we're up against is ... usernames as primary key (#835). 😊

@chadwhitacre
Copy link
Contributor Author

The way we log username changes, we don't record the old username, only the new one. This makes it impossible to trace back to what username a participant had at a particular point in time. We do have old database snapshots we could look at, but we didn't add user id until late enough that an old snapshot wouldn't help us map old username to current user id anyway.

@chadwhitacre
Copy link
Contributor Author

Let this be a lesson, kids! 😳

@chadwhitacre
Copy link
Contributor Author

Immutable user id implemented in #680.

@chadwhitacre
Copy link
Contributor Author

Event logging added in #2006. Username change events were backfilled at that point using the username at that time but with the original claimed time (+ 0.01 s).

@chadwhitacre
Copy link
Contributor Author

Tech debt, anyone? :)

@chadwhitacre
Copy link
Contributor Author

🍷 🍒

@chadwhitacre
Copy link
Contributor Author

But we can still use the old database backups to get the exchange id and username-at-the-time from exchanges.

@chadwhitacre
Copy link
Contributor Author

Our offline backups only go back to August of 2013.

@chadwhitacre
Copy link
Contributor Author

So I guess we manually reconstruct before then?

@chadwhitacre
Copy link
Contributor Author

Making progress on this, I'm pretty close to matching all Stripe transactions with exchanges in our database. Now it looks like I have to back up a step because I'm seeing duplicate entries in the Stripe input files.

@chadwhitacre
Copy link
Contributor Author

Here's the script I used for https://github.com/gratipay/logs/commit/f0ab1ba9f3fce55b1a7b0b7d4e8e393a8aa89872:

#!/usr/bin/env python2 -u
from __future__ import absolute_import, division, print_function, unicode_literals

import os


def process_month(year, month):
    seen = set()
    lines = list()
    filepath = '3912/{}/{}/_stripe-payments.csv'.format(year, month)
    for line in open(filepath, 'r'):
        if line not in seen:
            lines.append(line)
            seen.add(line)
    open(filepath, 'w+').writelines(lines)


def main():
    for year in os.listdir('3912'):
        if not year.isdigit(): continue
        for month in os.listdir('3912/' + year):
            if not month.isdigit(): continue
            process_month(year, month)


if __name__ == '__main__':
    main()

And here's the variant I used for https://github.com/gratipay/logs/commit/1ea21621d84b1fb1c8292c2d0fcc420cbcfdc0dd:

--- find-dupes.py   2016-03-09 13:39:39.000000000 -0500
+++ find-dupes.py.new   2016-03-09 13:39:31.000000000 -0500
@@ -9,9 +9,10 @@
     lines = list()
     filepath = '3912/{}/{}/_stripe-payments.csv'.format(year, month)
     for line in open(filepath, 'r'):
-        if line not in seen:
+        stripped = line.strip()
+        if stripped not in seen:
             lines.append(line)
-            seen.add(line)
+            seen.add(stripped)
     open(filepath, 'w+').writelines(lines)

@chadwhitacre
Copy link
Contributor Author

Okay! The match-stripe.py script in b4e9c3a seems to be working! We're able to find matches for all of the payments that we're finding in CSVs downloaded from Stripe.

That means we're about ready to dial back out and run the backfill.py script for 2012-07 (our first month with Stripe data).

@chadwhitacre
Copy link
Contributor Author

@kaguillera are thinking that we'll backfill exchanges in the database month by month in lockstep with gratipay/finances#3, because we'll be able to learn and adapt more easily as we discover bugs and edge cases and whatnot.

@chadwhitacre
Copy link
Contributor Author

Rebased, was 43804a3.

@chadwhitacre chadwhitacre force-pushed the backfill-remaining-exchanges branch from 43804a3 to 063a8cf Compare March 1, 2017 20:12
@chadwhitacre
Copy link
Contributor Author

chadwhitacre commented Mar 1, 2017

Gosh. Bringing @dmk246 up to speed here, and it's getting worse. We're seeing a route for yours truly that is of unknown network, and has 219 exchanges on it from 2012-06-01 through 2016-11-17. Amount and notes and circumstances indicate that it in fact represents a number of different networks: Samurai, PayPal, Braintree, Balanced, and Coinbase at the least. 😳

@chadwhitacre
Copy link
Contributor Author

Alright, this is A Big Deal™. It's by far our biggest piece of technical debt. This underpins our whole accounting function, so we have to get this done. And it's going to be a lot of work. It sounds like @kaguillera is going to dive back into this one, along with @dmk246 as she gets up to speed.

As @dmk246 pointed out IRL, the immediate next task is to ensure that we're not adding new bad data ... which we are: we are not storing ref for any new transactions—record_exchange doesn't even allow for it. 😮

!m @kaguillera @dmk246

@kaguillera
Copy link
Contributor

Diving back into this.

As @dmk246 pointed out IRL, the immediate next task is to ensure that we're not adding new bad data ... which we are: we are not storing ref for any new transactions—record_exchange doesn't even allow for it. 😮

This is taken care of by #4361

Now back to figuring out the corrupted data problem. @whit537 I think I will need a snapshot of the tables that are involved...exchanges, routes and exchange_route if that is at all possible.

@kaguillera
Copy link
Contributor

@whit537 can I have a the actual records from exchanges and exchange_routes for the month of 2012-06 to begin with. I guess a csv would be good or any format that would be easiest and fastest for you. I don't want to take you away from it what you are working on. Maybe you could drop it some where in logs repo.
I want to analyze it along with the payday logs and see if I can figure out the source and reason for the data corruption. Then maybe we can come up with a solution.

@chadwhitacre
Copy link
Contributor Author

I exported the exchange_routes table, id and username from participants, and the first three months of exchanges.

@kaguillera
Copy link
Contributor

tanx...going to start looking at this

@chadwhitacre chadwhitacre mentioned this pull request May 4, 2017
15 tasks
@kaguillera
Copy link
Contributor

This ticket is getting too large as it pertains to the comments so I am moving the research of the errors to #4442

@chadwhitacre
Copy link
Contributor Author

Bah! You can never have too many comments! :-)

@chadwhitacre
Copy link
Contributor Author

We added unknown in #3975.

@chadwhitacre
Copy link
Contributor Author

chadwhitacre commented Jun 2, 2017

We haven't had an unknown network for six months and I'm trying to remember why.

gratipay::DATABASE=> select timestamp, network from exchanges e join exchange_routes er on e.route = er.id where network='unknown' order by timestamp desc limit 10;
┌───────────────────────────────┬─────────┐
│           timestamp           │ network │
├───────────────────────────────┼─────────┤
│ 2016-11-17 21:33:03.859138+00 │ unknown │
│ 2016-11-17 21:33:02.467701+00 │ unknown │
│ 2016-11-17 21:33:01.429801+00 │ unknown │
│ 2016-11-17 21:33:00.15087+00  │ unknown │
│ 2016-11-17 21:32:59.618481+00 │ unknown │
│ 2016-11-17 21:32:59.212181+00 │ unknown │
│ 2016-11-17 21:32:58.823265+00 │ unknown │
│ 2016-11-17 21:32:58.47406+00  │ unknown │
│ 2016-11-17 21:32:56.693921+00 │ unknown │
│ 2016-11-17 21:32:56.283582+00 │ unknown │
└───────────────────────────────┴─────────┘
(10 rows)

gratipay::DATABASE=>

@chadwhitacre
Copy link
Contributor Author

We plugged the ref hole in #4361. What about the network hole?

@chadwhitacre
Copy link
Contributor Author

The last unknown networks are for PayPal masspays. Those run through record an exchange which depends on existing routes. How do we have a route with a null network and why did it stop seven months ago?

Last change to routes/associate.json.spt is four months ago, then before that is two years.

https://github.com/gratipay/gratipay.com/blame/24559cb0ee6d9fff77f64fcdba914a19fc0a4b98/www/~/%25username/routes/associate.json.spt

@chadwhitacre
Copy link
Contributor Author

gratipay::DATABASE=> select count(*) from exchange_routes where network='unknown';
┌───────┐
│ count │
├───────┤
│  1626 │
└───────┘
(1 row)

@chadwhitacre
Copy link
Contributor Author

We didn't fully plug the ref leak in #4361. We plugged it for braintree-cc but not for paypal, i.e., MassPay.

gratipay::DATABASE=> select count(*) from exchanges where ref is null and timestamp::text > '2017-03';
┌───────┐
│ count │
├───────┤
│  1089 │
└───────┘
(1 row)

gratipay::DATABASE=>

@chadwhitacre
Copy link
Contributor Author

Looking at those 1,089 it is clear from the note that the only thing still leaking NULL refs since #4361 is MassPay. A PR! A PR!

@chadwhitacre
Copy link
Contributor Author

chadwhitacre commented Jun 2, 2017

We have four things we need to backfill, two are now using unknown (status and network) and two are still using NULL (route and ref)

In general we want to plug leaks before backfilling. Moving back over to #3822 to #4442 ...

@kaguillera
Copy link
Contributor

Once #4518 is merged this should be taken care of.

@chadwhitacre
Copy link
Contributor Author

Closing per gratipay/inside.gratipay.com#1196.

@chadwhitacre chadwhitacre deleted the backfill-remaining-exchanges branch November 10, 2017 14:19
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

backfill the status, route, and ref columns of the exchanges table
3 participants