Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement: workbook_iterate and workbook_flatten #28

Open
djay opened this issue Sep 16, 2021 · 7 comments
Open

enhancement: workbook_iterate and workbook_flatten #28

djay opened this issue Sep 16, 2021 · 7 comments
Labels
enhancement New feature or request

Comments

@djay
Copy link

djay commented Sep 16, 2021

I wrote two higher level functions that could be useful to others if included in your library?

def workbook_iterate(url, **selects):
    "generates combinations of workbooks from combinations of parameters, selects or filters"
def workbook_flatten(wb, date=None, **mappings):
    """return a single DataFrame from a workbook flattened according to mappings
    mappings is worksheetname=columns
    if columns is type str puts a single value into column
    if columns is type dict will map worksheet columns to defined dataframe columns
    if those column names are in turn dicts then the worksheet will be pivoted and the values mapped to columns
    e.g.
    worksheet1="Address", 
    worksheet2=dict(ws_phone="phone", ws_state="State"), 
    worksheet3=dict(ws_state=dict(NSW="State: New South Wales", ...))
    """
    # TODO: generalise what to index by and default value for index

The code and examples of how I'm using it

Used in combination you can reliably scrape lots of data with not too much code, at least in the case of similar to what I've used it for?

@bertrandmartel
Copy link
Owner

@djay Thank you, that's great work!

I'm very interested in workbook_iterate function since there are many usecases when we need to iterate the parameters/filters (server side rendering / get all region/county/province data etc...). This addition would greatly reduce boilerplate

workbook_flatten seems quite advanced, maybe too advanced for most people that will use this library but I may be wrong.

Do you think you can provide a PR with a sample usage for one or both of these feature ?

@bertrandmartel bertrandmartel added the enhancement New feature or request label Oct 2, 2021
@djay
Copy link
Author

djay commented Oct 2, 2021

I;ve only really used it on my one usecase for now.
Screen Shot 2021-10-02 at 8 32 51 am

For this the iterate and flatten work together to reduce the code to.

        dates = reversed(pd.date_range("2021-02-01", today() - relativedelta(hours=7)).to_pydatetime())
        for get_wb, idx_value in workbook_iterate(url, param_date=dates, D2_Province="province"):
            date, province = idx_value
            if province is None:
                continue
            province = get_province(province)
            if skip_valid(df, (date, province), allow_na):
                continue
            if (wb := get_wb()) is None:
                continue
            row = workbook_flatten(
                wb,
                date,
                D2_Vac_Stack={
                    "DAY(txn_date)-value": "Date",
                    "vaccine_plan_group-alias": {
                        "1": "1 Cum",
                        "2": "2 Cum",
                        "3": "3 Cum",
                    },
                    "SUM(vaccine_total_acm)-value": "Vac Given",
                },
                D2_Walkin="Cases Walkin",
                D2_Proact="Cases Proactive",
                D2_Prison="Cases Area Prison",
                D2_NonThai="Cases Imported",
                D2_New="Cases",
                D2_NewTL={
                    "AGG(stat_count)-alias": "Cases",
                    "DAY(txn_date)-value": "Date"
                },
                D2_Lab2={
                    "AGG(% ติดเฉลี่ย)-value": "Positive Rate Dash",
                    "DAY(txn_date)-value": "Date"
                },
                D2_Lab={
                    "AGG(% ติดเฉลี่ย)-alias": "Positive Rate Dash",
                    "ATTR(txn_date)-alias": "Date",
                },
                D2_Death="Deaths",
                D2_DeathTL={
                    "AGG(num_death)-value": "Deaths",
                    "DAY(txn_date)-value": "Date"
                },
            )

results in

Date,Province,Cases,Cases Area Prison,Cases Imported,Cases Proactive,Cases Walkin,Deaths,Hospitalized Severe,Positive Rate Dash,Tests,Vac Given 1 Cum,Vac Given 2 Cum,Vac Given 3 Cum
...
2021-09-30,Trang,85.0,0.0,0.0,0.0,85.0,0.0,0.0,,,674308.0,433842.0,26718.0
2021-09-30,Trat,76.0,2.0,0.0,0.0,74.0,0.0,91.0,,,268796.0,193600.0,8403.0
2021-09-30,Ubon Ratchathani,178.0,0.0,0.0,0.0,178.0,1.0,0.0,,,512693.0,332288.0,24226.0
2021-09-30,Udon Thani,123.0,0.0,0.0,0.0,123.0,2.0,0.0,,,496948.0,286579.0,16244.0
2021-09-30,Uthai Thani,25.0,0.0,0.0,1.0,24.0,0.0,1.0,,,124028.0,74468.0,5276.0
2021-09-30,Uttaradit,10.0,0.0,0.0,0.0,10.0,0.0,0.0,,,178581.0,111562.0,5184.0
2021-09-30,Yala,30.0,0.0,0.0,2.0,28.0,2.0,0.0,,,374607.0,227236.0,9466.0
2021-09-30,Yasothon,563.0,2.0,0.0,0.0,561.0,1.0,0.0,,,254452.0,150356.0,5982.0

Maybe you could point me to a more permanent and simpler workbook I could scrape and I can use that as an example instead?
I think one thing I might have to do is generalise flatten to work for non timeseries data as it currently assumes this which makes it less useful. Know of a good example which is not indexed by something other than date?

@djay
Copy link
Author

djay commented Oct 2, 2021

yeah maybe the top5leagues one with a row per player.

Doesn't show off the merging of an embedded graph inside the workbook but chances are the usecase for that one is only going to be timeseries.

@djay
Copy link
Author

djay commented Nov 10, 2021

@bertrandmartel maybe workbook_flatten would be more useful if it worked more automatically to try to return a single dataframe from one workbook. Then you can rename columns yourself after to clean it up.
So in the example above it would be

df = wb.flatten(datatime.now())
df = df.rename({"D2_Vac_Stack: vaccine_plan_group-alias: 1": "Vac Given 1",...

However for that to work it needs to know what the index is for every table inside a workbook and assume they are the same index. And also that single value tables will have index value passed in. Im not yet sure where in the information is of what the index is for an internal plot.

@djay
Copy link
Author

djay commented Nov 10, 2021

@bertrandmartel actually in that example it would never work for pivoting an internal table/plot. It wouldn't know which column to pivot on. Maybe the way to do that more simply would be have an exclude param on flatten and the user to do that pivot manually and combine themselves.

In addition the code I had dealt with combining internal plots and single values that represented the same data (but potentially different dates). So that would have to be done manually.

So the example would be

df = wb.flatten(datatime.now(), exclude=["D2_Vac_Stack"])
df = df.rename(columns=dict(D2_New="Cases", D2_NewTL="Cases2", D2_Death="Deaths",...))
df = df.combine_first(df["Cases2"].to_frame("Cases")).drop(columns="Cases2")
...
vac = wb.getWorksheet("D2_Vac_Stack").pivot_table(....
df = df.combine_first(vac)

I'm not sure if the end result saves more work or not...

@ghost
Copy link

ghost commented Nov 13, 2021

@bertrandmartel inclusion of workbook_iterate would be very helpful especially if my understanding is correct and it would accept a parameter column and then iterate the values. Though I supose doing this is fairly easy, it would save a lot of time.

I am presently unable to use Python (due system restrictions) but it does seem like your library would be very helpful for getting scraped data into Microsoft Power BI as it supports running of Python scripts where a dataframe is the result. For the time being, many of your posts Stackoverflow have helped me to get a solution working in M (Power Query). Though it is much less advanced it does the job for now.

Thanks for you work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants