Skip to content

Commit

Permalink
Merge pull request #789 from naishasinha/main
Browse files Browse the repository at this point in the history
Google Summer of Code '24 Mid-Program Blog Post (Automating Quantifying the Commons)
  • Loading branch information
TimidRobot authored Jul 10, 2024
2 parents 56039bd + 5ba9b49 commit 5c1810f
Show file tree
Hide file tree
Showing 7 changed files with 205 additions and 0 deletions.
14 changes: 14 additions & 0 deletions content/blog/authors/NaishaSinha/contents.lr
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
username: naishasinha
---
name: Naisha Sinha
---
md5_hashed_email: c6f768d61d96f508d9523bf28664cb64
---
about:

Naisha worked on [Automating Quantifying the Commons][repository] as a developer for [Google
Summer of Code (GSoC) 2024](/programs/history/). <br>
GitHub: [`@naishasinha`][github]

[repository]: https://github.com/creativecommons/quantifying
[github]: https://github.com/naishasinha
1 change: 1 addition & 0 deletions content/blog/categories/big-data/contents.lr
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
name: big-data
1 change: 1 addition & 0 deletions content/blog/categories/cc-software/contents.lr
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
name: cc-software
1 change: 1 addition & 0 deletions content/blog/categories/gsoc-2024/contents.lr
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
name: gsoc-2024
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
188 changes: 188 additions & 0 deletions content/blog/entries/2024-07-10-automating-quantifying/contents.lr
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
title: Automating Quantifying the Commons: Part 1
---
categories:
gsoc-2024
gsoc
big-data
quantifying-the-commons
cc-software
open-source
community
---
author: naishasinha
---
pub_date: 2024-07-10
---
body:

![GSoC 2024](Automating - GSoC Logo.png)

## Introduction
***

Quantifying the Commons, an initiative emerging from the UC Berkeley Data Science Discovery Program,
aims to quantify the frequency of open domain and CC license usage for future accessibility and analysis purposes
(Refer to the initial CC article for Quantifying **[here!][quantifying]**).
To date, the scope of the previous project advancements has not included automation or combined reporting,
which is necessary to minimize the potential for human error and allow for more timely updates,
especially for a system that engages with substantial streams of data. <br>

As a selected developer for Google Summer of Code 2024,
my goal this summer is to develop automation software for data gathering, flow, and report generation,
ensuring that reports are never more than 3 months out-of-date. This blog post serves as a technical journal
for my endeavor till the midterm evaluation period. Part 2 will be posted after successful completion of the
entire summer program.

## Pre-Program Knowledge and Associated Challenges
***

As an undergraduate CS student, I had not yet had any experience working with codebases
as intricate as this one; the most complex software I had worked on prior to this undertaking
was most probably a medium-complexity full-stack application. In my pre-GSoC contributions to Quantifying, I did successfully
implement logging across all the Python files (**[PR #97][logging]**), but admittedly, I was not familiar with a lot of the other modules that
were being used in these files. As a result, this caused minor inconveniences to my development process from the very beginning.
For example, not being experienced with operating system (OS) modules had me confused as to how I was supposed to
join new directories. In addition, I had never worked with such large streams of data before, so it was initially a
challenge to map out pseudocode for handling big data effectively. The next section elaborates on my development process and how I resolved these setbacks.

## Development Process (Midterm)
***

### I. Data Flow Diagram Construction
Before starting the code implementation, I decided to develop a **Data Flow Diagram (DFD)**, which provides a visual
representation of how data flows through a software system. While researching effective DFDs for inspiration, I came across
a **[technical whitepaper by Amazon Web Services (AWS)][AWS-whitepaper]** on Distributed Data Management, and I found it very helpful in drafting
my own DFD. As I was still relatively new to the codebase, it helped me simplify
the current system into manageable components and better understand how to implement the rest of the project.

![DFD](DFD.png)
This was the initial layout for the data directory flow; however, the more I delved into the development process,
the more the steps changed. I will present the final directory flow in Part 2 at the end of the program.

### II. Identifying the First Data Source to Target
The main approach for implementing this project was to target one specific data source and complete its data extraction, analysis,
and report generation process before adding more data sources to the codebase. There were two possible strategies to consider:
(1) work on the easier data sources first, or (2) begin with the highest complexity data source and then add the easier
ones later. Both approaches have notable pros and cons; however, I decided to adopt the second strategy of
starting with the most complex data source first. Although this would take slightly longer to implement, it would simplify the process
later on. As a result, I began implementing the software for the **Google Custom Search**
data source, which has the largest number of data retrieval potential among all the other sources.

### III. Directory Setup + Code Implementation
Based on the DFD, **[Timid Robot][timid-robot]** (my mentor) and I identified the directory process to be as such: within our `scripts` directory, we would have
separate sub-directories to reflect the phases of data flow, `1-fetch`, `2-process`, `3-report`. The code would then be
set up to interact between systems in chronological order. Additionally, a shared directory was implemented to optimize similar functions and paths. <br>

**`1-fetch`**

As I mentioned in the previous sections, starting to code the initial file was a challenge, as I had to learn how to use
new technologies and libraries on-the-go. As a matter of fact, my struggles began when I couldn't even import the
shared module correctly. However, slowly but surely, I found that consistent research of available documentation as well
as constant insights from Timid Robot made it so that I finally understood everything that I was working with. There were
a few specific things that helped me especially, and I would like to share them here in case it helps any software
developer reading this post:

1. **Reading Technical Whitepapers:** As I mentioned earlier, I studied a technical whitepaper by AWS to help me design my DFD.
From this, I realized that consulting relevant whitepapers by industry giants to see how they approach similar tasks
helped me a lot in understanding best practices to implementing the system. Here is another resource by Meta that I referenced,
called **[Composable Data Management at Meta][meta-whitepaper]** (I mainly used the _Building on Similarities_ section
to study the logical components of data systems).

2. **Referencing the Most Recent Quantifying Codebase:** The pre-automation code that was already implemented by previous developers
for _Quantifying the Commons_
was the closest thing to my own project that I could reference. Although not all of the code was relevant to the Automating project,
there were many aspects of the codebase I found very helpful to take inspiration from, especially when online research led to a
dead end.

3. **Writing Documentation for the Code:** As a part of this project, I assigned myself the task of developing documentation for
the Automating Quantifying the Commons project (**[can be accessed here!][documentation]**). Heavily inspired by the "rubber duck debugging"
method, where explaining the code or problem step-by-step to someone or something will make the solution present itself, I decided to create documentation
for future developers to reference, in which I break down the code step-by-step to explain each module or function. I found that in
doing this, I was able to better understand my own code better.

As for the license data retrieval process using the Google Custom Search API Key,
I did have a little hesitation running everything for the first time.
Since I had never worked with confidential information or such large data inputs before,
I was scared of messing something up. Sure enough, the first time I ran everything with the language and country parameters,
it did cause a crash, since the API query-per-day limit was crossed with one script run. As I continued to update
the script, I learned a very useful trick when it comes to handling big data:
to avoid hitting the query limit while testing, you can replace the actual API calls
with logging statements to show the parameters being used. This helps you
understand the outputs without actually consuming API quota, and it can help you identify bugs more easily. <br>

A notable aspect of this software is the directory organization. Throughout the process, I designed it so that the datasets are automatically stored within their
respective quarter's directories rather than being stored altogether. This ensures efficient organization in order for users to easily access in the future,
especially when the number of datasets multiplies.

Upon successful completion of basic data retrieval and state management in Phase 1,
I felt much more confident about the trajectory of this project, and implementing
future steps and fixing new bugs became progressively easier.

**`2-process`**

The long-term goal of the Quantifying project is to have comprehensive datasets for each quarter, encompassing
license data that scales up to millions and even billions. For the `2-process` phase specifically, the aim is
to analyze and compare data between quarters to be able to display in the reports. However, given our Google Custom Search
API constraints as well as the time period we're working with for the GSoC period (most of this period is mainly
2024Q3), it is not possible to have a fully completed Phase 2. However, in order to deploy as complete of an automation software as possible,
I have set up a basic psuedocode that can be implemented
and built upon by future development efforts as more data is collected in the upcoming quarters/years.

**`3-report`**

As mentioned earlier, the Google Custom Search API constraints made it difficult to create a comprehensive and detailed dataset, so I plan to
initiate the development of a more fletched-out Google Custom Search post-GSoC, when more data can be accumulated (discussed further in the next section).
As of now, there are three main completed report visualization schemes: **(1)** Reports by Country, **(2)** Reports by License Type,
and **(3)** Reports by Language. Although the visualizations are basic in design, I made sure to incorporate accessibility into the
visualizations for a better user experience. This included adding elements like labels on top of the bars with specific number counts for better
readability and understanding of the reports. In addition, I included three key features in the reports codebase to cater to various possible
needs of the report users.

1. **Key Feature #1:** I implemented command line arguments in which users can choose any quarter to visualize, as I believe this would be useful
for anyone in need of individual reports from previous quarters, not just reports from this quarter.
2. **Key Feature #2:** Successfully stores reports into the data reports directory specific to each quarter for optimal organization (similar to the
dataset organization in Phase 1). In this way,
reports from one quarter will not be mixed up with reports from another quarter, making it easier for users to navigate and use.
3. **Key Feature #3:** The program automatically generates and/or updates an individual `README` file for each quarter's reports. This `README` organizes
all generated report images within that quarter into one page, alongside basic report descriptions.

## Mid-Program Conclusions and Upcoming Tasks
***

Overall, my understanding and skillset for this project increased ten-fold after completing all the phases for Google Custom Search.
Going into the second half of the Google Summer of Code program, I expect that I will complete the future data sources at a more efficient and faster rate,
given the license data sizes and my heightened expertise. In fact, as of now (the midterm evaluation point), I have completed
a relatively detailed Phase 1 for Flickr, which only involves 10 licenses. My biggest takeaway from the first half of the coding period is that rather than developing
a basic querying process and adding on later, it's easier to start off with a complex and detailed version before moving on to Phases 2 and 3. Additionally, using
the `shared` module within the scripts can be very beneficial to simplify the coding process.

In the second half of the GSoC program, I plan to keep both of these takeaways in mind when developing scripts for the rest of the data sources. On a formal level,
the final goal for the end of GSoC 2024 is to have a working codebase for Phases 1, 2, and 3 of all data sources,
including a completed automation setup for these scripts. Due to the effectiveness of the current directory organization and report generation features,
I will be standardizing them across all data sources.

Finally, after the software is complete to the extent that is possible during the GSoC period,
I plan to raise issues in the repository respective to all the next steps that
could be taken post-GSoC by open-source developers for a more comprehensive software system.

So far, my journey at Creative Commons has significantly enhanced my skillset as a software developer, and I have never felt more motivated to take on more challenging tasks.
I'm looking forward to more levels of growth and accomplishments in the second half of the program.
I'll be back with Part 2 at the end of the summer with an updated, completed project!

## Additional Readings
***

- (Stay posted for the second part of this series, coming soon!) Automating Quantifying the Commons: Part 2
- [Data Science Discovery: Quantifying the Commons][quantifying] | Author: Dun-Ming Huang (Brandon Huang) | Dec. 2022

[quantifying]: https://opensource.creativecommons.org/blog/entries/2022-12-07-berkeley-quantifying/
[logging]: https://github.com/creativecommons/quantifying/pull/97
[AWS-whitepaper]: https://docs.aws.amazon.com/whitepapers/latest/microservices-on-aws/distributed-data-management.html
[meta-whitepaper]: https://engineering.fb.com/2024/05/22/data-infrastructure/composable-data-management-at-meta/
[timid-robot]: https://opensource.creativecommons.org/blog/authors/TimidRobot/
[documentation]: https://unmarred-gym-686.notion.site/Automating-Quantifying-the-Commons-Documentation-441056ae02364d8a9a51d5e820401db5?pvs=4





0 comments on commit 5c1810f

Please sign in to comment.