Exploring and Visualizing Mixed Data in R with ggplot2 #606

hawc2 · 2024-03-19T15:46:50Z

Programming Historian in English has received a proposal for a lesson, 'Visualizing data with R and ggplot2,' by @rogorido and @nabsiddiqui.

I have circulated this proposal for feedback within the English team. We have considered this proposal for:

Openness: we advocate for use of open source software, open programming languages and open datasets
Global access: we serve a readership working with different operating systems and varying computational resources
Multilingualism: we celebrate methodologies and tools that can be applied or adapted for use in multilingual research-contexts
Sustainability: we're committed to publishing learning resources that can remain useful beyond present-day graphical user interfaces and current software versions

We are pleased to have invited @rogorido and @nabsiddiqui to develop this Proposal into a Submission under the guidance of @semanticnoodles as editor.

The Submission package should include:

Lesson text (written in Markdown)
- For guidance, we recommend Sarah Simpkin's lesson Getting Started with Markdown
Figures: images / plots / graphs (if using)
Data assets: codebooks, sample dataset (if using)

We ask @rogorido and @nabsiddiqui to share their Submission package with our Publishing team by email, copying in @semanticnoodles.

We've agreed a submission date of April. We ask @rogorido and @nabsiddiqui to contact us if they need to revise this deadline.

When the Submission package is received, our Publishing team will process the new lesson materials, and prepare a Preview of the initial draft. They will post a comment in this Issue to provide the locations of all key files, as well as a link to the Preview where contributors can read the lesson as the draft progresses.

If we have not received the Submission package by April, @semanticnoodles will attempt to contact @rogorido and @nabsiddiqui. If we do not receive any update, this Issue will be closed.

Our dedicated Ombudspersons are Ian Milligan (English), Silvia Gutiérrez De la Torre (español), Hélène Huet (français), and Luis Ferla (português) Please feel free to contact them at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudspersons will have no impact on the outcome of any peer review.

semanticnoodles · 2024-03-19T23:00:21Z

I confirm @rogorido and @nabsiddiqui shared with me access to their repository containing all the required files, and that I handed them over to @anisa-hawes to allow the publishing team to generate the preview, thanks.

anisa-hawes · 2024-03-20T13:19:36Z

Hello Giulia @semanticnoodles, Igor @rogorido and Nabeel @nabsiddiqui,

Many thanks for sharing the lesson submission materials with me. I've now checked the Markdown file, and add some key elements of metadata. I've also checked the accompanying images and assets, ensuring each element meets our requirements.

You can find the key files here:

You can review a Preview of the lesson here:

http://programminghistorian.github.io/ph-submissions/en/drafts/originals/exploring-visualizing-mixed-data-r-ggplot2

--

A few initial notes:

I've made a slight adjustment to the Header sizes used in the lesson. Our typesetting convention is that ## Header 2 is the largest.
I've added placeholder alt_text + captions for each of your images. We have committed to providing alt-text for all figure images, plots and graphs included in our lessons, so you'll need to add this as part of your revisions. These notes on Descriptive Alt text may be useful to you.
I've checked to ensure that you both have the Write access you'll need to edit your draft directly. We ask authors to work on their own files with direct commits: (we prefer you don't fork our repo, or use the Pull Request system in ph-submissions).
I imagine Giulia @semanticnoodles may have noted this too, but I noticed that you include both a .tsv and a .csv version of the dataset, although only the .csv appears to be used in the lesson. Is the .tsv alternative required too?

anisa-hawes · 2024-03-20T13:40:55Z

Hello again Igor @rogorido and Nabeel @nabsiddiqui.

What's happening now?

Your lesson has been moved to the next phase of our workflow which is Phase 2: Initial Edit.

In this Phase, your editor Giulia @semanticnoodles will read your lesson, and provide some initial feedback. Giulia will post feedback and suggestions as a comment in this Issue, so that you can revise your draft in the following Phase 3: Revision 1.

%%{init: { 'logLevel': 'debug', 'theme': 'dark', 'themeVariables': {
              'cScale0': '#444444', 'cScaleLabel0': '#ffffff',
              'cScale1': '#882b4f', 'cScaleLabel1': '#ffffff',
              'cScale2': '#444444', 'cScaleLabel2': '#ffffff'
       } } }%%
timeline
Section Phase 1 <br> Submission
Who worked on this? : Publishing Manager (@anisa-hawes) 
All  Phase 1 tasks completed? : Yes
Section Phase 2 <br> Initial Edit
Who's working on this? : Editor (@semanticnoodles)  
Expected completion date? : April 20
Section Phase 3 <br> Revision 1
Who's responsible? : Authors (@rogorido + @nabsiddiqui) 
Expected timeframe? : ~30 days after feedback is received

Note: The Mermaid diagram above may not render on GitHub mobile. Please check in via desktop when you have a moment.

rogorido · 2024-03-20T20:41:18Z

@anisa-hawes Thanks for your comments. As for the tsv file: no, it is not required. It can be deleted.

I'll add the alternative captions. Thanks.

rogorido · 2024-04-08T16:40:47Z

I added captions and alt texts (10a6a9e), but Nabeel should take a look whether it looks 'Englishly' enough...

semanticnoodles · 2024-04-15T17:18:30Z

Hello @rogorido and @nabsiddiqui,

here follows my preliminary feedback; I am aware it is quite extensive, but I believe these indications could help you strengthen your tutorial. If you need any clarification, please do not hesitate to ask!

Overall feedback

In general, your tutorial provides valuable guidance on navigating and producing a wide range of visualisations, effectively walking through the various features of ggplot2. The piece meets the accessibility and inclusivity goals of the Programming Historian fairly well, and in most cases the language is easy to understand and straightforward. However, some elements need further work, mostly falling under two intertwined aspects discussed in the following paragraphs.

Usability: Enhancing the logical structure of the lesson

In my opinion, this is the most critical point to consider. The tutorial lacks a cohesive element to tie its components together and the organisation of the content could benefit from a more linear and less convoluted approach. The case study you propose (sister cities) seems to be just a tool to obtain a series of visualisations. This is fair enough, but it could benefit from further methodological contextualisation and unpacking: the people following your tutorial may not be historians not have a clear understanding of the methods you are using -- although they can be familiar with R.

In terms of improving the overall content, I think there are two possible directions for you to consider: either revising the content to follow a visualisation task-based narrative or placing more emphasis on the structure of the case study. The first option would privilege the visualisation tasks (but still require some methodological support for the case study), while the second would require you to generate stronger and sharper research questions from the case study, to be answered (at least in part) by the visualisation tasks. I think @nabsiddiqui did a very good job of structuring the content in the lesson Data Wrangling and Management in R, so I would recommend keeping that in mind as a reference.

The title of the proposal could benefit from being more specific - or at least mentioning the context of application. The table of contents looks unbalanced: the headings and their actual wording could be better aligned with the content they cover, and the nesting could be more linear.

You give very clear information about the concept of the grammar of graphics - this is really the cornerstone of understanding how ggplot2 is designed. I really appreciate you explaining this and including many useful resources, although I think they could be arranged more organically, instead of including relatively short hints throughout the tutorial, as they tend to overshadow the walkthrough steps on several occasions.

Sustainability: Critically reviewing the data analysis narrative

The dataset looks more than adequate for the visualisation tasks you have set as objectives, but the data narrative and its wording could benefit from further tuning. What you offer in this lesson is mostly visualisation of data distributions and there is little statistical testing involved. As your topic is sister cities, it makes perfect sense to talk about relationships, although what you observe are mostly trends or tendencies that you could try to explain through further research; sometimes you clearly point that out and sometimes it looks rather implicit. I think this is just a matter of fine-tuning the language, nothing more.

Section-specific feedback

Para stands for paragraph number; please refer to the preview generated by @anisa-hawes

Introduction, Lesson Goals and Data

Para 1, line 2: there is an extra )
Lesson’s goals could be more specific (you could pick outcomes that have major resonance that adding meaningful labels to plots)
No reference to the dataset is presented here (it comes from Wikidata, right?). Make sure you at least have a couple of words about it here represented.
Review the heading accordingly with the edits.

ggplot2: General Overview

This acts more like an introductory section, although it is nested under the previous one. Bring it to the same level as the previous or put it before it to give a more comprehensive introduction (or re-arrange it for better consistency, please).
A couple of words about the Tidyverse here would better contextualise the workflow.
Para 7 could be added to the Additional Resources section.
Para 8 could mention more strategically the arguments – review it for a better alignment with the walkthrough. You could even thinking of following the official layers featured in the introduction to ggplot2 vignette, adapting that to match with the elements you thoroughly explain.
Review the heading accordingly with the edits.

Sister cities in Europe

Please clarify your understanding of sister cities by giving a working definition. This would clarify the starting point of your research.
The rationale of your case needs some more unpacking; please add some context here, also about the provenance of your dataset.
The research questions here listed are somewhat aligned with the steps you propose. I would recommend you to review them for enhanced consistency.
Review the heading accordingly with the edits. Most importantly, from here on you start with the walkthrough. Make sure you clarify this by tuning the headings.

Loading Data with `readr`

If you referenced the tidyverse above you won’t need to explain tibbles extensively here. Please review this part for conciseness.
Including head(eudata) could support your explanation about the observations occurring in the dataset – this is also considered good practice in data science.
Para 16 could benefit the previous section.
Consider raising the level of this heading and review it accordingly.

Creating a bar graph

IMPORTANT: There is no typecountry column included in your dataset. I tested the walkthrough using the data contained in the eu column, just remember to send us the correct version of the dataset.
Paras 20-23 could be more focused on the walkthrough; anticipating para 23 once obtained the barplot could enhance the clarity.
Para 30 could use a bit more details about the interpretation of the results. If you plan
Review the heading accordingly with the edits.

Other Geoms: Histograms, Distribution Plots and Boxplots

Para 31, penultimate line: comma missing space afterwards.
Para 33, please review this for clarity (here you should mention why you used log10 once for all or put it into another spot. Consider explaining why none of the methods is ideal)

This leads to an uninformative histogram. We can take log10(dist) as our variable or filter to exclude values above 5000kms. None of these methods is ideal, but as far as we know, we are operating with manipulated data making it less problematic
Para 36, please review it for clarity (it reads implicitly why you employed ECDF).
Para 41, same issue: you refer to ANOVA without explaining why you foresee that as a viable statistic test, cutting the paragraph short.
Review the heading accordingly with the edits.

Manipulating the Look of Graphs

This section would be more logically following the Other Geoms section. Evaluate how to make this and the following sessions more cohesive.
Para 42 could be revised for clarity – especially the research question. Mind that you first performed the random subsampling and then explained it.
Para 45 does not add much information to the following steps. Instead of pointing out which elements you want to manipulate, consider laying out clearly the goal for your tasks.
Para 55, review for conciseness (sometimes less is more).
Review the heading accordingly with the edits.

Scales: Colors, Legends, and Axes

Para 65, please review for straightforwardness - advantage of using a continuous scale? Also a repetition in the last line (“represent the distance”).
Para 68, review for accuracy: the way it is phrased seems like ggplot2 does not use discrete colour scales at all.
Para 70, would better fit in the Additional Resources section.
Para 74, review for accuracy.

Faceting a Graph

This section would be more logically part of the Other Geoms section and use a title anticipating also the theme changes.
Para 75, review for clarity and conciseness (“split by categories [space time and so]” is not very straightforward. Consider explaining straightforwardly what facetting is.)

Themes: Changing Static Elements

As the previous, this section would be more logically following the Other Geoms section.

Extending ggplot2 with Other Packages

Para 84, extra comma not rendering the link for Ridgeline plots
As the previous, this section would be more logically following the Other Geoms section.

Additional Resources

Consider reviewing and incorporating other elements into this section, following more closely the tools used in the tutorial instead of pointing towards general-purpose resources. A critical list of resources would be more useful to your readers.

Format & style

Two quick comments on the form and style.

Please homogenise the use of capitalisation in the headings (exclusion made for ggplot2 that always comes lowercased, but you know it 😄)
Please homogenise the way you refer to R functions and arguments – using the code format or not, you choose. Consistency is the only requirement.

Thank you for the great work done so far!

rogorido · 2024-04-16T08:08:40Z

@semanticnoodles thanks for your extensive comments. I will have a look at the enhancements you're proposing in the next days.

anisa-hawes · 2024-04-17T11:43:01Z

What's happening now?

Hello Igor @rogorido and Nabeel @nabsiddiqui. Your lesson has been moved to the next phase of our workflow which is Phase 3: Revision 1.

This Phase is an opportunity for you to revise your draft in response to @semanticnoodles's initial feedback. You can make direct commits to your file here: /en/drafts/originals/exploring-visualizing-mixed-data-r-ggplot2.md. @charlottejmc or I are here to help if you encounter any practical problems!

When both of you + Giulia are happy with the revised draft, we will move forward to Phase 4: Open Peer Review.

%%{init: { 'logLevel': 'debug', 'theme': 'dark', 'themeVariables': {
              'cScale0': '#444444', 'cScaleLabel0': '#ffffff',
              'cScale1': '#882b4f', 'cScaleLabel1': '#ffffff',
              'cScale2': '#444444', 'cScaleLabel2': '#ffffff'
       } } }%%
timeline
Section Phase 2 <br> Initial Edit
Who worked on this? : Editor (@semanticnoodles) 
All  Phase 1 tasks completed? : Yes
Section Phase 3 <br> Revision 1
Who's working on this? : Authors (@rogorido + @nabsiddiqui)  
Expected completion date? : May 17
Section Phase 4 <br> Open Peer Review
Who's responsible? : Reviewers (TBC) 
Expected timeframe? : ~60 days after request is accepted

Note: The Mermaid diagram above may not render on GitHub mobile. Please check in via desktop when you have a moment.

semanticnoodles · 2024-05-21T10:28:20Z

Hello Igor @rogorido and Nabeel @nabsiddiqui, I hope you are doing well!

Just checking in with you about the draft revision (Phase 3 / Revision 1) as the deadline of the 17th of May has passed. If you need some extra time let me know approximately how much, so we can set up a new deadline -- and @anisa-hawes or @charlottejmc can update the Mermaid timeframe.

If you have doubts or need any clarification, please do not hesitate to keep in touch.

nabsiddiqui · 2024-05-23T05:20:55Z

Hello @semanticnoodles,

I have tried to rework a lot of the tutorial. I feel that changing some of the headings will make the flow more obvious. Let me see if it makes sense the way I have done it or if there should be additional changes. Here are some of what I reviewed based on your timeline. The rest I will leave to @rogorido unless he has an objection:

Introduction, Lesson Goals and Data

Para 1, line 2: there is an extra )
Lesson’s goals could be more specific (you could pick outcomes that have major resonance that adding meaningful labels to plots)
No reference to the dataset is presented here (it comes from Wikidata, right?). Make sure you at least have a couple of words about it here represented.
Review the heading accordingly with the edits.

ggplot2: General Overview

This acts more like an introductory section, although it is nested under the previous one. Bring it to the same level as the previous or put it before it to give a more comprehensive introduction (or re-arrange it for better consistency, please).
A couple of words about the Tidyverse here would better contextualise the workflow.
Para 7 could be added to the Additional Resources section.
Para 8 could mention more strategically the arguments – review it for a better alignment with the walkthrough. You could even thinking of following the official layers featured in the introduction to ggplot2 vignette, adapting that to match with the elements you thoroughly explain.
Review the heading accordingly with the edits.

Sister cities in Europe

Please clarify your understanding of sister cities by giving a working definition. This would clarify the starting point of your research.
The rationale of your case needs some more unpacking; please add some context here, also about the provenance of your dataset.
The research questions here listed are somewhat aligned with the steps you propose. I would recommend you to review them for enhanced consistency.
Review the heading accordingly with the edits. Most importantly, from here on you start with the walkthrough. Make sure you clarify this by tuning the headings.

Loading Data with `readr`

If you referenced the tidyverse above you won’t need to explain tibbles extensively here. Please review this part for conciseness.
Including head(eudata) could support your explanation about the observations occurring in the dataset – this is also considered good practice in data science.
Para 16 could benefit the previous section.
Consider raising the level of this heading and review it accordingly. (Felt it was better at this level)

Creating a bar graph

IMPORTANT: There is no typecountry column included in your dataset. I tested the walkthrough using the data contained in the eu column, just remember to send us the correct version of the dataset.
Paras 20-23 could be more focused on the walkthrough; anticipating para 23 once obtained the barplot could enhance the clarity.
Para 30 could use a bit more details about the interpretation of the results. If you plan
Review the heading accordingly with the edits.

Other Geoms: Histograms, Distribution Plots and Boxplots

Para 31, penultimate line: comma missing space afterwards.
Para 33, please review this for clarity (here you should mention why you used log10 once for all or put it into another spot. Consider explaining why none of the methods is ideal)

This leads to an uninformative histogram. We can take log10(dist) as our variable or filter to exclude values above 5000kms. None of these methods is ideal, but as far as we know, we are operating with manipulated data making it less problematic
Para 36, please review it for clarity (it reads implicitly why you employed ECDF).
Para 41, same issue: you refer to ANOVA without explaining why you foresee that as a viable statistic test, cutting the paragraph short.
Review the heading accordingly with the edits.

Manipulating the Look of Graphs

This section would be more logically following the Other Geoms section. Evaluate how to make this and the following sessions more cohesive.
Para 42 could be revised for clarity – especially the research question. Mind that you first performed the random subsampling and then explained it.
Para 45 does not add much information to the following steps. Instead of pointing out which elements you want to manipulate, consider laying out clearly the goal for your tasks.
Para 55, review for conciseness (sometimes less is more).
Review the heading accordingly with the edits.

Scales: Colors, Legends, and Axes

Para 65, please review for straightforwardness - advantage of using a continuous scale? Also a repetition in the last line (“represent the distance”).
Para 68, review for accuracy: the way it is phrased seems like ggplot2 does not use discrete colour scales at all.
Para 70, would better fit in the Additional Resources section.
Para 74, review for accuracy.

Faceting a Graph

This section would be more logically part of the Other Geoms section and use a title anticipating also the theme changes.
Para 75, review for clarity and conciseness (“split by categories [space time and so]” is not very straightforward. Consider explaining straightforwardly what facetting is.)

Themes: Changing Static Elements

As the previous, this section would be more logically following the Other Geoms section.

Extending ggplot2 with Other Packages

Para 84, extra comma not rendering the link for Ridgeline plots
As the previous, this section would be more logically following the Other Geoms section.

Additional Resources

Consider reviewing and incorporating other elements into this section, following more closely the tools used in the tutorial instead of pointing towards general-purpose resources. A critical list of resources would be more useful to your readers.

Format & style

Two quick comments on the form and style.

Please homogenise the use of capitalisation in the headings (exclusion made for ggplot2 that always comes lowercased, but you know it 😄)
Please homogenise the way you refer to R functions and arguments – using the code format or not, you choose. Consistency is the only requirement.

Other

Change Title to be More Descriptive

anisa-hawes · 2024-05-29T09:51:49Z

Thank you, @nabsiddiqui!

@semanticnoodles will review these revisions and advise if we are ready to move onwards to the next Phase of the workflow (which will be Phase 4 Open Peer Review). Giulia is away this week, returning on June 3rd.

In the meantime, @charlottejmc and I can help with ensuring that functions and arguments are typographically consistent. These are aspects we always check as part of typesetting at Phase 6, but we'll do a quick scan now so that this isn't a distraction for Reviewers.

charlottejmc · 2024-05-29T11:10:48Z

Hello @nabsiddiqui and @semanticnoodles,

I've made some adjustments to add backticks to functions, arguments and other parts of code, trying to stay consistent with our house style.

semanticnoodles · 2024-06-10T15:14:27Z

Hello everybody, I am back! While I was away I got the chance to go through the tutorial and I can say you did upgrade the lesson quite a lot. Brilliant work @nabsiddiqui and @rogorido -- and many many thanks to @charlottejmc and @anisa-hawes for their support!

I will take another quick reading as I think I spotted another couple of small things to fix, but I believe now it is almost ready to move onwards to Phase 4. Sorry for the slight delay in my answer -- I will get back to you in a few hours.🖥

semanticnoodles · 2024-06-18T13:34:51Z

It took longer than expected (hours became days..). Nevertheless, if @rogorido and @nabsiddiqui can quickly fix the elements in the list below I believe we can move to the open peer review (Phase 4). The most urgent is the first element, the following are about simple formalities/typos.

remember to upload the correct dataset containing the typecountry column
paras 86-88 are missing list formatting
Conclusion, paras 126/127 display several contracted forms, e.g. "you'll" -- please expand them.
para 133 extra space before the dot
paras 136/137 missing end dot
para 141, explore typo "epxlore"

Thank you for the patience!

rogorido · 2024-06-19T21:03:33Z

@semanticnoodles (and @nabsiddiqui): I have already corrected all typos (I hope). And I have the correct dataset. But my question is: where should I exactly upload it?

Many thanks for your work!

rogorido · 2024-09-13T07:03:41Z

@justinwigard and @regan008 Thank you very much for the detailed corrections!

semanticnoodles · 2024-09-24T15:56:23Z

Hi @rogorido and @nabsiddiqui, here is my review/feedback summary (it took a while); thanks a million @regan008 and @justinwigard for all the food for thought and complementary feedback you provided! Both of you highly recommend the lesson for publication 🎉🎉🎉: @regan008 appreciates particularly the explanations about the tibbles and the Grammar of Graphics; on the other hand, @justinwigard appreciates the engaging tone of the lesson and the way it explains the potential of ggplot2.

Here is a quick recap of the core elements you highlight -- that I recommend @rogorido & @nabsiddiqui to go through carefully.

Notes on Amanda’s feedback

@regan008 makes some detailed comments about typos, potential clarifications (e.g., on plotting packages, coordinate systems), and a suggestion to link out where ECDF is mentioned, clarifying the contents of para 56-60. She also notes that while maps are mentioned, the lesson does not cover them explicitly (might be a chance to link to Using Geospatial Data to Inform Historical Research in R).

There may be an opportunity to use additional line charts, as @regan008 suggests, but requiring further transformations/brand new additions, e.g. using long/lat or population size between sister cities. The structure of the lesson works and I would like you to prioritise the refinements she suggests rather than adding brand new extensions. She makes a good point, but please only add additional data filtering/visualisation if you have time to devote to the task.

Notes on Justin’s feedback

@justinwigard highlights a number of areas where the lesson is already strong, as well as offering thoughtful suggestions for improvement under the four sections he articulated. Surely the minor typographical and grammatical suggestions other than the consistency of the sister cities spelling and the geoms require your attention.

On a functional level, he notes that some additional context could be helpful for readers unfamiliar with the tidyverse or Wikidata. He noted that providing counter-examples alongside some of the figures, like Figure 6, could help readers compare different cases, as well as adding more references on the choice of binwidth size (very often a rule of thumb, in my experience). He additionally suggests listing the tidyverse packages explicitly, and including a link to Wikidata, making more evident the line about the dataset download. He also suggests incorporating a screenshot to show how the tibble should appear after loading (I believe I suggested you to consider something similar previously, like running head(eudata), it might be really worth getting a screenshot). Many more technical insights from his side follow, and I suggest you have a look at them carefully.

Again, as I noted in Amanda's feedback, please focus on refinement/consolidation first, and then consider expanding your lesson further.

A few extras

Here are a few extra comments from my side, mostly technically oriented.
Following @justinwigard notes I ran all the code to see if I could provide some extra technical feedback (using R version 4.3.0 [2023-04-21] on my RStudio version Cranberry Hibiscus, 2024.9.0.375).

The tibble size is in fact 13081 x 15, with the following colnames (I believe the index X could be removed from the dataset).

> colnames(eudata)
 [1] "X"                       
 [2] "origincityLabel"         
 [3] "origincountry"           
 [4] "originlat"               
 [5] "originlong"              
 [6] "originpopulation"        
 [7] "sistercityLabel"         
 [8] "destinationlat"          
 [9] "destinationlong"         
[10] "destinationpopulation"   
[11] "destination_countryLabel"
[12] "dist"                    
[13] "eu"                      
[14] "samecountry"             
[15] "typecountry

The overall code formatting for several chunks is a bit weird in fact: if you could remove the extra spaces or check returns that break the code (e.g. paras 38, 44, etc.) I believe it could facilitate the end users. I realise this is probably not your doing but it might be 100% dependent on the style format packages/export to .md.
para 64: it’s eudata.filtered (eudata missing filtered).
para 71: the y axis goes up to 15 max and I as well get a warning message Warning: Removed 956 rows containing missing values (geom_point())and the same happens with the codeblock in para 73. My plots look just like the ones from @justinwigard
Last but not least, please consider using a more specific title for this tutorial, like Visualizing Distributions and Relationships with R and ggplot2 (or something more task-specific).

A huge thank you for all your patience and hard work!🌟

anisa-hawes · 2024-09-24T21:40:31Z

Hello Igor @rogorido and Nabeel @nabsiddiqui,

What's happening now?

Your lesson has been moved to the next phase of our workflow which is Phase 5: Revision 2.

This phase is an opportunity for you to revise your draft in response to the peer reviewers' feedback.

Giulia @semanticnoodles has summarised their suggestions, but feel free to ask questions if you are unsure.

Please make revisions via direct commits to your file: /en/drafts/originals/visualizing-data-with-r-and-ggplot2.md. @charlottejmc and I are here to help if you encounter any difficulties.

When you and Giulia are all happy with the revised draft, the Managing Editor @hawc2 will read it through and provide additional feedback/suggestions as necessary before we move forward to Phase 6: Sustainability + Accessibility.

%%{init: { 'logLevel': 'debug', 'theme': 'dark', 'themeVariables': {
              'cScale0': '#444444', 'cScaleLabel0': '#ffffff',
              'cScale1': '#882b4f', 'cScaleLabel1': '#ffffff',
              'cScale2': '#444444', 'cScaleLabel2': '#ffffff'
       } } }%%
timeline
Section Phase 4 <br> Open Peer Review
Who worked on this? : Reviewers (@justinwigard + @regan008)
All  Phase 4 tasks completed? : Yes
Section Phase 5 <br> Revision 2
Who's working on this? : Authors (@rogorido + @nabsiddiqui)
Expected completion date? : October 24
Section Phase 6 <br> Sustainability + Accessibility
Who's responsible? : Publishing Team
Expected timeframe? : 7~21 days

Note: The Mermaid diagram above may not render on GitHub mobile. Please check in via desktop when you have a moment.

rogorido · 2024-09-25T06:01:14Z

@semanticnoodles thanks for your review/feedback.We will make all corrections in the next days.

rogorido · 2024-10-01T09:27:17Z

@regan008 and @justinwigard: Many thanks again for your comments and corrections. I have added many of them (cdfc89f) and @nabsiddiqui and I should think about two or three changes you are proposing which have maybe more profound consequences for the tutorial.

In any case, just some comments:

@justinwigard:

the differences between your graphs and the graph in the tutorial come (as far as I can see it) from the fact that we use sample_frac() which takes a random sample out of the data. We should add a warning for the reader...
As of your question: Would it be helpful to provide a counter-example to Germany here? How should we read Portugal’s relationship, or Bulgaria, to sister-cities? No. we give the reader some hints to make analysis, but this is not a tutorial about sister-cities relationships, but about using ggplot2 for analyzing/visualizing them.

@regan008:

as of other packages: plotly was created mainly for python and has nowadays extensions for R, julia, etc. As far as I know, it is not very much used in R in comparison to 'native' solutions like ggplot (see here) the number of stars in github for instance). dygraphs is also rather a interface to the dygraphs javascript library and nothing 'R-native';
you are right: a line chart to show change over time would be the best for historians. Unfortunately it is not easy (if it is possible at all) to extract such kind of information from wikidata for the data we are working with.
You are right about maps, gis, etc. I have tried to make explicit that we do not cover maps in this lesson (maybe can someone points to a lesson about this topic in PH?).

In any case, we will still work in some on your comments (@semanticnoodles). Many thanks again.

anisa-hawes · 2024-10-02T14:42:32Z

Thank you for your work so far, Igor @rogorido and Nabeel @nabsiddiqui ✨

Please let Giulia @semanticnoodles know when you feel you've completed the revisions. She will read through the draft again to confirm that she's satisfied with the suggestions integrated.

rogorido · 2024-10-04T07:42:48Z

@anisa-hawes yes will do it!

nabsiddiqui · 2024-10-08T18:43:52Z

Hello @rogorido, @anisa-hawes, and @semanticnoodles,

Igor and I have added our edits, and I believe that we are all set to move to the next stage now.

anisa-hawes · 2024-10-09T10:12:59Z

Thank you, @nabsiddiqui and @rogorido.

Giulia @semanticnoodles will read through your revisions later this week, and advise if she feels any further adjustments are needed.

After that, Alex will read it through and share additional feedback/suggestions as necessary.

When both Giulia and Alex are happy, we will move forward to Phase 6: Sustainability + Accessibility which will begin with copyediting 🙂

rogorido · 2024-10-09T11:23:06Z

@anisa-hawes OK, many thanks!

semanticnoodles · 2024-11-15T10:42:29Z

Hello @rogorido & @nabsiddiqui,

I apologise for the delay in posting this feedback. I have been going through the whole lesson again with @justinwigard and @regan008 comments at hand. I think you have done a wonderful job of polishing the lesson, we are almost ready for Phase 6! 🎉

Please review the following points and we will be ready to move on - looking forward to seeing this brilliant lesson of yours available to the PH audience!

General Comments

Missing rows warning: not to have the readers freaking out when they encounter ``Warning: Removed xyz rows containing missing values (geom_point())` can you spend a line or so just saying the do not have to worry?
Title: I understand that it might not be easy, but as I mentioned in my previous comment, I would like you to think if the title could be improved, to be more informative for the PH audience – you are doing much more than teaching how to plot graphs here! Something like Exploring and Visualizing Data in R with ggplot2 might make the difference already, but you can consider referring to the grammar of graphics or anything that (and massive thanks @anisa-hawes for the brainstorming session on this):
- relates to your dataset.
- better clarifies the scope of the tools you are using.

Paragraph-specific comments

¶ 25: link missing a [ to be rendered
¶ 28-30: following @justinwigard observation, to make the dataset download less skippable can you put at the end of paragraph 28:

You can download the dataset at [this link](https://github.com/programminghistorian/ph-submissions/tree/gh-pages/assets/visualizing-data-with-r-and-ggplot2/sistercities.csv).

and then in paragraph 30 change the phrasing to:

Let’s go ahead and place the dataset in our project’s current working directory.
¶ 38: The paragraph seems messed up a little(trimmed?). Please check it.
¶ 39 (@regan008 ’s): I think the figure caption for Fig 1 is mixed up. This chart appears to show the count of locations not the total percentage.
¶ 44: instead of “tutorial” can you plese use its full name (Data Wrangling and Managment in R)?
¶ 49 (@justinwigard ’s): I think there’s a sentence that was unfinished, potentially? “…the column for different bars, and We also added”:

We passed a new parameter to the ggplot() command named fill, indicating the column for the bars. We also added…

Here I believe you meant something like “We mapped the origincountry column to the fill aesthetic in the ggplot() command, which defines the color range of the bars. We also added…”
¶ 64: in the code chunk it’s eudata.filtered (eudata missing filtered).
¶ 117: The Wallstreet Journal -> The Wall Street Journal

rogorido · 2024-11-15T15:45:41Z

@semanticnoodles Thanks a lot for your comments. We will work on your corrections and I hope we will be ready in 2-3 days.

nabsiddiqui · 2024-11-20T15:41:24Z

Hello @semanticnoodles. @rogorido and I have finished our edits. I have set a seed in the R code to allow for reproducibility. I have also updated the images to reflect the sample data the user will get due to the seed.

For the title, we were thinking perhaps "From Historical Data to Visual Analytics: The Grammar of Graphics in Practice"? I don't know what would be needed to change the title since the folders are based on the title. I am sure @anisa-hawes can help. Look forward to moving this ahead.

anisa-hawes · 2024-11-20T17:16:26Z

Thank you, @nabsiddiqui. Yes, of course we can help with the practicalities of adjustments to any file and directory names.

However, I think what Giulia @semanticnoodles is aiming towards is finding a title that is more specific. Fundamentally, we want to help readers find lessons that meet their learning goals. A clear title facilitates discovery through search, and offers a quick, basic sense of what can be learned.

Reviewing our lesson directory, I think the most successful titles generally comprise:

a verb or a noun which defines the main learning activity, method or process: Transcribing, Analysing, Visualising, Mapping, Text Mining, Facial Recognition
the kind of data readers will handle in the lesson: YouTube comment data, historical photographs, OCR text files
the names of key tools, software libraries or programming languages readers will use: R, Python, Neo4j, OpenRefine, SPARQL.

The current title is: Visualizing Data with R and ggplot2
Giulia has suggested the subtle adjustment: Exploring and Visualizing Data in R with ggplot2

I was wondering whether your title could clarify what kind of data readers are handling with these methods? The concept of Sister Cities is mentioned but what are you describing in general: demographic data? geographical/spatial data? ('mixed' data? - is the fact that you are selecting methods to visualise a range of different data types the key? 🤔)

My sense is that an effective lesson title is usually simple and succinct. So, I think I'd suggest avoiding the semicolon and compound structure (more often encountered for an expanded research article title) and focus on providing straight-forward keys to the lesson.

rogorido · 2024-11-22T22:37:26Z

@anisa-hawes After talking with @nabsiddiqui I think we stick to the title proposed by Giulia.

hawc2 · 2024-11-23T01:23:56Z

to @anisa-hawes' point, it would be nice to clarify what type of data this lesson teaches how to visualize - would it be fair to label it "Demographic Data"?

nabsiddiqui · 2024-11-23T03:51:22Z

I think it is more mixed data since some of it is about the cities themselves and some of it is about the demographics of the city.

I like "Exploring and Visualizing Mixed Data in R with ggplot2".

@rogorido is this ok with you?

rogorido · 2024-11-23T05:16:45Z

@nabsiddiqui Yes, perfect!

semanticnoodles · 2024-11-23T09:43:24Z

Hi everybody, thank you @anisa-hawes and @hawc2 for stimulating these productive exchanges 🧠! The title solution you settled with sounds quite good to me; let us know your thoughts, @hawc2 and @anisa-hawes.

Thanks a lot for fixing the last items, @rogorido & @nabsiddiqui, I highly appreciated you adding the seed for reproducibility 👏

On my side I think we are ready to move to phase 6 🎉

anisa-hawes · 2024-11-25T23:06:23Z

Thank you, Giulia @semanticnoodles.

Hello Igor @rogorido and Nabeel @nabsiddiqui,
Many thanks for all your work, and for taking the time to rethink the title. I agree with Giulia that your suggestion works well.

The Managing Editor Alex @hawc2 will now read the lesson to confirm if it should be moved onwards to our Phase 6 (Sustainability and Accessibility checks, beginning with copyediting), or if he'd like to suggest any final revisions.

Best,
Anisa

charlottejmc · 2024-11-26T09:51:51Z

Hello Igor @rogorido, Nabeel @nabsiddiqui, and Giulia @semanticnoodles,

Thank you for your thoughtful consideration of the lesson title.

I've now updated the title across all the files: the lesson's new slug is now exploring-visualizing-mixed-data-r-ggplot2.

rogorido · 2024-11-26T10:04:57Z

@charlottejmc, @anisa-hawes @semanticnoodles thanks!

hawc2 · 2024-11-26T22:48:03Z

@rogorido and @nabsiddiqui this looks like a solid and well developed lesson with a clear scope and utility for those looking to learn how to present their research with R.

My only request for further revision pertains to our prior discussion about the title and what kind of data this lesson shows the reader how to analyze and present. The concept of "mixed data" doesn't get discussed currently in the lesson, so the title will raise a basic question for the reader as to what that means. I must admit I'm not quite sure myself what "mixed data" refers to, so I do think a couple additional paragraphs of explanation about your dataset, at a high level, would help contextualize the following parts of the lesson.

The core questions I think need further response: How does this lesson show visualization techniques specifically useful for "mixed data? What is it about "mixed data" that is particularly complex, necessitating different measures for presentation than required for less mixed data? What types of data is this mixed dataset a mixture of, exactly?

The opening section of the lesson jams together three different subsections (Introduction, Lesson goals, and Data). My recommendation would be to break those into three separate sections with headings provided for each, and to take more time to introduce the type of data (and specific sample dataset) at the core of your lesson on presenting data visualizations. It is worth going through the lesson as a whole with a mind to this question, as it would be nice to see you bring up the concept of mixed data (or the data central to this lesson) again during the central and concluding sections.

Once you make revisions to address this issue, I'll do a brief line edit, and assuming I don't have any remaining questions, I'll send it on to copyedits and preparation for publication. Please let me know if you have any questions!

rogorido · 2024-11-28T10:05:50Z

@hawc2 Thanks for your comments. We will work on them in the next days!

anisa-hawes · 2024-12-05T14:05:45Z

Thank you, Igor @rogorido. Remember that the lesson title + filename have been adjusted so the key links are slightly changed.

You'll find the Markdown file here: /en/drafts/originals/exploring-visualizing-mixed-data-r-ggplot2.md
And the live preview here: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/exploring-visualizing-mixed-data-r-ggplot2

Please let Charlotte or I know if there's anything we can help with 🙂

hawc2 · 2024-12-11T14:57:26Z

@rogorido As an alternate title, I'd recommend changing it to: "Visualizing Urban and Demographic Data in R with ggplot2." I'm just not sold on the idea of 'mixed data,' and from what @nabsiddiqui said to my prior comment, the mixed data is a mixture of Urban and Demographic data, so why not just say that? It seems to be giving yourselves unnecessary work to try to explain "mixed data," which I fear doesn't have any legit research or technical precedent for you to lean on.

Regardless of what title you pick, the lesson itself will need to be revised to explain the data you're using in more detail, and guide the reader into the lesson with introductory steps in the first few sections.

hawc2 added English 0. Proposal Original labels Mar 19, 2024

hawc2 assigned semanticnoodles Mar 19, 2024

hawc2 added this to Active Lessons Mar 19, 2024

anisa-hawes moved this to 0 Proposal in Active Lessons Mar 19, 2024

anisa-hawes moved this from 0 Proposal to 1 Submission in Active Lessons Mar 20, 2024

anisa-hawes added 1. Submission and removed 0. Proposal labels Mar 20, 2024

anisa-hawes moved this from 1 Submission to 2 Initial Edit in Active Lessons Mar 20, 2024

anisa-hawes added 2. Initial Edit and removed 1. Submission labels Mar 20, 2024

anisa-hawes added 3. Revision 1 and removed 2. Initial Edit labels Apr 17, 2024

anisa-hawes moved this from 2 Initial Edit to 3 Revision 1 in Active Lessons Apr 17, 2024

anisa-hawes added 5. Revision 2 and removed 4. Open Peer Review labels Sep 24, 2024

anisa-hawes moved this from 4 Open Peer Review to 5 Revision 2 in Active Lessons Sep 24, 2024

charlottejmc changed the title ~~Visualizing data with R and ggplot2~~ Exploring and Visualizing Mixed Data in R with ggplot2 Nov 26, 2024

Exploring and Visualizing Mixed Data in R with ggplot2 #606

Exploring and Visualizing Mixed Data in R with ggplot2 #606

Comments

hawc2 commented Mar 19, 2024 • edited by anisa-hawes Loading

semanticnoodles commented Mar 19, 2024

anisa-hawes commented Mar 20, 2024 • edited by charlottejmc Loading

anisa-hawes commented Mar 20, 2024 • edited Loading

What's happening now?

rogorido commented Mar 20, 2024

rogorido commented Apr 8, 2024

semanticnoodles commented Apr 15, 2024

Overall feedback

Usability: Enhancing the logical structure of the lesson

Sustainability: Critically reviewing the data analysis narrative

Section-specific feedback

Introduction, Lesson Goals and Data

ggplot2: General Overview

Sister cities in Europe

Loading Data with readr

Creating a bar graph

Other Geoms: Histograms, Distribution Plots and Boxplots

Manipulating the Look of Graphs

Scales: Colors, Legends, and Axes

Faceting a Graph

Themes: Changing Static Elements

Extending ggplot2 with Other Packages

Additional Resources

Format & style

rogorido commented Apr 16, 2024

anisa-hawes commented Apr 17, 2024 • edited by charlottejmc Loading

What's happening now?

semanticnoodles commented May 21, 2024

nabsiddiqui commented May 23, 2024 • edited by semanticnoodles Loading

Introduction, Lesson Goals and Data

ggplot2: General Overview

Sister cities in Europe

Loading Data with readr

Creating a bar graph

Other Geoms: Histograms, Distribution Plots and Boxplots

Manipulating the Look of Graphs

Scales: Colors, Legends, and Axes

Faceting a Graph

Themes: Changing Static Elements

Extending ggplot2 with Other Packages

Additional Resources

Format & style

Other

anisa-hawes commented May 29, 2024 • edited Loading

charlottejmc commented May 29, 2024

semanticnoodles commented Jun 10, 2024

semanticnoodles commented Jun 18, 2024 • edited by charlottejmc Loading

rogorido commented Jun 19, 2024

rogorido commented Sep 13, 2024

semanticnoodles commented Sep 24, 2024 • edited Loading

Notes on Amanda’s feedback

Notes on Justin’s feedback

A few extras

anisa-hawes commented Sep 24, 2024

What's happening now?

rogorido commented Sep 25, 2024

rogorido commented Oct 1, 2024

anisa-hawes commented Oct 2, 2024

rogorido commented Oct 4, 2024

nabsiddiqui commented Oct 8, 2024

anisa-hawes commented Oct 9, 2024

rogorido commented Oct 9, 2024

semanticnoodles commented Nov 15, 2024 • edited by rogorido Loading

General Comments

Paragraph-specific comments

rogorido commented Nov 15, 2024

nabsiddiqui commented Nov 20, 2024

anisa-hawes commented Nov 20, 2024

rogorido commented Nov 22, 2024

hawc2 commented Nov 23, 2024

nabsiddiqui commented Nov 23, 2024

rogorido commented Nov 23, 2024

semanticnoodles commented Nov 23, 2024

anisa-hawes commented Nov 25, 2024

charlottejmc commented Nov 26, 2024

rogorido commented Nov 26, 2024

hawc2 commented Mar 19, 2024 •

edited by anisa-hawes

Loading

anisa-hawes commented Mar 20, 2024 •

edited by charlottejmc

Loading

anisa-hawes commented Mar 20, 2024 •

edited

Loading

Loading Data with `readr`

anisa-hawes commented Apr 17, 2024 •

edited by charlottejmc

Loading

nabsiddiqui commented May 23, 2024 •

edited by semanticnoodles

Loading

Loading Data with `readr`

anisa-hawes commented May 29, 2024 •

edited

Loading

semanticnoodles commented Jun 18, 2024 •

edited by charlottejmc

Loading

semanticnoodles commented Sep 24, 2024 •

edited

Loading

semanticnoodles commented Nov 15, 2024 •

edited by rogorido

Loading