Skip to content

Latest commit

 

History

History
525 lines (417 loc) · 42.7 KB

notes-flowcytometry.md

File metadata and controls

525 lines (417 loc) · 42.7 KB

notes on flow cytometry Prashant Kalvapalle started - 5/March/2022 tags: #notes

Reading

Read papers that does bacterial flow cytometry to see what kind of analysis they do : Read a bunch of Tabor papers : Castillo hair etc.

  • Sexton et al, MSB, 2020 "Multiplexing cell‐cell communication." Molecular systems biology 16.7 (2020): e9618.
    • 60% SSC threshold ; 20,000–30,000 E. coli-sized events were recorded; 85% events density gated
  • Sebastian, 2019 Schmidl, Sebastian R., et al. "Rewiring bacterial two-component systems by modular DNA-binding domain swapping." Nature Chemical Biology 15.7 (2019): 690-698.
    • 35% SSC threshold ; 20,000 events ; 30% density retained : Nice image showing gating workflow (S15)
    • Combined histograms of 3 replicates after gating
    • Tried B. subtilis, S oneidensis; used geometric mean?; combines separate day replicates for histograms; Took avg (arithmetic?) of 3 replicates' geometric mean for
  • Klümper, Uli, et al. 2015 "Broad host range plasmids can invade an unexpectedly diverse fraction of a soil bacterial community." The ISME journal 9.4 (2015): 934-945.
    • Sorting: 0.1% to 82% enrichment fast sorting; second 100% sorting, 10,000 cells : 2,000 eps speed

tutorials

Full tutorial (flowcore, ggcyto): https://jchellmuth.com/posts/FACS-with-R/ openCyto automatic gating schemes ggcyto plotting help, adding gates, axis limits

Which tool to use?

  1. Loading files
  2. Attaching names from google sheet :: R done, python can try
  3. Processing :: FlowCal ahead, FlopR working but too slow. Need to automate FlowCal, ~~and use beads named for FlopR~~
    • FlowCal lets you pre-trim data before automatic clustering and MEFLing, FlopR does not currently
  4. Gating :: Useful for determining % of population that is red etc. R/flopR workflow is more familiar right now
    • (verify) Neither have a nice gating heirarchy, but R's flowworkset has good ones for gating and stats
  5. Plotting distribution : Who has the better violins automated? FlowCal : violins, but without sample name and matplotlib is not compositional like ggplot; flopr -- check
  6. Retrieving summary stats :: FlowCal.stats module Python, flowWorkspace::gs_pop_get_count_fast() in R

Other considerations

  1. Run-time
    • FlowCal : Run time for single .fcs ~ 9-10 sec with plotting and 1 sec without plotting. using %timeit -n 1 -r 1 _process_single_fcs(..). For full workflow of 4 files + beads calibration = 23.4 sec (without plotting calibration details)
    • FlopR : ~ 1 min per sample ; maybe turn plotting off? Which step takes the most time?
  2. What kind of stats do we want from data?
    • (S048 mainly) % and numbers of cells in red high, red low and gr ..
    • Other expts : distributions of fcs data plotted
  3. Can we skip MEFL calibration for certain experiments? - Especially if only population fraction in a gate is desired (S048)? Could stay within R if doing this -

Conclusions

  • 10/6/22 : Use adhoc R openCyto, flowWorkspace - for S048 data -- flowcal is discarding useful data. Re-calibrate FlowCal to retain more events (~ >80%) in the future?
  • 14/5/22 : Let's focus on trim-mefl processing with FlowCal since it is faster
  • 30/4/22 : ~~FlopR, since R is good for the other steps -

Ideal workflow

Script in rmd

  1. R script loads google sheet names
  2. python script for calibration w flow cal
    • Pass variable of which well(s) are beads; Save the calibrated data to .fcs again
    1. Can't proceed to next step .. Figure out a way to beautify the existing plot; Ask John Sexton for custom code
  3. Re-read processed fcs, attach names and condense replicates
  4. plot using ggcyto + geom_violin/ggridges

Running flowcytometer

  • Check with Harsha: Issues with clogs/air from Vibrio? Running vibrio through the machine seems to clog the SEA Sony in BRC. Even beads have air issues with these sample sets (S0045, S045b) and the data looks like this with SSC around 0 ; but beads run on the same day in S046 are fine..
  • How old are the cells? vibrio are known to aggregate- Swetha says ; Especially true if they are in stationary phase right?

![[Pasted image 20220530125021.png]]

Python/Flowcal

Advantage of flowcal

  • Nice looking plots to show transformations, ex: gated vs full samples, MEFLing
  • Built-in normalization
  • Density based gating : ~automatic but needs a user based selection of the % of cells that should be retained
    • Check if this can be done in R too?

Bugs

  • to_mef = FlowCal.mef.get_transform_fxn in g.15.. : fails when only red channels are present in fluorescence_channels :: uses the mGreenLantern mefl vals for it :(
  • Code: input() function called from within a module in jupyterlab does not work. See analyze_fcs_flowcal/Line 121 for beads with low events
  • code stops in the pipeline but runs adhoc : to_mef() step : on gfp and mcherry2 data S066x_Ara dose-1 -- fluorescence_channels is empty, not being recognized -- due to PBS
  • _(fixed now) Looks like singlet gating is using the wrong y axis- should be FSC-H?

Bimodality after MEFL issue

  • Investigate the bimodality around 0 for non fluorescent cells. Ex: S055/S063. is this an artifact of the Sony flow cyt software doing background subtraction?

  • Verdict : Don't use MEFLing till I figure out why the beads are giving shitty slope m (~ 0.8) even after strong thresholding (low = 5,000 / 10,000) like Sebastian did

  • Check on the github issue I made on FlowCal. Sebastian confirms that m needs to be closer to 1. Beads have too much debris (probably data too?). Need to clear more debris with gate.high_low(beads_sample, channels=['FSC-A'], low=(5000))

    • First, make sure that your beads processing is correct. m=0.68 is highly atypical and suspicious. Ideally you would have it between 0.9 - 1.1, and these should be extremely consistent across your runs (i.e. if you got 0.92 once, you should get +/- 0.1 at most every time in the same instrument)
    • The other option is to force m=1. the beads curve won't fit well in this case.. compare S055 : Raw ![[S055_51 in 8 organisms-raw-ridge density-raw.png|250]] Processed ![[S055_51 in 8 organisms-processed-ridge density-processed.png|250]]
  • Take the negative control sample from this dataset and trace the intermediate distributions in the flowCal process - Is this because of subsetting the data or due to MEFL transformation? -- Looks like it is. How is it that S061 sensory only doesn't have the bimodal

  • trying on negative control of S063c/D01 (SS marine) : Raw ![[S063c_Ds_empty-raw.png|250]] Processed : ![[S063c_Ds_empty-processed.png|250]] MEFLing : ![[S063c_Ds_empty-MEFLing.png|500]] ; Histogram at 200 bins : ![[S063c_Ds_empty-processed_200bins.png|250]]

    • Follow up to trace steps in the processing to the origination of bimodality ;
      • Looks fine after gate_and_reduce() : Plotted with this FlowCal.plot.hist1d(processed_single_fcs.gated_data, channel = fluorescence_channels[0], xlim = (-100, 100), bins = 1000)
      • Seeing bimodal around 0 after MEFLing
  • Effect of the FlowCal processing on a positive sample : Ex: 'S063a_B02_gr_raw..' Raw : ![[S063a_B02_gr_raw.png|250]] Processed : ![[S063a_B02_gr_processed.png|250]]

  • check some sample that is negative with the above method intermediate steps in flowcal. ex: S067b1/B09; which is 143/So - d-1_U which loos clearly bimodal without zoom in ![[S067b1_143_B9_processed_bimodal.png | 200]] Plotting density 2d is more informative than 1d. Here's FlowCal.plot.density2d(.., mode = 'scatter') Zoomed in here with smooth and with smooth = False ![[S067b1_143_B9_raw_density.png | 300]] ![[S067b1_143_B9_processed_density.png| 300]]

  • Scatter lot on linearscale (this one should do no approximations, plotting exact data) : Shows that MEFL is producing values away from 0 -- need to inspect the MEFLing vector?

![[S067b1_143_B9_raw_scatter.png|300]] ![[S067b1_143_B9_processed_scatter.png|300]]

  • MEFLing is producing a gap at 0 - check formula if 0 is not an output of MEFL transformation, this could cause the effect. How to get around it? Running to_mef.fitting() with full_output = True gives these details about the calibration model 'beads_params': array(6.84108707e-01, 4.13784424e+00, 3.43116109e+03), 'beads_model_str': 'm*log(fl_rfi) + b = log(fl_mef_auto + fl_mef)' where m (slope) = 0.684 b (intercept) = 4.137 fl_mef_auto (autofluor.) = 3.43e3 --- this seems too high, is something weird going on?

  • Post a reply with the density plot - say that bin edges is not an issue, maybe the model parameters..

Check clen peak : S055

Checking the bead calibration of S055_B2 (MG1655 control) which looks clean (not bimodal) in a mGreenLantern plot ; it is still bimodal in mScarlet! fluorphore order : green (good), red (bimodal) ``'beads_params': [array([ 0.89199098, 2.5864598 , 792.15587603]), array([7.93298660e-01, 2.76010590e+00, 1.09402666e+03])],`

the m (slope) close to 1 seems to be less bimodal

The values are not that different but see the scatterplots ![[S055_B2_green_good.png|300]] ![[S055_B2_red_bad.png|300]]

S063c_2 bimodal

Here's the calibration parameters for green : 0.88250639, 2.46310895, 675.87577154] the m is closer to 1, but the bimodality is also more subtle here : example data D01 - same as above ![[S063c_Ds_empty-processed_200bins.png|250]]

Tasks

  • Make a list of samples with the # of events before filtering to check if something is wrong?

  • Practicing the flowcal tutorial

  • Read in a bunch of .fcs files from given directory into a vectorized FlowCal.io.FCSRead

  • (working) Vectorize the flowcal processing script by putting it in a function / or vectorizing each step with list comprehensions

  • (connecting through R better) : Bring in plate layout from the google sheets, Convert plate layout to columns like in R -- using pandas? or a dplyr for python, Attach the names (with some kind of regex matching)

Plotting - matplotlib

  • Plot the initial and final event count (x-axis) by well name (short, y-axis) for easy identification of outliers and mis-processed wells. Or do a side by side scatter plot, labelling only the outliers (outside 2 SD..?). requires deep dive into matplotlib :(
  • (_~ not urgent_) figure out the best channel to plot during flowcal processing
  • (ignore, can do in R) Add sample names to the plot using plt.legend(list of names in the same order, loc = 'best')
    • Or figure out what variable in the .fcs file is being made the title of the plots?
  • Figure out how to compose multiple data into a matplotlib by colour etc. -- Don't know if it will work as good as ggplot; and if FlowCal does it automatically as flowworkspace
  • Add sample names to the median violin plots? docs matplotlib

File/fcs handling

  • Is pData saved with write.FCS()? _NO; needs to me remade

  • (mutliple blocks analysis) How to load multiple folders, join metadata and analyze together? Useful when samples are split among multiple blocks, ex: S062, S063 or multi-day S050 like experiments. join file loading and pData replacement into a function and run map2 with plate number/day do add a feature. How to cat multiple cytosets? try rbind2 and flowset_to_cytoset()

  • (Guava data) To expand single .fcs file into multiple .fcs : use subprocess.run module to open an R function use case; documentation

  • (not doing this) Getting plate layout google sheet : Can use the same approach to call the existing R function to do this for us

  • (doing this in R, automatic) Merge data for replicates : as simple as ndarray.concatenate? do before plotting violins/ distributions

    All fluorescence histograms shown are composed of three histograms taken from samples on three separate days and combined into 128 logarithmically spaced bins between 101–106 MEFL units (Supplementary Fig. 15c) unless otherwise stated. Source: Schmidl, Sebastian R., et al. "Rewiring bacterial two-component systems by modular DNA-binding domain swapping." Nature Chemical Biology 15.7 (2019): 690-698.

  • Saving output .fcs: Idea - from FCSData/numpy ndarray use fcswrite to convert to .fcs 3.0 file. Can write any numpy array. For metadata etc, can convert to the data format used by fcsparser or flowcytometry tools which are alluded to in fcswrite documentation.

    • solved : ~~Error: ValueError - If odd number of keys + values detected (indicating an unpaired key or value). due to duplicates in HEADER and TEXT

Error-handling

  • Could have a user input if beads data gating looks acceptable before proceeding to MEFLing

Gating

  • Make a variable for density gating percentage,
    • (fancy) get the density gating percentage from user input after showing a plot of 50%, interactive analysis.?

Processing

  • Show the parameters of fitting in the output ; + have user review before proceeding with MEFLing?

Other information

  • How do we get volume information to get cell density data (Cells/ul) from the .fcs file?
    • Is the flow rate recorded in the .FCS file so we can use the time units (assume seconds?) to do: $\Large \frac{/sec}{flow rate = (ul/ sec)}$

Quality controls

  • Write an output log file as .txt in the output folder - include the initial event count, density fraction gated, final event fraction, calibrated or not, what beads file was used
  • Add initial event count to the summary data from python workflow (Hint: Line 209 in analyze_fcs.py)
  • (Added to output .csv data) Show the number of (final gated?) events in all files loaded
  • (will do graph instead) Look for anomalies in event count with any automated metric and add a warning.
    • (too much information, ignore for now) Implement a QC data output by counting events in each gating step (for every .fcs) and save as csv file -- and/Or make a plot to help look for anomalies?

html output

  • convert to Quarto document with jupytext : Advantages
    1. Make the saved html file after the folder name
    2. Run through R studio and integrate with the later plotting code..?
  • Also convert to pluto and see if we can slowly inject julia for practice?
  • Add a separator between output of individual well plots '----' and print the well name maybe?
    • Separate the beads from the data files
    • split into functions that can be run independantly too --
  • option to Save plots along the gating procedure for the first 3 (or 3 random) .fcs files? better than showing for all right?
  • Would it be worthwhile to save all plots along the gating process for each file in the dataset for manual QC purposes -- a RMD style html output would be good
    • How do we dump the plots from each .fcs file from within the loop into an RMD output? using jupyter-notebook
    • (can't use since subplots are already made by FlowCal.plots) Can squeeze into a pdf by doing subplots - source
    • Would a pluto notebook (or jupyter notebook) work for calling the python script and collecting all the plots generated?

Literature

  • read the flowcal introduction paper to understand the data storage format and theme etc.
    • wondering how extendable the formats are compared to the R/bioconductor ones that are building on the original FlowCore so more future proof?

Data display

Nice plot from tabor lab, 4c : Schmidl, Sebastian R., et al. "Rewiring bacterial two-component systems by modular DNA-binding domain swapping." Nature Chemical Biology 15.7 (2019): 690-698.

![[Pasted image 20220526020446.png]]

R/cytoset/cytoframes

Packages/tools

  • flowCore : core data format (flowFrame etc.)
  • flowWorkspace : gating and operations on facs :

samples, groups, transformations, compensation matrices, gates, and population statistics in the gating tree, which is represented as a GatingSet object in R. - Also includes the more efficient formats : cytosets for holding .fcs data

  • openCyto - Automated gating functions and workflows for serial gating
  • ggcyto - plotting with ggplot semantics

(note) : When installing these packages, specify the location of install for BiocManager with the command : BiocManager::install('ggcyto', lib = .libPaths()[1]) ; 1 being the path in the local "My Documents" folder instead of program files which needs write permissions.

data format

Need to understand the data format a bit flowframe/cytoframe = single files ; set = set of files. cyto - stores data in C for efficiency

  • The cytoset for S032 dataset has 13 observables, and 26 cells : I assume each observable before and after gain adjustment/filtering and such?

  • What is TIME in fl.set %>% colnames()? Time is recorded as each event passes through.. Since the first 4 in S032 have fewer, it was because I terminated the PBS samples as they were taking too long..

  • Need to figure out how to get the wellid from .fcs file headers. used it to name the individual files when saving them.

    • Could record these wellids and put them into a column that I could use to merge metadata
  • Learn how to explore a flowworkspace::cytoset

  • Make exploratory_plots.rmd run a loop and plot each colname modality of data as scatter and histogram -- the file will be pretty bulky already :(

  • How to incorporate sample names into the cytoset (would be great to show up on the automatic plots)

    • Looks like you can make a data.frame of the annotations and use pData(gating_set) <- annotations.data.frame (Cytometry on air, Ryan Duggan, youtube, 2014)
    • Or sampleNames(cytoset) <- new_values as a vector cytoset documentation
    • How to ensure that the order of samples is matched by the replacement file name? I guess the data.frame workflow might be easier, make a new column called well

file handling

  • Move the multifile .fcs to Archive when expanded into a folder already
  • Plan a workflow for analysing a subset of files from multiple sub-directories for S050 - multiday expt)
    • Can use simple regex? Gets only wells, not easy to get replicates + induced, uninduced etc. Loading with regex 'A01' - works, and adds numbers to the sampleNames in the cytoset to make them unique . Will be hard to match to proper day ; not generalizable..
    • Implement a function to load processed data, change filenames to the metadata and save to another folder ; also copy the logfile if present and delete the original folder. Start from merge_cytosets_and_save.R script.
        • include a key for the dataset captured using regex such as - letters : S0xx[:alpha:] / day key : d[:digit:] from the folder name ; with user input, skip this when details are present in the dataset already ; how does this change for outer folder vs S050 which has inner folders -- this is rare/onetime only
      • Show the first filename example and ask user to proceed before doing the batch renaming
    • Load these renamed samples using regex as needed ; set gates with a designated sample and save them somehow
      • How to save gates?
  • How to solve the problem of non unique names of wells within different runs inside S050? _This is solved in combine_data_workflow.R
  • (solved, by replacing pData and sampleNames) Attach the names from the template to cytoset? cf_rename_channel(x, old, new). there is non uniqueness here too for replicates -- need to merge before attaching names
  • (done) Another application : to join data from different experiments to plot on the same graph (48 + 51 + 78, 79 etc.). temporary fix would be to just copy the (processed) files/renamed with expt and sample name into a temp folder ; since it is just a few files not too much space consumed // Can copy the logfiles and delete the processed folders -- these can always be remade with FlowCal given the logfile parameters

Directory checking in 1-reading_multidata_fcs

  • Check for empty directory to future proof when a directory exists but has no files in it

  • Change sampleNames of the fcs set when reading files by passing a vector(?) to

    • name.keyword: An optional character vector that specifies which FCS keyword to use as the sample names. If this is not set, the GUID of the FCS file is used for sampleNames, and if that is not present (or not unique), then the file names are used.
  • Remove biological replicate counts from cytometry template layouts (metadata) - justification Could attach the numbering in R, will save some effort while making template?, Unless you want to mark biological replicates from technical replicates, which will complicate the analysis by making more columns .. etc. Could also think about this when multiple dilutions are read and need to be analyzed?

  • Equalize processing for Guava vs Sony :: use alias feature to harmonize names to green/red or fluorophores.. #manually supply the alias vs channel options mapping as a data.frame in read.FCS

FCS format notes :

  • Scaling: Do we need to do any non default transformation (such as Pn6 scaling - power transform for parameters stroed on a log scale) :

    • keyword(single_fcs) and element P1D shows "Logarithmic,6,1"
    • does that mean the data needs to be log transformed and is this done by the default linearize option when loading the file?
    • Ref documentation for load_cytoframe_from_fcs
  • (related to scaling) Why does R/flowWorkspace's load_cytoset_from_fcs() give different values than python/FlowCal's ... The summary stats differ between the software. The difference in graphs could be in their logicle transformations, but difference in summary stats needs explanation

    • Start with the transformation parameter in load_cytoframe.. :

      linearize (default), linearize-with-PnG-scaling, or scale. linearize transformation applies the appropriate power transform to the data.. Also, when the transformation keyword of the FCS header is set to "custom" or "applied", no transformation will be used.

    • FlowCal :

      An FCS file normally stores raw numbers as they are are reported by the instrument sensors. These are referred to as “channel numbers”. The FCS file also contains enough information to transform these numbers back to proper fluorescence units, called Relative Fluorescence Intensities (RFI), or more commonly, arbitrary fluorescence units (a.u.). Depending on the instrument used, this conversion sometimes involves a simple scaling factor, but other times requires a non-straigthforward exponential transformation. The latter is our case.

    • Look at .fcs header : using FlowCal/R for both raw data and processed data. Look for the transformation keyword to figure out if R is doing the right thing by ignoring the linearize transform - I'm passing trasformation = FALSE by default. Don't know why I did that. Maybe need to use the ignored FlowCal.transform.to_rfi() command and redo all the FlowCal processing?
  • Why are there negative values in the fluorescence? ex: S063_Marine promoters-2-green.png. Do the FSC, SSC also have negatives? Take one sample and test it ; check other expts

    • is this an artifact of the logicle scaling? : Re-look at the Github issues on FlowCal about negative values
    • Negative values could result from : baseline subtraction ; or compensation
    • FSC and SSC also have negative values : Ex: 'S063_B02_FSC_raw' : ![[S063a_B02_FSC-flowcal.png|350]]

Processing

  • How to merge replicates of data before taking summary() stats, ~median
  • R/flowWorkspace and python/FlowCal's data encoding seems different. Seen in S040 graphs - ridgeline/R vs points/Python
    • why does the R/flowWorkspace's calculation have positive medians vs negative medians for python/FlowCal's median calculation but with exact same magnitude. This happens only for a few samples on the lower end -- but needs to be sorted if using data from both sources to label the plot ![[Pasted image 20221006162045.png |1000]]
  • (use ggcyto for plotting, no need to get raw data out) How to get the raw-data from the cytoset to just plot mean/median similar to how Lauren Gambill's script does with flow..python
  • Break the processing modules into functions that can be called interactively ex: while figuring out the correct density fraction etc.
  • (ggcyto is already merging them) Merge data from biological replicates (as mentioned in paper) : ideas CytoTree ; post issue in flowWorkspace or flowCore?

    CombineFCS; This function can be used to combine 2 FCS files having a set of shared markers and return one FCS file (matrix) with the total number of cells is equal to the summation of cells in both FCS files, with each cell has an extended number of measured markers. CyTOFmerge

flow rate

  • Tried using the flowAI's flow_auto_qc but it does not work due to number of cells incompatibility
  • Try the interactive method flowiQC
  • Go into the functions of the flowAI package and fish out the one that is making the flow rate plot and use it by itself
  • Calculate an avg flow rate (either all stuff or only gated cells) once you understand what the $Time variable is storing.. and its units

Gating

  • Is there a way to gate a mindensity to be in the middle of two samples (one positive and one negative)? Start looking at openCyto and then flowWorkspace documentations

  • How to plot only the gated data points? fs <- getData(tbdata, "CD3") - ggcyto

  • Is there a flowWorkspace way to gate (quad gate..?) on a single sample and use the same gating parameters across all samples?

    • (use biexp): mindensity gate value causes convergence problems with logicle transform
  • figure out how to add filterID with mindensity

  • RemoveMargins : Remove margin events in flow cyt data. PeacoQC documentation; github;

    • The PeacoQC package provides quality control functions that will check for monotonic increasing channels and that will remove outliers and unstable events introduced due to e.g. clogs, speed changes etc. during the measurement of your sample. It also provides the functionality of visualising the quality control result of only one sample and the visualisation of the results of multiple samples in one experiment.

Plotting

  • (was due to error in geom_hex() ; solved by updating to ggplot 3.2.4) Explain the ugly plots of FSC, SSC in S063 processed and raw data. Could this be related to the bin edge [issue](flowcal issue) that also causes bimodality of negative samples around 0?
    • Something changed after Oct 6th commit 620f4e5 made a good plot for fluors for S040; aes_string(x = as.name(fluor_chnls[['red']])) format - which also produces a colourkey/legend.
    • Bad plot first seen on 31 Jan for S062a?
    • Steps to try
      • Try reproducing the old S044 plots with current scripts?
      • Read documentations and try geom_hex() -> stat_bin_hex() alternatives
      • Replace ggcyto call with ggplot call directly?
    • Ex: S063a_B02: ![[S063a_B02_raw.png|200]] vs flowcal ![[S063a_B02_raw-flowcal.png|300]]

Also check S067b1 : plotting with geom_hex() and geom_point() : hex is messing up representing the density bigtime - was there any change in the function behavior from the past? ![[S067b1_143_ww-raw-sctr_hex.png | 300]] ![[S067b1_143_ww-raw-sctr_point.png | 300]]

  • extra plot : Show the density of highly fluorescent events in FSC-SSC plot : Would be useful to see if any gating/density filtration by FlowCal is distorting the data
    • Not straightforward, has wierd errors ~
  • (too hard to automate) De-clutter ridges : Remove labels and medians for overlapping sample names only.
    • Or (Currently doing with user input) remove all medians and labels when > 2 categories (colours) occur for > 5 sample names for exploratory plotting
    • For high density data (> 1 or 2 colours/sample_category), need to remove median highlighting / control it with a switch -- good first application when you make the ridge plotting into a function
  • Plotting order : Currently using red as the default I assume, but can have a user key for primary fluorophore of interest 'red/2' or 'green/1'? Need to pass it to .fluor_colour in arrange_in_order_of_fluorophore()
  • Re-arrange the ridgeline plots in descending order of fluorescence. Currently it orders in ascending order so messes up for multiple coloured plots (check S063c)
  • Overlay accurate median value labels on a ridgeline/density plot -- difficulty is in calculating this since the replicates are merged while plotting but not before-hand. Currently the labels and values of "mean_medians" are shifted to lower than the lines plotted; visible in scale_x_log10()
  • (archive) figure out the mystery : scale_x_log10() is plotting different values compared to scale_x_flowjo_biexp(). The biexp shows negative values too, and MG1655 is centered around 0 almost which is what we expect from the machine's current calibration. The mean_median labels match the data better in this case too
  • Note: The name column of the pData appears as facets ; and samples with same name are merged before plotting (verified for histograms)
  • (fixed using aes_string(as.name(ch))) Pass channel names stored in a variable (by reference) to the ggcyto aes call does not work easily -- something about (quasi)quotation?
  • Error: xlim(c(-100, 1e3)) not working on ggcyto + geomhex + geomdensity2d plot
  • changing title of the plots : do it manually for now
  • Replicates merge: ggcyto automatically merges replicates!

Git organization

For manual scripts with possible changes across experiments, should they be in a separate branch for each dataset used vs copying over into different files? until of course, the script is eventually broken into functions that don't change across expts. 11-manual_gating.R -> S048..

  • Branch : new universal changes can be ported by merging main into branch (moving the branch along; but manually rejecting experiment specific changes) ; Static branch should be in a working state ; retains history
  • Copy : (universal) Changes in everything other than the current file will be retained but could break the code. breaking has not happened so far for qPCR files where I apply this philosophy using fucntions from general code
  • branch + copy single file: doing both could enable mergeability, cause confusion too? -- thinking too much.. go do some actual work now

R/flopR

What exactly does flopR do?

  1. Log10 transform, trim zero values (-Inf after log transf)
  2. Density gating
  3. Singlet gating ; May not be as useful if density gating already makes it clean enough.; beads seem to matter -- compare with FlowCal for S044 beads
  4. Autofluorescence (geometric mean) subtraction ; straightforward task
  5. MEFL calculation

Issues

  • Is too slow. 5 files, with calibration takes >5 mins even failing in the beads step.
    • flowClust of the beads is the slowest step, approx 1 min ; ggplot saving is also long -- too many points?
    • Put a profiler to figure out where it is being slowed down, and how to improve it..
user system elapsed
155.77 12.20 291.44
  • Some error when processing beads --

  • Get_calibration not selecting the correct bead population. Our sample has too much junk, but the process_fcs that was getting bacteria and then singlets was doing ok -- so maybe short-circuit the program to do this for the beads too..

Tasks

  • Fixed : issue on github! : showing error in process_fcs from the get_bacteria function.

     [1] "2 clusters found"
     [1] "only debris found"
     'x' must be an object of class flowframe
    
    • this was due to a hardcoded 1e4 threshold for calling debris. I guess our FSC/SSC voltages are on the lower side, and increases them might separate out debris more easily (especially for vibrio)
  • Fix the hardcoded beads file : pattern = utils::glob2rx("*beads*.fcs"). Can make a variable to fill with the appropriate well; with this pattern as the default

  • Feed in the calibration data as list of lists, with proper names

  • Channel names : Generalize with an if loop later, Currently hard-coded to Sony

  • Test a small dir with 5 files + beads and autofluorescence subtractor (if even relevant for Sony). _Running into errors : x is empty.., negative values retained in log-transform _

  • For now, I'm moving onto using flowcal; since flopR is too slow and failing with the density plot of beads. Decision was initially reconsidered after issue with cells not found was fixed

R/other packages

FlopR looks to be better than flowcal, does doublet discrimination and background subtraction as well (though it is trivial)

FlowAI does some QC truncation of data based on flow rate anomalies (OUP, 2016)

  • Is also useful to get the flowrates and volume estimations for our case on the sorter data
    • Need this to do CFU or cells/ml volume normalization
table 1 table 2
God knows What does
Who is God really?

Julia/python

Notes to reproduce running python scripts inside Pluto notebooks as a way to capture output plots in a nice html format - 21/5/22

Setting up Julia -> python

  • Load julia with julia in the gitbash terminal
  • Create new julia environment using ] then activate .
  • Add packages with add .. : Added Pluto, Plots, PyCall
  • load package with using PyCall. following PyCall documentation
  • Add python of the desired conda environment to path variable ENV["PYTHON"] = "C:/Users/new/.conda/envs/flowcal/python.exe"
  • Load Pkg with using Pkg and build PyCall (need to do everytime the path changes or any major changes are made) with Pkg.build("PyCall")

Error

  • from scripts_general_fns.g3_python_utils_facs import * causes ModuleNotFoundError("No module named 'scripts_general_fns'")
  • Same error in jupyter-lab
  • Looks like you need empty __init__.py files within subdirectories where the modules are located (or all dirs starting from the closest path in sys.path)

Jupyter-lab - to save html of plots

  • Idea: have the base python code as a standalone runnable script. and call the ..py from jupyter or pluto when you need the plots to be saved.

Quarto transition

Goal : Want to transition the jupyter notebook into a quarto (or Rmd) workflow so I can run it from within Rstudio.

adhoc_flowcal_analysis.qmd works when run in Rstudio chunk by chunk :) Should inspire the full workflow flowcal_html_output.qmd now!

  • current error : pandoc .. openFile: does not exist (No such file or directory) git issue

Features

  1. (convenience) Need to read in the filename directly from R script -.5-user_inputs.R instead of the python file g10.user_config.py to prevent duplication

Issues

  • Quarto has a different way of managing attachments. It is misplacing the attachments in two different places : 1. head directory - images ; 2. scripts_archive/ - libs : contains formatting etc. - Moving them both to html_outputs enables a cleaner .html output : Check S071a_d-1,0.html - [ ] Read on why this is different from .Rmd and if there are options to make them same
  • Modules loaded through R (using reculate) don't update when the .py file is updated. Requires to restart R
  • Show IPython.core.display.Markdown object in quarto document like jupyter notebook does it

connecting to conda env

Need to figure out how to use the correct conda env 'flowcal'. is this set by reticulate or by quarto or Rstudio itself?. Works with R's reticulate library if Sys.setenv(RETICULATE_PYTHON = "C:/Users/new/.conda/envs/flowcal/python.exe") is used before library(reticulate).

convert .ipnb to .qmd : jupytext?

  • Might be better to rewrite -- directory changes and modules loading might differ need to convert onetime only and not link notebooks
  • Use quarto convert - from jupytext docs below
  • Check jupytext docs

run with custom output.html like .Rmd?

how to do this? Need to get the quarto library says quarto docs

From the R console using the quarto R package:

library(quarto)
quarto_render("document.qmd") # defaults to html
quarto_render("document.qmd", output_format = "pdf")

Notes

quarto produces some extraneous folder that holds random files, is this going to be overwritten when running multiple datasets or needs to be saved for actual html to be visible?

Rmd test run

Seems to take longer but not sure. The markdown calls from python are displayed as output <IPython.core.display.Markdown object>. Maybe quarto will do better?

Notes of individual analyses

S050

  • fixed by resaving data? Noticed that name of fluophores: gfpmut3, mcherry2 names are not showing up in channel names on d0/d1/d8 data. Did we forget to unmix or something else gone wrong?
    • Is the fluor data completely missing? Confirmed in the Sony software: Happens when the spectrum does not show up, implying that compensation removes the fluorescenc channels from the data
    • Check sizes for comparable number of samples : data for d-1 has gfpmut3 and mcherry2 channels
data # samples size Notes
S048 35 164 MB
S050.d8 78 80 MB definitely missing data :(
S052 33 53 MB mGL, mSc present; Maybe less off target events due to thresholding on SSC instead of FSC?
  • S052 PBS wells were missing the fl channels and preventing making of cytoset.. Could this property have been carried over from the S050 dataset?

> colnames(fl.set) # vector [1] "FSC-A" "FSC-H" "FSC-W" "SSC-A" "SSC-H" "SSC-W" "TIME"

in comparison, S048 gives

[1] "FSC-A" "FSC-H" "FSC-W" "SSC-A" "SSC-H" "SSC-W"
[7] "mGreenLantern cor-A" "mGreenLantern cor-H" "mGreenLantern cor-W" "mScarlet-I-A" "mScarlet-I-H" "mScarlet-I-W"
[13] "TIME"

  • S050 issue when combining data workflow : 8 files are missing
    • Fixed by loading the proper files now
    • missing files presumably because they were not in the plate grid ; d-1/E12, G12, H12 - probably PBS and beads?

Data sets to re-analyze without MEFL

Use case : Currently doing this only for showing full distribution figs

  • (gating) Check for S070/S050 data if with and without MEFLing changes gated fractions significantly (gate above C-)

Memory paper

  • S050 : Need for gating (if it makes a difference?) ; (Fig S only) Can use it to show trend of decay in suppl.
  • (?) S071 : memory WW visual glance
  • (check) S070 Ara dose response

Marine

  • S063a, b, c : Redo processing without MEFLing

Recombinase memory paper

Here's some notes on analysis used for memory paper

  • Fig 2c : S070 - gated fractions
    • (with MEFLing)
    • gated on sample : B02 ; quantile 99, sets at 647.82
  • Fig 3 : S050 - gated fractions
    • No MEFLing
    • Gating :
      • green : 'A07_d-1' ; 99 quantile gate : sets at 172.39
      • red : 'A06_d-1' ; 300 manual gate -- since tail is long, eyeballing 300 based on full distribution
  • Fig S11? : S061_WW-AHL production
    • No MEFLing
    • Gate : C08 ; 99 quantile gate : sets at 2668.32

Logs of data archival

Processed data are getting too bulky and can be archived for old and published stuff onto box.com

  • Eventually will need a plan if these intermediate data need to be kept and where once I leave Rice
  • Discuss with Lauren the plan for data archival incl. big data such as flow cyt..

Here's the folders using the most space. Screenshot from WizTree! ![[flowcyt_disk_usage.png]]

Moved to box.com

**Recombinase memory

  • S071_combined and MEFL. Expt: Memory in WW II
  • S050 individual, S050_combined, and MEFLs for both. Expt: Long duration memory in E. coli

SS_marine work

  • S062_a and b. Marine promoter library I
  • S063_a, b and c_2 & combined. iteration II. Re-conjugation