scRNA-Dimentionality-Reduction¶

This ReadTheDocs site provides supporting documentation for the benchmarking of dimensionality reduction methods conducted by the researchers who wrote this repo. It is divided in to XX main sections: - A - B - C

Step 00 – Data Prep¶

00-00_install-r-packages.R
- attempt to install prerequisite R packages …
00-01_download_sce_datasets.sh
- clones into my forked repo for scRNA.seq.datasets, and uses the relevent bash/Rscripts to create RDS files
- note: there may be some bugs in this step – consider rerunning to make sure it is smooth
00-02_convert_sce_to_csv.R
- for each *.rds in data/datasets/rds, convert the SingleCellExperiment data structure into csv format – two files are produced:
for log-transformed and non-transformed counts.
00-03_convert_csv_to_pd.py
- input: dataset name as ds
- converts *.csv.gz files to pickled pandas.DataFrame

Step 01 – Embedding Calculation¶

src/generate_embedding.py
01-00-main_generate_all_embeddings.sh
- bash script wrapper which uses qsub to submit run_generate_embedding.sh for each required embedding
01-01-main_generate_gpu_embeddings.sh
- bash script wrapper to sequentially generate embeddings for the two gpu methods (vasc and ???)

Step 02 – IVM Calculation¶

src/get_internal_validation_measures.py

Step 03 – Calculate Preservation of Global Structure¶

03-00-main_parallel_pairwise.sh
- see below
03-00-a_measure_pairwise_distances.py
- note: further work needed
- not sure what the difference between these three files is, but caclculates the correlation of pairwise distances before and after DR
- move to subfolder?
parallel_pairwise_args.txt
- used in combination with xargs and parallel_pairwise.sh to compute measure_pairwise_distances.py in parallel
- move to subfolder?

Step 04 – DBSCAN optimization preprocessing¶

04-00_get_eps_bounds.py
- calculate pairwise distances in order to determine the minimum value of epsilon such that DBSCAN will result in at least one cluster
04-01-main_generate_queues.sh
- calls generate_trials.py for each of the distiance measures to create a queue of trials for DBSCAN
- suggested usage is cat data/results/eps_upperbounds.csv | head -n+2 | xargs -n1 -P 8 -I {} bash scripts/generate_queues.sh {}
- requires epsilon upperbounds to have been calculated – input format is rather particular to the output of get_eps_bounds.py
04-01-a_generate_trials.py
- randomly samples according to provided parameters to provide hyperparameters for DBSCAN optimization
04-02_setup_pool.sh
- sets up file-system based queue to run each of the DBSCAN optimization jobs
- need to include cluster_pool.sh

Step 05 – DBSCAN cluster calculation¶

src/run_dbscan_trials.py
05-00-main_run_trial.sh
- bash script wrapper around src/run_dbscan_trials.py
05-00-sge_self_submitting.sh
- self submitting script for Raijin to repetedly call cluster-pool.sh

Step 06 – IVM Analyses¶

06-00_plot_internal_results_heatmap.py
- input: data/reults/internal_validation_measures/internal_measures_reduced.csv
- creates the heatmap for Figure 2
06-01_ivm_concordance_analysis.R ivm_concordance_analysis-medians.R
- same as ivm_analysis.R, but uses medians in place of means
06-02_ivm_sign_tests.R
- calculate Sign test results between methods for each IVM
06-03_ivm_sign_test_heatmap.py
- creates Figure 3 displaying heatmaps of p-values comaparing methods within an IVM

Step 07 – Global Structure Analyses¶

07-00_combine_ivm_correlation_data.py
- input: data/results/internal_validation_measures/internal_measures_reduced.csv, data/results/pairwise_distances/pairwise_correlations_all.csv
- output: data/results/pw_correlations/best_ivm_combined_pw_cor.csv
07-01_correlation_with_pw_cor.R
- calculate spearman correlations between IVMs and preservation of global structure … consider removing as I don’t believe this is a valid anlysis
07-02_pairwise_correlation_boxplots.R
- create boxplots showing the correlation of pairwise distances across datasets for each method
07-03_plot_pairwise_distance_correlations.py
- creates data/results/pw_correlations/pw_correlations_by_best_ivm.csv
- created writeup/plots/pw_correlations.pdf, but this is now commented out

Step 08 – DBSCAN results analysis¶

08-00_get_best_dbscan_trial_parallel.py
- find the “best” dbscan clusterings in parallel (using multiple cores)
08-01_plot_dbscan_results_heatmap.py
- input: metric, acc, opt
- creates heatmap with boxplots of the style for Figure 5
08-03_dbscan_median_analysis.R
- prints the median ARI across datasets for each method – one row for each pair of distance metric and IVM.
- should be used in Table 4 to replace Averages …
- note: this should probably be refactored to output a csv
08-03_alt_dbscan_mean_analysis.R
- prints the mean ARI of each method’s performance on DBSCAN optimization for each of the distance metric/ivm combinations
- note: this should probably be refactored to output a csv
08-04_dbscan_ivm-and-distance-metric_comparison.R
- calculate pairwise Sign and Wilcoxon Sign Rank tests to compare within DRM differences in ARI from different distance metric/IVM pairings.
- note: this should probably be refactored to output a csv
08-05_dbscan_seu-vrc_pairwise_tests.R
- Calculate pairwise Sign and Wilcoxon Sign Rank tests to compare between DRM differences in ARI when using SEU-VRC optimized clusterings.
08-06_dbscan_seu-vrc_pairwise_heatmaps.py
- intputs: writeup/spreadsheets/dbscan_vrc-seu_sign-test-by-methods.csv,`writeup/spreadsheets/dbscan_vrc-seu_sign-test-by-methods_filtered.csv`,
writeup/spreadsheets/dbscan_vrc-seu_wilcox-test-by-methods.csv, writeup/spreadsheets/dbscan_vrc-seu_wilcox-test-by-methods_filtered.csv - note: check which script produce these csvs – probably dbscan_seu-vrc_pairwise_tests.R - used in Figure 6
08-07_dbscan_vs_ivm_analysis.R
- plots mean/meadian rank of DRM in the IVM analysis vs DBSCAN analysis – also calculate
08-08_dbscan_barplots.R
- output barplots of correlation of ARI from DBSCAN optimization with dataset properties n_classes, n_samples, min_class_perc,
max_class_perc, protocol.

Step 09 – Resource Analysis¶

09-00_get_resource_results.sh
- parse through the log files to get time and memory information for each embedding
09-01_plot_resources.R
- create some plots for memory/time usage