Documentation for github.com/ForrestCKoch/scRNA-Dimentionality-Reduction¶
This ReadTheDocs site provides supporting documentation for the benchmarking of dimensionality reduction methods conducted by the researchers who wrote this repo. It is divided in to XX main sections: - A - B - C
Step 00 – Data Prep¶
- 00-00_install-r-packages.R
- attempt to install prerequisite R packages …
- 00-01_download_sce_datasets.sh
- clones into my forked repo for scRNA.seq.datasets, and uses the relevent bash/Rscripts to create RDS files
- note: there may be some bugs in this step – consider rerunning to make sure it is smooth
- 00-02_convert_sce_to_csv.R
- for each *.rds in data/datasets/rds, convert the SingleCellExperiment data structure into csv format – two files are produced:
for log-transformed and non-transformed counts.
- 00-03_convert_csv_to_pd.py
- input: dataset name as ds
- converts *.csv.gz files to pickled pandas.DataFrame
Step 01 – Embedding Calculation¶
- src/generate_embedding.py
- 01-00-main_generate_all_embeddings.sh
- bash script wrapper which uses qsub to submit run_generate_embedding.sh for each required embedding
- 01-01-main_generate_gpu_embeddings.sh
- bash script wrapper to sequentially generate embeddings for the two gpu methods (vasc and ???)
Step 02 – IVM Calculation¶
- src/get_internal_validation_measures.py
Step 03 – Calculate Preservation of Global Structure¶
- 03-00-main_parallel_pairwise.sh
- see below
- 03-00-a_measure_pairwise_distances.py
- note: further work needed
- not sure what the difference between these three files is, but caclculates the correlation of pairwise distances before and after DR
- move to subfolder?
- parallel_pairwise_args.txt
- used in combination with xargs and parallel_pairwise.sh to compute measure_pairwise_distances.py in parallel
- move to subfolder?
Step 04 – DBSCAN optimization preprocessing¶
- 04-00_get_eps_bounds.py
- calculate pairwise distances in order to determine the minimum value of epsilon such that DBSCAN will result in at least one cluster
- 04-01-main_generate_queues.sh
- calls generate_trials.py for each of the distiance measures to create a queue of trials for DBSCAN
- suggested usage is cat data/results/eps_upperbounds.csv | head -n+2 | xargs -n1 -P 8 -I {} bash scripts/generate_queues.sh {}
- requires epsilon upperbounds to have been calculated – input format is rather particular to the output of get_eps_bounds.py
- 04-01-a_generate_trials.py
- randomly samples according to provided parameters to provide hyperparameters for DBSCAN optimization
- 04-02_setup_pool.sh
- sets up file-system based queue to run each of the DBSCAN optimization jobs
- need to include cluster_pool.sh
Step 05 – DBSCAN cluster calculation¶
- src/run_dbscan_trials.py
- 05-00-main_run_trial.sh
- bash script wrapper around src/run_dbscan_trials.py
- 05-00-sge_self_submitting.sh
- self submitting script for Raijin to repetedly call cluster-pool.sh
Step 06 – IVM Analyses¶
- 06-00_plot_internal_results_heatmap.py
- input: data/reults/internal_validation_measures/internal_measures_reduced.csv
- creates the heatmap for Figure 2
- 06-01_ivm_concordance_analysis.R ivm_concordance_analysis-medians.R
- same as ivm_analysis.R, but uses medians in place of means
- 06-02_ivm_sign_tests.R
- calculate Sign test results between methods for each IVM
- 06-03_ivm_sign_test_heatmap.py
- creates Figure 3 displaying heatmaps of p-values comaparing methods within an IVM
Step 07 – Global Structure Analyses¶
- 07-00_combine_ivm_correlation_data.py
- input: data/results/internal_validation_measures/internal_measures_reduced.csv, data/results/pairwise_distances/pairwise_correlations_all.csv
- output: data/results/pw_correlations/best_ivm_combined_pw_cor.csv
- 07-01_correlation_with_pw_cor.R
- calculate spearman correlations between IVMs and preservation of global structure … consider removing as I don’t believe this is a valid anlysis
- 07-02_pairwise_correlation_boxplots.R
- create boxplots showing the correlation of pairwise distances across datasets for each method
- 07-03_plot_pairwise_distance_correlations.py
- creates data/results/pw_correlations/pw_correlations_by_best_ivm.csv
- created writeup/plots/pw_correlations.pdf, but this is now commented out
Step 08 – DBSCAN results analysis¶
- 08-00_get_best_dbscan_trial_parallel.py
- find the “best” dbscan clusterings in parallel (using multiple cores)
- 08-01_plot_dbscan_results_heatmap.py
- input: metric, acc, opt
- creates heatmap with boxplots of the style for Figure 5
- 08-03_dbscan_median_analysis.R
- prints the median ARI across datasets for each method – one row for each pair of distance metric and IVM.
- should be used in Table 4 to replace Averages …
- note: this should probably be refactored to output a csv
- 08-03_alt_dbscan_mean_analysis.R
- prints the mean ARI of each method’s performance on DBSCAN optimization for each of the distance metric/ivm combinations
- note: this should probably be refactored to output a csv
- 08-04_dbscan_ivm-and-distance-metric_comparison.R
- calculate pairwise Sign and Wilcoxon Sign Rank tests to compare within DRM differences in ARI from different distance metric/IVM pairings.
- note: this should probably be refactored to output a csv
- 08-05_dbscan_seu-vrc_pairwise_tests.R
- Calculate pairwise Sign and Wilcoxon Sign Rank tests to compare between DRM differences in ARI when using SEU-VRC optimized clusterings.
- 08-06_dbscan_seu-vrc_pairwise_heatmaps.py
- intputs: writeup/spreadsheets/dbscan_vrc-seu_sign-test-by-methods.csv,`writeup/spreadsheets/dbscan_vrc-seu_sign-test-by-methods_filtered.csv`,
writeup/spreadsheets/dbscan_vrc-seu_wilcox-test-by-methods.csv, writeup/spreadsheets/dbscan_vrc-seu_wilcox-test-by-methods_filtered.csv - note: check which script produce these csvs – probably dbscan_seu-vrc_pairwise_tests.R - used in Figure 6
- 08-07_dbscan_vs_ivm_analysis.R
- plots mean/meadian rank of DRM in the IVM analysis vs DBSCAN analysis – also calculate
- 08-08_dbscan_barplots.R
- output barplots of correlation of ARI from DBSCAN optimization with dataset properties n_classes, n_samples, min_class_perc,
max_class_perc, protocol.
Step 09 – Resource Analysis¶
- 09-00_get_resource_results.sh
- parse through the log files to get time and memory information for each embedding
- 09-01_plot_resources.R
- create some plots for memory/time usage
sc_dr¶
sc_dr.datasets¶
-
class
sc_dr.datasets.
FromPickle
(path)[source]¶ Load a Dataset from a pickled object. At this stage, however, labels will not be available for the Dataset.
It is currently being used in scripts to compare embeddings. Labels can be taken from the full dataset
sc_dr.metrics¶
-
sc_dr.metrics.
davies_bouldin_score
(X, labels)[source]¶ Taken from: https://github.com/scikit-learn/scikit-learn/pull/12760 to avoid errors