This script performs various statistical tests on a k-mer count matrix using parallel processing. It allows to rank the statistical results and select the top results in output files.The script supports different types of statistical tests, including t-test, pi-test, variance, Wilcoxon test, zero inflated wilcoxon test, and ANOVA.
python3 kamscan.py [options]
-
-i, --input
Input file path containing the count matrix -
-o, --output_folder
Folder path where the statistical test results will be stored. (You don't have to create the folder) -
-t, --top_tags
The number of top elements to be selected based on the best test statistics. Default: 200,000.- If the value of
--top_tagsis between 0 and 1 (e.g., 0.9), it will select the corresponding percentage of the best results (e.g., 0.9 will select the top 90%). - If the value is greater than 1 (e.g., 100), it will select the exact number of top results (e.g., 100 will select the top 100 results).
- If the value of
-
-c, --chunk_size
The size of each data chunk for parallel processing. Default: 10,000. -
-p, --processes
The number of CPUs used for parallel processing. Default: Number of available CPUs. -
-d, --condition_folder
Folder path containing the design files that will be processed. -
-n, --normalize
Perform Counts Per Million (CPM) normalization ((raw_abundance * 1 000 000) / number_of_kmers_in_the_dataset), using a file containing the total number of k-mers for each sample. The file should be a text file with two columns separated by a space or tabs, formatted as thedesign_kmers_nb_per_patientfile in the GitHub repository. -
--test_type
Specify the type of statistical test to be performed.
Choices: ttest (t-test), pitest (pi-test), wilcoxon (Wilcoxon signed-rank test), variance, anova (Analysis of variance + covariates)
Default: ttest. -
--covariatesFile with covariates option for the (Analysis of variance + covariates) statistical test, formatted as thecovariates_file.tsvfile in the GitHub repository.
python3 kamscan.py -i kmer_count_matrix -o results_folder -t 10000 -c 5000 -p 8 -d condition_folder -m normalization_file.txt --test_type ttest
- Make sure to provide the correct condition folder to store the design files. The script accepts various design files as input, each containing a table with the sample_id in the first column and their corresponding condition in the second column. For example, condition can be normal or tumoral.
- The script uses parallel processing with multiple CPUs to speed up the computation of statistical tests.
- The number of top tags meeting the statistical criteria will be selected and stored in a file in the output folder.
- If
--normalize normalization_file.txtis provided, the script will perform Counts Per Million (CPM) normalization using this file, otherwise, the statistical test is performed without normalization. - The --test_type anova is a linear regression, allowing group effects to be tested while controlling for covariates in the form y ~ group + covariates. You will need to provide a tab-separated file with header, with the sample_id in the first column and then the covariates.
To run this script, you will need the following Python packages:
pandas: 1.4.2
numpy: 1.22.3
scipy: 1.10.1
multiprocessing: (built-in with Python)
argparse: (built-in with Python)
functools: (built-in with Python)
shutil: (built-in with Python)
glob: (built-in with Python)
os: (built-in with Python)
statistics: (built-in with Python)
math: (built-in with Python)
To simplify the setup, a Conda environment YAML file is provided (kamscan.yaml). You can use it to create and activate the environment.