KamScan

Perform parallel statistical tests over k-mer or contig count matrices

This script performs various statistical tests on a k-mer count matrix using parallel processing. It allows to rank the statistical results and select the top results in output files.The script supports different types of statistical tests, including t-test, pi-test, variance, Wilcoxon test, zero inflated wilcoxon test, and ANOVA.

Usage:

python3 kamscan.py [options]

Arguments:

-i, --input
Input file path containing the count matrix
-o, --output_folder
Folder path where the statistical test results will be stored. (You don't have to create the folder)
-t, --top_tags
The number of top elements to be selected based on the best test statistics. Default: 200,000.
- If the value of --top_tags is between 0 and 1 (e.g., 0.9), it will select the corresponding percentage of the best results (e.g., 0.9 will select the top 90%).
- If the value is greater than 1 (e.g., 100), it will select the exact number of top results (e.g., 100 will select the top 100 results).
-c, --chunk_size
The size of each data chunk for parallel processing. Default: 10,000.
-p, --processes
The number of CPUs used for parallel processing. Default: Number of available CPUs.
-d, --condition_folder
Folder path containing the design files that will be processed.
-n, --normalize
Perform Counts Per Million (CPM) normalization ((raw_abundance * 1 000 000) / number_of_kmers_in_the_dataset), using a file containing the total number of k-mers for each sample. The file should be a text file with two columns separated by a space or tabs, formatted as the design_kmers_nb_per_patient file in the GitHub repository.
--test_type
Specify the type of statistical test to be performed.
Choices: ttest (t-test), pitest (pi-test), wilcoxon (Wilcoxon signed-rank test), variance, anova (Analysis of variance + covariates)
Default: ttest.
--covariates File with covariates option for the (Analysis of variance + covariates) statistical test, formatted as the covariates_file.tsv file in the GitHub repository.

Example Usage:

python3 kamscan.py -i kmer_count_matrix -o results_folder -t 10000 -c 5000 -p 8 -d condition_folder -m normalization_file.txt --test_type ttest

Note:

Make sure to provide the correct condition folder to store the design files. The script accepts various design files as input, each containing a table with the sample_id in the first column and their corresponding condition in the second column. For example, condition can be normal or tumoral.
The script uses parallel processing with multiple CPUs to speed up the computation of statistical tests.
The number of top tags meeting the statistical criteria will be selected and stored in a file in the output folder.
If --normalize normalization_file.txt is provided, the script will perform Counts Per Million (CPM) normalization using this file, otherwise, the statistical test is performed without normalization.
The --test_type anova is a linear regression, allowing group effects to be tested while controlling for covariates in the form y ~ group + covariates. You will need to provide a tab-separated file with header, with the sample_id in the first column and then the covariates.

Dependecies:

To run this script, you will need the following Python packages:

pandas: 1.4.2
numpy: 1.22.3
scipy: 1.10.1
multiprocessing: (built-in with Python)
argparse: (built-in with Python)
functools: (built-in with Python)
shutil: (built-in with Python)
glob: (built-in with Python)
os: (built-in with Python)
statistics: (built-in with Python)
math: (built-in with Python)

To simplify the setup, a Conda environment YAML file is provided (kamscan.yaml). You can use it to create and activate the environment.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
RESULTS		RESULTS
conditionFolder		conditionFolder
scripts		scripts
README.md		README.md
covariates_file.tsv		covariates_file.tsv
design_kmers_nb_per_patient		design_kmers_nb_per_patient
kamscan.sh		kamscan.sh
kamscan.yaml		kamscan.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KamScan

Perform parallel statistical tests over k-mer or contig count matrices

Usage:

Arguments:

Example Usage:

Note:

Dependecies:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KamScan

Perform parallel statistical tests over k-mer or contig count matrices

Usage:

Arguments:

Example Usage:

Note:

Dependecies:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages