Skip to content

Transipedia/KamScan

Repository files navigation

KamScan

Perform parallel statistical tests over k-mer or contig count matrices

This script performs various statistical tests on a k-mer count matrix using parallel processing. It allows to rank the statistical results and select the top results in output files.The script supports different types of statistical tests, including t-test, pi-test, variance, Wilcoxon test, zero inflated wilcoxon test, and ANOVA.

Usage:

python3 kamscan.py [options]

Arguments:

  • -i, --input
    Input file path containing the count matrix

  • -o, --output_folder
    Folder path where the statistical test results will be stored. (You don't have to create the folder)

  • -t, --top_tags
    The number of top elements to be selected based on the best test statistics. Default: 200,000.

    • If the value of --top_tags is between 0 and 1 (e.g., 0.9), it will select the corresponding percentage of the best results (e.g., 0.9 will select the top 90%).
    • If the value is greater than 1 (e.g., 100), it will select the exact number of top results (e.g., 100 will select the top 100 results).
  • -c, --chunk_size
    The size of each data chunk for parallel processing. Default: 10,000.

  • -p, --processes
    The number of CPUs used for parallel processing. Default: Number of available CPUs.

  • -d, --condition_folder
    Folder path containing the design files that will be processed.

  • -n, --normalize
    Perform Counts Per Million (CPM) normalization ((raw_abundance * 1 000 000) / number_of_kmers_in_the_dataset), using a file containing the total number of k-mers for each sample. The file should be a text file with two columns separated by a space or tabs, formatted as the design_kmers_nb_per_patient file in the GitHub repository.

  • --test_type
    Specify the type of statistical test to be performed.
    Choices: ttest (t-test), pitest (pi-test), wilcoxon (Wilcoxon signed-rank test), variance, anova (Analysis of variance + covariates)
    Default: ttest.

  • --covariates File with covariates option for the (Analysis of variance + covariates) statistical test, formatted as the covariates_file.tsv file in the GitHub repository.

Example Usage:

python3 kamscan.py -i kmer_count_matrix -o results_folder -t 10000 -c 5000 -p 8 -d condition_folder -m normalization_file.txt --test_type ttest

Note:

  • Make sure to provide the correct condition folder to store the design files. The script accepts various design files as input, each containing a table with the sample_id in the first column and their corresponding condition in the second column. For example, condition can be normal or tumoral.
  • The script uses parallel processing with multiple CPUs to speed up the computation of statistical tests.
  • The number of top tags meeting the statistical criteria will be selected and stored in a file in the output folder.
  • If --normalize normalization_file.txt is provided, the script will perform Counts Per Million (CPM) normalization using this file, otherwise, the statistical test is performed without normalization.
  • The --test_type anova is a linear regression, allowing group effects to be tested while controlling for covariates in the form y ~ group + covariates. You will need to provide a tab-separated file with header, with the sample_id in the first column and then the covariates.

Dependecies:

To run this script, you will need the following Python packages:

pandas: 1.4.2
numpy: 1.22.3
scipy: 1.10.1
multiprocessing: (built-in with Python)
argparse: (built-in with Python)
functools: (built-in with Python)
shutil: (built-in with Python)
glob: (built-in with Python)
os: (built-in with Python)
statistics: (built-in with Python)
math: (built-in with Python)

To simplify the setup, a Conda environment YAML file is provided (kamscan.yaml). You can use it to create and activate the environment.

About

A simple Python script for scanning a kmer count matrix and select lines with desired statistics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors