In part, we will acquaint you with widely-used bionformatics public data and computational tools essential for genomic analysis. It's not necessary (and you don't need to) use all of them, but you may consider this as a foundation to create your personalized tool map for future research endeavors. Additionally, some of these resources will be employed in the upcoming pipeline examples, providing practical insights into their application.
In our research journey here, working with genomic datasets will be a frequent and essential aspect. Genomic datasets serve as foundational elements for numerous analyses and investigations. For example:
Here we list some widely used data sources:
Database | Contents | When you might need it | Similar resources |
---|---|---|---|
Reference data | |||
Ensemble | Model organisms: assembly and annotation | Analysis of human or mouse sequencing data | RefSeq |
InterPro | Protein families | Annotate amino acid sequences of unknown function | |
KEGG | Hierarchical gene functions | Functional analysis | |
GTDB | Microbiome reference genomes | Metagenomic analysis | RefSeq |
BV-BRC | Pathogen database | ||
Genomic sequencing data | |||
NCBI | Almost everything | Search for specific data | EBI, GEO, SRA |
TCGA | Cancer genomic data | ||
ENCODE | Functional genomic data | ||
HMP | Real microbial data | ||
CAMI | Simulated microbial data for computational challenges | ||
10X Genomics | Single cell | ||
For data mining | |||
Kaggle | Cleaned data for data science community |
The majority of tools are compatible with Linux, Python, or R. However, it is not uncommon to find yourself in situations where you may need to switch between these platforms. Therefore, it is essential to remain adaptable and flexible in your approach. The initial lesson I learned in Bioinformatics, which resolves over 95% of my challenges, is simple: when in doubt, Google it (now you can ChatGPT it).
Task | Software | Comments |
---|---|---|
Comprehensive | Galaxy | A web-based platform that provides a user-friendly interface for running bioinformatics tools and workflows. |
Sequence search | BLAST | |
Aligner | BWA, Bowtie2, STAR, Minimap2 | |
Trim adapter | Cutadapt, Trimmomatic | |
Sequence analysis | FastQC, Samtools, Bedtools, Picard | |
Genome analysis | GATK, IGV | |
NCBI interaction | SRA-Toolkit, NCBI Entrez Utilities, NCBI Datasets | A set of tools for accessing and retrieving biological data from the NCBI databases. |
Genome browser | UCSC genome browser, WashU Epigenome Browser | Visualize your data with genome annotations |
Differential analysis | DESeq2, EdgeR | |
Enrichment analysis | GO enrichment analysis, DAVID | |
Metagenomics | Sourmash, Kraken, QIIME |
conda activate bootcamp
# conda install -y -c conda-forge parallel
mkdir -p test_parallel
cd test_parallel
######## example1: when you have tons of scripts
for i in {1..100}; do
echo -e "sleep 1\necho done script ${i}" > temp_script_${i}.sh
done
# a usual way if I want to run all of them
for script in $(ls *.sh); do
echo "Processing ${script} now"
bash ${script}
done
# you can press Ctrl+C to terminate the process
# it can be crazy when a script is long, such as seq alignment
# a parallel way
ls *.sh | parallel -j 10 bash {}
# clean folder
rm *.sh
######## example2: when you need to process tons of files
for i in {1..100}; do
echo "This is file ${i}" > temp_file_${i}.txt
done
# let's define some function
process_some_file ()
{
input_file=$1
some_other_parameter=$2
# whatever functions you like
sleep 1
echo "We have a second parameter ${some_other_parameter}"
cat ${input_file}
}
export -f process_some_file
# now let's run it in parallel
ls temp*txt | parallel -j 10 process_some_file {} 100
# clean dir
rm temp*txt
######## example3: when you want to use same file but loop through tons of parameters
loop_parameters ()
{
input_file=$1
para1=$2
para2=$3
para3=$4
# whatever functions you like
sleep 1
echo "Running command for ${input_file} with para1 ${para1}, para2 ${para2}, para3 ${para3}"
}
export -f loop_parameters
touch assume_im_test_file.fastq
# ::: is used to separate the command from the argument list.
# check here: parallel -h
parallel -j 10 loop_parameters assume_im_test_file.fastq {1} {2} {3} ::: {1..10} ::: M F ::: {1..22}
# sometimes your parameters are complex
echo -e "a1\nb1\nc1\nd1" > para1_list.txt
echo -e "a2\nb2\nc2\nd2\ne2" > para2_list.txt
echo -e "a3\nb3\nc3\nd3\ne3\nf3" > para3_list.txt
# solution1:
parallel -j 10 loop_parameters assume_im_test_file.fastq {1} {2} {3} ::: $(cat para1_list.txt) ::: $(cat para2_list.txt) ::: $(cat para3_list.txt)
# solution 2: merge everything together first
while IFS= read -r line1; do
while IFS= read -r line2; do
while IFS= read -r line3; do
echo "$line1 $line2 $line3" >> merged_parameters.txt
done < para1_list.txt
done < para2_list.txt
done < para3_list.txt
cat merged_parameters.txt | parallel --colsep " " -j 10 loop_parameters assume_im_test_file.fastq {1} {2} {3}
# clean folder
rm *txt assume_im_test_file.fastq
As bioinformaticians, we consistently handle complex datasets. Ensuring these datasets are cleaned according to our requirements is a challenging task that requires careful attention.
######## Example 4: partition fasta file of KEGG KOs
echo ">aaa:Acav_0001|dnaA; chromosomal replication initiator protein|K02313
MTEEPSRSPDFDT
>aaa:Acav_0002|dnaN; DNA polymerase III subunit beta [EC:2.7.7.7]|K02338
MIVLKATQDKVL
>aaa:Acav_0003|gyrB; DNA gyrase subunit B [EC:5.6.2.2]|K02470
MTAENTLPEPTLP
>aaa:Acav_0005|hsdM; type I restriction enzyme M protein [EC:2.1.1.72]|K03427
MLDAQQQYAIRSAL
>aaa:Acav_0007|hsdM; type I restriction enzyme M protein [EC:2.1.1.72]|K03427
MSLTLDTLESWLWE
>aaa:Acav_0008|hsdS; type I restriction enzyme, S subunit [EC:3.1.21.3]|K02470
MKSMGTAEQVTPKA
>aaa:Acav_0014|TROVE2, SSA2; 60 kDa SS-A/Ro ribonucleoprotein|K02338
MVNTQLFQTLKAR
>aac:Aaci_2089|hemE, UROD; uroporphyrinogen decarboxylase [EC:4.1.1.37]|K02313
MRNGDFAVNNRFLR" > pseudo_ko_seq.faa
# if we need a collection of all KOs
grep '>' pseudo_ko_seq.faa | rev | cut -d'|' -f1 | rev | sort -u
# grep '>': get all lines start with ">"
# rev: reverse line back to front
# cut -d'|' -f1: pick the 1st field
# rev: reverse back to normal order
# sort -u: deduplicate
# if we want group sequences by every single KO
grep '>' pseudo_ko_seq.faa | rev | cut -d'|' -f1 | rev | sort -u > pseudo_ko_list.txt
for ko in $(cat pseudo_ko_list.txt); do
echo ${ko}
grep -A 1 --no-group-separator -Fw ${ko} pseudo_ko_seq.faa
echo " "
done
# Question: can we do parallel?
cat pseudo_ko_list.txt | parallel -j 4 'ko={}; grep -A 1 --no-group-separator -Fw ${ko} pseudo_ko_seq.faa > pseudo_out_${ko}.txt'
# clean dir
rm pseudo_*
Gradually get fimilar with them, and then enjoy playing with them.
Tools | Function |
---|---|
awk, sed | text manipulation and extraction |
grep | search in file |
cut | select section/field |
sort | sort |
find | search in dir |
uniq | dedup |
echo | |
xargs | command after stdout |
head, tail | show begin/end of file |
tr | translate character (often delimiter) |
wget, curl | download |
pwd, cd | dir operation |
Snakemake
is a powerful, python-based workflow management tool that aims to improve reproducibility and scalability of creating workflows across different systems. It allows users to define customized rules for deriving output from input files, making it easier to build a workflow.
In this part of module 3, we will first introduce the basic syntax in Snakemake, and then utilize a ATAC-seq pipeline as an example to show how to create it using Snakemake. The interactive code of the ATAC-seq analysis can be found here.
Basic knowledge about Python and Bash is required.
To build a workflow via Snakemake, you will need to create a so-called Snakefile
which is commonly end with a suffix .smk
(e.g., "workflow.smk"). Adding this suffix to the file name will help people better identify it as a snakefile. But the Snakemake can still read the file even if the suffix is not present. A snakefile is a collection of different rules.
A rule is composed of a few directives input
, params
, output
and shell/run
.
Here is a basic framework of a rule:
rule <a rule name>:
input:
<your input files>
params:
<some required parameters>
output:
<your output files>
shell/run:
<your command lines>
A Snakemake rule has to start with a keyword rule
followed by a given name (e.g., "download_data"). The input
and output
directives are followed by lists of files (separated by ,
) that are expected to be used or created by the rule. However, both of them are optional directives that we will talk later when we need them. The params
is also an optional directive where you can put some required parameters used in the command lines. A last and required directive is shell
or run
, followed by a Python string or a few simple python lines to execute the commands.
We will use an ATAC-seq pipeline as an example to further demonstrate how to use these directives.
To run a workflow described in a snakefile, we can execute:
snakemake -c <number of cores to use> -s <path to snakemake file>
The -c
(or --cores
) flag is required and used to specify the number of cores to use. It tells Snakemake to use a specified number of cores in the machine to run the workflow in parallel according to the dependence. If -c
is given without a number, all available cores are used.
It is always a good habit to use dry run
to examine whether there are any syntax errors or something missing in the snakefile via:
snakemake -n -s <path to snakemake file>
Using -n
(or --dry-run
) flag, Snakemake will only show the execution plan instead of actually performing the steps.
Testing is crucial in software development. Before a software product reaches the end-user, it typically undergoes various tests including integration tests, systems tests, and acceptance tests. The basic level of software testings are called unit tests
, a form of automated testing. If you're working in the field of computational biology, you usually need to develop some algorithms or software. Learning how to do unit tests can greatly benefit your research.
In this part, we will learn how to use a popular Python testing framework Pytest
. It is open-source and allows developers to write concise test suites for unit testing, functional testing, and API testing. Compared with the Python built-in testing framework Unittest
, I think it is more flexible and convenient.
Basic knowledge about Python is required.
Since Pytest is a third-party framework, you will need to first install this package, simply via a pip command:
pip install pytest
Please click this link to access a jupyter notebook with illustrations.