Bootcamp

Return to main page

Bioinformatics Resources Overview TOP

Bioinformatics Resources

In part, we will acquaint you with widely-used bionformatics public data and computational tools essential for genomic analysis. It's not necessary (and you don't need to) use all of them, but you may consider this as a foundation to create your personalized tool map for future research endeavors. Additionally, some of these resources will be employed in the upcoming pipeline examples, providing practical insights into their application.

1. Common Bioinformatics Public Data Resources

In our research journey here, working with genomic datasets will be a frequent and essential aspect. Genomic datasets serve as foundational elements for numerous analyses and investigations. For example:

Genome Annotation: references necessary for analysis
Algorithm Validation: Benchmark datasets are essential for validating the accuracy and performance of new algorithms
Comparative Studies: When comparing different bioinformatics tools or pipelines, benchmark datasets provide a standardized reference
Reproducibility and standardization: well-curated benchmark datasets for specific computational tasks (e.g. CAMI, Multiome)
Meta-analysis: data mining from accumulated genomic datasets

Here we list some widely used data sources:

Database	Contents	When you might need it	Similar resources
Reference data
Ensemble	Model organisms: assembly and annotation	Analysis of human or mouse sequencing data	RefSeq
InterPro	Protein families	Annotate amino acid sequences of unknown function
KEGG	Hierarchical gene functions	Functional analysis
GTDB	Microbiome reference genomes	Metagenomic analysis	RefSeq
BV-BRC	Pathogen database
Genomic sequencing data
NCBI	Almost everything	Search for specific data	EBI, GEO, SRA
TCGA	Cancer genomic data
ENCODE	Functional genomic data
HMP	Real microbial data
CAMI	Simulated microbial data for computational challenges
10X Genomics	Single cell
For data mining
Kaggle	Cleaned data for data science community

2. Bioinformatic tools

The majority of tools are compatible with Linux, Python, or R. However, it is not uncommon to find yourself in situations where you may need to switch between these platforms. Therefore, it is essential to remain adaptable and flexible in your approach. The initial lesson I learned in Bioinformatics, which resolves over 95% of my challenges, is simple: when in doubt, Google it (now you can ChatGPT it).

Task	Software	Comments
Comprehensive	Galaxy	A web-based platform that provides a user-friendly interface for running bioinformatics tools and workflows.
Sequence search	BLAST
Aligner	BWA, Bowtie2, STAR, Minimap2
Trim adapter	Cutadapt, Trimmomatic
Sequence analysis	FastQC, Samtools, Bedtools, Picard
Genome analysis	GATK, IGV
NCBI interaction	SRA-Toolkit, NCBI Entrez Utilities, NCBI Datasets	A set of tools for accessing and retrieving biological data from the NCBI databases.
Genome browser	UCSC genome browser, WashU Epigenome Browser	Visualize your data with genome annotations
Differential analysis	DESeq2, EdgeR
Enrichment analysis	GO enrichment analysis, DAVID
Metagenomics	Sourmash, Kraken, QIIME

3. Examples of some useful tools

Help resources:

Package: Conda, R Bioconductor, Github
Forum: Biostars
Coding: Stack Overflow, ChatGPT, etc.

GNU parallel: process data in parallel

Why do you need it: efficiency
1. when you have tons of fastq files to process
2. when you have tons of repetitive tasks
How-to:
1. install by conda
2. usage: help documents or check the examples in the link

conda activate bootcamp
# conda install -y -c conda-forge parallel

mkdir -p test_parallel
cd test_parallel



######## example1: when you have tons of scripts
for i in {1..100}; do
 echo -e "sleep 1\necho done script ${i}" > temp_script_${i}.sh
done

# a usual way if I want to run all of them
for script in $(ls *.sh); do
 echo "Processing ${script} now"
 bash ${script}
done

# you can press Ctrl+C to terminate the process
# it can be crazy when a script is long, such as seq alignment

# a parallel way 
ls *.sh | parallel -j 10 bash {}

# clean folder
rm *.sh



######## example2: when you need to process tons of files
for i in {1..100}; do
 echo "This is file ${i}" > temp_file_${i}.txt
done

# let's define some function
process_some_file ()
{
 input_file=$1
 some_other_parameter=$2
 
 # whatever functions you like
 sleep 1
 echo "We have a second parameter ${some_other_parameter}"
 cat ${input_file}
}

export -f process_some_file

# now let's run it in parallel
ls temp*txt | parallel -j 10 process_some_file {} 100

# clean dir
rm temp*txt



######## example3: when you want to use same file but loop through tons of parameters
loop_parameters ()
{
 input_file=$1
 para1=$2
 para2=$3
 para3=$4
 
 # whatever functions you like
 sleep 1
 echo "Running command for ${input_file} with para1 ${para1}, para2 ${para2}, para3 ${para3}"
}

export -f loop_parameters

touch assume_im_test_file.fastq
# ::: is used to separate the command from the argument list.
# check here: parallel -h
parallel -j 10 loop_parameters assume_im_test_file.fastq {1} {2} {3} ::: {1..10} ::: M F ::: {1..22} 

# sometimes your parameters are complex
echo -e "a1\nb1\nc1\nd1" > para1_list.txt
echo -e "a2\nb2\nc2\nd2\ne2" > para2_list.txt
echo -e "a3\nb3\nc3\nd3\ne3\nf3" > para3_list.txt

# solution1: 
parallel -j 10 loop_parameters assume_im_test_file.fastq {1} {2} {3} ::: $(cat para1_list.txt) ::: $(cat para2_list.txt) ::: $(cat para3_list.txt)

# solution 2: merge everything together first
while IFS= read -r line1; do
    while IFS= read -r line2; do
        while IFS= read -r line3; do
            echo "$line1 $line2 $line3" >> merged_parameters.txt
        done < para1_list.txt
    done < para2_list.txt
done < para3_list.txt

cat merged_parameters.txt | parallel --colsep " " -j 10 loop_parameters assume_im_test_file.fastq {1} {2} {3}

# clean folder
rm *txt assume_im_test_file.fastq

Get fluent with Shell script

As bioinformaticians, we consistently handle complex datasets. Ensuring these datasets are cleaned according to our requirements is a challenging task that requires careful attention.

######## Example 4: partition fasta file of KEGG KOs
echo ">aaa:Acav_0001|dnaA; chromosomal replication initiator protein|K02313
MTEEPSRSPDFDT
>aaa:Acav_0002|dnaN; DNA polymerase III subunit beta [EC:2.7.7.7]|K02338
MIVLKATQDKVL
>aaa:Acav_0003|gyrB; DNA gyrase subunit B [EC:5.6.2.2]|K02470
MTAENTLPEPTLP
>aaa:Acav_0005|hsdM; type I restriction enzyme M protein [EC:2.1.1.72]|K03427
MLDAQQQYAIRSAL
>aaa:Acav_0007|hsdM; type I restriction enzyme M protein [EC:2.1.1.72]|K03427
MSLTLDTLESWLWE
>aaa:Acav_0008|hsdS; type I restriction enzyme, S subunit [EC:3.1.21.3]|K02470
MKSMGTAEQVTPKA
>aaa:Acav_0014|TROVE2, SSA2; 60 kDa SS-A/Ro ribonucleoprotein|K02338
MVNTQLFQTLKAR
>aac:Aaci_2089|hemE, UROD; uroporphyrinogen decarboxylase [EC:4.1.1.37]|K02313
MRNGDFAVNNRFLR"  > pseudo_ko_seq.faa

# if we need a collection of all KOs
grep '>' pseudo_ko_seq.faa | rev | cut -d'|' -f1 | rev | sort -u 

# grep '>': get all lines start with ">"
# rev: reverse line back to front
# cut -d'|' -f1: pick the 1st field
# rev: reverse back to normal order
# sort -u: deduplicate

# if we want group sequences by every single KO
grep '>' pseudo_ko_seq.faa | rev | cut -d'|' -f1 | rev | sort -u  > pseudo_ko_list.txt

for ko in $(cat pseudo_ko_list.txt); do
 echo ${ko}
 grep -A 1 --no-group-separator -Fw ${ko} pseudo_ko_seq.faa
 echo " "
done

# Question: can we do parallel?
cat pseudo_ko_list.txt | parallel -j 4 'ko={}; grep -A 1 --no-group-separator -Fw ${ko} pseudo_ko_seq.faa > pseudo_out_${ko}.txt'

# clean dir
rm pseudo_*

Gradually get fimilar with them, and then enjoy playing with them.

Tools	Function
awk, sed	text manipulation and extraction
grep	search in file
cut	select section/field
sort	sort
find	search in dir
uniq	dedup
echo	print
xargs	command after stdout
head, tail	show begin/end of file
tr	translate character (often delimiter)
wget, curl	download
pwd, cd	dir operation

Snakemake TOP

Snakemake

Snakemake is a powerful, python-based workflow management tool that aims to improve reproducibility and scalability of creating workflows across different systems. It allows users to define customized rules for deriving output from input files, making it easier to build a workflow.

In this part of module 3, we will first introduce the basic syntax in Snakemake, and then utilize a ATAC-seq pipeline as an example to show how to create it using Snakemake. The interactive code of the ATAC-seq analysis can be found here.

Prerequisite

Basic knowledge about Python and Bash is required.

Basic Usage

To build a workflow via Snakemake, you will need to create a so-called Snakefile which is commonly end with a suffix .smk (e.g., "workflow.smk"). Adding this suffix to the file name will help people better identify it as a snakefile. But the Snakemake can still read the file even if the suffix is not present. A snakefile is a collection of different rules.

A rule is composed of a few directives input , params, output and shell/run. Here is a basic framework of a rule:

rule <a rule name>:
	input:
		<your input files>
	params:
		<some required parameters>
	output:
		<your output files>
	shell/run:
		<your command lines>

A Snakemake rule has to start with a keyword rule followed by a given name (e.g., "download_data"). The input and output directives are followed by lists of files (separated by ,) that are expected to be used or created by the rule. However, both of them are optional directives that we will talk later when we need them. The params is also an optional directive where you can put some required parameters used in the command lines. A last and required directive is shell or run , followed by a Python string or a few simple python lines to execute the commands.

We will use an ATAC-seq pipeline as an example to further demonstrate how to use these directives.

To run a workflow described in a snakefile, we can execute:

snakemake -c <number of cores to use> -s <path to snakemake file>

The -c (or --cores) flag is required and used to specify the number of cores to use. It tells Snakemake to use a specified number of cores in the machine to run the workflow in parallel according to the dependence. If -c is given without a number, all available cores are used.

It is always a good habit to use dry run to examine whether there are any syntax errors or something missing in the snakefile via:

snakemake -n -s <path to snakemake file>

Using -n (or --dry-run) flag, Snakemake will only show the execution plan instead of actually performing the steps.

Reference

Snakemake Official Tutorial

Pytest TOP

Python Primer (Pytest)

Testing is crucial in software development. Before a software product reaches the end-user, it typically undergoes various tests including integration tests, systems tests, and acceptance tests. The basic level of software testings are called unit tests, a form of automated testing. If you're working in the field of computational biology, you usually need to develop some algorithms or software. Learning how to do unit tests can greatly benefit your research.

In this part, we will learn how to use a popular Python testing framework Pytest. It is open-source and allows developers to write concise test suites for unit testing, functional testing, and API testing. Compared with the Python built-in testing framework Unittest, I think it is more flexible and convenient.

Prerequisite

Basic knowledge about Python is required.

Basic Usage

Since Pytest is a third-party framework, you will need to first install this package, simply via a pip command:

pip install pytest

Please click this link to access a jupyter notebook with illustrations.

Reference

Pytest Official Tutorial

Return to main page

Contents

Bioinformatics Resources Overview TOP

Bioinformatics Resources

1. Common Bioinformatics Public Data Resources

2. Bioinformatic tools

3. Examples of some useful tools

Help resources:

GNU parallel: process data in parallel

Get fluent with Shell script

Snakemake TOP

Snakemake

Prerequisite

Basic Usage

Reference

Pytest TOP

Python Primer (Pytest)

Prerequisite

Basic Usage

Reference

Return to main page