bbmap 이란?

bbmap은 Ilumnia 또는 Pacbio와 같은 시퀀싱 장비에서 생산된 어셈블인 RNA와 DNA 정렬을 위한 툴입니다.


bbmap 세팅 메뉴얼

Ubuntu 기준으로 설치 되는 메뉴얼입니다.

1. http://sourceforge.net/projects/bbmap/files/latest/download 에서 bbmap을 다운받습니다.(ex. BBMap_38.34.tar.gz)


2.원하는 폴더에 BBMap_38.34.tar.gz를 옮긴 후에 압축을 풀어주세요. (Usage. tar zxf BBMap_38.34.tar.gz )

---------------------이곳 까지 완료했으면 local 에서 BBmap or BBduk 을 쓸 수 있습니다.--------------------------


3. PATH 설정

export PATH=$PATH:$HOME/bbmap


'Bioinformatics' 카테고리의 다른 글

PCoA- taxonomic level in Qiime2  (0) 2020.02.25
Convert Tsv to Biom File Format  (0) 2019.10.11
NCBI blast+ local install OS Linux  (0) 2018.12.13
bedtools coverage  (0) 2018.12.04
Miso Analysis  (0) 2018.10.29

*Step 1: Create a new directory name ncbi-blast+

mkdir ncbi-blast+

*Step 2: Enter in to created directory

$ cd ncbi-blast+

#For Fedora/ CentOS/ Ubuntu operating system

 

*Step 3: Download the software from NCBI-FTP site

wget -N ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.6.0/ncbi-blast-2.6.0+-1.x86_64.rpm

*Step 4: To Install (Fedora/ CentOS operating system)

$ sudo yum install ncbi-blast-2.6.0+-1.x86_64.rpm -nogpgcheck

 To Install ( Ubuntu operating system)

sudo alien -i ncbi-blast-2.6.0+-1.x86_64.rpm 

*Step 5:  To check for successful installation

blastn -h

Output: USAGE
blastn [-h] [-help] [-import_search_strategy filename]
[-export_search_strategy filename] [-task task_name] [-db database_name]
[-dbsize num_letters] [-gilist filename] [-seqidlist filename]
[-negative_gilist filename] [-entrez_query entrez_query]
[-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]
[-subject subject_input_file] [-subject_loc range] [-query input_file]
[-out output_file] [-evalue evalue] [-word_size int_value]
[-gapopen open_penalty] [-gapextend extend_penalty]
[-perc_identity float_value] [-qcov_hsp_perc float_value]
[-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
[-xdrop_gap_final float_value] [-searchsp int_value]
[-sum_stats bool_value] [-penalty penalty] [-reward reward] [-no_greedy]
[-min_raw_gapped_score int_value] [-template_type type]
[-template_length int_value] [-dust DUST_options]
[-filtering_db filtering_database]
[-window_masker_taxid window_masker_taxid]
[-window_masker_db window_masker_db] [-soft_masking soft_masking]
[-ungapped] [-culling_limit int_value] [-best_hit_overhang float_value]
[-best_hit_score_edge float_value] [-window_size int_value]
[-off_diagonal_range int_value] [-use_index boolean] [-index_name string]
[-lcase_masking] [-query_loc range] [-strand strand] [-parse_deflines]
[-outfmt format] [-show_gis] [-num_descriptions int_value]
[-num_alignments int_value] [-line_length line_length] [-html]
[-max_target_seqs num_sequences] [-num_threads int_value] [-remote]
[-version]

**NCBI-BLAST+ version 2.6.0   is successfully installed on your system, now you can perform sequence similarity search locally with your customised database and queries

'Bioinformatics' 카테고리의 다른 글

Convert Tsv to Biom File Format  (0) 2019.10.11
bbmap 세팅 메뉴얼  (0) 2019.01.07
bedtools coverage  (0) 2018.12.04
Miso Analysis  (0) 2018.10.29
SAM FILE Format  (0) 2018.10.18

Problem

To allow for the presence of its varying forms, a protein motif is represented by a shorthand as follows: [XY] means "either X or Y" and {X} means "any amino acid except X." For example, the N-glycosylation motif is written as N{P}[ST]{P}.

You can see the complete description and features of a particular protein by its access ID "uniprot_id" in the UniProt database, by inserting the ID number into

http://www.uniprot.org/uniprot/uniprot_id

Alternatively, you can obtain a protein sequence in FASTA format by following

http://www.uniprot.org/uniprot/uniprot_id.fasta

For example, the data for protein B5ZC00 can be found at http://www.uniprot.org/uniprot/B5ZC00.

Given: At most 15 UniProt Protein Database access IDs.

Return: For each protein possessing the N-glycosylation motif, output its given access ID followed by a list of locations in the protein string where the motif can be found.


Sample Dataset

A2Z669
B5ZC00
P07204_TRBM_HUMAN
P20840_SAG1_YEAST



Sample Output

B5ZC00
85 118 142 306 395
P07204_TRBM_HUMAN
47 115 116 382 409
P20840_SAG1_YEAST
79 109 135 248 306 348 364 402 485 501 614



Problem explanation

단백질 Access ID를 통해 uniprot에 접속하여 단백질 정보를 얻어와 특정 motif를 찾는 문제입니다. Dataset을 보아 크롤링이 필요할 것으로 추측되어 이용 했습니다. 다만 pre태크로 싸여있어 텍스트 파일로 저장한 후 다시 불러오는 방식(저의 크롤링 실력이 부족하여 방법을 못찾았을 수도 있습니다.)으로 진행했습니다. 또한 정규식을 이용하여 각 문자열에 있는 N-glycosylation motif를 찾았습니다.

Source

from bs4 import BeautifulSoup
import requests
import os
import re

def get_html(url):
_html = ""
resp = requests.get(url)
if resp.status_code == 200:
_html = resp.text
return _html
rosadirect = "D:/gs/rosalind/"
id, seq, start, total =[], [], [], []

f = open(rosadirect + "rosalind_mprt.txt","r")
lines = f.readlines()
for line in lines:
id.append(line.replace("\n", ""))
f.close()
p = re.compile(r"N[^P](S|T)[^P]")
for ide in id :
URL = "https://www.uniprot.org/uniprot/%s.fasta" % ide
html = get_html(URL)
soup = BeautifulSoup(html, 'html.parser')
f = open(rosadirect + "temp.txt", "w")
f.write(soup.getText())
f.close()
f = open(rosadirect + "temp.txt", "r")
lines = f.readlines()
seq= ""
totemp=[]
for j in range(1,len(lines)):
seq = seq + lines[j].replace("\n","")

for k in range(len(seq)+1):
a = seq[k:]
m = p.match(a)
if m is not None:
totemp.append(k+1)
if totemp == [[]]:
total.append('None')
else:
total.append(totemp)
f.close()
total = list(map(str,total))
for z in range(len(id)):
if total[z] == '[]':
continue
else:
print(id[z])
print(total[z].replace("[","").replace("]","").replace(",",""))
os.remove(rosadirect+"temp.txt")



Motif Implies Functionclick to collapse

Figure 1. The human cyclophilin family, as represented by the structures of the isomerase domains of some of its members.

As mentioned in “Translating RNA into Protein”proteins perform every practical function in the cell. A structural and functional unit of the protein is a protein domain: in terms of the protein's primary structure, the domain is an interval of amino acids that can evolve and function independently.

Each domain usually corresponds to a single function of the protein (e.g., binding the protein to DNA, creating or breaking specific chemical bonds, etc.). Some proteins, such as myoglobin and the Cytochrome complex, have only one domain, but many proteins are multifunctional and therefore possess several domains. It is even possible to artificially fuse different domains into a protein molecule with definite properties, creating a chimeric protein.

Just like species, proteins can evolve, forming homologous groups called protein families. Proteins from one family usually have the same set of domains, performing similar functions; see Figure 1.

A component of a domain essential for its function is called a motif, a term that in general has the same meaning as it does in nucleic acids, although many other terms are also used (blocks, signatures, fingerprints, etc.) Usually protein motifs are evolutionarily conservative, meaning that they appear without much change in different species.

Proteins are identified in different labs around the world and gathered into freely accessible databases. A central repository for protein data is UniProt, which provides detailed protein annotation, including function description, domain structure, and post-translational modifications. UniProt also supports protein similarity search, taxonomy analysis, and literature citations.


'Python > rosaland' 카테고리의 다른 글

Finding a Shared Motif  (0) 2018.12.11
Calculating Expected Offspring  (0) 2018.12.07
Overlap Graphs  (0) 2018.12.07
Mortal Fibonacci Rabbits  (0) 2018.12.05
Consensus and Profile  (0) 2018.12.05

+ Recent posts