Problem
To allow for the presence of its varying forms, a protein motif is represented by a shorthand as follows: [XY] means "either X or Y" and {X} means "any amino acid except X." For example, the N-glycosylation motif is written as N{P}[ST]{P}.
You can see the complete description and features of a particular protein by its access ID "uniprot_id" in the UniProt database, by inserting the ID number into
Alternatively, you can obtain a protein sequence in FASTA format by following
For example, the data for protein B5ZC00 can be found at http://www.uniprot.org/uniprot/B5ZC00.
Given: At most 15 UniProt Protein Database access IDs.
Return: For each protein possessing the N-glycosylation motif, output its given access ID followed by a list of locations in the protein string where the motif can be found.
Sample Dataset
Sample Output
Problem explanation
Source
from bs4 import BeautifulSoup
import requests
import os
import re
def get_html(url):
_html = ""
resp = requests.get(url)
if resp.status_code == 200:
_html = resp.text
return _html
rosadirect = "D:/gs/rosalind/"
id, seq, start, total =[], [], [], []
f = open(rosadirect + "rosalind_mprt.txt","r")
lines = f.readlines()
for line in lines:
id.append(line.replace("\n", ""))
f.close()
p = re.compile(r"N[^P](S|T)[^P]")
for ide in id :
URL = "https://www.uniprot.org/uniprot/%s.fasta" % ide
html = get_html(URL)
soup = BeautifulSoup(html, 'html.parser')
f = open(rosadirect + "temp.txt", "w")
f.write(soup.getText())
f.close()
f = open(rosadirect + "temp.txt", "r")
lines = f.readlines()
seq= ""
totemp=[]
for j in range(1,len(lines)):
seq = seq + lines[j].replace("\n","")
for k in range(len(seq)+1):
a = seq[k:]
m = p.match(a)
if m is not None:
totemp.append(k+1)
if totemp == [[]]:
total.append('None')
else:
total.append(totemp)
f.close()
total = list(map(str,total))
for z in range(len(id)):
if total[z] == '[]':
continue
else:
print(id[z])
print(total[z].replace("[","").replace("]","").replace(",",""))
os.remove(rosadirect+"temp.txt")
Motif Implies Function
As mentioned in “Translating RNA into Protein”, proteins perform every practical function in the cell. A structural and functional unit of the protein is a protein domain: in terms of the protein's primary structure, the domain is an interval of amino acids that can evolve and function independently.
Each domain usually corresponds to a single function of the protein (e.g., binding the protein to DNA, creating or breaking specific chemical bonds, etc.). Some proteins, such as myoglobin and the Cytochrome complex, have only one domain, but many proteins are multifunctional and therefore possess several domains. It is even possible to artificially fuse different domains into a protein molecule with definite properties, creating a chimeric protein.
Just like species, proteins can evolve, forming homologous groups called protein families. Proteins from one family usually have the same set of domains, performing similar functions; see Figure 1.
A component of a domain essential for its function is called a motif, a term that in general has the same meaning as it does in nucleic acids, although many other terms are also used (blocks, signatures, fingerprints, etc.) Usually protein motifs are evolutionarily conservative, meaning that they appear without much change in different species.
Proteins are identified in different labs around the world and gathered into freely accessible databases. A central repository for protein data is UniProt, which provides detailed protein annotation, including function description, domain structure, and post-translational modifications. UniProt also supports protein similarity search, taxonomy analysis, and literature citations.
'Python > rosaland' 카테고리의 다른 글
Finding a Shared Motif (0) | 2018.12.11 |
---|---|
Calculating Expected Offspring (0) | 2018.12.07 |
Overlap Graphs (0) | 2018.12.07 |
Mortal Fibonacci Rabbits (0) | 2018.12.05 |
Consensus and Profile (0) | 2018.12.05 |