Bioinformatics

Overlap Graphs

2018. 12. 7. 11:16

Problem

A graph whose nodes have all been labeled can be represented by an adjacency list, in which each row of the list contains the two node labels corresponding to a unique edge.

A directed graph (or digraph) is a graph containing directed edges, each of which has an orientation. That is, a directed edge is represented by an arrow instead of a line segment; the starting and ending nodes of an edge form its tail and head, respectively. The directed edge with tail $v$ and head $w$ is represented by $(v, w)$ (but not by $(w, v)$ ). A directed loop is a directed edge of the form $(v, v)$ .

For a collection of strings and a positive integer $k$ , the overlap graph for the strings is a directed graph $O_{k}$ in which each string is represented by a node, and string $s$ is connected to string $t$ with a directed edge when there is a length $k$ suffix of $s$ that matches a length $k$ prefix of $t$ , as long as $s \neq t$ ; we demand $s \neq t$ to prevent directed loops in the overlap graph (although directed cycles may be present).

Given: A collection of DNA strings in FASTA format having total length at most 10 kbp.

Return: The adjacency list corresponding to $O_{3}$ . You may return edges in any order.

Sample Dataset

>Rosalind_0498
AAATAAA
>Rosalind_2391
AAATTTT
>Rosalind_2323
TTTTCCC
>Rosalind_0442
AAATCCC
>Rosalind_5013
GGGTGGG

Sample Output

Rosalind_0498 Rosalind_2391
Rosalind_0498 Rosalind_0442
Rosalind_2391 Rosalind_2323

Problem explanation

문제의 조건들은 다음과 같습니다.

1. 비교 하는 대상이 앞에 올때 s, 비교 하는 대상이 뒤에 올때 t로 명명합니다. (이때 s와 t는 같을 수 없습니다.)

2. s의 전체 시퀀스의 뒤에서 k 만큼의 bp(시퀀스 한개 1bp)는 v를 의미하며 t의 전체 시퀀스의 앞에서 k 만큼의 bp를 추출한 것을 w로 판단합니다.

3. v = w 의 조건에 만족한 영역들만 추출합니다.( w = v 일때는 포함되면 안됩니다.)

Source

f = open("D:/gs/rosalind/rosalind_grph.txt","r") 
k = 3
lines = f.readlines()
id = []
seq = []
#rosalind 에서 주는 txt 파일을 불러와서 나의 inputdata에 맞게 reform 합니다.
def reform(lines):
    for line in lines:
        if line.startswith(">"):
            a = line.replace(">", "")
            a = a.replace("\n", "")
            id.append(a)
        else:
            if len(id) > len(seq):
                line = line.replace("\n","")
                seq.append(line)
            else:
                line = line.replace("\n", "")
                seq[len(id) - 1] = seq[len(id) - 1] + line
    return id , seq
id, seq = reform(lines)

s = []
t = []

for i in range(len(id)):
    for j in range(len(seq)):

        if i != j:
            if seq[i][-k:] == seq[j][:k]:
                s.append(id[i])
                t.append(id[j])

for i in range(len(s)):
    print(s[i] + " " + t[i])

f.close()

A Brief Introduction to Graph Theory

Networks arise everywhere in the practical world, especially in biology. Networks are prevalent in popular applications such as modeling the spread of disease, but the extent of network applications spreads far beyond popular science. Our first question asks how to computationally model a network without actually needing to render a picture of the network.

First, some terminology: graph is the technical term for a network; a graph is made up of hubs called nodes (or vertices), pairs of which are connected via segments/curves called edges. If an edge connects nodes $v$ and $w$ , then it is denoted by $v, w$ (or equivalently $w, v$ ).

an edge $v, w$ is incident to nodes $v$ and $w$ ; we say that $v$ and $w$ are adjacent to each other;
the degree of $v$ is the number of edges incident to it;
a walk is an ordered collection of edges for which the ending node of one edge is the starting node of the next (e.g., ${v_{1}, v_{2}}$ , ${v_{2}, v_{3}}$ , ${v_{3}, v_{4}}$ , etc.);
a path is a walk in which every node appears in at most two edges;
path length is the number of edges in the path;
a cycle is a path whose final node is equal to its first node (so that every node is incident to exactly two edges in the cycle); and
the distance between two vertices is the length of the shortest path connecting them.

Graph theory is the abstract mathematical study of graphs and their properties.

'Python > rosaland' 카테고리의 다른 글

Finding a Shared Motif (0)	2018.12.11
Calculating Expected Offspring (0)	2018.12.07
Mortal Fibonacci Rabbits (0)	2018.12.05
Consensus and Profile (0)	2018.12.05
Combing Through the Haystack (0)	2018.11.30

Mortal Fibonacci Rabbits

2018. 12. 5. 20:03

Problem

Figure 4. A figure illustrating the propagation of Fibonacci's rabbits if they die after three months.

Recall the definition of the Fibonacci numbers from “Rabbits and Recurrence Relations”, which followed the recurrence relation $F_{n} = F_{n - 1} + F_{n - 2}$ and assumed that each pair of rabbits reaches maturity in one month and produces a single pair of offspring (one male, one female) each subsequent month.

Our aim is to somehow modify this recurrence relation to achieve a dynamic programming solution in the case that all rabbits die out after a fixed number of months. See Figure 4 for a depiction of a rabbit tree in which rabbits live for three months (meaning that they reproduce only twice before dying).

Given: Positive integers $n \leq 100$ and $m \leq 20$ .

Return: The total number of pairs of rabbits that will remain after the $n$ -th month if all rabbits live for $m$ months.

month = 83
lifespan = 19
cycle = [0] * lifespan
cycle[0] = 1

for i in range(month-1):
    temp = 0
    for j in range(lifespan-1, 0, -1):
        temp = temp + cycle[j]
        cycle[j] = cycle[j-1]
    cycle[0] = temp

print(sum(cycle))

Wabbit Season

Figure 1. A c.1905 photo from Australia of a cart loaded to the hilt with rabbit skins.

Figure 2. Western Australia's rabbit fence is actually not the longest fence in the world as the sign claims. That honor goes to a 3,500 mile fence in southeastern Australia built to keep out dingoes. Courtesy Matt Pounsett.

Figure 3. An 1884 cartoon from the Queensland Figaro proposing how the rabbits viewed their fence.

In “Rabbits and Recurrence Relations”, we mentioned the disaster caused by introducing European rabbits into Australia. By the turn of the 20th Century, the situation was so out of control that the creatures could not be killed fast enough to slow their spread (see Figure 1).

The conclusion? Build a fence! The fence, intended to preserve the sanctity of Western Australia, was completed in 1907 after undergoing revisions to push it back as the bunnies pushed their frontier ever westward (see Figure 2). If it sounds like a crazy plan, the Australians at the time seem to have concurred, as shown by the cartoon in Figure 3.

By 1950, Australian rabbits numbered 600 million, causing the government to decide to release a virus (called myxoma) into the wild, which cut down the rabbits until they acquired resistance. In a final Hollywood twist, another experimental rabbit virus escaped in 1991, and some resistance has already been observed.

The bunnies will not be stopped, but they don't live forever, and so in this problem, our aim is to expand Fibonacci's rabbit population model to allow for mortal rabbits.

'Python > rosaland' 카테고리의 다른 글

Calculating Expected Offspring (0)	2018.12.07
Overlap Graphs (0)	2018.12.07
Consensus and Profile (0)	2018.12.05
Combing Through the Haystack (0)	2018.11.30
Translating RNA into Protein (0)	2018.11.30

Consensus and Profile

2018. 12. 5. 10:38

Problem

A matrix is a rectangular table of values divided into rows and columns. An $m \times n$ matrix has $m$ rows and $n$ columns. Given a matrix $A$ , we write $A_{i, j}$ to indicate the value found at the intersection of row $i$ and column $j$ .

Say that we have a collection of DNA strings, all having the same length $n$ . Their profile matrix is a $4 \times n$ matrix $P$ in which $P_{1, j}$ represents the number of times that 'A' occurs in the $j$ th positionof one of the strings, $P_{2, j}$ represents the number of times that C occurs in the $j$ th position, and so on (see below).

A consensus string $c$ is a string of length $n$ formed from our collection by taking the most common symbol at each position; the $j$ th symbol of $c$ therefore corresponds to the symbol having the maximum value in the $j$ -th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.

	A T C C A G C T
	G G G C A A C T
	A T G G A T C T
DNA Strings	A A G C A A C C
	T T G G A A C T
	A T G C C A T T
	A T G G C A C T

	A 5 1 0 0 5 5 0 0
Profile	C 0 0 1 4 2 0 6 1
	G 1 1 6 3 0 1 0 0
	T 1 5 0 0 0 1 1 6

Consensus	A T G C A A C T

Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.

Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

f= open("D:/gs/rosalind/rosalind_cons1.txt","r")

lines  = f.readlines()
seq = []
arr1 =[]
for i in range(len(lines)):
    if i%2==1:
        seq.append(lines[i])

for i in range(len(seq)):
    arr2=[]
    for j in range(len(seq[i])):
        if seq[i][j] == '\n':
            continue
        else:
            arr2.append(seq[i][j])
    arr1.append(arr2)



A = []
C = []
G = []
T = []
Max = []

for i in range(len(arr1[i])):
    a = 0
    c = 0
    g = 0
    t = 0
    temp = []
    for j in range(len(arr1)):
        if arr1[j][i]=="A":
            a+=1
        elif arr1[j][i]=="T":
            t+=1
        elif arr1[j][i]=="G":
            g+=1
        else:
            c+=1
    A.append(a)
    C.append(c)
    G.append(g)
    T.append(t)
    temp.append(a)
    temp.append(c)
    temp.append(g)
    temp.append(t)
    Max.append(temp.index(max(temp)))

count = 0
for ma in Max:
    if ma == 0:
        Max[count] = "A"
        count+=1
    elif ma == 1:
        Max[count] = "C"
        count += 1
    elif ma == 2:
        Max[count] = "G"
        count += 1
    else:
        Max[count] = "T"
        count += 1



print("".join(Max))
A = [str (i) for i in A]
C = [str (i) for i in C]
G = [str (i) for i in G]
T = [str (i) for i in T]

print("A: "+" ".join(A))
print("C: "+" ".join(C))
print("G: "+" ".join(G))
print("T: "+" ".join(T))

Finding a Most Likely Common Ancestor

In “Counting Point Mutations”, we calculated the minimum number of symbol mismatches between two strings of equal length to model the problem of finding the minimum number of point mutations occurring on the evolutionary path between two homologous strands of DNA. If we instead have several homologous strands that we wish to analyze simultaneously, then the natural problem is to find an average-case strand to represent the most likely common ancestor of the given strands.

'Python > rosaland' 카테고리의 다른 글

Overlap Graphs (0)	2018.12.07
Mortal Fibonacci Rabbits (0)	2018.12.05
Combing Through the Haystack (0)	2018.11.30
Translating RNA into Protein (0)	2018.11.30
Mendel's First Law (0)	2018.11.26

PREV 1 2 3 4 5 6 7 ···17 NEXT

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Bioinformatics

Overlap Graphs

Problem

Sample Dataset

Sample Output

Problem explanation

Source

A Brief Introduction to Graph Theory

'Python > rosaland' 카테고리의 다른 글

Mortal Fibonacci Rabbits

Problem

Wabbit Season

'Python > rosaland' 카테고리의 다른 글

Consensus and Profile

Problem

Finding a Most Likely Common Ancestor

'Python > rosaland' 카테고리의 다른 글

+ Recent posts

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역