Problem

common substring of a collection of strings is a substring of every member of the collection. We say that a common substring is a longest common substring if there does not exist a longer common substring. For example, "CG" is a common substring of "ACGTACGT" and "AACCGTATA", but it is not as long as possible; in this case, "CGTA" is a longest common substring of "ACGTACGT" and "AACCGTATA".

Note that the longest common substring is not necessarily unique; for a simple example, "AA" and "CC" are both longest common substrings of "AACC" and "CCAA".

Given: A collection of k (k100DNA strings of length at most 1 kbp each in FASTA format.

Return: A longest common substring of the collection. (If multiple solutions exist, you may return any single solution.)


Sample Dataset

>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA


Sample Output

AC




Problem explanation

fasta format에서 최장 공통 문자열을 찾는 코드 입니다. 처음에는 LCS 알고리즘을 이용해 풀려고 시도 해봤지만 약간 변형된 문제라 LCS 알고리즘으로 접근하기는 힘들었습니다. 두번째 접근 방법은 문자열에 대한 모든 경우를 리스트로 생성하고 교집합을 통해 비교하고자 하였지만 함수 runtime이 1000bp로 이뤄진 read 100개당 120초가 넘어가는 아주 무거운 방식이여서 포기했습니다. 마지막에 찾아낸 해결 방식은 아주 빠르고 정확했습니다. 



Source

import time
start_time = time.time()
def FindCommonString(s1, s2):
if len(s2) > len(s1):
s1, s2 = s2, s1
n = len(s2)
for i in range(n):
for j in range(i + 1):
token = s2[j: n - i + j]
if token in s1:
return token
def reform(lines):
for line in lines:
if line.startswith(">"):
a = line.replace(">", "")
a = a.replace("\n", "")
id.append(a)
else:
if len(id) > len(seq):
line = line.replace("\n","")
seq.append(line)
else:
line = line.replace("\n", "")
seq[len(id) - 1] = seq[len(id) - 1] + line
return id , seq

f = open("D:/gs/rosalind/rosalind_lcsm.txt","r")
lines = f.readlines()
id = []
seq = []
id, seq = reform(lines)
print(seq[1])
com = ""
for i in range(len(seq) -1):
if com == "":
com = FindCommonString(seq[i], seq[i+1])
print(com)
else:
if len(com) >= len(FindCommonString(seq[i+1], com)):
com = FindCommonString(seq[i+1],com)
print(com)
print("--- %s seconds ---" %(time.time() - start_time))
f.close()


Searching Through the Haystack

In “Finding a Motif in DNA”, we searched a given genetic string for a motif; however, this problem assumed that we know the motif in advance. In practice, biologists often do not know exactly what they are looking for. Rather, they must hunt through several different genomes at the same time to identify regions of similarity that may indicate genes shared by different organisms or species.

The simplest such region of similarity is a motif occurring without mutation in every one of a collection of genetic strings taken from a database; such a motif corresponds to a substring shared by all the strings. We want to search for long shared substrings, as a longer motif will likely indicate a greater shared function.


'Python > rosaland' 카테고리의 다른 글

Finding a Protein Motif  (0) 2018.12.13
Calculating Expected Offspring  (0) 2018.12.07
Overlap Graphs  (0) 2018.12.07
Mortal Fibonacci Rabbits  (0) 2018.12.05
Consensus and Profile  (0) 2018.12.05

+ Recent posts