Problem

Figure 2. The Hamming distance between these two strings is 7. Mismatched symbols are colored red.

Given two strings s and t of equal length, the Hamming distance between s and t, denoted dH(s,t), is the number of corresponding symbols that differ in s and t. See Figure 2.

Given: Two DNA strings s and t of equal length (not exceeding 1 kbp).

Return: The Hamming distance dH(s,t).


a="GAGCCTACTAACGGGAT"
b="CATCGTAATGACGGCCT"
count=0
for i in range(0,len(a)):
if a[i]!=b[i]:
count+=1

print(count)



Evolution as a Sequence of Mistakesclick to collapse

Figure 1. A point mutation in DNA changing a C-G pair to an A-T pair.

mutation is simply a mistake that occurs during the creation or copying of a nucleic acid, in particular DNA. Because nucleic acids are vital to cellularfunctions, mutations tend to cause a ripple effect throughout the cell. Although mutations are technically mistakes, a very rare mutation may equip the cell with a beneficial attribute. In fact, the macro effects of evolution are attributable by the accumulated result of beneficial microscopic mutations over many generations.

The simplest and most common type of nucleic acid mutation is a point mutation, which replaces one base with another at a single nucleotide. In the case of DNA, a point mutation must change the complementary base accordingly; see Figure 1.

Two DNA strands taken from different organism or species genomes are homologous if they share a recent ancestor; thus, counting the number of bases at which homologous strands differ provides us with the minimum number of point mutations that could have occurred on the evolutionary path between the two strands.

We are interested in minimizing the number of (point) mutations separating two species because of the biological principle of parsimony, which demands that evolutionary histories should be as simply explained as possible.


'Python > rosaland' 카테고리의 다른 글

Translating RNA into Protein  (0) 2018.11.30
Mendel's First Law  (0) 2018.11.26
Computing GC Content  (0) 2018.11.21
Rabbits and Recurrence Relations  (0) 2018.11.16
Transcribing DNA into RNA  (0) 2018.11.14

Problem

The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.


Code

import re

f=open("D:/gs/rosalind/rosalind_gc.txt","r")


lines = f.readlines()
name=[]
score=[]
seq=[]
temp=[]
lengthlines = len(lines)-1

for line in lines:
if re.match(">",line):
name.append(line.replace(">",""))
if temp :
seq.append("".join(temp))
temp=[]

else:
if lines[lengthlines]==line:
temp.append(line)
seq.append("".join(temp))
else :
temp.append(line)

for i in range(0,len(seq)):
length = len(seq[i]) - seq[i].count("\n")
score.append((seq[i].count("C")+seq[i].count("G"))/length*100)


maxvalue = score.index(max(score))
print(name[maxvalue])
print(round(score[maxvalue],6))
f.close()





DataSet


>Rosalind_1856

TTGAGTTCGGCGCTAACACGTGCAGAGCGCTCCGAAGGCCAAGGGTTAGGCACAGATTAG

CCGTGAAGTGCTGTATGGACGCAACGGTTGTCGATTTCACAGATTCGAATCTGACTGTAC

TACGTTAGATCAGCCCCAGAACTAATTCGAGTTTTATGCTCTAAAATCTTGCGGCTAGTG

ACTAAGTGCGATGACGGGCCTAAGGACGAGCCATCGCGATCATAAGTTACTGCTCGTCAC

TTTATGCGGGACAAGAGCCCCCTACAGTTACCGGGAAATCACCTATCAGCCTCTCTAATG

GGAAGCGCCAATGGAACGGACCAAGCGCCATTGATAACGATTGGCGCTGATTACTATATA

CCGACTCAGCCTCCCAATCTTTCCTCCGTGCAAGGTAGGCAAGGCGTTTATGAAAAGGGC

TGGCGCATGCTCAGTTAATGTGTGTGTTTATGGCTCAACAGAAACATCACCCGGACCAGA

ACTACTGCAAGGACAGATATATTACACTGCGAATCCAAGGCCTTGTGGCGACCTTTTCCT

ACCCTCAAGACCAGCATAGATTCGGCTGGGTTGGAGCTCGCTTTAGAATGGTTACAGTAG

TGTTACTACCAAAGGATTCCGAGATGTGTACCTCGAAGAGAGATTCTTTCTGATGCAAGT

GTTGTCGTATAGAAGGTCGAAATCTTATGATTTAGGCCACGTGTGTTAGGGCTCCACACC

CCCGATTGTCTACATACGGACTAGGTGGATATGCCGACTTAGTGTGTAATGGGGTCCGCG

TCCCTAACCTTTTTTTCCACCATTTTTCCTTTCGGTCTGTACTTCCTCCAATCAGCGTTG

TTGCAATACTATACTGCAAGATAAAGTAACGCGACCCTGGACTGCAATCACGCCTCTGGC

CAA

>Rosalind_4787

ACACCAAGAGGGAGCGTATTGTGACGTCGTCCCGGGATATCTGTTGGTAGCGCATTATCG

ATTACTCTCATCGACTAGGCAGTAAGGTCCACCGACGACTATGGAATTCGAGCTCTGCTA

TCAGGTAGGCGATACTCGCTTGGTTTAATCGAAGATTTGGCGACCTAGACAAAGGTAAAC

GCTCAGGGGGGCCTACTCGGCCTTATTCCTGACTACGTCTCAGGTGATTTCGAACGCACG

AAAGGGGGAAGTTCTGTATTTCGGTTGATTTTGCGTACACTGTGGAATTAACATACGAGT

TCCCCCTTTTCATCACTCTATCTTTACTTGCATAGGCCCGTTCATCTCACACCTCACAGT

GAACTGTGTTGGATTGGCGTCTCGTCGCTAACCTTGGATATCAGGGGGGCATGGATTCGG

AAGAAGGCGAGCGTCGCGAGGATGTTGTTCAGAGGCTACAAGTATATCGCACTCGCGTCA

CTCGCGGCTGATGAGAGTAATACCACAAAACTTATCCGTCCTAAAGGGACGAGAGGCGGC

ACTCTCTACAATCGAATACACGAGATGTTATTTCGAAAATTATTATTTTCTTATGTCGCA

TATAAGTCCCACCAGCAGACGCGCGTGAAGATAAGCGCCGTAAAGCCTCCCTGGTGATCC

GCTCGCTGCTGTAATAAACAGTTCAGCCCCATCTCACGTCTCGACGCTTTACCTGGGTGA

GGTCGCGAGTTTCCTAGTAGTGGCCCGGCTTCGATCATGTATGTTAATAACGATGTAGTG

CAATGGGTACTGTAACTAGGACTGGGCAGGCTTGGATCAGCCTGTATACGGATGTTTTCC

TCGGCACGCCCAGAAGATACGTATTAAGTTTCCACGGTCACAGTGCCCCCTTGAGACGTG

TCTTTGTTTCTTTGTGGTATGGAGGTGACAAGTGTCCACCGCAAGATATGTGGAT

>Rosalind_0276

CTTAAGTATCAGACAGAACCGTTTGTGGGGTATCCAGGTTATGGATCCGACACCTTAGCT

AGACTGACTCATAACGAGCGTTATCTTCTTTCGGCTTCGACTCAACTTCGAGGGAATTGA

TCCAACTAAGTTACGGGGCTAGCGCGCACTGCATTATCGTTAGTCCCAAGCGCATATATG

TTACACTGTTCTAGTAATCGGTCAACACTAGAGATGTCAATCATGCCCTTCCGTGAACGC

GTCGAAATTAAGCGTACCTTAATTCCTATCGCATTACGCTATGCAGAGCCCGTTTAGCAG

TACGCGAAGTCCCGTACCCAGTAGAGGATTTGTGTGATGGAGCCACAGCTGGGTACCAGG

GGATTAGAGTATGGATGCTTATCTATATGAGCACTGTCAAAATTGTCGCCGCCGGCCTGG

GTCGCTACCGTTATGGCAGTCGATGTAATTGTATGCGCAAGCTCACGCCTCGCAGAGTAA

TTAAGCGGTTGGGATTCGGCTGACCTCTTTGGTAGTCATCGAGATGTGTCCCCTCAATGG

TTAATTGATGCACGGACTTTTACATCTGTCCCGTCGACTCTTTGACAGTTTCACTTGTGC

TACAAGTTGCTTCTAAATATGCGCGAATGTGCTAATACATGTAGTAGATACGCCTCGTAG

TATTGCTTGAAACATGAACTAGTGGCCCCCAACTACCATAGGCGCAACTCATTGTGCACA

CCGGTAGCAGTATAACAACATCGTGCAGCGAGTTCCGTAGCCGTCCTGTCTCTTAGAGCG

GAAAAACATAGCTCGCTATTTTTATGTTGTGTCGGTTATACCGCCGGTTGCGTAGAGATC

TGTTGCCAGTGAAATAAGCCAAGTATTTGGTAGTTCTATTGGGACGGTGGGCTTCGGACT

CGTAGT

>Rosalind_0058

GCTGCATGACTTGTTACCTTACAAGCCAGCGGTTTCATGCATGTTACAGCGGCGGAGTAT

ACTTTGTAGTTGGAGTCGTACTGGAGAAAGCTAATATACTAGTATGCCATACCGGCCGTC

TTACTCCGTTGTCGGAGAGGCCACCTGTATCAGGGTATAGAAACAGCACACTGCTAGCTT

CGACTTGTTGTGGGGTCGGTAGCCTAGGTGCGTCTTGATTTCCGGCGCTTGAACGGATAT

ACACCGTTCACTCCGTCCAGGTTCCTCGCATGCCCGGTCGTCCCCACGTACTCTATACAC

CGATCACTCCTGATTGTAAAGCGTAGGAGGACATATAGCAACTTACCGGCCCATGGATTG

ATTGCCCGAAGTCGTGTCCGTGCCGGTAGTGTCAGGTTTCCGCGATGGACAGCTTCCACT

ACTTGAATAGGCCGACACTGGGTCACTCGCCCTTTTGGCCCGTCGTGGAGACATTAGCTT

GCGATACAATTCCGCAAGCGCGCATCGCTCTACAGCGCAGAACAAAGAGAAGACGCTGAG

GCGTCGGAAACACCCCTAAGATATTGGACACAGCGGACACTAGGTTAAAGACCCTGTTGT

TCCCACATCGAACATGCCTAGATGCCTGGTGGTTAAGGATCTTCAGGCTGCCTCTAGAAC

TCGCTCACAAAAAGTAGGTGGACGACTCCTATTGTCGTCCAAGTAGAACCAAGCCAACGC

TTGATTTCGGTGTTTCCGAAACTGAGTATAGTTAAACCATTCCCGAGCTCACAGCTGGAT

AGCCCGCCTCGCTGGCTCCTAAGTCCAACACCTAATTAAGACTCTATGGTCACGTGGGAT

CACGGTACGGGAGTTTTGTAGAATACGGCTGCCACGTCCTGTTGTCCGATAGCATTACTC

GGCCTCCTTAGGAGGGCTTTGCGAGGATGTTTGGTCAAACCAGGTTCCGCAGCGGACATT

ACCAACTTCGCGGTCCGCTAACTTCGGAACT

>Rosalind_9108

AACATTGCATACGGCGCAGGGTTTGTTAGGAGAATAGCCCTCTGAATTTTGCACCTGTAG

TTGGCATTCATACCGTTCACGGCTTCTTACATACTTTCGCCTCGACCATGCGAATCGACC

ACGCGAGGCCGTGCATGGTATCAAACGTGACGAGACGGATCACCGAGCTGCCAGCGCATT

AACGTCTGGTGTGACTTTACTTTAGTCTTTGTAGAGCCAACAGATTCTGTAATGGATGCG

AATCGGTGACCGGTGACACCGCTGTAAACGGTCTCTCTATACGATGGTAGAGCCAAGCGT

TGACGGATGTAAGTACCTAACGGTTAAGGAGCGCAGGGCTATACGCGCTAGGCGGCATGT

TTTGACGCGCCTGAGATATGCCTGATACGCGACCCTCTTAATAAGTAAATGACATATGTC

CATTGCCAGTAGTCTGAAAGGACAGAACCTGACGTGAGCCAACAGACATTACTACTTAGA

GTCTGCCGGTACTTGCATATGTCTTCAAGGTACGGACATCATTCCTGATGATCTGAACGA

AGACATTCAGGGCAACAAAAAATTTTGCCCACCTTGTGACACAACTAACATGTAGCCCTC

ACCATAGGACTGGAGGATATAAGGCCTTGATCTTCTTTCGTGTGGTCTCAGTTAGGGATG

GGGCTACGGGGTGCAACACCGTGTCACATTGATTAGCTGTTCGACTTAAGCTTCATTTTG

GACGCTAATTTCTTCTGCGCTGAAAAGTACGACATGTATGGATACTAACGTCACTAACCT

TAGCCTATAGGCGATGTAAGCATTCAGAAACGGGCGTAAGCGCGACTGACTTGGGACTAG

TGAATTCCCCATGAGCAATTACAGTTTCATGATATGACCAGTGACCCCTCTACGGTAGCG

TTTGCAAAACTTATTTCTGGATTACTCCATCGGGCTACACGAAAGCTGTGCGACATTTTA

TCTCCATGTCTTGATGCTGTGAGGTGAGCTAACCG

>Rosalind_8703

TCTAAGTATCCGAAAACAGACTGATCACGGGAGGGGCCACAGTTACACAAGTCATGAGAT

TAGAAGCAAAAACCCGTGGTGCGGGTCATTACGAATGGTTGACCCAATGTTCTATCCAGG

CCGTGAGATACGACCCATCCAGCCCTACGAGTACGCGGACGGAGTCCAGGGCTGGGTGCC

AGACGCAGTTACCGATGATTCTGTAGCTCCAACCTCTGCCTGTTCGCCCATTCCAGTACA

TCCACACGCCCGTTAAGATGTAATTCGAGTCCCGGACTAAAAAGTTGCGCAAGCCTTGGA

GTGCCGATGTCGAGGTGCCCGATCTTCAACCCCCCGGCTTTGGACTCATTACGGGGTCCA

CGTGTGAACGGAAACTTTACTATGGTTTCCTAACCACAAAGCCTAAGAGGAAGTCGGATT

CGGTCTGGGATAGGTCGTAATGCGCCTCTTTGCCGAATGGATGGGACGTACCATACTAGC

AGGGCGTAAATTACCCTTCGTGAGAAGTCGGATGTTCCGCTACTATGTAAATGGACTGGT

TATCGACGTATGACTTTTGCGACTTAGGGGCCGTAGTCCAATCATTGGGCGCAGATGTTG

CCAGGATAGTGTGTTTGACCCCGCCGATCATCGTGCTGGGCGCTATGGGGCGTCTCAACA

TCTAACCGTTCGGAACAGACCTGACCGCACCGGTTTCGAGTTACGCGGGGTCAGGGGAAC

CCCTTCAAGCTCCTTCCCTTATCTCGTACTACGTATAGTATGTAGTGGCTGTCCACCTTG

AAAAATAGTAAGAGCACCGACCTACAGTTGATCGGCACTGTCTCCTAACTTG

>Rosalind_1385

GGCACCATCCACGGCGTAGCGCGGGTGACTTGCTCCAATGTGTCCCATCGTGGTTGATGG

GACTGCGTGCCTAGCGCGCTAATGGGCTTCAGGAGGGAATCGACACCCTGCCGGCCCGGC

AAGACAGACGATGCCCACGAAGTAGGGTAACGCGACCAAGGGCAGTAACGGACGGGTCGT

AGCCAGGAAGGTCTAACGGGAAACCGTGTCAGGCATTGCATCAGTATGCCGCAAGAGATA

CCTCGAACTGCTCGAGACATCAATCAACCGCGCCCGCTGAGATAGGCCTGCAGCACCCTT

TAGTTCTGCTTATGCGATTAATTGCGCTTATGAGTCCCCGGGGCCGAGTGCCGGTTTCCA

ACCTAGTCATTCAGTTGCCGGCGAGGTCACCTCTGACTTTTAACGTATCATACTCACAGG

GCGGGCACAGTCTGCGACGACCTATGCGCCAGGCTCTAAGACTTCCAGTGACTCCGTTGG

TTCTTCAGCTTTCATATTTCAGTGAAGTTGTAACCCTCTGTTATAGTATGGGAGGCCCTG

TACACCAGTGCTATTCCGGTATTAGAATGCTAGACTAGTCACTTCATGCGAAGGATCGCA

TACATATACGCCTCCTTAGGCAGTGCAAAAGGCATTTAGTAGCGTACCTAATTCCGAATT

CACACAGGATTGGCACGACAGGCCAGACATACCATTCGTGTATAGGGGGAAGCGTATTGT

TGCAGCCGAAACTGTTACTATCCATGAGCGGAAATGCAATACACTTTAAGTCCTAATACT

TTCTCCTTTCTGTAGGCCGCTACGGGGAAATCTACCATCACGTAAGGGGGACCATCGGAA

AAGTCTTACTGACATCCCGAGTGCTGCCGAATGGGAGTGGTAGGCCATTTTCTTTTAACG

ATTCCGTGTACTGATATGAGAATCACGGACGGATCAACAGTGAAGGAGCAGTACTTCGAC

GAACTCGATTTGGGCCTAGATTGATGAG

>Rosalind_4521

TTCGGAGTACCATGCCGAGCGGACCTTGTAATGCGAACTTGCTAGGATTTGGTGCACTTA

TCCGAAAGAGTTAAGATCGGGGCTAGTGTGACAAGGTTCAGGGCGCAGAGTGCTAATTCC

AGGGAGCCCTTACGATGAAGCGAAGGGTTAAAGTCGTGTCTATTTTTAATTTGGTTATGG

GAGTACTGCGCATTCTTGAAGGTGTCTCGTGTCTTTTCAGATGATGCTCTATTTCAGATG

TCGCGTTCCTATACATCTCGCTACCGTGATTACAGGCGGTTGCCCGTTCTGCGTCGTAAT

CGCCACCATCGTCCTCCAGCCATTAGGGTCTCTCACAAAGTATTTCAAGGCATGCCATAG

AAAAGAGGGCATTTCTAGTGAGCATTCCGAAGTAGTTAATGGCTTGCGTTTCACGTATGG

TATAGTCAAAAGTGGGGACGAGGGTAATATCAGAAGGATGTCTCTGTCCAGGTTGCGGGG

CAGCCATTCGATACGCTGAAAGGGATACGTCATCATGGAAGAGACCAGGGTTTGCGTAAA

CGCTATTTAACCGATGCTGTATAAATTCTACTTGCCAGGGATGATCTGAATAATGCTGGT

TGCGAGCCCTAGGAATACGCTCAGCAGTTACTGTGCTAAGGGCCTGGTGCTGAAAGACTC

GAAAGCTAACGGACCCCCGACCTTCGGTACTCAGGGCAATGGAGGAGAACCCGTGATGTA

AGTTATGGCAAGGCTTTGCCCACTAATCGGTATTACGCATAGGTTAGTTATTTCGTATTT

AGGGCTATGTCCTTTCCTCGGGCGTCAGGCATGGGGTAGTAACGCCCTCGGTGCGAACGT

CCGGAGATTCACTTGAAATGAAGTGAAGACAGGCCCCCTTTTGGCGACGACGAAAACCAA

GGCCAAACATCTGACATTAACACCATTACGTCCAGTATCTCATTCCGGGACTCGAGCCAC

CGCATCATA

>Rosalind_2518

CGCTGGCGTTTTGCCTGCACGCAAGTGGGGTGCCGAGACTAACCGGTGGGGCTGCGAATA

CCCTAGGGGCATCCGGTCACTATTTGCTTCATTGTAGGTCCGGCCCTTTTATGCACGGCC

CACGTTACTCCGTCAGGGTAGGGGACATCCCCTTATGATCCGACGCTATGACGCCAGTCT

AACCGCCTTCCCAGCGCGGGGCCGCGTCAACATATGAAGGCATCGAGATTATCGACTTCG

ACATTGAGACCGACGGGTTTTACTGATTTTGTCACAGGATCCAGCCCTCTTTACCTGCAC

CGTCGACCACCCTGTAATCAATCAAACTTAAGATGGCAGATCCAGGATTTTATATGTCAT

TGAACCCGAGTCAGTACTCCCTCAGTACGGAGCGGTTAGACTATGTAATCGACTGCTTGA

GTACCAAATTGGCTGATCCTAGGATTAACATTTATCAATTAAATGGTCTAATCCATCGTT

CGAGCAGAAGCCTGCAGGGGTACTTAACATGTAAAATGCGGCGAGGGGTTAGAAGATTCT

TAAATTCGCTGTTGCTGACTCCGGGCGAGTTTCTTGAAATGGCTCTGGGGTCCGGTGGAT

CACGGGCTTACTACGCGGCGGCTGGGCAAGCATATAGTATCAAACACTCGAAATTTCAGG

CACCATCGAACGTGCACCTCCGAATGGCAGTTGTTCTCTGCGCAGCGCACCCGCTGGGCC

GACCTGCCATCCTCCATCGAGACCCGAATCCACATCGGAAGGGTCGTGACGCGTGACTCG

CTCTTCAGGCAGAATATTCATGCGGCTTTCTTACGGATTGATCGATCGACAGCTCGAAGC

AACCCGTCCACGGTCCTGACTGGAACACGCATCAGTTGACATGATGGGAGGATGTCAGCA

ATATAG



'Python > rosaland' 카테고리의 다른 글

Mendel's First Law  (0) 2018.11.26
Counting Point Mutations  (0) 2018.11.21
Rabbits and Recurrence Relations  (0) 2018.11.16
Transcribing DNA into RNA  (0) 2018.11.14
Complementing a Strand of DNA  (0) 2018.11.14
arrange.vars <- function(data, vars){
##stop if not a data.frame (but should work for matrices as well)
stopifnot(is.data.frame(data))
##sort out inputs
data.nms <- names(data)
var.nr <- length(data.nms)
var.nms <- names(vars)
var.pos <- vars
##sanity checks
stopifnot( !any(duplicated(var.nms)),
!any(duplicated(var.pos)) )
stopifnot( is.character(var.nms),
is.numeric(var.pos) )
stopifnot( all(var.nms %in% data.nms) )
stopifnot( all(var.pos > 0),
all(var.pos <= var.nr) )
##prepare output
out.vec <- character(var.nr)
out.vec[var.pos] <- var.nms
out.vec[-var.pos] <- data.nms[ !(data.nms %in% var.nms) ]
stopifnot( length(out.vec)==var.nr )
##re-arrange vars by position
data <- data[ , out.vec]
return(data)
}


Usage


dataframe <- arrange.vars(바꾸고 싶은 데이터프레임, c("colname"=이동하고 싶은 위치))


mergedat <- arrange.vars(mergedat,c("Fold_Change"=2))

'R' 카테고리의 다른 글

library(openxlsx) sheet 여러개 쓰기  (0) 2018.11.26
1차 자료형  (0) 2018.11.21
Bioconductor ShortRead  (0) 2018.11.15
scope <<-  (0) 2018.11.09
엑셀 변환을 위한 R script  (0) 2018.10.29

+ Recent posts