- 在线时间
- 1630 小时
- 最后登录
- 2024-1-29
- 注册时间
- 2017-5-16
- 听众数
- 82
- 收听数
- 1
- 能力
- 120 分
- 体力
- 563348 点
- 威望
- 12 点
- 阅读权限
- 255
- 积分
- 174227
- 相册
- 1
- 日志
- 0
- 记录
- 0
- 帖子
- 5313
- 主题
- 5273
- 精华
- 3
- 分享
- 0
- 好友
- 163
TA的每日心情 | 开心 2021-8-11 17:59 |
|---|
签到天数: 17 天 [LV.4]偶尔看看III 网络挑战赛参赛者 网络挑战赛参赛者 - 自我介绍
- 本人女,毕业于内蒙古科技大学,担任文职专业,毕业专业英语。
 群组: 2018美赛大象算法课程 群组: 2018美赛护航培训课程 群组: 2019年 数学中国站长建 群组: 2019年数据分析师课程 群组: 2018年大象老师国赛优 |
基因组测序模拟
& K& n( U/ P. e k7 z! f, t$ f5 @$ S基因组测序模拟$ `6 c$ N) S/ c& X+ A( Q& G) i
, g4 k" N# m9 j" h一、摘要* r7 C$ ~7 n; ^8 _, f
* G( F6 N1 V+ j3 F
通过熟悉已有的基因组测序模拟和评估程序,加深全基因组鸟枪法测序原理的理解,并且能够编写程序模拟全基因组鸟枪法测序,理解覆盖度、测序深度、拷贝数等概念,设置测序相关参数,生成单端/双端测序结果文件. q3 y* {# J5 t, J) F
# a$ }: q) }3 w! M4 n" E' c# L
二、材料和方法
! v- A9 V: }, d# M, U) B2 P3 s8 |* ]; t+ ]8 I& \
1、硬件平台( |$ \( y {0 `/ Z# y
4 w+ H6 H& ]; m) {3 x
处理器:Intel(R) Core(TM)i7-4710MQ CPU @ 2.50GHz
0 \/ Q/ `: \) n9 }* R$ P安装内存(RAM):16.0GB
/ O, N, |" t( `2 l7 i% t7 f1 g5 e5 W% K8 T
2、系统平台$ s/ B( r- {' g- Y4 y
Windows 8.1,Ubuntu
" N2 w# H: P' y) {
5 H2 @4 {& X3 V/ c* O+ _3、软件平台
U8 p# c1 G# [" L( L
A1 `5 k( q. Zart_454
3 z0 V* n3 c) k7 P7 d& X% P9 CGenomeABC http://crdd.osdd.net/raghava/genomeabc/
3 a; w# X( h" t" ~2 d! @# kPython3.5
3 k& r3 \* \' v$ h3 ]+ IBiopython5 r2 J0 z8 m/ E, s9 Y9 s2 }+ K
4、数据库资源* v" w! o7 c! k: t, x
/ J/ A: k6 y9 |0 [
NCBI数据库:https://www.ncbi.nlm.nih.gov// |6 i6 D. \( ^- ]' U6 {
2 A K; \) k2 m2 P/ [& C* u* q5 m5、研究对象
& k7 y8 ]# l8 Q- u4 T0 d- n7 ^8 ?2 @; [ \2 \! }! I: l
酵母基因组Saccharomyces cerevisiae S288c (assembly R64)
) f4 |+ {" s) C/ N9 Oftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz
; d% d: |; W. b4 n0 z2 T- e: j9 }2 C1 a
6、方法& u0 P( @5 u: q9 h! z0 Q3 R" h6 I
) i! K C. @( f3 W
art_454的使用
: P4 v% _- q" ?$ P首先至art系列软件的官网,下载软件,在ubuntu系统安装,然后阅读相关参数设置的帮助文档,运行程序。, ^4 K1 v* P( z% J0 M8 E
GenomeABC
. B* e" a, [1 {& j% |进入GenomeABC(http://crdd.osdd.net/raghava/genomeabc/),输入参数,获得模拟测序结果。
1 J" ^* s, I) ?) P编程模拟测序
6 @5 X( f" D% F: N L下载安装python,并且安装biopython扩展模块,编写程序,模拟单端/双端测序。
2 L, f, K8 y' k! y3 s三、结果
( g4 q9 @7 Z2 C( o! x ?7 P1 b
4 y* }% m$ ^3 X0 X( D3 ?- B0 F1 g1、art_454的运行结果
: R3 M; d# y2 z# d: H& G3 e$ `+ \7 X# c \5 P, o" z
无参数art_454运行,阅读帮助文档
! Z& p& ^/ |5 I9 ? x8 S/ m E
# z& }% [3 S: _- f3 \; B图表 1无参数art_454运行
' A1 ^2 z8 T2 p' T8 E5 b对酵母基因组进行基因组单端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20.
s, d1 t, N: S6 p下图为模拟单端测序,程序运行过程及结果 7 c. k7 l+ w% H. I( B5 P" S
) Z! n$ @/ i) B" L# t图表 2 art454单端测序 ' L" i8 k( F3 W7 [5 g9 G
5 ^, d( \. J/ w2 g% ~' r
图表 3 art454单端模拟结果
5 c' g e6 j' @4 P* M5 e双端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20;MEAN_FRAG_LEN设为1500,即平均片段长度为1500;STD_DEV设为20,即长度的标准差为20 5 k0 E$ n7 `- {2 Z5 |
下图为模拟双端测序,程序运行过程及结果 2 ?# C' M7 q D+ @
. L1 e9 t5 S3 |& ?/ c# O
图表 4 art454双端测序
# o- y$ y; N; h# e
4 a2 B( w4 o) F( ^图表 5 art454双端模拟结果 # g9 m/ y ]5 J [& }
2、GenomeABC
: q+ O# x3 k3 M# W: S! a. ]0 h6 o6 m下图为设置参数页面 2 X3 D5 J& \; u1 l8 B# o
3 s* ]) \$ H3 o: w3 a8 j
下图为结果下载页面
: ?- |+ | H% u/ W8 c' K' P
& [" ~9 i9 t4 n& M+ Z图表 6 结果下载页面 $ [9 ?8 d" K: Q) R! s
3、编程模拟测序结果 + V& @6 y8 q9 A" p4 A9 r+ g
拷贝数是这里的N值;覆盖度是m,测序深度是宏观的量,在这里与覆盖度意思相同,就是测序仪10X,20X。
8 F" q$ m3 N6 T; x单端测序 . N+ n) X2 H; ?
. y! x* y8 v4 r0 K图表 7 程序模拟单端测序 $ n! J; U6 a1 ~+ J
双端测序 5 \' z! |$ {$ j3 J% S* O" n" u
: p$ F7 t6 v y% d图表 8 程序模拟双端测序
; ~7 o' M7 K6 y, U& i测序结果 + m0 M% t% Y+ g* v
4 \. h' b% r1 j9 F+ R
图表 9 结果文件
2 a; Y: w# a* {, v! K$ l
& L6 P! t/ t* q; X) ?3 \因为期望片段长度是600bp,在片段长度区间200-1000bp内,所以大部分的片段都没有删除。 . y( P3 k9 r. Q
测序结果统计表* _$ c! R) L& r: `' L
8 s) f7 D" Q3 u$ J
测序方式 基因组大小(bp) 片段长度区间 (bp) N值 期望片段长度 克隆保留率 片段数量 Reads长度范围(bp) Reads总数量 Reads总长度 覆盖度(m值) 理论丢失率(e-m) 覆盖率(1-e-m)
: `; X, f) X5 I, r) G! E) b单端 12157kb 200-1000 10 600 0.95 107378 50-100 101968 7645.541kb 0.62889 0.53318 0.46682
5 v" z' a2 o, b- ?. M: I! _单端 12157kb 200-1000 20 600 0.95 213722 50-100 202996 15227.882kb 1.25259 0.28576 0.71424# l. C6 i" Y& k
双端 12157kb 200-1000 10 600 0.95 106704 50-100 202770 15212.662kb 1.25134 0.28612 0.713888 O3 c( D, d( B+ U! V
双端 12157kb 200-1000 20 600 0.95 214212 50-100 407186 30534.265kb 2.51164 0.08114 0.91886- b, m$ [5 J- Y3 j2 A
四、讨论和结论' ~- q) ^8 k8 L! j
! g* N0 x$ U7 K. V程序运行方法
/ _+ U" z% a- [# s& S* I) E7 T* l. {+ V _' O" y
在类的构造方法init()中,调整参数。 & y* P9 W6 M3 j$ o0 E! o3 k
Averagefragmentlength为片段平均的长度;
B" p/ m8 z) X5 S, C$ \' K4 x1 Ominfragmentlength和maxfragmentlength是保留片段的范围;
D( B! ?. f3 S$ m1 P8 s! NcloneRetainprobability是克隆的保留率; ! I0 _+ }/ ?1 ]2 W
minreadslength和maxreadslength是测序reads的长度范围
8 y& e$ _" Q- Y B. a6 ` g
4 e* X# Y7 G6 c9 v) T! N& f- y) G模拟测序的诸多方法都封装成了Sequencing类,只需要创建类,并调用singlereadsequencing()和pairreadsequencing()方法,传入文件名的参数即可。. i1 k W0 j' Y5 o! v5 W
, w+ {& H( w8 Y" a+ g附录
7 A* y, D& [# c r! a( m# z# X- O1 {9 {1 [
from Bio import SeqIO8 Z5 L5 E" h9 |1 a2 I, s, J% m
from math import exp
4 W7 C9 M' A p# I7 ]: j# j: ]. d# }1 Ximport random
% }* I, Y. L& c5 ]8 A, Z9 T; Q I" u2 x* H5 [
class Sequencing:: r$ W* |) h7 \( }' d2 x1 G% h
# N代表拷贝份数) B5 J7 H1 h. v6 C
def __init__(self)- p2 q8 p1 N \* ]1 [8 z
self.fragmentList = []
4 ?4 p) `6 c) H4 K. V8 d- Q4 b self.readsID = 1- C: P5 a6 l7 D1 y$ K
self.readsList = []
" Q* l, o/ V: E: c2 ] self.averagefragmentlength = 6507 b8 x/ j& O7 l4 M! A; Z5 O. [
self.minfragmentlength = 500
7 P2 Z& z% {2 r6 [- i self.maxfragmentlength = 800
$ O9 T$ q9 {9 x) _6 M+ X self.cloneRetainprobability = 1" P4 v/ u) {( Q/ Y5 i% M1 A
self.minreadslength = 50
: C5 ]0 Z6 J: V: q( s self.maxreadslength = 150
9 M4 i3 e- ?% C self.N = 10
/ y& _+ p" p% k! V self.genomeLength = 0
3 _* B2 k" V. |1 X/ ? self.allreadslength = 0+ t: I a4 K i( T# r( {; ?
& L6 {, V6 S6 N: D # 生成断裂点
! O% y- @9 A8 ^ v* B( k def generatebreakpoint(self, seqlen, averageLength):/ W: Y/ {* K& _" O: i
# 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数): ~9 X+ ^! j* H& C0 r
breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
. U* \* d. Q% g7 c breakpoint.append(seqlen)( m' g/ w! L. ^. }6 q: p
breakpoint.append(0)
: y% S l' O6 E2 j' ]- y, F # 把随机断裂点从小到大排序
! `- K- `* q6 f2 X( I7 @ breakpoint.sort()
* A) u3 i8 U. E- H0 p% O return breakpoint
& N% l) t, d3 N$ t- u* i' V" Y2 I$ q+ I
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp
$ F& S, y, ^! t" x" q def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):" P X0 D% [6 |( b7 i3 G
for i in range(len(breakpoint) - 1):
! A. ] k- s$ l) m/ K/ b fragment = seq[breakpoint:breakpoint[i + 1]]
; r7 F1 |+ |* G4 O if maxfragmentlength > len(fragment) > minfragmentlength:. w) F" L5 K: b' j5 D6 i7 v. C; \
self.fragmentList.append(fragment)
" S) r4 H" t4 H# \3 x, ?) d1 i8 m/ P return self.fragmentList+ h" B% D' U& ^, A: z4 ]* W% H
5 A n& b+ N) O+ h # 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率$ \" I* F" z* P# u5 d+ |. p0 @. {
def clonefragment(self, fragmentList, cloneRetainprobability):+ h/ S7 v4 B7 R+ A
clonedfragmentList = []! ^/ ]. l. F" v, b& n# ~; z
Lossprobability = [random.random() for _ in range(len(fragmentList))]1 r o) d/ O& v, P' |
for i in range(len(fragmentList)):- Z) W6 H- E! F! ^: V5 }+ V5 S
if Lossprobability <= cloneRetainprobability:
: j. d3 `! T* z2 |: m0 w clonedfragmentList.append(fragmentList)7 p6 u3 f+ C* d+ n
return clonedfragmentList9 Y7 r! r5 p% |. c1 @& y# E$ P
i0 V* U( T. ]( e # 模拟单端测序,并修改reads的ID号$ ]. h) r2 x1 ~$ K5 z
def singleread(self, clonedfragmentList):. @" L! K" n) d, J9 n( }; F4 ~
for fragment in clonedfragmentList:( p) C( U: Z5 ?0 e
fragment.id = ""
6 @, O3 M7 k& X7 P; J( n' T. y fragment.name = ""
3 a1 H! W) M- T4 I) [1 M fragment.description = fragment.description[12:].split(",")[0]$ Y+ @0 \4 d# @8 d4 b
fragment.description = str(self.readsID) + "." + fragment.description
! n$ e5 q: A6 w- W' | self.readsID += 1
9 c( e( k: a8 B& _7 B readslength = random.randint(self.minreadslength, self.maxreadslength)
+ T8 }# ]% K5 O6 C* L self.allreadslength += readslength6 g8 M" m, X# K+ M, R
self.readsList.append(fragment[:readslength])
5 }* X# L# ^( E/ V) s( G1 K. u3 q! s2 x" \
def singlereadsequencing(self, genomedata, sequencingResult):! B4 c( Z/ H+ }5 A. [/ w
for seq_record in SeqIO.parse(genomedata, "fasta"):. Q; y0 r- e8 T% M) h* p
seqlen = len(seq_record)0 b8 B; b2 p! [2 B7 w( x
self.genomeLength += seqlen; j- J. `' C+ e$ P
for i in range(self.N):1 ~$ o* m" E6 r& l. X# E5 Z
# 生成断裂点
" Y/ p$ _! u; V5 d- f breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
& q& |3 v+ L' X3 H2 w, k% ] # 沿断裂点打断基因组& d# a! J; d" ~2 E5 D0 O3 G. D
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)+ s( |& [5 ?$ Y% n! g5 i/ W& e6 g
# 模拟克隆时的随机丢失情况
# I! `& o. w( t" L: y' l2 |$ U clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)% R9 v- \/ i; Z3 Q5 B
# 模拟单端测序 ]- `: g$ m: L) _
self.singleread(clonedfragmentList)
9 u; ]6 w8 ]3 H7 R3 G4 ? SeqIO.write(self.readsList, sequencingResult, "fasta")/ G; Y$ o! t* m: h4 b. h$ q
1 V# X; _$ f! N& q' Z' l2 Z
def pairread(self, clonedfragmentList):
+ [/ f$ K, w1 c( v \4 N. h for fragment in clonedfragmentList:
S5 X# L4 a- E8 Z3 Y# ` fragment.id = ""
$ T/ I: y9 o% H( u; ^1 H fragment.name = ""
! V* _ L0 V* X0 y& u% u: ]1 E description = fragment.description[12:].split(",")[0] b* C+ p1 V N# `4 r" Q7 `+ Z
fragment.description = str(self.readsID) + "." + description# V9 i7 h ]% ]( r _
readslength = random.randint(self.minreadslength, self.maxreadslength)
. a+ f7 T6 X+ V' H0 ? self.allreadslength += readslength+ O9 p ^' ?) P. r5 q6 r& j( D
self.readsList.append(fragment[:readslength])) \$ E) ^8 I! }% B5 J" K+ S4 R
0 M! l2 m6 V, F* K+ d8 u
readslength = random.randint(self.minreadslength, self.maxreadslength)0 ?1 S+ f8 [: G1 f8 _. p8 n
self.allreadslength += readslength( z" Z3 ?! s8 h4 t- S5 O8 S" p* O
0 u1 z) c d: q. f fragmentcomplement = fragment.reverse_complement()$ o$ I8 j8 A! M+ L
fragmentcomplement.id = ""( u2 E& w$ H7 u# d9 v
fragmentcomplement.name = ""$ [# i/ A6 I' Y9 ]5 Q( @+ I, \5 U6 V
fragmentcomplement.description = str(self.readsID) + "." + description
4 g5 ?; z8 F0 [- N1 w self.readsList.append(fragmentcomplement[:readslength]), b# H9 k* O+ o/ D. G5 M% s
4 S1 Q# B9 n* |6 V4 i' M6 k self.readsID += 17 b5 g/ C+ H, Q0 S) k
) Q7 q0 t9 P# H6 [5 S def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):) K+ ^/ R3 K' L- O
for seq_record in SeqIO.parse(genomedata, "fasta"):0 v7 \$ B0 k5 D& `! S
seqlen = len(seq_record)# n4 ~$ z9 I# C. B
self.genomeLength += seqlen
% x9 A6 M8 P- G& P% L. ? H0 N for i in range(self.N):+ B( |& w) F7 w( y' d; S X) }9 t
# 生成断裂点
# @9 E$ {7 g& X% n. S# W breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)- F L+ r6 Y# T9 t: y
# 沿断裂点打断基因组6 g5 ]8 b- B7 Q# S4 j
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
}! ~; o$ Z: X. b, ^ # 模拟克隆时的随机丢失情况& I0 g; D' u9 K
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability), |. `' k4 N3 ?$ M
# 模拟双端测序
; r! L- f6 ^) e% r self.pairread(clonedfragmentList)6 W$ {% v) K1 Q+ s) n
readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]
/ @, e3 k( ^8 m4 o readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1] B0 t9 ]8 ]" r# ]3 [
SeqIO.write(readsList_1, sequencingResult_1, "fasta")
. @3 s" I$ O" {: ]6 l+ ^ } SeqIO.write(readsList_2, sequencingResult_2, "fasta")
5 g8 g/ V; V. I1 p1 b! E5 V- w
3 N1 N8 R* Q3 J+ n7 h; \* [" F def resultsummary(self):
& T, _ T/ w: ?8 l/ e6 h; ? print("基因组长度:" + str(self.genomeLength / 1000) + "kb")9 u9 a4 y$ R) C$ S% I2 y& h3 i5 _
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))# ~1 |) Q% |! f; ~' }5 r
print("N值:" + str(self.N))
5 n4 L, V& L" E$ G print("期望片段长度:" + str(self.averagefragmentlength))
+ N4 {- H: ?. ~5 J7 Z1 |1 j# z5 K print("克隆保留率:" + str(self.cloneRetainprobability))
& m, r7 Y. s9 L print("片段数量:" + str(len(self.fragmentList)))& u- o$ F2 g k% O2 S
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))4 g& `" L0 p0 z* e. F
print("reads总数量:" + str(len(self.readsList)))
4 r- u1 J% g6 Y5 o o. n print("reads总长度:" + str(self.allreadslength / 1000) + "kb")
3 ]% U# v6 r7 ` m = self.allreadslength / self.genomeLength% R Z( {$ u% K+ _% f
print("覆盖度(m值):" + str(round(m, 5)))
; D2 c. t- i7 M6 @' c: n print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))
% p( Z/ C! W1 O& d3 a print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))% d K% E2 \# F( Y9 p
# -------------------------------------------主程序-------------------------------------------
W- P6 a! W+ I7 A/ A& A( o# 模拟单端测序
1 e* s5 ?% F H' k' }sequencingObj = Sequencing()5 E6 K0 _, \" x2 @2 s9 v
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")# I% q5 h2 N9 X' f. p
sequencingObj.resultsummary()
* d2 d7 R r: A2 ~$ Z) d& k# R2 Q6 S- h2 P' Q7 t- q
# 模拟双端测序
! N' M4 F! `: usequencingObj = Sequencing()
' g! a( b- l: A% W" SsequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")- t2 X/ m: J6 {8 J- n
sequencingObj.resultsummary()
( c9 s# C1 Q9 W2 M3 g# Qfrom Bio import SeqIO5 j9 j, U* R, J# [" c- l' D, f; o( e
from math import exp8 \! J9 o0 e8 D0 o
import random) R& s1 f( l( m
2 s& C8 b' K. i8 \5 O- K
class Sequencing:
; P% t) E6 J3 L% E" k # N代表拷贝份数% Q4 G/ K: @6 `% l8 J* ~
def __init__(self):2 o( S8 a6 l P" Y- ?
self.fragmentList = []5 d! |3 Z' X; R( b M0 w7 Q7 G
self.readsID = 1
9 Y2 F3 S* o) A$ M self.readsList = []
8 q u) x* F0 B1 ~; s6 D/ Z' C; O self.averagefragmentlength = 6508 W; b/ l1 P9 A8 S3 Q4 `
self.minfragmentlength = 500
: u! N) D- w: i* b self.maxfragmentlength = 8006 a/ d' I0 Q2 ^8 L/ [' d& h0 Y! _, R
self.cloneRetainprobability = 13 R( ?! ^6 J8 F8 \( ~
self.minreadslength = 50
9 m9 u# q8 @+ x7 _( X! v+ Z0 J self.maxreadslength = 150
7 C A7 z1 y3 @; Z8 u- { self.N = 10# R+ R; X) x$ {3 K* M
self.genomeLength = 04 e* c3 Z- @' x
self.allreadslength = 0
6 P6 ~ [0 e$ E7 I; L W- B. e- j/ m8 {' s
# 生成断裂点; m* g- O3 }6 p; I8 q v" O' ]
def generatebreakpoint(self, seqlen, averageLength):
! F: G. \* f! k, \ # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
# U/ f3 z2 G6 I& V, o/ ^: p' f8 X breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]5 o. r' y3 X7 ]. Y# p7 t
breakpoint.append(seqlen)+ _" \, E9 t' b0 |" C
breakpoint.append(0)" u" ?2 h [3 I9 `9 ]- K
# 把随机断裂点从小到大排序
y. r% u/ T4 d/ u: \3 K breakpoint.sort()
- v2 D. r. X$ X. X0 |. ^* H4 ?4 T return breakpoint
% Q- K# A! p# V n/ K* D* j f' H- z/ {7 o3 }) q0 m3 `
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp2 P+ @. M4 l) r$ Z e
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):
8 ]# ]. ~& I) ~ for i in range(len(breakpoint) - 1):# e' D: P# Z1 j8 v* P5 x
fragment = seq[breakpoint:breakpoint[i + 1]]# o" ]; h( M6 S+ c- \' A$ `2 y
if maxfragmentlength > len(fragment) > minfragmentlength:
6 h) \0 [7 b9 m; r ~+ h self.fragmentList.append(fragment)
3 H) l' c2 m# }4 j; W$ y( Y w9 } return self.fragmentList
# q; w: q5 a+ Y5 Y; e8 K
* E0 J% g3 d, k9 i # 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率
2 K' v5 b; [ d. d0 I$ l) \9 _! H- _ def clonefragment(self, fragmentList, cloneRetainprobability):
, b9 w; C) Z0 r8 u4 U0 z1 ?" o2 p2 H+ C clonedfragmentList = []% z; K0 [( x1 O* a4 |) U
Lossprobability = [random.random() for _ in range(len(fragmentList))]
! p0 p6 B, D$ V, `9 T for i in range(len(fragmentList)):, |$ }7 h5 n( t' H* _+ z
if Lossprobability <= cloneRetainprobability:3 Y6 C ?/ T' n$ v# P! W% N
clonedfragmentList.append(fragmentList)
- Q: t c8 h: L+ c" R/ `+ | return clonedfragmentList
. y( O$ L4 L" P# H9 V: J& J7 k. \4 t8 q* U4 l! t, @) g7 Q, ?6 P) s2 H: G
# 模拟单端测序,并修改reads的ID号% C, }+ z! v( M5 O5 r
def singleread(self, clonedfragmentList):- [, e& G; t2 R9 u+ @( ~. w
for fragment in clonedfragmentList:
u+ u, e/ u$ b0 L$ w8 |6 w fragment.id = ""
W9 @5 A+ u! U9 q- o5 A fragment.name = ""
+ c7 O* j# J! n3 [9 X fragment.description = fragment.description[12:].split(",")[0]1 j5 p6 q+ d7 x$ @# |5 P k0 A1 g. P
fragment.description = str(self.readsID) + "." + fragment.description9 ]# v9 G. Z5 W* a' y
self.readsID += 1" q; Z' b- `5 ?" g
readslength = random.randint(self.minreadslength, self.maxreadslength)/ Q+ q+ F4 o) Q" |
self.allreadslength += readslength
1 g9 X# I% x! S4 s9 n2 b* T self.readsList.append(fragment[:readslength]): l2 g/ i& }& {3 z* B3 k& ?7 V
& |, v' {1 C; Q; |6 e
def singlereadsequencing(self, genomedata, sequencingResult):
- U2 N8 v4 }& m+ M for seq_record in SeqIO.parse(genomedata, "fasta"):3 N n+ x* S4 g5 K* @. k2 o- h3 p
seqlen = len(seq_record)) S1 B! \' D; F; Z K' A
self.genomeLength += seqlen' K* R& x. G; A' z* v+ G, W- I z
for i in range(self.N):# _/ P8 O3 H' [+ r8 Q5 u( a! F
# 生成断裂点
3 {$ m; e) ^+ ]9 `! a/ W6 Y# H breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
1 q4 m; o. Y/ g8 v- a; x # 沿断裂点打断基因组
, A4 _; _4 r7 ]. G! I7 O8 p self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)( I1 ?; b) p' n
# 模拟克隆时的随机丢失情况
8 V$ P! _3 W3 d9 C, l3 `7 S- l, q clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
* y* J+ q k& g. z/ l # 模拟单端测序
* o/ D: O$ m. V/ d4 j1 j7 { self.singleread(clonedfragmentList)
3 t+ L, f, M" y. [ SeqIO.write(self.readsList, sequencingResult, "fasta") s0 ~, m( v9 @ r% A3 v
2 h" i6 o I1 h0 ^) q
def pairread(self, clonedfragmentList):. h5 M* h. B; ?$ m2 ?4 ~' D
for fragment in clonedfragmentList:) g. Y5 ]6 B( i
fragment.id = ""
& r. r6 D1 a# G) |: @ }. w+ } fragment.name = ""2 {1 m1 W* J4 D7 m; l3 G1 s
description = fragment.description[12:].split(",")[0]
* b/ B* g2 [+ }0 c fragment.description = str(self.readsID) + "." + description
- J; |0 q* q8 G* |# X- A: K readslength = random.randint(self.minreadslength, self.maxreadslength)
2 R0 _# [, |3 N7 E0 S+ C* @ self.allreadslength += readslength6 ?8 G( [1 C, n: D, S9 J
self.readsList.append(fragment[:readslength])* X4 V/ x0 @1 X+ T% p/ c
) O3 ^0 v9 C" R+ f a
readslength = random.randint(self.minreadslength, self.maxreadslength)
8 E; S/ Q M1 d5 n" R5 Z self.allreadslength += readslength
7 C1 }% D3 U7 }, h3 c4 Z0 c
/ c$ D* N* u6 j5 T* ^ fragmentcomplement = fragment.reverse_complement()
$ U+ o$ o3 [8 u7 Z fragmentcomplement.id = ""
, M; h: @3 f& J0 ~# S6 T4 e; j$ ` fragmentcomplement.name = ""
8 L/ R1 ?: R+ h( D fragmentcomplement.description = str(self.readsID) + "." + description) B! y- T3 o( k5 S
self.readsList.append(fragmentcomplement[:readslength])
) z H) y8 E3 A1 \. i% x1 H$ b; Z# R) D& d, V
self.readsID += 1; Y5 m8 [3 Y" d" o" O1 W: N
2 E- o- ?' W; p) I
def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):) t9 J5 F8 k! V; D" Q6 I7 M
for seq_record in SeqIO.parse(genomedata, "fasta"):
) {3 a X' r0 M2 w seqlen = len(seq_record)
: M2 M, x% F3 V6 |' l! J6 s% u self.genomeLength += seqlen
z% T# y, i! z; s8 _ for i in range(self.N):2 \0 R( }7 {; I0 I# {
# 生成断裂点
" P `) c- Q5 J. H" j8 ] breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength). x- L" e0 ` r$ Z: A1 L- ? B& h
# 沿断裂点打断基因组8 V1 Z! v0 s: E. \0 m* ^: |
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
- ^4 u/ `/ D6 u( u& u: Y+ h) [1 R # 模拟克隆时的随机丢失情况- M/ D- H( ?( \, I A
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)0 S. C" C" l9 h7 h! I
# 模拟双端测序
. G d! ^3 ^! j# t6 r' v {0 c4 {5 V self.pairread(clonedfragmentList)
! Y+ @/ x( u4 o3 u0 E2 ^5 B readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]
: [: ] U g( ?* R( j. x5 [ readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]$ T; | y; P( w9 [, d! O( O# q
SeqIO.write(readsList_1, sequencingResult_1, "fasta")
7 N$ l; v+ x6 d+ V SeqIO.write(readsList_2, sequencingResult_2, "fasta")
7 f! a& K" Q* V2 R c2 M* U r" n3 a! B& R! C# L
def resultsummary(self):
; ~* Z7 ~% U1 b* p+ d% ?! b print("基因组长度:" + str(self.genomeLength / 1000) + "kb")
( g! f& C- `8 C print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))" I" x; ?7 `" Z& k8 H) x' b
print("N值:" + str(self.N))* u% i4 I, i5 u8 c6 a/ Y
print("期望片段长度:" + str(self.averagefragmentlength))% V3 ], E6 ^- @1 a" e4 }/ S- \# a) j5 H
print("克隆保留率:" + str(self.cloneRetainprobability))
' p1 X) E* j, e2 d n. R print("片段数量:" + str(len(self.fragmentList)))# l" J, e c5 @
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))
- `+ \4 u+ Q ~ print("reads总数量:" + str(len(self.readsList)))
. ^( X( `4 O: y* t0 K print("reads总长度:" + str(self.allreadslength / 1000) + "kb")+ |9 g6 C/ p; b3 y0 U. I
m = self.allreadslength / self.genomeLength4 {' v W0 c5 O! Z
print("覆盖度(m值):" + str(round(m, 5)))
7 B3 ] {5 H) | print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))* ]( x' Z0 Q" T5 \& @3 V$ k
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))
2 D' G. D J8 l# -------------------------------------------主程序-------------------------------------------
2 h' u1 c2 B: B% S# 模拟单端测序
1 l, c, l. K" D" _( b( [1 w" hsequencingObj = Sequencing()* [; ?0 v( t3 ~4 L; B: L# q. R
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")" G/ t" _7 c8 Y
sequencingObj.resultsummary()
' r% B3 s- \7 I4 p0 M3 G
) X' }' Z6 i, z; P8 B- p# 模拟双端测序
, C- t% F; o% ?) \: gsequencingObj = Sequencing()) q. M% {, w4 z
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
, i! J; w; ]+ m8 p- |5 @sequencingObj.resultsummary()
2 j* @7 N. n/ M0 R6 z4 o
* P @" }+ M, B @: i4 \& l, k
5 w. j4 l6 Q: I+ ?, u5 S7 T0 Q6 d
! K/ ^3 o, y1 f! T3 N2 h( R6 O7 L 8 v4 u" C: O/ I9 @$ d: A, G) M
|
zan
|