注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Bioinformatics home

 
 
 

日志

 
 

how to run SOAPdenovo-31mer / SOAPdenovo-63mer / SOAPdenovo-127mer  

2011-07-31 03:52:43|  分类: 默认分类 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |
there is a problem in using SOAPdenovo-31mer, I was told 'segment fault'.
That is ridiculous. I downloaded .SRA from SRA database on genebank. the SOAPdenovo-31mer  need data format in fastq or fasta.
dataset in fastaq is not available in genebank due to space issue. so, we are offered a tool named sratoolkit to transform .sra to .fastq.
I checked the original data format, like below:
@SRR063318.1 FC617AVAAXX:2:1:1021:831 length=150
NTCGCAGTTAGGTCTACATGAGTGANAAATTCGGCTTGCATAGTTTGCCTTGTGCATTGCATCGACAACAAATANCCTGCATGGAAGTGGTTTGAGAAGCAAGATGGCATCCGATCATTGCAGAAGAATAACAAAGATCCTGCCCCAGAA
+SRR063318.1 FC617AVAAXX:2:1:1021:831 length=150
!########################!################################################!CCC@C@CACCCACCCCCCCACCCCCCCC?ACC;CCCCCBCCCCCCCCCC3B>AAA;?A=<A>AAA@C9C?CBC@C
oh, my god, the format is composed of pairs end sequences. That means one sequence is consisted by two-end sequences.
we need to split it to two sequence, as well as their sequence quality information
the right format is as below:

@SRR063318.1.1 FC617AVAAXX:2:1:1021:831.1 length=75
NTCGCAGTTAGGTCTACATGAGTGANAAATTCGGCTTGCATAGTTTGCCTTGTGCATTGCATCGACAACAAATAN
+SRR063318.1.1 FC617AVAAXX:2:1:1021:831.1 length=75
!########################!################################################!
@SRR063318.1.2 FC617AVAAXX:2:1:1021:831.2 length=75
CCTGCATGGAAGTGGTTTGAGAAGCAAGATGGCATCCGATCATTGCAGAAGAATAACAAAGATCCTGCCCCAGAA
+SRR063318.1.2 FC617AVAAXX:2:1:1021:831.2 length=75
CCC@C@CACCCACCCCCCCACCCCCCCC?ACC;CCCCCBCCCCCCCCCC3B>AAA;?A=<A>AAA@C9C?CBC@C

According to SRAtoolkit, fastq-dump can do job.
I try fastq-dump xxx.sra
In order to generate the right format above, I wrote a perl script read.pl to split the sequences in right format.

read.pl
$fileName=shift;
$outName=shift;
open $file, $fileName;
open $out, ">".$outName;
while($line=<$file>)
{
  $line1=$line;
  $line2=<$file>;
  $line3=<$file>;
  $line4=<$file>;
  $line1=~s/length=150/length=75/;
  $line3=~s/length=150/length=75/;
  chmod($line2);
  chmod($line4);
  print $out $line1;
  print $out substr($line2,0,75)."\n";
  print $out $line3;
  print $out substr($line4,0,75)."\n";
 
  print $out $line1;
  print $out substr($line2,75,75)."\n";
  print $out $line3;
  print $out substr($line4,75,75)."\n";
}
close $out;
close $file;

then I run the command
./SOAPdenovo-V1.05/bin/SOAPdenovo-31mer all -s ./config.config -K 29 -R -o graph_prefix

Finally, it worked. I damn it.
Why can't fastq-dump do the right job? Is there wrong parameters in using the program? who can tell me?
anyway, I succeed. hope my experience can be helpful to others.

config.config
#maximal read length
max_rd_len=75
[LIB]
#average insert size
avg_ins=200
#if sequence needs to be reversed
reverse_seq=0
#in which part(s) the reads are used
asm_flags=3
#use only first 50 bps of each read
rd_len_cutoff=50
#in which order the reads are used while scaffolding
rank=1
# cutoff of pair number for a reliable connection (default 3)
pair_num_cutoff=3
#minimum aligned length to contigs for a reliable read location (default 32)
map_len=32

#fastq file for single reads
q=/home/xx/gzsweet.fastq

[LIB]
avg_ins=2000
reverse_seq=1
asm_flags=2
rank=2
# cutoff of pair number for a reliable connection
#(default 5 for large insert size)
pair_num_cutoff=5
#minimum aligned length to contigs for a reliable read location
#(default 35 for large insert size)
map_len=35
q=/home/xx/gzsweet.fastq


the format of gzsweet.fastq  is as below:

@SRR063318.1.1 FC617AVAAXX:2:1:1021:831.1 length=75
NTCGCAGTTAGGTCTACATGAGTGANAAATTCGGCTTGCATAGTTTGCCTTGTGCATTGCATCGACAACAAATAN
+SRR063318.1.1 FC617AVAAXX:2:1:1021:831.1 length=75
!########################!################################################!
@SRR063318.1.2 FC617AVAAXX:2:1:1021:831.2 length=75
CCTGCATGGAAGTGGTTTGAGAAGCAAGATGGCATCCGATCATTGCAGAAGAATAACAAAGATCCTGCCCCAGAA
+SRR063318.1.2 FC617AVAAXX:2:1:1021:831.2 length=75
CCC@C@CACCCACCCCCCCACCCCCCCC?ACC;CCCCCBCCCCCCCCCC3B>AAA;?A=<A>AAA@C9C?CBC@C
  评论这张
 
阅读(2157)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017