Tuesday, June 9, 2009

Calling all Pythonistas!

Okay so I have a question:


I have a pile of sequences (200+) in fasta format in a text file eg 

>geneA
asdfdfasfsdfasdfsdfsdfsfdsfsdfsdfsdfasdfsdfsdfasdfasfsdfafasdfsdsdfsfs
afasdfasdfasdfsdfasdfdfafdasdfsdfafasdfasfafdsfasfafsdfsdfasfsfasdfsdf

>geneB
asdfdfasfsdfasdfsdfsdfsfdsfsdfsdfsdfasdfsdfsdfasdfasfsdfafasdfsdsdfsfs
afasdfasdfasdfsdfasdfdfafdasdfsdfafasdfasfafdsfasfafsdfsdfasfsfasdfsdf

>geneC
asdfdfasfsdfasdfsdfsdfsfdsfsdfsdfsdfasdfsdfsdfasdfasfsdfafasdfsdsdfsfs
afasdfasdfasdfsdfasdfdfafdasdfsdfafasdfasfafdsfasfafsdfsdfasfsfasdfsdf

I need to make it into columns like this:

geneA asdfsdfasdfsdfsdfsdfasdfsfasdfsdfafsdfsdfasdfasdfsdfasdfs
geneB asdfsdfasdfsdfsdfsdfasdfsfasdfsdfafsdfsdfasdfasdfsdfasdfs
geneC asdfsdfasdfsdfsdfsdfasdfsfasdfsdfafsdfsdfasdfasdfsdfasdfs

all in one line.  I want to have the text file set up like this because i want to use python to stuff the text file into an SQL table - only I dont know how to do it - I can concatenate the sequence bit of it in excel for one of the sequences but that doesnt work for the 200+ other sequences I have.  Everything I have found on fasta and or concatenation involves simple exercises or pulling down individual fasta sequences from genbank which didnt help me a lot.

So what I want to know is how do i get the sequence name in one column and all sequences in the other so that I can make an output file that I can open using a python script then stuff into an SQL table.  Any ideas?

E.

3 comments:

Hermitage said...

I had to do this once and can't for the life of me remember how I sorted it out. If I remember I'll definetly get back to you (I know this is not very helpful!).

PhizzleDizzle said...

if this is already in a text file, then use python to manipulate it into the format you want. Or even, just go straight to sql.

Have you looked at the re package?

Something like

import re # regular expression package

name_rx = re.compile(">(\w+)") # grab the name

f = open("mytextfile.txt", "rb")

sequence = False
mysequence = ''
name = ''
for line in f.readlines():
if sequence:
mysequence = line
sequence = False
# insert name and mysequence into sql db
else:
name = name_rx.search(line).groups()[0]
sequence = True

This is really hackish, I wrote it stream of consciousness. You might have to do additional stuff if your sequences have newlines in the middle (which I somehow doubt). Hopefully this framework (or at least the pointer to re) will help. Good luck! I love python!

tideliar said...

Good God... I've trying to learn python for...damn near 76hrs and I can't get some of the stuff in the sodding tutorial to work. Even when I paste it into Python Shell.

I wish you the best of luck!