My biology paper in Science (really)

Think I’m pranking you, right?

You can see the paper right here (“Synthetic recombinase-based state machines in living cells,” by Nathaniel Roquet, Ava P. Soleimany, Alyssa C. Ferris, Scott Aaronson, and Timothy K. Lu). [Update (Aug. 3): The previous link takes you to a paywall, but you can now access the full text of our paper here. See also the Supplementary Material here.] You can also read the MIT News article (“Scientists program cells to remember and respond to series of stimuli”). In any case, my little part of the paper will be fully explained in this post.

A little over a year ago, two MIT synthetic biologists—Timothy Lu and his PhD student Nate Roquet—came to my office saying they had a problem they wanted help with. Why me? I wondered. Didn’t they realize I was a quantum complexity theorist, who so hated picking apart owl pellets and memorizing the names of cell parts in junior-high Life Science, that he avoided taking a single biology course since that time? (Not counting computational biology, taught in a CS department by Richard Karp.)

Nevertheless, I listened to my biologist guests—which turned out to be an excellent decision.

Tim and Nate told me about a DNA system with surprisingly clear rules, which led them to a strange but elegant combinatorial problem. In this post, first I need to spend some time to tell you the rules; then I can tell you the problem, and lastly its solution. There are no mathematical prerequisites for this post, and certainly no biology prerequisites: everything will be completely elementary, like learning a card game. Pen and paper might be helpful, though.

As we all learn in kindergarten, DNA is a finite string over the 4-symbol alphabet {A,C,G,T}. We’ll find it more useful, though, to think in terms of entire chunks of DNA bases, which we’ll label arbitrarily with letters like X, Y, and Z. For example, we might have X=ACT, Y=TAG, and Z=GATTACA.

We can also invert one of these chunks, which means writing it backwards while also swapping the A’s with T’s and the G’s with C’s. We’ll denote this operation by * (the technical name in biology is “reverse-complement”). For example:

X*=AGT, Y*=CTA, Z*=TGTAATC.

Note that (X*)*=X.

We can then combine our chunks and their inverses into a longer DNA string, like so:

ZYX*Y* = GATTACA TAG AGT CTA.

From now on, we’ll work exclusively with the chunks, and forget completely about the underlying A’s, C’s, G’s, and T’s.

Now, there are also certain special chunks of DNA bases, called recognition sites, which tell the little machines that read the DNA when they should start doing something and when they should stop. Recognition sites come in pairs, so we’ll label them using various parenthesis symbols like ( ), [ ], { }. To convert a parenthesis into its partner, you invert it: thus ( = )*, [ = ]*, { = }*, etc. Crucially, the parentheses in a DNA string don’t need to “face the right ways” relative to each other, and they also don’t need to nest properly. Thus, both of the following are valid DNA strings:

X ( Y [ Z [ U ) V

X { Y ] Z { U [ V

Let’s refer to X, Y, Z, etc.—the chunks that aren’t recognition sites—as letter-chunks. Then it will be convenient to make the following simplifying assumptions:

Our DNA string consists of an alternating sequence of recognition sites and letter-chunks, beginning and ending with letter-chunks. (If this weren’t true, then we could just glom together adjacent recognition sites and adjacent letter-chunks, and/or add new dummy chunks, until it was true.)
Every letter-chunk that appears in the DNA string appears exactly once (either inverted or not), while every recognition site that appears, appears exactly twice. Thus, if there are n distinct recognition sites, there are 2n+1 distinct letter-chunks.
Our DNA string can be decomposed into its constituent chunks uniquely—i.e., it’s always possible to tell which chunk we’re dealing with, and when one chunk stops and the next one starts. In particular, the chunks and their reverse-complements are all distinct strings.

The little machines that read the DNA string are called recombinases. There’s one kind of recombinase for each kind of recognition site: a (-recombinase, a [-recombinase, and so on. When, let’s say, we let a (-recombinase loose on our DNA string, it searches for (‘s and )’s and ignores everything else. Here’s what it does:

If there are no (‘s or )’s in the string, or only one of them, it does nothing.
If there are two (‘s facing the same way—like ( ( or ) )—it deletes everything in between them, including the (‘s themselves.
If there are two (‘s facing opposite ways—like ( ) or ) (—it deletes the (‘s, and inverts everything in between them.

Let’s see some examples. When we apply [-recombinase to the string

A ( B [ C [ D ) E,

we get

A ( B D ) E.

When we apply (-recombinase to the same string, we get

A D* ] C* ] B* E.

When we apply both recombinases (in either order), we get

A D* B* E.

Another example: when we apply {-recombinase to

A { B ] C { D [ E,

we get

A D [ E.

When we apply [-recombinase to the same string, we get

A { B D* } C* E.

When we apply both recombinases—ah, but here the order matters! If we apply { first and then [, we get

A D [ E,

since the [-recombinase now encounters only a single [, and has nothing to do. On the other hand, if we apply [ first and then {, we get

A D B* C* E.

Notice that inverting a substring can change the relative orientation of two recognition sites—e.g., it can change { { into { } or vice versa. It can thereby change what happens (inversion or deletion) when some future recombinase is applied.

One final rule: after we’re done applying recombinases, we remove the remaining recognition sites like so much scaffolding, leaving only the letter-chunks. Thus, the final output

A D [ E

becomes simply A D E, and so on. Notice also that, if we happen to delete one recognition site of a given type while leaving its partner, the remaining site will necessarily just bounce around inertly before getting deleted at the end—so we might as well “put it out of its misery,” and delete it right away.

My coauthors have actually implemented all of this in a wet lab, which is what most of the Science paper is about (my part is mostly in a technical appendix). They think of what they’re doing as building a “biological state machine,” which could have applications (for example) to programming cells for medical purposes.

But without further ado, let me tell you the math question they gave me. For reasons that they can explain better than I can, my coauthors were interested in the information storage capacity of their biological state machine. That is, they wanted to know the answer to the following:

Suppose we have a fixed initial DNA string, with n pairs of recognition sites and 2n+1 letter-chunks; and we also have a recombinase for each type of recognition site. Then by choosing which recombinases to apply, as well as which order to apply them in, how many different DNA strings can we generate as output?

It’s easy to construct an example where the answer is as large as 2ⁿ. Thus, if we consider a starting string like

A ( B ) C [ D ] E { F } G < H > I,

we can clearly make 2⁴=16 different output strings by choosing which subset of recombinases to apply and which not. For example, applying [, {, and < (in any order) yields

A B C D* E F* G H* I.

There are also cases where the number of distinct outputs is less than 2ⁿ. For example,

A ( B [ C [ D ( E

can produce only 3 outputs—A B C D E, A B D E, and A E—rather than 4.

What Tim and Nate wanted to know was: can the number of distinct outputs ever be greater than 2ⁿ?

Intuitively, it seems like the answer “has to be” yes. After all, we already saw that the order in which recombinases are applied can matter enormously. And given n recombinases, the number of possible permutations of them is n!, not 2ⁿ. (Furthermore, if we remember that any subset of the recombinases can be applied in any order, the number of possibilities is even a bit greater—about e·n!.)

Despite this, when my coauthors played around with examples, they found that the number of distinct output strings never exceeded 2ⁿ. In other words, the number of output strings behaved as if the order didn’t matter, even though it does. The problem they gave me was either to explain this pattern or to find a counterexample.

I found that the pattern holds:

Theorem: Given an initial DNA string with n pairs of recognition sites, we can generate at most 2ⁿ distinct output strings by choosing which recombinases to apply and in which order.

Let a recombinase sequence be an ordered list of recombinases, each occurring at most once: for example, ([{ means to apply (-recombinase, then [-recombinase, then {-recombinase.

The proof of the theorem hinges on one main definition. Given a recombinase sequence that acts on a given DNA string, let’s call the sequence irreducible if every recombinase in the sequence actually finds two recognition sites (and hence, inverts or deletes a nonempty substring) when it’s applied. Let’s call the sequence reducible otherwise. For example, given

A { B ] C { D [ E,

the sequence [{ is irreducible, but {[ is reducible, since the [-recombinase does nothing.

Clearly, for every reducible sequence, there’s a shorter sequence that produces the same output string: just omit the recombinases that don’t do anything! (On the other hand, I leave it as an exercise to show that the converse is false. That is, even if a sequence is irreducible, there might be a shorter sequence that produces the same output string.)

Key Lemma: Given an initial DNA string, and given a subset of k recombinases, every irreducible sequence composed of all k of those recombinases produces the same output string.

Assuming the Key Lemma, let’s see why the theorem follows. Given an initial DNA string, suppose you want to specify one of its possible output strings. I claim you can do this using only n bits of information. For you just need to specify which subset of the n recombinases you want to apply, in some irreducible order. Since every irreducible sequence of those recombinases leads to the same output, you don’t need to specify an order on the subset. Furthermore, for each possible output string S, there must be some irreducible sequence that leads to S—given a reducible sequence for S, just keep deleting irrelevant recombinases until no more are left—and therefore some subset of recombinases you could pick that uniquely determines S. OK, but if you can specify each S uniquely using n bits, then there are at most 2ⁿ possible S’s.

Proof of Key Lemma. Given an initial DNA string, let’s assume for simplicity that we’re going to apply all n of the recombinases, in some irreducible order. We claim that the final output string doesn’t depend at all on which irreducible order we pick.

If we can prove this claim, then the lemma follows, since given a proper subset of the recombinases, say of size k<n, we can simply glom together everything between one relevant recognition site and the next one, treating them as 2k+1 giant letter-chunks, and then repeat the argument.

Now to prove the claim. Given two letter-chunks—say A and B—let’s call them soulmates if either A and B or A* and B* will necessarily end up next to each other, whenever all n recombinases are applied in some irreducible order, and whenever A or B appears at all in the output string. Also, let’s call them anti-soulmates if either A and B* or A* and B will necessarily end up next to each other if either appears at all.

To illustrate, given the initial DNA sequence,

A [ B ( C ] D ( E,

you can check that A and C are anti-soulmates. Why? Because if we apply all the recombinases in an irreducible sequence, then at some point, the [-recombinase needs to get applied, and it needs to find both [ recognition sites. And one of these recognition sites will still be next to A, and the other will still be next to C (for what could have pried them apart? nothing). And when that happens, no matter where C has traveled in the interim, C* must get brought next to A. If the [-recombinase does an inversion, the transformation will look like

A [ … C ] → A C* …,

while if it does a deletion, the transformation will look like

A [ … [ C* → A C*

Note that C’s [ recognition site will be to its left, if and only if C has been flipped to C*. In this particular example, A never moves, but if it did, we could repeat the analysis for A and its [ recognition site. The conclusion would be the same: no matter what inversions or deletions we do first, we’ll maintain the invariant that A and C* (or A* and C) will immediately jump next to each other, as soon as the [ recombinase is applied. And once they’re next to each other, nothing will ever separate them.

Similarly, you can check that C and D are soulmates, connected by the ( recognition sites; D and B are anti-soulmates, connected by the [ sites; and B and E are soulmates, connected by the ( sites.

More generally, let’s consider an arbitrary DNA sequence, with n pairs of recognition sites. Then we can define a graph, called the soulmate graph, where the 2n+1 letter-chunks are the vertices, and where X and Y are connected by (say) a blue edge if they’re soulmates, and by a red edge if they’re anti-soulmates.

When we construct this graph, we find that every vertex has exactly 2 neighbors, one for each recognition site that borders it—save the first and last vertices, which border only one recognition site each and so have only one neighbor each. But these facts immediately determine the structure of the graph. Namely, it must consist of a simple path, starting at the first letter-chunk and ending at the last one, together with possibly a disjoint union of cycles.

But we know that the first and last letter-chunks can never move anywhere. For that reason, a path of soulmates and anti-soulmates, starting at the first letter-chunk and ending at the last one, uniquely determines the final output string, when the n recombinases are applied in any irreducible order. We just follow it along, switching between inverted and non-inverted letter-chunks whenever we encounter a red edge. The cycles contain the letter-chunks that necessarily get deleted along the way to that unique output string. This completes the proof of the lemma, and hence the theorem.

There are other results in the paper, like a generalization to the case where there can be k pairs of recognition sites of each type, rather than only one. In that case, we can prove that the number of distinct output strings is at most 2^kn, and that it can be as large as ~(2k/3e)ⁿ. We don’t know the truth between those two bounds.

Why is this interesting? As I said, my coauthors had their own reasons to care, involving the number of bits one can store using a certain kind of DNA state machine. I got interested for a different reason: because this is a case where biology threw up a bunch of rules that look like a random mess—the parentheses don’t even need to nest correctly? inversion can also change the semantics of the recognition sites? evolution never thought about what happens if you delete one recognition site while leaving the other one?—and yet, on analysis, all the rules work in perfect harmony to produce a certain outcome. Change a single one of them, and the “at most 2ⁿ distinct DNA sequences” theorem would be false. Mind you, I’m still not sure what biological purpose it serves for the rules to work in harmony this way, but they do.

But the point goes further. While working on this problem, I’d repeatedly encounter an aspect of the mathematical model that seemed weird and inexplicable—only to have Tim and Nate explain that the aspect made sense once you brought in additional facts from biology, facts not in the model they gave me. As an example, we saw that in the soulmate graph, the deleted substrings appear as cycles. But surely excised DNA fragments don’t literally form loops? Why yes, apparently, they do. As a second example, consider the DNA string

A ( B [ C ( D [ E.

When we construct the soulmate graph for this string, we get the path

A–D–C–B–E.

Yet there’s no actual recombinase sequence that leads to A D C B E as an output string! Thus, we see that it’s possible to have a “phantom output,” which the soulmate graph suggests should be reachable but that isn’t actually reachable. According to my coauthors, that’s because the “phantom outputs” are reachable, once you know that in real biology (as opposed to the mathematical model), excised DNA fragments can also reintegrate back into the long DNA string.

Many of my favorite open problems about this model concern algorithms and complexity. For example: given as input an initial DNA string, does there exist an irreducible order in which the recombinases can be applied? Is the “utopian string”—the string suggested by the soulmate graph—actually reachable? If it is reachable, then what’s the shortest sequence of recombinases that reaches it? Are these problems solvable in polynomial time? Are they NP-hard? More broadly, if we consider all the subsets of recombinases that can be applied in an irreducible order, or all the irreducible orders themselves, what combinatorial conditions do they satisfy? I don’t know—if you’d like to take a stab, feel free to share what you find in the comments!

What I do know is this: I’m fortunate that, before they publish your first biology paper, the editors at Science don’t call up your 7^th-grade Life Science teacher to ask how you did in the owl pellet unit.

More in the comments:

Some notes on the generalization to k pairs of recognition sites of each type
My coauthor Nathaniel Roquet’s comments on the biology

Unrelated Announcement from My Friend Julia Wise (July 24): Do you like science and care about humanity’s positive trajectory? July 25 is the final day to apply for Effective Altruism Global 2016. From August 5-7 at UC Berkeley, a network of founders, academics, policy-makers, and more will gather to apply economic and scientific thinking to the world’s most important problems. Last year featured Elon Musk and the head of Google.org. This year will be headlined by Cass Sunstein, the co-author of Nudge. If you apply with this link, the organizers will give you a free copy of Doing Good Better by Will MacAskill. Scholarships are available for those who can’t afford the cost. Read more here. Apply here.

This entry was posted on Friday, July 22nd, 2016 at 2:50 am and is filed under Complexity, Nerd Interest. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

61 Responses to “My biology paper in Science (really)”

lewikee Says:
Comment #1 July 22nd, 2016 at 10:32 am
When you say:

“For example, applying [, {, and < (in any order) yields

A B C D* E F* G H* I."

It should instead yield

A B* C D* E F* G H* I.
lewikee Says:
Comment #2 July 22nd, 2016 at 10:34 am
Nevermind it looks like you didn’t apply the (- recombinase. Although that’s odd since they stated “and we also have a recombinase for each type of recognition site” and you clearly have the ( recognition site in your original string.
Scott Says:
Comment #3 July 22nd, 2016 at 10:40 am
lewikee: I was trying to illustrate how you could target an arbitrary subset of the recognition sites, rather than necessarily all of them, and thereby get 2ⁿ distinct output strings.
lewikee Says:
Comment #4 July 22nd, 2016 at 10:51 am
Oh I see, you have a recombinase available for to use if you wish to use it.

Sorry – misunderstood that sentence in the instructions. I thought that once a recognition site was in a string you were forced to use its recombinase, and that the “choice” mentioned in the instructions was about which recognition sites you wished to include in the string in the first place (which would equivalently make it a choice on which to apply). Sorry – I had a wrong understanding!
eniteris Says:
Comment #5 July 22nd, 2016 at 12:02 pm
I’ve been interested in this field for quite some time now (synthetic biology, currently doing my Master’s), and ive never liked recombinase-dependent logic as the processes are not (easily) reversible. It allows for easy inheritance of memory-states, bit I don’t think that’s worth forcing your finite state machine to be a directed acyclic graph.
Scott Says:
Comment #6 July 22nd, 2016 at 12:08 pm
eniteris #5: Yes, the irreversibility is a striking feature here, which prevents us from analyzing the situation using group theory, and makes it all the more surprising how much structure still survives. Tim and Nate explained to me that figuring out how to reverse the deletions is a significant challenge on the synthetic biology side.
lewikee Says:
Comment #7 July 22nd, 2016 at 12:14 pm
I was also struck by how much predictable structure remained after the rules being applied seemed so destructive.

Also, I wonder if deletions are more likely to leave behind characteristic patterns in the strings that might allow someone to – if not reverse deletions – have an idea about when and where they occurred.
Scott Says:
Comment #8 July 22nd, 2016 at 12:27 pm
lewikee #7: I guess a lone recognition site with no matching partner would be one extremely telltale sign of past deletions! Although if you adopt my convention, that lone recognition sites get pruned automatically, then obviously you’d no longer have that. In which case, I don’t see any way to tell from examining a string whether deletions (or for that matter, inversions) have happened or not, without additional facts from biology. Here I’m assuming that, when you examine the string, adjacent letter-chunks with no recognition sites between them automatically get glommed together, so that you can’t tell what’s happened by simply examining the number of letter-chunks versus the number of recognition sites.
Scott Says:
Comment #9 July 22nd, 2016 at 1:57 pm
Incidentally, writing up this post caused me to notice two new structural facts about the model.

First, there exist DNA strings for which the “utopian string” is reachable, but is not reachable by applying all n recombinases in an irreducible order. As an example:

A ( B [ C { D [ E { F ( G.

This has AG as its utopian string. But on the way to AG, if the [-recombinase acts then the { one can’t act, and vice versa.

Second, something weaker is true: namely, whenever the utopian string is reachable, it’s reachable by applying all n recombinases in some order. To see this: clearly the utopian string must be reachable by applying some subset S of the recombinases in an irreducible order. Now suppose that after we do that, there are still matching pairs of recognition sites left in the DNA string. Then there are letter-chunks that are currently next to each other, but could be moved away from each other (or switched in their relative orientations) by further recombinase applications. Hence some letter-chunks were not paired off with their soulmates, so the string was not utopian, contrary to assumption. It follows that we can now apply all the remaining recombinases (the ones not in S), without any of them taking us away from the utopian string.
Jon K. Says:
Comment #10 July 22nd, 2016 at 4:12 pm
Very Interesting. This reminds me of Gregory Chaitin latest endeavor, which is to make “a toy model of biology.” I believe the models he is exploring relate to how a single mutating genome/computer program will evolve under different rules and a fitness measure. He says that he isn’t too concerned with making his model biological accurate, since he is most interested in getting some proofs out of his work. He believes his model needs to be very simple in order to make theorem proving tractable, but here you are proving things based on more realistic biological models. Of course you haven’t proven anything about evolution, which I think he sees as the ultimate goal for his toy biological models, but this work still seems relevant. You should share this with him.
Sniffnoy Says:
Comment #11 July 22nd, 2016 at 4:27 pm
OK, but if you can specify each S uniquely using n bits, then there are at most 2^n possible S’s.

I guess that’s one way of saying “Since there are at most 2^n possible S, there are at most 2^n possible S.” 😛
Sniffnoy Says:
Comment #12 July 22nd, 2016 at 4:29 pm
But wow, yeah, that’s pretty unexpected, that there are only at most 2^n! It would certainly seem like there ought to be more.

For 1≤k≤4, where 2k/3e<1, I’m guessing 2^n is the largest you know how to get?
Scott Says:
Comment #13 July 22nd, 2016 at 5:03 pm
Jon K. #10: I’m familiar with Chaitin’s work on computability and evolution. I read his book on it, and I also attended a talk by him on it a few years ago. Unfortunately, it didn’t go well. When I asked him technical questions, just trying to understand his result better—i.e., the same thing I do at any other technical talk—he somehow perceived it as a personal attack, and ranted about how I only dismissed his work because I’m a sheeplike academic conformist, or something to that effect.

I’ve since heard from others that he’s known for this sort of thing. I still feel bad about it, since before this incident, I’d liked Chaitin and had had some great conversations with him.

The real irony is that, far from “dismissing” his work, I’d actually found it the most interesting thing at that workshop—although I’d probably tone down the claims he made for its world-changing importance to biology by at least 4 orders of magnitude. 🙂

Briefly, Chaitin studies a model of evolution for which the word “idealized” is an understatement—which is actually fine by me! He has a population of exactly one organism, “mutations” that can be arbitrary computable maps and that never make things worse—oh, and a fixed fitness function that requires an oracle for the halting problem in order to evaluate it. The question that interests him is whether “mutation”—or I’d simply call it randomized hill-climbing—can discover “genuinely new information,” which in his mind, can only mean learning additional bits of Chaitin’s constant Ω. 🙂

I.e., if you told him that the evolution that takes a bacterium to a cheetah was governed solely by computable processes (even huge ones taking billions of years), then unless I badly misunderstood, he’d say that that evolution was ipso facto uninteresting: the only evolution that matters for him is the kind that has some uncomputable aspect.

Setting aside any qualms we might have about that, his main result is that, yes, a suitably-defined randomized hill-climbing process can keep discovering additional bits of Ω, provided (again) that we have a HALT oracle with which to compute the fitness function.

At a technical level, his result boils down to the following observation, which while simple was completely new to me. If you define Busy Beavers using a suitable programming language (say, a prefix-free one), and you know the n^th Busy Beaver, then someone can let you compute the (n+1)^st Busy Beaver by giving you only O(log n) bits of new information. So in particular, if you randomly guessed the additional information (and then verified it with your HALT oracle), you’d have a 1/poly(n) probability of getting it right, and thereby “improving your fitness”—that is, learning a new Busy Beaver number, and correspondingly one or more new bits of Ω.

I’ll let you decide for yourself whether this is biology or vaguely-biologically-inspired computability theory. 😉

Anyway, as far as I can see, my result “relates” to his only insofar as they both involve both math and biology! Besides that, they’ve got nothing to do with each other—for starters, mine says nothing about evolution, while his says nothing about DNA.
amy Says:
Comment #14 July 22nd, 2016 at 6:34 pm
Hooray! Finally, a paper of yours I can read!
Scott Says:
Comment #15 July 22nd, 2016 at 6:41 pm
amy #14: And also, perhaps, the first paper of mine that I can’t read most of. 😉
Jay Says:
Comment #16 July 22nd, 2016 at 11:47 pm
>If you (…) know the nth Busy Beaver, then someone can let you compute the (n+1)st Busy Beaver by giving you only O(log n) bits of new information.

We know the 2-symbol 2-state BB. We can guess O(log 3) bits of new information. But we don’t know the 3-symbol 3-state BB. What’s the catch?
Scott Says:
Comment #17 July 23rd, 2016 at 12:26 am
Jay #16: We don’t have a HALT oracle to tell us when we guessed right.
Jay Says:
Comment #18 July 23rd, 2016 at 11:17 am
O_ok thx
Scott Says:
Comment #19 July 23rd, 2016 at 11:21 am
Sniffnoy #12: No, you can also give constructions for particular small values of k. For example, when k=2, I can get 5^n/2 distinct outputs, which is better than 2ⁿ although it still doesn’t achieve the upper bound of 2²ⁿ.

Incidentally, I’d be careful about putting an exclamation point after 2ⁿ!

(self-referentiality intended 🙂 )
Sniffnoy Says:
Comment #20 July 23rd, 2016 at 12:31 pm
Interesting, thanks!
Jon K. Says:
Comment #21 July 23rd, 2016 at 12:34 pm
Comment #13:
Exactly, they both involve math and biology. Haha. 🙂

Thanks for the detailed description of Chaitin’s work. I read Meta Math and Goedel’s Way, but not his biology-themed book.

I have seen panel talks with Chaitin and physicists duking it out on stage. From an audience perspective, I love it! We need people in there mixing it up, causing trouble, and challenging conventional wisdom. 🙂 I think you are good at asking difficult questions too, but usually from the more orthodox perspective, keeping speculative theories in check. I think both sides are needed for progress. Open-mindedness(even when conflicting with prior beliefs) and skepticism can both serve a scientist well.

I do think looking at biology in a computational and informational way is a pretty big idea, although nobody has any robust biological models yet. I avoided biology classes as well growing up, but after seeing how mRNA and ribosomes work like a tape and tape head, I got more interested in the mechanics of biology. I think CRISPR experiments and machine learning should help with analyzing the complexity and building better models over time. If statistical models on drug efficacy, disease predisposition, etc. become more statistically acurate in the future, they start look more like a deterministic computer program model than a statistical model that would need random seeds in order to be simulated. So in that sense, drawing an anology between biology and computer programs might be useful.
Alison Miller Says:
Comment #22 July 23rd, 2016 at 4:13 pm
Scott @ 13: So I think what you’re saying is that there is an algorithm which, given n and BB(n), computes a set of n integers one of which must equal BB(n+1)?

Except that this can’t be right, since if I iterate the algorithm then I can get a computable upper bound for BB(n) for any n.

So what am I missing? I guess maybe the algorithm could be invoking the oracle for HALT, but then I don’t see the point, since if I have the oracle I can just compute BB(n+1) anyway.

(Also, re: the original post — very nice paper!)
Alison Miller Says:
Comment #23 July 23rd, 2016 at 5:58 pm
(correction: the set should have size polynomial in n, not necessarily equal to n. but the argument stands).
Scott Says:
Comment #24 July 24th, 2016 at 12:04 am
Hi Alison, good to have you back! You’re exactly right—the algorithm is invoking the HALT oracle; otherwise this would be false. What makes it somewhat nontrivial, even with the HALT oracle, is that you’re trying to learn (say) n+1 bits—namely, the (n+1)-bit program that runs for the longest finite time—but you only get O(log n) bits depending on the HALT oracle, plus n bits that you might have thought were totally irrelevant to your problem, namely the n-bit program that runs the longest. Nevertheless this suffices, again as long as your programs are written in a prefix-free language (one where one valid program is never a proper prefix of another valid program).
Alison Miller Says:
Comment #25 July 24th, 2016 at 3:45 am
OK, I think I get it now. Thanks — that’s pretty interesting.

So an example of how this could be the case would be e.g. if one knew that the longest-running terminating program of length (n+1) belonged to a specific subset of size O(log(n))? Then you could query your oracle O(log(n+1)) times to determine which programs halted, and then run only those to find BB(n+1).

The issue I was being confused about before was: why can’t I do the following? I make myself O(poly(n)) “fake oracles”, each of which gives a different set of answers to my queries, such that at least one of them is correct. Then I run the algorithm once using each fake oracle, and I get a set of possible values for BB(n+1), at least one of which is correct. However, this is impossible as explained in my previous post

But now that I have the example above, I see what the issue is: if I invoke my algorithm with the wrong fake oracle, it won’t halt — and I have no way of knowing for which fake oracles this will be the case without having a legit oracle for HALT.
Julia Belk Says:
Comment #26 July 24th, 2016 at 6:32 pm
Hi Scott,

I want to thank you for maintaining this blog and making your research accessible to the general public. Hearing about the interdisciplinary and collaborative approach the team took is awesome, and I wouldn’t have independently read the paper without seeing this post. I’m an undergraduate in EECS with a latent interest in synthetic biology—this has motivated me to take a closer look and check out Professor Weiss’s class (6.580) in the fall. Thank you!
Sniffnoy Says:
Comment #27 July 25th, 2016 at 12:23 am
Scott #19: Oh, hm — I was hoping to find the other lower bounds in the paper, but I finally got around to looking through it and it looks like they aren’t. So, uh, what are they? How does one fill in the intermediate values of k?
James Cross Says:
Comment #28 July 25th, 2016 at 9:05 am
Any thoughts on “Why genetic information processing could have a quantum basis”.

https://arxiv.org/pdf/quant-ph/0105001v2.pdf
Scott Says:
Comment #29 July 25th, 2016 at 8:36 am
Sniffnoy #27: A fair question! Since I’m about to board a plane and don’t have time to write a new essay for you, I’m simply going to cut and paste from the emails that Nate Roquet and I exchanged about this, with some light editing. Let me know if you have any questions.

For k=2, the key point is just that there’s a way to use 2 pairs of recognition sites to get 5 distinct DNA strings (rather than the 4 that would be the maximum possible in the k=1 case). So then, if we have n recombinases available (let n be even), we just break them up into n/2 pairs, and repeat the construction over and over along the string to get 5^n/2 output strings.

Here’s the way to get 5, which Nate and Tim discovered:

(‘ S1 ( S2 [ S3 ) S4 [ S5 )’

Here, the primed () brackets are “orthogonal” to the unprimed () brackets but belong to the same flavor.

if we treat with the ()-recombinase we get:
S5* S4* S2 S3 S1*

if we treat with the []-recombinase we get:
S1 S2 S5

if we treat with the ()-recombinase followed by the []-recombinase we get:
S5* S2* S4 S3 S1*

if we treat with the []-recombinse followed by the ()-recombinase we get:
S5* S2* S1*

OK, now let’s generalize and improve the bound!

CLAIM: Suppose we have n flavors of recombinase, and k subflavors per flavor (where k<=3n/2). Then we can generate a number of distinct strings that grows asymptotically at least like (2k/3e)^n. PROOF: The string (' A ( B [ C ) D [ E )' records not only whether the (-recombinase was applied and whether the [-recombinase was applied, but also the order in which they were applied (assuming they both were). Thus, given h recombinase flavors, suppose we create a substring like the above for every choose(h,2) pairs of flavors, using different subflavors for each substring. For example, with the h=3 flavors (, [, {, we would have (' A ( B [ C ) D [ E )' ['' F [' G { H ]' I { J ]'' {'' F {' G ('' H }' I ('' J }'' In general, doing this will require about k=3h/2 subflavors per flavor (since each flavor, say (, is paired with h-1 other flavors, and for half of them we need two subflavors of (, while for the other half we only need one). And this construction lets us generate about e*(h!) distinct strings -- one for every ordered list of the h recombinases in which each one appears at most once. So, if we divide the n flavors into n/h blocks of h flavors each, and do the above construction for each flavor, then the total number of strings we can generate is about (e*(h!))^(n/h) ~ (h/e)^n ~ (2k/3e)^n. QED The above bound isn't tight for small k, and working things out carefully in the special case k=6, we find that we can get at least 326^floor(n/5) ~ 3.18^n distinct strings. Some further clarifications here: To get from (e*(h!))^(n/h) to (h/e)^n, I simply used Stirling's formula: h! is about (h/e)^h. The 326^floor(n/5) ~ 3.18^n does not come from the formula (2k/3e)^n, but from the construction that yielded the formula. When k=6, you can check that it's possible to implement the sorting construction with n=5 flavors. (In fact, k=6 subflavors is just *barely* enough for this -- nothing is left over!!) Furthermore, the n=5 sorting construction lets you differentiate among 5! * (1 + 1 + 1/2! + 1/3! + 1/4! + 1/5!) = 326 possibilities. So then, if we have n>>5 flavors, we just group them into n/5 blocks of 5 flavors each, and apply the construction to each block.

So, for the problem of how many distinct strings can be generated, using n flavors and k subflavors per flavor, I now have an upper bound of 2^{kn} and a lower bound of k^{cn} (for some constant c). Between these two, I conjecture that the lower bound is closer to the truth — i.e., that the k should be in the base rather than in the exponent. Certainly we know that the lower bound becomes tight (and the upper bound non-tight) when k~n, since no matter how large k is, we always have the obvious upper bound e*(n!) on the number of possible recombinase sequences.

Here are a few other thoughts.

In my proof of the 2^n upper bound, for the special case k=1 (i.e., one pair of parentheses of each flavor), you might wonder where I used the assumption that k=1. (Clearly I used it *somewhere*, since if k>1 then the bound is false!) In fact, there’s only one place in the entire proof where I used that k=1. That’s in the statement that, if a recombinase sequence S of length n is irreducible, then when we apply S, no parenthesis (say, ( ) can ever get deleted as a byproduct of applying some *other* flavor of recombinase (say, the [ flavor). This is false for k>1 because applying the [-recombinase might delete the (-subflavor of ( but not the (‘-subflavor — so that there would still be a reason to apply the (-recombinase later. Indeed, that’s exactly what happens in the first construction: when we apply the [-recombinase to

(‘ A ( B [ C ) D [ E )’

we get

(‘ A B E )’

which has deleted the ( subflavor, but left the (‘ subflavor, so that applying the (-recombinase still does something to the string, changing it to

E* B* A*.

And because “no parentheses deleted as byproducts” fails, it’s no longer true that every symbol in the string has unique left and right “soulmates”: which symbol it ends up next to could depend on whether certain parentheses got “accidentally deleted along the way,” which in turn could depend on the order in which the recombinases were applied. So, in the above example, if the ( and ) get “accidentally deleted” as a byproduct of first applying the [-recombinase, then B* ends up next to A*, since there’s no longer anything to pry them apart. But if we apply the (-recombinase first and the [-recombinase second, then A and C* get to “consummate their soulmate relationship” :-), and we end up with

E* B* D C A*.

How much do we care about more precise bounds on the number of possible output strings? For example, when k=2, what I know right now is that the number is at least sqrt(5)^n (via the sorting construction) and at most 4^n (via my 2^{kn} bound). Should we try to determine the exact base between sqrt(5) and 4?

To answer the above question, it would help to know something about the following closely-related question: let’s go back to the case k=1. And suppose I have n recombinases, and suppose I have to use ALL of them — but in any order I want. (Call a sequence that uses all n of the recombinases a “complete” sequence.) Then how many distinct DNA strings can I generate?

Via sorting constructions, I can show that the answer is at least 2^{n/2}, and in fact at least 3^{n/3} (which is slightly larger). But the only upper bound I know is 2^n, from my general 2^n bound.

Trying out examples, I can’t get anywhere close to 2^n: the best I can find so far is 2 different outcomes from complete sequences when n=2, or 3 different outcomes from complete sequences when n=3. So I suspect that the 2^n upper bound can be improved. However, *if* it were possible to get ~2^n different outcomes from complete sequences, then I can show that would have an implication for the other problem: it would mean that, when k=2, there would be a way to generate at least 3^n distinct DNA strings.
Nathaniel Says:
Comment #30 July 25th, 2016 at 4:42 pm
Hi all. I worked with Scott on this paper. Sorry I’m a little late to the party, but I can help with responding to comments that are more on the biological side. I’ll start now with comment #5 and I hope to answer more later in the day.

Re comment #5 (Eniteris): There are many ways to implement logic in biological systems. They each have advantages and disadvantages depending on context and application. Recombinases are good for applications that require state-dependent logic because the recombinases encode states (memories) in DNA, which makes the states very stable. The state is easily inherited between generations of cells (as Eniteris mentions), and it is maintained with little energetic cost to the cell (in fact, the DNA state is maintained even when its surrounding cell chassis dies).

In our research paper, we used a subfamily of recombinases called large serine recombinases (LSRs), which perform irreversible operations on DNA (inversion or excision as Scott discusses in his post). The irreversible nature of these recombinases gives us the added benefit of enhanced programmability. That is, when a LSR is expressed in response to an input, it will drive all DNA to the downstream state. Imagine instead if we had used recombinases that perform reversible operations (in an unbiased fashion- not favoring one state over another). Then equilibrium tells us that no more than ~50% of the DNA will end up in the downstream state. So the use of LSRs was crucial for the proper function of our state machines, albeit limiting in the sense that they constrained our state machines to directed acyclic graphs.

That being said, bioengineers are exploring strategies for programmable, recombinase-based circuits that can reverse state. Two examples that come to mind are from the Endy Lab at Stanford (Bonnet et al. 2012: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3384180/) and the Voigt Lab at MIT (Fernandez-Rodriguez et al. 2015: http://pubs.acs.org/doi/10.1021/acssynbio.5b00170). In the former example, researchers build a DNA inversion switch using a LSR to drive the switch to one state and then the same LSR together with a recombination directionality factor (RDF) to drive the switch back to its original state. In the latter example, researchers use antagonistic recombinases (FimE and HbiF) to drive a DNA inversion switch between its two states.
Ashley Lopez Says:
Comment #31 July 26th, 2016 at 1:45 am
Scott,

Not always would a layperson get an opportunity to ask a mathematician who has a new proof about the thought process behind the discovery – if it is not too much trouble and if you don’t mind, could you please share how you came up with the proof? (Especially since this math is ‘easily accessible’.)
Scott Says:
Comment #32 July 26th, 2016 at 12:30 pm
Ashley #31: Interesting request!

Fortunately for me, this was a pretty easy problem, one that required just a day or so of trial-and-error (less, I’m sure, had they asked someone in the theory group other than me 🙂 ).

As far as I remember, what I did was basically just, first, to fill sheet after sheet of paper applying the rules by hand to a bunch of examples (with 2-4 pairs of recognition sites each), with every possible order on every possible subset, to see what happened.

This convinced me that the “2ⁿ conjecture” was almost certainly true, but left me confused about why. At some point, though, I must’ve noticed that certain pairs of letter-chunks had a propensity to stick together in the final output, which would’ve led me to the concept of “soulmates,” which must be brought together under certain conditions no matter how far apart they are on the string, e.g. because they both border a [ recognition site, and the recognition sites will just travel along with them when recombinases other than [ are applied. But under what conditions will these letter-chunks be brought together? Does it suffice for the [ recombinase to be applied? No, clearly not, since maybe it won’t even find the [ recognition sites to act on. This, in turn, would’ve led me to the concept of an irreducible sequence, and thence to the theorem.
Ashley Lopez Says:
Comment #33 July 26th, 2016 at 1:06 pm
Thanks Scott!

When I read the proof – I was trying to retrace your possible steps – I stupidly assumed that you must have imagined up the concept of an irreducible sequence and the main lemma (only because they were presented first in the text!) BEFORE you noticed the ‘soulmates’ thing (in an attempt to prove the lemma AFTERWARDS). But I could not visualize how one could imagine those up at the beginning of the proof attempt, subject to P != NP 🙂 . Now it makes sense.
Nathaniel Says:
Comment #34 July 26th, 2016 at 1:13 pm
Re comment #31 (Scott): I’ve wondered this myself. The way you describe it sounds like you took a very bottom-up approach: first looking for any interesting properties of the system and then looking at those interesting properties and how they could come together to prove the theorem at hand.
.
The key lemma states that applying all irreducible sequences of the same subset of recombinases will lead to the same DNA string. Did you know at the outset of your work that the proof of the 2^N upper bound conjecture would likely involve a lemma of this nature (some sort of statement about the equivalence of recombinase sequences made from the same subset of recombinases)? I don’t see why the proof would have necessarily involved something like this, but it seems like it would have been a good hunch.
Nathaniel Says:
Comment #35 July 26th, 2016 at 1:18 pm
Re comment #7 (Lewikee): We introduce the initial DNA string into the system (so we know the starting configuration of letter-chunks). Later on, when we read the DNA (with DNA sequencing), if we notice that a letter chunk is missing, then we know that a deletion operation must have occurred. If we are clever about the way we design the starting DNA string then we will know exactly which deletion operation occurred and when it occurred relative to the operations of other recombinases. One of the main aims of our paper was to find starting DNA strings that would record this sort of information, which we theoretically did for up to 7 recombinases (experimentally validating for up to 3). This required designing DNA strings with multiple pairs of recognition sites per recombinase – otherwise, as Scott explains in his post, this feat would be impossible.
.
In my comment #30, I explain that strategies have been (and continue to be) developed to reverse the activity of directional recombinases. To explain why reversal is possible, I need to delve into some biological detail. When two recognition sites are acted on by a recombinase, they recombine. The recombination “pastes” the back end of one recognition site in a pair with the front end of the other, and vice-versa. See Figure S2 in the supplement of our paper for a diagram of how recombination occurs in our system. Depending on the type of recombinase, recombination can sometimes lead to recombined sites that no longer look like the original recognition sites, as is the case with the large serine recombinases (LSRs) used in our research. When this happens, the original recombinase can no longer (without extra co-factors) operate on the recombined sites because they no longer look like the original recognition sites, however the recombined sites can still act as recognition sites for a separate factor (LSR+RDF complex, or perhaps a separate recombinase entirely) that can operate on them in the same way (pasting the back end of one site to the front end of the other), and therefore reverse the operation of the original recombinase. Let us refer to recognition sites in their starting state as BP sites, and let us refer to their recombined configurations (made up of conjoined halves of each BP site) as LR sites. An inversion operation will take two anti-aligned BP sites on a DNA string and turn them into two anti-aligned LR sites on a DNA string. Reversing this reaction is then simply (in theory, at least) a matter of applying a recombination factor that acts on the LR sites. However, reversing a deletion operation on BP sites is a separate beast. Deletion leaves one LR site on the original DNA string and the other LR site on the deleted fragment. Biologically, unless the deleted DNA fragment has components (e.g. an origin of replication) that allow it to be maintained and replicated by the cell, then it will not be inherited when cells divide and it will essentially become lost after several generations. Therefore, even with a recombination factor that can act on LR sites, reversing deletion operations would still be challenging (and likely time-sensitive) due to a loss of LR substrate.
Scott Says:
Comment #36 July 26th, 2016 at 1:36 pm
Nathaniel #34:
No, I didn’t know anything of the kind. At first, I was just on an open-ended hunt for any mathematical structure I could find, any clue for why you couldn’t get more than 2^N sequences. For a problem of this kind, I can’t overstate the importance of simply trying a huge number of examples, not just one or two of them. (Well, maybe other people can proceed more abstractly, but I can’t!) Nor would it have helped much, in this case, to write a program to work out the examples for me, since the whole point was to think carefully through each one and try to deduce what would happen more generally.
Craig Says:
Comment #37 July 26th, 2016 at 8:54 pm
This seems to be proof of intelligent design. Random evolution isn’t smart enough to prove theorems like this.
Scott Says:
Comment #38 July 26th, 2016 at 9:29 pm
Craig #37: DNA didn’t prove the theorem; I proved the theorem. And take it from me—as theorems go, this wasn’t one that you needed to be all that smart to prove.
Craig Says:
Comment #39 July 26th, 2016 at 9:58 pm
I meant that you said, “because this is a case where biology threw up a bunch of rules that look like a random mess—the parentheses don’t even need to nest correctly? inversion can also change the semantics of the recognition sites? evolution never thought about what happens if you delete one recognition site while leaving the other one?—and yet, on analysis, all the rules work in perfect harmony to produce a certain outcome. Change a single one of them, and the “at most 2n distinct DNA sequences” theorem would be false. ”

Seems like intelligent design that it works in perfect harmony, not random evolution.
Ashley Lopez Says:
Comment #40 July 27th, 2016 at 4:44 am
Craig #39: Bunch of random mess producing something serving some biological purpose would be IN FAVOR of evolution, right? (Other random mess which did not bring forth enough ‘fitness’ would not have survived.)

Trying to think on the contrary if an intelligent designer would have thought all this up has got me wondering this: would this PROBLEM would have ever come from pure math?
Scott Says:
Comment #41 July 27th, 2016 at 5:03 am
Craig #39: You’re talking to someone who has personal experience of “blind” optimization algorithms (even ones run on the desktop computers of 1996 or so rather than the earth’s biosphere) producing solutions with an alien coherence and even beauty.
Craig Says:
Comment #42 July 27th, 2016 at 10:49 am
I would love to see examples of blind optimization algorithms producing “solutions with an alien coherence and even beauty.” Scott, can you please point us to such examples?
Nathaniel Says:
Comment #43 July 27th, 2016 at 3:46 pm
Re comment #39 (Craig):

I want to make it clear that the system that we describe in our paper, though biological in the sense that it uses parts derived from living matter, is artificial in the sense that we (the authors) composed the parts together to make the system behave the way it does. Yes, recombinases evolved in nature to perform recombination between two specific sites. Specifically, the recombinases used in our research were derived from bacteriophage (viruses that only infect bacteria), and the bacteriophage use them in nature to integrate their viral DNA into the host genome. But biological engineers are the ones who repurposed recombinases to perform logic in living cells. So if our system seems designed, it’s because it is- by scientists. We did not design it for the intention of having at most 2^N states when you use only one recognition site pair per recombinase- this just happened to be an interesting mathematical result.
Scott Says:
Comment #44 July 27th, 2016 at 5:47 pm
Craig #42: The examples I found in the 1990s (the last time I messed around with genetic algorithms) involved things like hypertext link structures and de Bruijn-like sequences. I don’t have the details handy anymore, but fortunately, the sort of thing I’m talking about is not at all uncommon. One class of examples that’s received a fair bit of attention is evolved antennas, totally unlike anything a human would come with (just look at them!), which nevertheless outperform human-designed antennas.
Chris Says:
Comment #45 July 27th, 2016 at 7:34 pm
Craig #42, Scott #44: John Koza and his colleagues created a number of such examples, including re-creation of patented electronic circuits:

http://www.eecs.harvard.edu/~rad/courses/cs266/papers/koza-sciam03.pdf
Ashley Lopez Says:
Comment #46 July 28th, 2016 at 11:59 am
Nathaniel # 43,

When you say the system is designed, which of the ‘rules’ in the problem statement were DESIGNED and which of them were from nature? For example, the rule that the parentheses need not face each other or nest correctly come from the biology and not from the design of your experiment, right? (As you could agree from reading my question, my knowledge of biology is pretty minimal.)
DLI Says:
Comment #47 July 28th, 2016 at 4:24 pm
Scott #44, Neat where can I buy one? Or better yet, TV antenna evolved for your living room as a service.

That being said, I’ve been in software cleaning up other people’s messes (and making a few of my own) for long enough that “totally unlike anything a human would come with” feels like a rather naive statement. I’ve *seen* things, man.

I even tried to come up with an objective way to talk about incomprehensible code. After about four years I’m still encountering code that is incomprehensible in it’s own unique way that’s totally orthogonal to my previous categorization attempts.
Sniffnoy Says:
Comment #48 July 28th, 2016 at 11:46 pm
So, let me see if I have this straight — the sequence of best-known lower bounds (or approximate lower bounds) on the base of the exponent goes as follows?

2, √5, √5, √5, √5, 326^(1/5), 326^(1/5), 326^(1/5), 326^(1/5), 326^(1/5), 326^(1/5), 326^(1/5), 26/3e, 28/3e, 10/e, 32/3e, …
Nathaniel Says:
Comment #49 July 29th, 2016 at 4:36 am
Ashley #46
.
I’ll start with the statement that the parentheses need not face each other or nest correctly. If this statement means what I think it means, then a different way of saying it would be that there is no physical restriction as to how we can arrange recognition sites on a DNA string. In other words, if you give me some sort of arbitrary arrangement of recognition sites, then I should be able to physically build a DNA molecule with that arrangement. While I can’t think of any DNA in nature (not built by scientists) that contains nested or overlapping recognition sites from different recombinases, I also don’t see any reason why such a DNA molecule should not exist in nature. TLDR: It is a “rule” intended for design, but also applicable to nature (though probably not relevant).
.
The rules about how a recombinase operates on a DNA string are from nature, but again, not necessarily relevant to nature. For example, we didn’t engineer the [-recombinase to delete everything between two [‘s facing the same way, nor did we engineer it to invert everything between two [‘s facing opposite ways. These were things the [-recombinase already “knew” how to do when we took it out of its natural context. That being said, the [-recombinase may not have necessarily evolved for the purpose of doing either of these things (deletion or inversion). For example, none of the recombinases that we used in our research actually perform inversion operations in nature. That is because none of these recombinases see oppositely facing recognition sites in nature. But they do in our artificial system. So I guess you could say that we re-purposed the recombinases- we didn’t change the way they work, but we did change the context in which they work.
.
I hope this explanation helps answer your question, please let me know if it doesn’t.
Filip Says:
Comment #50 July 29th, 2016 at 8:18 am
Exciting paper and a very fun problem. Thank you for posting it!

It sounds to me like the system exhibits a confluence property in rewriting parlance (https://en.wikipedia.org/wiki/Confluence_(abstract_rewriting)).

I haven’t had the chance to read the full article so I’m not sure what framework you used but have you considered results from rewriting systems? Would it offer any further insight or would it simply be an equivalent reformulation?

I’m curious to look at the problem that way myself but I thought it valuable to ask in the meantime.
Ashley Lopez Says:
Comment #51 July 29th, 2016 at 2:50 pm
Nathaniel # 49

Thank you very much for the reply. Yes it helps.

Thank you also for providing me with some motivation to read some biology! Especially this has got me curious: one would never see any of the recombinases you used actually perform inversion operations in nature as they never get the opportunity, but then you (or your community) somehow figured out that they HAVE that capability.
Nathaniel Says:
Comment #52 July 29th, 2016 at 3:45 pm
Ashley #51
.
Can’t take credit for this one! It was people way before me that figured out the inversion capability of the recombinases used in our study.
.
I should also mention that there are different types of recombinases. While the subset we used our study (composed of BxbI, A118, and TP901 recombinases) are primarily used by viruses in nature to pop their DNA in and out of host cell genomes (integration/deletion operations, but not inversion), there are other recombinases out there that actually do perform inversion in nature (e.g., https://en.wikipedia.org/wiki/Hin_recombinase).
Muhammad Ahmad Tirmazi Says:
Comment #53 August 5th, 2016 at 3:43 am
Dear Scott, the second (August 3) link you added also leads to a paywall I think. Or maybe I am unable to view it from my country, perhaps. Is there any other place where one can read your paper?
NPP Says:
Comment #54 August 6th, 2016 at 11:09 am
Continuing in Ashley’s direction of questions for a more “real” model.

A recognition site can be any DNA sub-string as long as it or its reverse-complement appears –without overlap– more than once in the DNA string. Right?

Isn’t DNA folded? Do recombinases attach to one recognition site and then unfold the DNA to find the immediate next site, or do they attach to any two “exposed” recognition sites, and thus a recombinase may operate on two non-adjacent recognition sites?
Spencer Says:
Comment #55 August 8th, 2016 at 5:02 am
Scott #28: Can you clarify what you mean by “orthogonal … but the same flavor” relating to the primes in (‘ S1 ( S2 [ S3 ) S4 [ S5 )’? In your initial problem statement you define n to be the number of pairs of restriction sites, which would seem to be 3 in your example. If n is instead the number of recombinases, then can you specify how the sites are paired up for each recombinase? From biology, I would expect all pairs of sites of the same flavor to be paired in some outcomes. In the above example, applying the (-recombinase only seems like it should give a mixture of

S5* S4* S2 S3 S1* – pair (‘ with )’
S2 S3 – pair (‘ with (
S3* S2* – pair (‘ with )

This still doesn’t get us over 2^n combinations though.

(‘ S1 ( S2 [ S3 ) S4 [ S5 )’ – apply nothing
S5* ] S4* S2 [ S3 S1* – pair (‘ with )’
S2 [ S3 – pair (‘ with (
S3* ] S2* – pair (‘ with )
S5* S2 S4* S3 S1* – pair (‘ with )’, then [
(‘ S1 ( S2 S5 )’ – apply [
S5* S2* ) S1* – apply [, pair (‘ with )’
(‘ S1 S5* S2* – apply [, pair ( with )’

Or did I miss some?
Scott Says:
Comment #56 August 8th, 2016 at 4:17 pm
Spencer #55: No, n still means the number of recombinases (so, n=2 in the example in question). For each recombinase, you then get k pairs of recognition sites. Acting with the recombinase then pairs each recognition site with its partner, where we’ll assume the order among the k pairs doesn’t matter (if it does matter, then what happens is undefined).

So, in the example in question:

(‘ A ( B [ C ) D [ E )’

stays itself if we do nothing, or becomes

(‘ A ( B E )’

if we apply [-recombinase,

E* B* A*

if we apply [ and then (,

E* ] D* B [ C A*

if we apply (, and

E* B* D C A*

if we apply ( and then [. This is 5 > 2² possibilities.
Spencer Says:
Comment #57 August 9th, 2016 at 7:15 am
Scott #56: I looked up the biology in the supplemental (Fig S2) and I now understand the biology that makes the restriction site pairing possible. I was previously under the impression that a recombinase cut at one and only one particular sequence.
s Says:
Comment #58 August 29th, 2016 at 10:35 pm
Very nice work.

Is it known what happens when there is more than one (non-orthogonal) recombinase site in a sequence? e.g.:

A(B(C(D

Thanks!
s Says:
Comment #59 September 1st, 2016 at 11:50 am
Missing a word in the previous post: more than one (non-orthogonal) recombinase site *pair* in a sequence. And an additional example (with 2 full pairs):

A(B(C(D(E

Thanks!
Scott Says:
Comment #60 September 2nd, 2016 at 2:55 pm
s #58: Sorry for the delay! That’s an interesting question. No, the model that I was given doesn’t specify what happens in that case—you’re promised that each recognition site occurs exactly twice. However, one could also ask what would happen in actual biology, if you engineered by hand a DNA sequence with 3 identical recognition sites. My guess is that, depending on the order in which the recognition sites are reached, any 2 of the 3 recognition sites might get acted on with recombinases. But I’m not sure—you’d need to ask a biologist.
Brian Says:
Comment #61 September 10th, 2016 at 6:01 pm
Link #2 is paywalled too 🙁

Shtetl-Optimized

My biology paper in Science (really)

61 Responses to “My biology paper in Science (really)”