Thursday, October 20, 2011

Steve Jobs is the Tyler Durden of design

All the ways you wish you could be, that's me. I talk like you want to talk, I control like you want to control, I am smart, capable, and most importantly, I am free in all the ways that you are not.

Wednesday, October 12, 2011

Stevey's Google Platforms Rant

Archiving this here in case it gets deleted. Accidentally made public from Google+, oh the irony:
Stevey's Google Platforms Rant

I was at Amazon for about six and a half years, and now I've been at Google for that long. One thing that struck me immediately about the two companies -- an impression that has been reinforced almost daily -- is that Amazon does everything wrong, and Google does everything right. Sure, it's a sweeping generalization, but a surprisingly accurate one. It's pretty crazy. There are probably a hundred or even two hundred different ways you can compare the two companies, and Google is superior in all but three of them, if I recall correctly. I actually did a spreadsheet at one point but Legal wouldn't let me show it to anyone, even though recruiting loved it.

I mean, just to give you a very brief taste: Amazon's recruiting process is fundamentally flawed by having teams hire for themselves, so their hiring bar is incredibly inconsistent across teams, despite various efforts they've made to level it out. And their operations are a mess; they don't really have SREs and they make engineers pretty much do everything, which leaves almost no time for coding - though again this varies by group, so it's luck of the draw. They don't give a single shit about charity or helping the needy or community contributions or anything like that. Never comes up there, except maybe to laugh about it. Their facilities are dirt-smeared cube farms without a dime spent on decor or common meeting areas. Their pay and benefits suck, although much less so lately due to local competition from Google and Facebook. But they don't have any of our perks or extras -- they just try to match the offer-letter numbers, and that's the end of it. Their code base is a disaster, with no engineering standards whatsoever except what individual teams choose to put in place.

To be fair, they do have a nice versioned-library system that we really ought to emulate, and a nice publish-subscribe system that we also have no equivalent for. But for the most part they just have a bunch of crappy tools that read and write state machine information into relational databases. We wouldn't take most of it even if it were free.

I think the pubsub system and their library-shelf system were two out of the grand total of three things Amazon does better than google.

I guess you could make an argument that their bias for launching early and iterating like mad is also something they do well, but you can argue it either way. They prioritize launching early over everything else, including retention and engineering discipline and a bunch of other stuff that turns out to matter in the long run. So even though it's given them some competitive advantages in the marketplace, it's created enough other problems to make it something less than a slam-dunk.

But there's one thing they do really really well that pretty much makes up for ALL of their political, philosophical and technical screw-ups.

Jeff Bezos is an infamous micro-manager. He micro-manages every single pixel of Amazon's retail site. He hired Larry Tesler, Apple's Chief Scientist and probably the very most famous and respected human-computer interaction expert in the entire world, and then ignored every goddamn thing Larry said for three years until Larry finally -- wisely -- left the company. Larry would do these big usability studies and demonstrate beyond any shred of doubt that nobody can understand that frigging website, but Bezos just couldn't let go of those pixels, all those millions of semantics-packed pixels on the landing page. They were like millions of his own precious children. So they're all still there, and Larry is not.

Micro-managing isn't that third thing that Amazon does better than us, by the way. I mean, yeah, they micro-manage really well, but I wouldn't list it as a strength or anything. I'm just trying to set the context here, to help you understand what happened. We're talking about a guy who in all seriousness has said on many public occasions that people should be paying him to work at Amazon. He hands out little yellow stickies with his name on them, reminding people "who runs the company" when they disagree with him. The guy is a regular... well, Steve Jobs, I guess. Except without the fashion or design sense. Bezos is super smart; don't get me wrong. He just makes ordinary control freaks look like stoned hippies.

So one day Jeff Bezos issued a mandate. He's doing that all the time, of course, and people scramble like ants being pounded with a rubber mallet whenever it happens. But on one occasion -- back around 2002 I think, plus or minus a year -- he issued a mandate that was so out there, so huge and eye-bulgingly ponderous, that it made all of his other mandates look like unsolicited peer bonuses.

His Big Mandate went something along these lines:

1) All teams will henceforth expose their data and functionality through service interfaces.

2) Teams must communicate with each other through these interfaces.

3) There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team's data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.

4) It doesn't matter what technology they use. HTTP, Corba, Pubsub, custom protocols -- doesn't matter. Bezos doesn't care.

5) All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.

6) Anyone who doesn't do this will be fired.

7) Thank you; have a nice day!

Ha, ha! You 150-odd ex-Amazon folks here will of course realize immediately that #7 was a little joke I threw in, because Bezos most definitely does not give a shit about your day.

#6, however, was quite real, so people went to work. Bezos assigned a couple of Chief Bulldogs to oversee the effort and ensure forward progress, headed up by Uber-Chief Bear Bulldog Rick Dalzell. Rick is an ex-Armgy Ranger, West Point Academy graduate, ex-boxer, ex-Chief Torturer slash CIO at Wal*Mart, and is a big genial scary man who used the word "hardened interface" a lot. Rick was a walking, talking hardened interface himself, so needless to say, everyone made LOTS of forward progress and made sure Rick knew about it.

Over the next couple of years, Amazon transformed internally into a service-oriented architecture. They learned a tremendous amount while effecting this transformation. There was lots of existing documentation and lore about SOAs, but at Amazon's vast scale it was about as useful as telling Indiana Jones to look both ways before crossing the street. Amazon's dev staff made a lot of discoveries along the way. A teeny tiny sampling of these discoveries included:

- pager escalation gets way harder, because a ticket might bounce through 20 service calls before the real owner is identified. If each bounce goes through a team with a 15-minute response time, it can be hours before the right team finally finds out, unless you build a lot of scaffolding and metrics and reporting.

- every single one of your peer teams suddenly becomes a potential DOS attacker. Nobody can make any real forward progress until very serious quotas and throttling are put in place in every single service.

- monitoring and QA are the same thing. You'd never think so until you try doing a big SOA. But when your service says "oh yes, I'm fine", it may well be the case that the only thing still functioning in the server is the little component that knows how to say "I'm fine, roger roger, over and out" in a cheery droid voice. In order to tell whether the service is actually responding, you have to make individual calls. The problem continues recursively until your monitoring is doing comprehensive semantics checking of your entire range of services and data, at which point it's indistinguishable from automated QA. So they're a continuum.

- if you have hundreds of services, and your code MUST communicate with other groups' code via these services, then you won't be able to find any of them without a service-discovery mechanism. And you can't have that without a service registration mechanism, which itself is another service. So Amazon has a universal service registry where you can find out reflectively (programmatically) about every service, what its APIs are, and also whether it is currently up, and where.

- debugging problems with someone else's code gets a LOT harder, and is basically impossible unless there is a universal standard way to run every service in a debuggable sandbox.

That's just a very small sample. There are dozens, maybe hundreds of individual learnings like these that Amazon had to discover organically. There were a lot of wacky ones around externalizing services, but not as many as you might think. Organizing into services taught teams not to trust each other in most of the same ways they're not supposed to trust external developers.

This effort was still underway when I left to join Google in mid-2005, but it was pretty far advanced. From the time Bezos issued his edict through the time I left, Amazon had transformed culturally into a company that thinks about everything in a services-first fashion. It is now fundamental to how they approach all designs, including internal designs for stuff that might never see the light of day externally.

At this point they don't even do it out of fear of being fired. I mean, they're still afraid of that; it's pretty much part of daily life there, working for the Dread Pirate Bezos and all. But they do services because they've come to understand that it's the Right Thing. There are without question pros and cons to the SOA approach, and some of the cons are pretty long. But overall it's the right thing because SOA-driven design enables Platforms.

That's what Bezos was up to with his edict, of course. He didn't (and doesn't) care even a tiny bit about the well-being of the teams, nor about what technologies they use, nor in fact any detail whatsoever about how they go about their business unless they happen to be screwing up. But Bezos realized long before the vast majority of Amazonians that Amazon needs to be a platform.

You wouldn't really think that an online bookstore needs to be an extensible, programmable platform. Would you?

Well, the first big thing Bezos realized is that the infrastructure they'd built for selling and shipping books and sundry could be transformed an excellent repurposable computing platform. So now they have the Amazon Elastic Compute Cloud, and the Amazon Elastic MapReduce, and the Amazon Relational Database Service, and a whole passel' o' other services browsable ataws.amazon.com. These services host the backends for some pretty successful companies, reddit being my personal favorite of the bunch.

The other big realization he had was that he can't always build the right thing. I think Larry Tesler might have struck some kind of chord in Bezos when he said his mom couldn't use the goddamn website. It's not even super clear whose mom he was talking about, and doesn't really matter, because nobody's mom can use the goddamn website. In fact I myself find the website disturbingly daunting, and I worked there for over half a decade. I've just learned to kinda defocus my eyes and concentrate on the million or so pixels near the center of the page above the fold.

I'm not really sure how Bezos came to this realization -- the insight that he can't build one product and have it be right for everyone. But it doesn't matter, because he gets it. There's actually a formal name for this phenomenon. It's called Accessibility, and it's the most important thing in the computing world.

The. Most. Important. Thing.

If you're sorta thinking, "huh? You mean like, blind and deaf people Accessibility?" then you're not alone, because I've come to understand that there are lots and LOTS of people just like you: people for whom this idea does not have the right Accessibility, so it hasn't been able to get through to you yet. It's not your fault for not understanding, any more than it would be your fault for being blind or deaf or motion-restricted or living with any other disability. When software -- or idea-ware for that matter -- fails to be accessible toanyone for any reason, it is the fault of the software or of the messaging of the idea. It is an Accessibility failure.

Like anything else big and important in life, Accessibility has an evil twin who, jilted by the unbalanced affection displayed by their parents in their youth, has grown into an equally powerful Arch-Nemesis (yes, there's more than one nemesis to accessibility) named Security. And boy howdy are the two ever at odds.

But I'll argue that Accessibility is actually more important than Security because dialing Accessibility to zero means you have no product at all, whereas dialing Security to zero can still get you a reasonably successful product such as the Playstation Network.

So yeah. In case you hadn't noticed, I could actually write a book on this topic. A fat one, filled with amusing anecdotes about ants and rubber mallets at companies I've worked at. But I will never get this little rant published, and you'll never get it read, unless I start to wrap up.

That one last thing that Google doesn't do well is Platforms. We don't understand platforms. We don't "get" platforms. Some of you do, but you are the minority. This has become painfully clear to me over the past six years. I was kind of hoping that competitive pressure from Microsoft and Amazon and more recently Facebook would make us wake up collectively and start doing universal services. Not in some sort of ad-hoc, half-assed way, but in more or less the same way Amazon did it: all at once, for real, no cheating, and treating it as our top priority from now on.

But no. No, it's like our tenth or eleventh priority. Or fifteenth, I don't know. It's pretty low. There are a few teams who treat the idea very seriously, but most teams either don't think about it all, ever, or only a small percentage of them think about it in a very small way.

It's a big stretch even to get most teams to offer a stubby service to get programmatic access to their data and computations. Most of them think they're building products. And a stubby service is a pretty pathetic service. Go back and look at that partial list of learnings from Amazon, and tell me which ones Stubby gives you out of the box. As far as I'm concerned, it's none of them. Stubby's great, but it's like parts when you need a car.

A product is useless without a platform, or more precisely and accurately, a platform-less product will always be replaced by an equivalent platform-ized product.

Google+ is a prime example of our complete failure to understand platforms from the very highest levels of executive leadership (hi Larry, Sergey, Eric, Vic, howdy howdy) down to the very lowest leaf workers (hey yo). We all don't get it. The Golden Rule of platforms is that you Eat Your Own Dogfood. The Google+ platform is a pathetic afterthought. We had no API at all at launch, and last I checked, we had one measly API call. One of the team members marched in and told me about it when they launched, and I asked: "So is it the Stalker API?" She got all glum and said "Yeah." I mean, I was joking, but no... the only API call we offer is to get someone's stream. So I guess the joke was on me.

Microsoft has known about the Dogfood rule for at least twenty years. It's been part of their culture for a whole generation now. You don't eat People Food and give your developers Dog Food. Doing that is simply robbing your long-term platform value for short-term successes. Platforms are all about long-term thinking.

Google+ is a knee-jerk reaction, a study in short-term thinking, predicated on the incorrect notion that Facebook is successful because they built a great product. But that's not why they are successful. Facebook is successful because they built an entire constellation of products by allowing other people to do the work. So Facebook is different for everyone. Some people spend all their time on Mafia Wars. Some spend all their time on Farmville. There are hundreds or maybe thousands of different high-quality time sinks available, so there's something there for everyone.

Our Google+ team took a look at the aftermarket and said: "Gosh, it looks like we need some games. Let's go contract someone to, um, write some games for us." Do you begin to see how incredibly wrong that thinking is now? The problem is that we are trying to predict what people want and deliver it for them.

You can't do that. Not really. Not reliably. There have been precious few people in the world, over the entire history of computing, who have been able to do it reliably. Steve Jobs was one of them. We don't have a Steve Jobs here. I'm sorry, but we don't.

Larry Tesler may have convinced Bezos that he was no Steve Jobs, but Bezos realized that he didn't need to be a Steve Jobs in order to provide everyone with the right products: interfaces and workflows that they liked and felt at ease with. He just needed to enable third-party developers to do it, and it would happen automatically.

I apologize to those (many) of you for whom all this stuff I'm saying is incredibly obvious, because yeah. It's incredibly frigging obvious. Except we're not doing it. We don't get Platforms, and we don't get Accessibility. The two are basically the same thing, because platforms solve accessibility. A platform is accessibility.

So yeah, Microsoft gets it. And you know as well as I do how surprising that is, because they don't "get" much of anything, really. But they understand platforms as a purely accidental outgrowth of having started life in the business of providing platforms. So they have thirty-plus years of learning in this space. And if you go to msdn.com, and spend some time browsing, and you've never seen it before, prepare to be amazed. Because it's staggeringly huge. They have thousands, and thousands, and THOUSANDS of API calls. They have a HUGE platform. Too big in fact, because they can't design for squat, but at least they're doing it.

Amazon gets it. Amazon's AWS (aws.amazon.com) is incredible. Just go look at it. Click around. It's embarrassing. We don't have any of that stuff.

Apple gets it, obviously. They've made some fundamentally non-open choices, particularly around their mobile platform. But they understand accessibility and they understand the power of third-party development and they eat their dogfood. And you know what? They make pretty good dogfood. Their APIs are a hell of a lot cleaner than Microsoft's, and have been since time immemorial.

Facebook gets it. That's what really worries me. That's what got me off my lazy butt to write this thing. I hate blogging. I hate... plussing, or whatever it's called when you do a massive rant in Google+ even though it's a terrible venue for it but you do it anyway because in the end you really do want Google to be successful. And I do! I mean, Facebook wants me there, and it'd be pretty easy to just go. But Google is home, so I'm insisting that we have this little family intervention, uncomfortable as it might be.

After you've marveled at the platform offerings of Microsoft and Amazon, and Facebook I guess (I didn't look because I didn't want to get too depressed), head over to developers.google.com and browse a little. Pretty big difference, eh? It's like what your fifth-grade nephew might mock up if he were doing an assignment to demonstrate what a big powerful platform company might be building if all they had, resource-wise, was one fifth grader.

Please don't get me wrong here -- I know for a fact that the dev-rel team has had to FIGHT to get even this much available externally. They're kicking ass as far as I'm concerned, because they DO get platforms, and they are struggling heroically to try to create one in an environment that is at best platform-apathetic, and at worst often openly hostile to the idea.

I'm just frankly describing what developers.google.com looks like to an outsider. It looks childish. Where's the Maps APIs in there for Christ's sake? Some of the things in there are labs projects. And the APIs for everything I clicked were... they were paltry. They were obviously dog food. Not even good organic stuff. Compared to our internal APIs it's all snouts and horse hooves.

And also don't get me wrong about Google+. They're far from the only offenders. This is a cultural thing. What we have going on internally is basically a war, with the underdog minority Platformers fighting a more or less losing battle against the Mighty Funded Confident Producters.

Any teams that have successfully internalized the notion that they should be externally programmable platforms from the ground up are underdogs -- Maps and Docs come to mind, and I know GMail is making overtures in that direction. But it's hard for them to get funding for it because it's not part of our culture. Maestro's funding is a feeble thing compared to the gargantuan Microsoft Office programming platform: it's a fluffy rabbit versus a T-Rex. The Docs team knows they'll never be competitive with Office until they can match its scripting facilities, but they're not getting any resource love. I mean, I assume they're not, given that Apps Script only works in Spreadsheet right now, and it doesn't even have keyboard shortcuts as part of its API. That team looks pretty unloved to me.

Ironically enough, Wave was a great platform, may they rest in peace. But making something a platform is not going to make you an instant success. A platform needs a killer app. Facebook -- that is, the stock service they offer with walls and friends and such -- is the killer app for the Facebook Platform. And it is a very serious mistake to conclude that the Facebook App could have been anywhere near as successful without the Facebook Platform.

You know how people are always saying Google is arrogant? I'm a Googler, so I get as irritated as you do when people say that. We're not arrogant, by and large. We're, like, 99% Arrogance-Free. I did start this post -- if you'll reach back into distant memory -- by describing Google as "doing everything right". We do mean well, and for the most part when people say we're arrogant it's because we didn't hire them, or they're unhappy with our policies, or something along those lines. They're inferring arrogance because it makes them feel better.

But when we take the stance that we know how to design the perfect product for everyone, and believe you me, I hear that a lot, then we're being fools. You can attribute it to arrogance, or naivete, or whatever -- it doesn't matter in the end, because it's foolishness. There IS no perfect product for everyone.

And so we wind up with a browser that doesn't let you set the default font size. Talk about an affront to Accessibility. I mean, as I get older I'm actually going blind. For real. I've been nearsighted all my life, and once you hit 40 years old you stop being able to see things up close. So font selection becomes this life-or-death thing: it can lock you out of the product completely. But the Chrome team is flat-out arrogant here: they want to build a zero-configuration product, and they're quite brazen about it, and Fuck You if you're blind or deaf or whatever. Hit Ctrl-+ on every single page visit for the rest of your life.

It's not just them. It's everyone. The problem is that we're a Product Company through and through. We built a successful product with broad appeal -- our search, that is -- and that wild success has biased us.

Amazon was a product company too, so it took an out-of-band force to make Bezos understand the need for a platform. That force was their evaporating margins; he was cornered and had to think of a way out. But all he had was a bunch of engineers and all these computers... if only they could be monetized somehow... you can see how he arrived at AWS, in hindsight.

Microsoft started out as a platform, so they've just had lots of practice at it.

Facebook, though: they worry me. I'm no expert, but I'm pretty sure they started off as a Product and they rode that success pretty far. So I'm not sure exactly how they made the transition to a platform. It was a relatively long time ago, since they had to be a platform before (now very old) things like Mafia Wars could come along.

Maybe they just looked at us and asked: "How can we beat Google? What are they missing?"

The problem we face is pretty huge, because it will take a dramatic cultural change in order for us to start catching up. We don't do internal service-oriented platforms, and we just as equally don't do external ones. This means that the "not getting it" is endemic across the company: the PMs don't get it, the engineers don't get it, the product teams don't get it, nobody gets it. Even if individuals do, even if YOU do, it doesn't matter one bit unless we're treating it as an all-hands-on-deck emergency. We can't keep launching products and pretending we'll turn them into magical beautiful extensible platforms later. We've tried that and it's not working.

The Golden Rule of Platforms, "Eat Your Own Dogfood", can be rephrased as "Start with a Platform, and Then Use it for Everything." You can't just bolt it on later. Certainly not easily at any rate -- ask anyone who worked on platformizing MS Office. Or anyone who worked on platformizing Amazon. If you delay it, it'll be ten times as much work as just doing it correctly up front. You can't cheat. You can't have secret back doors for internal apps to get special priority access, not for ANY reason. You need to solve the hard problems up front.

I'm not saying it's too late for us, but the longer we wait, the closer we get to being Too Late.

I honestly don't know how to wrap this up. I've said pretty much everything I came here to say today. This post has been six years in the making. I'm sorry if I wasn't gentle enough, or if I misrepresented some product or team or person, or if we're actually doing LOTS of platform stuff and it just so happens that I and everyone I ever talk to has just never heard about it. I'm sorry.

But we've gotta start doing this right.

Wednesday, October 5, 2011

The Nymwars: Anonymity as a Security Strategy

I swore  I wouldn't post about this, it was too stupid. The Nymwars people were freaking out over nothing, I thought. Just don't use Google+, it's not hard, right? But even within Google there were fights over it. Then the EFF chimed in and I think they're right, and so does JWZ. So I'm pro-pseudonym I guess, because people I like are too.

But that's not really it, is it? At The City we don't have a "real name policy" as such, but the system makes the assumption that people are going to put their real name and picture on their profile because everyone they interact with using the service is presumably someone they could meet any given Sunday. We have a "nickname" field but that's only to accomodate Dave/David preferences and doesn't hide their real name in a meaningful way. So as a product we have a real name assumption and have steadily resisted hiding real names because the scale is small, and users are naturally bucketed into churches.

As a church grows, that scale gets bigger, and the probability of someone meeting someone they don't like any given Sunday gets bigger. But on when you put everyone in the church in a flat, searchable namespace on The City, that probability becomes 1 as soon as that someone joins. So we get requests from churches to allow hidden profiles or other privacy measures to make it less obvious that someone with a certain name has an online presence

We've long dismissed this as a "human sin problem" and simply added ways to limit who can contact them rather than hide their existence. Because The City doesn't do anything for a bad person except show you someone's name, something they could already see anyway using a phone book. But we haven't taken it far enough and I think this debate opens up how. While pseudonymity doesn't really work in the Church, for us, and I think other social sites of appreciable scale, you can choose to require real names if you like. But there had better be complete security controls around displaying presence, and affordance for anonymous interaction where it makes sense.

The exemplar is credit card billing records in a small business. They need to know your real name and billing info and may know your purchase history. Groups of customers may know each other if they choose to announce their patronage, but may choose not to. The business doesn't get to go announcing exactly who purchased from them without raising the ire of their customers, who didn't ask that.

Real names have power, and we should not expect users to trust us with theirs unless we handle them responsibly. In our case the user may be required to use The City to engage in their church community, and that's a lot to ask in any case. Reducing the fear about engaging in close community online only happens if that community and the people in it are carefully protected.

Friday, September 30, 2011

Rails Migrations Best Practices

The City is a very long-lived Rails application and as such has accumulated over 600 migrations. We've had enough of them fail in enough interesting ways to learn some rules to follow when writing and applying them.

One Atomic Change Per Migration

Migrations create a schema version that your database applies atomically. That version should specify one reversible change to the database. Generally speaking this means one add_column, create_table, or add_index call per migration. This way if any of your migrations fail you can roll back one and only one migration with a db:rollback. Remember, just because you tested a migration with a dozen add_column calls doesn't mean it will actually apply in your live site - timeouts, connection errors, and more can all happen and you don't want to have to do SQL tricks to get a half-applied migration to complete. 

While it is conceptually nice to bundle an entire feature's worth of database changes into its associated migration - say you're adding a Story model and you'd like to do something like this:

class CreateStoryTable < ActiveRecord::Migration
  def self.up
    create_table :stories do |t|
      t.string  :title
      t.integer :journal_id
      t.integer :user_id
      t.integer :account_id
      t.timestamps
    end
 
    add_index :stories, [:user_id, :account_id, :journal_id]
 
    add_column :journals, :stories_count, :integer
  end
 
  def self.down
    remove_index :stories, [:user_id, :account_id, :journal_id]
    drop_table :stories
    remove_column :journals, :stories_count
  end
end
This is a bad idea. While it's conceptually consistent, if that add_column call fails, you'll be in a half-migrated state, where the database has been modified but the schema version has not. The database will want to run this migration again the next time you run db:migrate, but guess what? The table is already there, and the migration will fail once again.

Atomicity Must Be Preserved

Ok, so we have to finish running this migration. You could run a db:migrate:redo, right? Sure, except redo is going to run the down migration, and the first thing it does is remove an index which was never created. You can't move the remove_index line down below the drop_table call because the index won't be there when the table is gone.

Let's rewrite this migration to do the 3 things it needs to do in separate migrations:
class CreateStoryTable < ActiveRecord::Migration
  def self.up
    create_table :stories do |t|
      t.string  :title
      t.integer :journal_id
      t.integer :user_id
      t.integer :account_id
      t.timestamps
    end
  end
 
  def self.down
    drop_table :stories
  end
end
class AddStoriesIndexOnUserAccountAndJournal < ActiveRecord::Migration
  def self.up
    add_index :stories, [:user_id, :account_id, :journal_id]
  end
 
  def self.down
    remove_index :stories, [:user_id, :account_id, :journal_id]
  end
end
class AddStoriesCountToJournals < ActiveRecord::Migration
  def self.up
    add_column :journals, :stories_count, :integer
  end
 
  def self.down
    remove_column :journals, :stories_count
  end
end

Much better. Now if you deploy this whole set of migrations and need to roll back the entire feature, you can just use rake db:rollback STEP=3 (if you want, but in the case of this example, you might not have to - see below), or if any individual one fails you can roll back only the changes it made.

Code-Safety Within The Release

As much as possible, avoid performing migrations that break compatibility with currently running code. Perform remove_column and drop_table migrations one release after the code change that drops dependency on them. If you're making a structure change that breaks compatibility, you're best off shipping a compatibility change first and migrating later. Pedro Belo at Heroku wrote the definitive guide to this. If you're not as sensitive to scheduled downtime, by all means take it first and ask questions later.

Time Your Tests

You should of course be testing your releases against production data before deploying them, and when doing so you must verify not only data correctness (with tests if you can) but change timeliness. If you have a migration that takes 20 minutes with production data and performs breaking changes, you had better know that in advance so you can prepare with either downtime or a staggered release.

No Code In Migrations

I know the Rails docs say it's ok (with caveats) to use models directly in migrations, but having migrations do too much has been a source of problematic bugs for us in practice. If you have data changes that must be performed as part of the migration, they should be done in a rake task. That way the whole app in its post-migrated state is available to work on, and more importantly, as the size of your dataset grows, you can distribute large dataset transformation operations to a background process like Resque. On a large database, a call like Product.all.each {...} could take a very long time, time you could save by parallelizing the work.

This has its caveats too - it complicates release management by creating a separate task to do just to keep data correct, so YMMV. We've done this out of need because we have so much data.

No Shortcuts

Migrations, despite their simple appearance, are one of the easiest ways to screw yourself up and lose customer data. Back up before applying them, test and time them carefully, and don't get lazy. Your database is not like code, testing it for correctness and rolling back changes is not easy. Love your users by being rabid about keeping their trust, and they will love you back.

Friday, March 4, 2011

Using Amazon EC2 like a cheap, confusing VPS

Let's say you wanted a VPS-like server somewhere that runs all the time, gives you root access, and lets you install whatever you want. Let's say you also wanted to attach tons of space, a CDN, a database server, and anything else. Well you'd want to use some of the Amazon AWS products but you'd probably not want to have your VPS service hosted somewhere else, so you can centralize.

So you'd think, EC2 lets you run servers, and EBS lets you use them like they have regular hard drives attached, let's just buy a yearlong EC2 reservation and get a fat server for an amortized $75/month! It's a great plan, until you start to try it and realize everything in AWS is designed in little tiny pieces, not systems.

So then you try something like RightScale, or Judo, or Scalr, and you think, I don't need this autoscaling clouding magic scripty crap, I just want a server! Is that so hard? Well no, but it's $250/month from Slicehost, and you don't want to pay that.

Let's do something different. Let's get as close to a VPS as we can in EC2 using the simplest tools possible - just the console and the ec2 API tools.

First, some concepts. AWS has its own language that is not very familiar. I strongly recommend taking an hour or two and reading the EC2 User Guide so you can get a handle on what's going on here. I tried to avoid it but really it's best to just read it straight through. This isn't a quickie project.

The core units we'll be focusing on are EBS volumes and EBS-backed AMIs. These are going to form the core of your server's identity.

EBS volumes can start as either a Snapshot, which is basically a tape backup of a drive at a point in time, or as an empty drive. Once created, they can then exist either as a detached ("available") volume, which is just like an unplugged hard drive sitting on the table, or an attached ("in-use") volume, which is like a drive plugged into a server. (Note that the server they are attached to doesn't have to be running.) You can leave EBS volumes sitting around detached as long as you like, and take snapshots of them whenever.

EBS-backed AMIs are awesome, because we can take any Snapshot and make it the root device of an AMI. Then whenever you start the AMI, it would be like like taking that Snapshot (remember, tape backup), buying a server, copying the tape data onto the server's main hard drive, and starting it up. You still have the tape, and the server now has the tape data (as copied to its hard drive, a new EBS volume) as its starting point.

The thing about AMIs is they don't really exist as servers, they exist as the idea of a server. Sort of a specification, not a saved state. They have a particular set of disks (or snapshots) to attach, a kernel to run, and an architecture, but that's about it. In most cases they're designed to destroy all the data they create during their lifetime, because AWS likes you to run things on-demand, not forever. Well in a VPS you want forever, so we're going to do some tricks to get there.

What we're going to create in AWS is an AMI that, when you create an instance from it, creates and boots from an EBS volume containing the data in a Snapshot. When the instance is stopped (shut down) or terminated (deleted), the EBS volume it made will sit there detached in your list.

This is important because while you could use an EBS-backed instance as a VPS simply by never terminating it once it's running, if for some reason it was terminated, you'd lose all the data on it unless you took a snapshot, but even then you'd be restoring from your last snapshot, not the moment the instance stopped. This is why termination protection exists, but a checkbox on a webpage is not enough to protect production data.

Let's do some work. Look for a the AMI you want in the AMI list. At the time of writing ami-3202f25b (Ubuntu 10.04 20110201.1) is a good one. Launch it, and go to Instances. When it's running, right-click it and select Create Image (EBS AMI). You'll see an AMI go to Pending in your list of AMIs Owned By Me, and a Snapshot go Pending as well. Get some paper and write down the Snapshot ID, then go back to AMIs and write down the Kernel ID.

Now drop into your terminal where we'll be using some of the ec2 tools (you installed them, right?). You can't do what we want from the web console, which is to set up a server that keeps all its disks around when you terminate it. Use ec2reg like so:
ec2reg -n 'Ubuntu 10.04 20110201.1 Base' -d 'Basic ubuntu server configuration' --root-device-name /dev/sda1 -b /dev/sda1=your-snap-id:8:false -a x86_64 --kernel your-kernel-id
What you just did was make an AMI (again, the idea of a server) that runs Ubuntu 10.04. What's special about it is that when you make instances out of it, those instances don't destroy the data they made over their lifetime. The false at the end of the -b argument is the trick.

So how many servers do you want? Just launch as many of those AMIs as you like and even if you terminate them, the stuff they do will still exist. You probably never actually want to terminate them, only stop them, but you would be safe even if you did. Now you can safely install software right on your instances without running launch scripts or pasting in a bunch of userdata.

Once you start running your instances, you'll still want to take snapshots of their volumes every so often. As long as you don't terminate your instances, only stop them, they'll work just like servers with hard drives. If you do terminate one, you'll need to do a little runaround. You'll have a detached volume that was that instance's root device. You can't boot a server from a detached EBS volume, only a snapshot. So take a snapshot of the detached root device, then run ec2reg again with that snapshot as the snap-id. You now have a new 'rescue' AMI that really only represents that one particular instance, so when you launch a replacement instance from it, you should be right back where you left off. You can keep that AMI around if you want, the important thing is the snapshot. You can always create another AMI based on a standing snapshot that has the data you want, just make sure you pick the right kernel and mountpoints. Documentation helps here.

So this is great for software on a server's root drive, but let's say you want to run a data storage engine on your VPS-like EC2 setup. You wouldn't be getting much of the benefits out of EC2 by having all that data stored right on the boot drive, far better to use a separate EBS volume for your data, or even a couple of them RAIDed together. This allows you to take data snapshots separately from system snapshots.

There's 2 ways you could do this. First, we'll design a setup where all the instances you want will have a data volume mounted when they are launched.

We'll need to make a different AMI to represent this, because you can't edit the launch block device configuration of an AMI once you've made it. We'll use ec2reg again:
ec2reg -n 'Ubuntu server with 10GB at sdf1' -d 'Ubuntu 10.04 with new 10GB volume at /dev/sdf1' --root-device-name /dev/sda1 -b /dev/sda1=your-snap-id:8:false -b /dev/sdf1=:10:false -a x86_64 --kernel your-kernel-id
This will create a new 10GB EBS volume at /dev/sdf1 when this AMI is launched. It won't actually be mounted in the OS, or even formatted, because the root device snapshot we're working with has no idea that it exists. If you're launching these for the first time this should be fine, because you can format it, declare it in fstab, etc. and your changes will be safe due to the persistent root device. If you terminate an instance and have to reattach its snapshot to a new AMI later, you'll need to snapshot both the root and data volumes and include them in the rescue AMI:
ec2reg -n 'Redis slave server rescue AMI' -d 'Rescuing the redis slave from snapshots' --root-device-name /dev/sda1 -b /dev/sda1=instance-snap-id:8:false -b /dev/sdf1=instance-data-snap-id:10:false -a x86_64 --kernel your-kernel-id
This will make an AMI that will let you relaunch that instance with both its root device and its data intact.

The other way to do this is to simply create some EBS volumes and attach them to the instances you made from the first AMI. They won't be destroyed if you terminate the instances because they were attached after the instances launched. Just don't forget to reattach them if you have to rescue the instances, or include them in the rescue AMI as shown above.

Taking into account the costs of data transfer, snapshot and EBS storage, and instance type runtime costs, this may or may not actually be cheaper than a standard VPS. You get a lot more headroom though, and easy integration with other AWS products.

Friday, February 25, 2011

Rails routing gotcha: Don't name a route not_found

When moving our app to Bundler, I got the following error:

(__DELEGATE__):2:in `not_found': wrong number of arguments (2 for 0)
(ArgumentError)
from (__DELEGATE__):2:in `send'
from (__DELEGATE__):2:in `not_found'
from /app_dir/vendor/rails/activesupport/lib/active_support/
option_merger.rb:20:in `__send__'
from /app_dir/vendor/rails/activesupport/lib/active_support/
option_merger.rb:20:in `method_missing'
from /app_dir/config/routes/plaza_routes.rb:5
from /app_dir/vendor/rails/activesupport/lib/active_support/core_ext/
object/misc.rb:78:in `with_options'
from /app_dir/vendor/rails/actionpack/lib/action_controller/routing/
route_set.rb:51:in `namespace'

The secret was a route that looked like this:

global.not_found '/not_found', :controller => 'home', :action =>
'not_found'

The solution? Rename the route to something else. No idea why this wasn't exposed before, but at least there's a solution!