seperis: (Default)
seperis ([personal profile] seperis) wrote2009-04-13 09:20 pm

amazon and codefixes - oh, this is something i might know something about!

A possible explanation, gakked from [livejournal.com profile] trobadora:

AmazonFail: An Inside Look at What Happened

Amazon managers found that an employee who happened to work in France had filled out a field incorrectly and more than 50,000 items got flipped over to be flagged as "adult," the source said. (Technically, the flag for adult content was flipped from 'false' to 'true.')


Note: If they are telling the truth about what happened, this applies. And actually, it would apply if they lied, but worse. One error is one thing, but if this was a deliberate system-wide build that made the change, pretty much the same thing applies, but with less sympathy.

My expertise is not expertise, it is anecdata, but it's also ten builds and fifty emergency releases of professional anecdata, so take that as you will.

I am a professional tester because at some point, it occurred to people that things worked better when there was a level of testing that was specifically designed to mimic the experiences of the average user with a change to a program. Of course, they didn't use average users, they used former caseworkers and programmers, but the point stands.



I'm a professional program tester and do user acceptance, which means I am the last line of defense for users before we release a change to the program, major and minor. It's a web-based program that with three very idiotic ways to interface with it online for a user and about fifty for other agencies to do automatically, and I won't go into our vendor interfaces because it hurts me inside. I am one of thirty user acceptance testers for this program, because it's huge and covers a massive number of things and interfaces with federal and state level agencies outside of our own internal agencies. I test things straight from the hands of coders in emergency releases and also after they've gone through two other levels of testing in our quarterly builds.

This does ring true to my experience when something just goes stupid. And when I say stupid, I mean, someone accidentally killed off pieces of welfare policy with a misflag once, and that's not even the stupidest thing I've had to test when the program was built and is still coded modularly and the coders are in different parts of the country and sometimes at home in India when working on this. And none of them ever know what anyone else is doing.

While I have no idea what amazon's model looks like, to do a rollback on a change for us, even a minor one, it goes like this:

1.) Report
2.) Reproduction in one of our environments.
3.) Code fix and discussion and so many meetings, God. (emergency releases may not go through this.)
4.) DEV environment 1 (theoretical construct of the program, works very well, nothing like the real thing)
5.) DEV environment 2 (closer to the actual program, but not by much)(sometimes do not use both Dev 1 and Dev 2 both)
6.) SIT (sometimes skipped for emergency releases) (I have issues with their methodology.)
7.) User Acceptance (me! And some other people, somewhat close to field conditions with database as of last mass update, usually two to three months before)
8.) Prodfix (optional) (also me! And some other people, almost perfect mirror of field conditions with full database)

If it's really desperate, it goes to prodfix instead of or in addition to User Acceptance, which is the only environment we have that nearly-perfectly mirrors live field conditions and is fully updated with our field database as of five o'clock COB the day before. For me to do a basic test, they give me a (really horrifyingly short) version of events and if I get lucky, I get to see screenshots of the problem in progress.

[If I win the lottery, someone uploaded the specific patches themselves for me to look at, and I get to see what is going on pre-compiling. That has happened once. I did not take advantage of it. I kick myself sometimes.]

Usually, I get a fifth hand account that's gone through eight other people on what went wrong and what function I'm supposed to test and in what order to do it in. Depending on severity, I have four hours to four days to write the test (or several tests, or several variations of the same test for different user conditions, or different input conditions), send it to the person who owns the defect, have them check it, then I run the test in full, then fail or pass it. Or run it in full, fail or pass, then run it in prodfix, fail or pass it.

[Sometimes, I have a coder call me and we both stare in horror at our lot in life when both of us really don't know what the hell went wrong and hope to God this didn't break more things.]

The fastest I've ever seen an emergency release fix go through is three days from report to implementation, and at least once, we had a massive delay when they were too eager and crashed our database because the rollback didn't match the new information entered into the system since the problem started.

[And since this is welfare and under federal jurisdiction, the state gets fined by the feds when we cannot issue benefits correctly or have egrerious errors. Feds are really, really politely nasty about this sort of thing. And OIG, who audits us for errors, hates this program like you would not believe. To say there is motivation for speed is to understate the case.]

The program I test is huge, and terrifyingly complicated, and unevenly coded, and we can easily crash the servers for incredibly stupid small-seeming things. Amazon is about a hundred times larger. We do four major builds and four minor (just like major, just with a different name) per year, plus upwards of thirty emergency releases between builds. Our releases aren't live but overnight batched when the program goes to low-use after 8 PM, so we have some leeway if something goes dramatically bad or our testing isn't thorough enough. Which you know, that also happens. Amazon is always up and while it has the same constant database updates we do, I'm betting also has more frequent normal code updates, both automatic and human initiated.

If this is actually what happened, then the delay in fixing it makes sense, at least in my experience. Unless they release live code without testing it in an environment that is updated to current database conditions, which um, wow, see the thing where we crashed the state servers? The state is cheap and they suck and even they don't try to do even a minor release without at least my department getting to play with it first and give yea or nay because of that.



Short version: this matches my testing experience and also tells you more than you ever wanted to know about my daily life and times. YMMV for those who have a different model for code releases and updates.

And to add, again, if this is true, I am seriously feeling for the tech dept right now. Having to do unplanned system-wide fixes sucks. Someone is leaving really unkind post-it notes for the French coder. Not that I ever considered doing that or anything.

ETA: For us, there are two types of builds and fixes: mod (modification) and main (maintenance). The former is actual new things added to the code, like, I don't know, adding an interface or new policy or changing the color scheme. Maintenance is stuff that is already there that broke and needs to be fixed, like suddenly you can't make a page work. Emergency fixes in general are maintenance, something broken that needs fixing, with occasional mods, the legislature did something dramatic.

None of this means they aren't lying and it wasn't deliberate. My department failed an entire build once due to the errors in it.

Actually, the easiest way to find out if it was deliberate is to hunt down whoever did their testing and check the scripts they wrote, or conversely, if amazon does it all automated, the automated testing scripts will also tell you exactly what was being tested. If it was deliberate, there were several scripts specifically created to test this change.

Example:

If I wrote the user script and was running it in a near-field environment.

Step Four: Query for Beauty's Punishment from main page.
Expected Result: Does not display.
Actual Result: Does not display.
(add screenshot here)

Step Five: Query for Beauty's Punishment from Books.
Expected Result: Displays.
Actual Result: Displays.
(add screenshot here)

We're like the evidence trail. Generally, a tester has to know what they are supposed to be testing to test it. If this was live beta'ed earlier this year with just a few authors, it still had to, at some point, go through some kind of formal testing procedure and record the results. And there would be a test written specifically to see if X Story Marked Adult would appear if searched from the main page, and one specifically written to check that X Story Marked Adult was showing sales figures, either human-run or automated.

[identity profile] lyorn.livejournal.com 2009-04-14 12:34 pm (UTC)(link)
(here via trobadora)

I work closely with our sw testers, and, word.

If that is, it's a code issue. Amazon makes it sound as if it was a manual flagging of items or the setting of a parameter, and the code doing what it's supposed to do. But I don't expect them to tell their customers these things -- TMI.

[identity profile] seperis.livejournal.com 2009-04-14 01:32 pm (UTC)(link)
If that is, it's a code issue. Amazon makes it sound as if it was a manual flagging of items or the setting of a parameter, and the code doing what it's supposed to do. But I don't expect them to tell their customers these things -- TMI.

Their description totally left me confused on what exactly went wrong. He just happened to fill in this field in this one thing that just happened to propogate through GLBT, erotica, and some feminist literature only?

It's--I mean, honestly, a bad code update makes more sense.

[identity profile] tienriu.livejournal.com 2009-04-14 02:53 pm (UTC)(link)
Actually this bit made sense to me. From the report (caveat: I got the impression the information in the article had been filtered from a tech person to a non-tech person to a blogger/reporter so I had to apply my reporter-to-technical filter over it all) it sounds like somebody made a mistake with the catalog by flipping the wrong switch.

If it's a flag (in a datbase that's a 0 or a 1 value - which means, usually in applications, it's represented as a checkbox) and this was a long manual process (which it would be if it was affecting these many books) then the developer is likely to have set up some form of shortcut to enter the values they needed to enter. Or alternatively, created a script that uploaded everything into the system for them. If they did this on screen, rather than reading each field and then entering it in, they'd remember sequences of clicks that were identical (i.e. the one that controlled the adult flag). Thence a mistake made with the first book would be made for all the books. The same would apply if they used a script and accidentally put a '1' for the flag rather than a '0' in the code (which would make sense since the flag would probably have been '0' by default).

Anyway, how much to bet that developer is now incredibly worried for his or her job? Lets hope it wasn't an intern.

[identity profile] seperis.livejournal.com 2009-04-14 03:03 pm (UTC)(link)
No, that part makes sense. The part that strikes me as odd is which books the script would have been set to find; I don't think you can accidentally hit LGBT and erotica and random feminist etc. by a script error that's working off preset parameters. The script generally would have had to already have specified to match, say, all LGBT flags to this book, and do it retroactively if this came about because a guy was updating the database inventory records.

Or more specifically, it means there probably was already a decision table for this that wasn't being used and was activated, sending books on the site through that to decide flagging or not using a preset metadata category to decide the adult rating or not. Which that, I'm inclined to possibly believe, becaues the problem looks like it's been around for a couple of months for some people, but a much smaller group.

[identity profile] tienriu.livejournal.com 2009-04-15 12:30 am (UTC)(link)
From the areas that seemed affected by the script my best guess is that it wasn't one area, it was several - and one of them was 'sexuality' or 'identity' (perhaps even gender identity specifically).

Or rather, that the developer was updating or adding meta tags to books and other items that fell within certain categories (sexuality, gender identity, self help - mind you unless I can see a list of all the books I can only take a guess based on the ones that are being listed. I'd be curious to see if there were other books that were tagged as well that simply haven't been identified as such because they weren't as high profile). His update script (which I doubt was a database script but rather something command line based or similar given how, if it was accidental, this sounds very much like a haphazard, ad hoc developer created way to reduce time consuming manual updates) ran against a list he had of all the books he had to update and changed a few tags over by accident.

That is of course if this was all accidental (which I suspect it is given my faith in the ability of humans to be very silly is higher than my belief in conspiracies of this magnitude and ill-conception).

On the other hand, if Amazon IS flagging LGBT books as adult (but NOT filtering them from view) is that a bad thing or a good thing or a no-commentary thing? I'm sort of confused. Aren't most LGBT books supposed to be adult or aimed at an adult (as defined legally that is - which is theoretically 16 in some countries)?