Monday, April 13th, 2009 09:20 pm
amazon and codefixes - oh, this is something i might know something about!
A possible explanation, gakked from
trobadora:
AmazonFail: An Inside Look at What Happened
Note: If they are telling the truth about what happened, this applies. And actually, it would apply if they lied, but worse. One error is one thing, but if this was a deliberate system-wide build that made the change, pretty much the same thing applies, but with less sympathy.
My expertise is not expertise, it is anecdata, but it's also ten builds and fifty emergency releases of professional anecdata, so take that as you will.
I am a professional tester because at some point, it occurred to people that things worked better when there was a level of testing that was specifically designed to mimic the experiences of the average user with a change to a program. Of course, they didn't use average users, they used former caseworkers and programmers, but the point stands.
I'm a professional program tester and do user acceptance, which means I am the last line of defense for users before we release a change to the program, major and minor. It's a web-based program that with three very idiotic ways to interface with it online for a user and about fifty for other agencies to do automatically, and I won't go into our vendor interfaces because it hurts me inside. I am one of thirty user acceptance testers for this program, because it's huge and covers a massive number of things and interfaces with federal and state level agencies outside of our own internal agencies. I test things straight from the hands of coders in emergency releases and also after they've gone through two other levels of testing in our quarterly builds.
This does ring true to my experience when something just goes stupid. And when I say stupid, I mean, someone accidentally killed off pieces of welfare policy with a misflag once, and that's not even the stupidest thing I've had to test when the program was built and is still coded modularly and the coders are in different parts of the country and sometimes at home in India when working on this. And none of them ever know what anyone else is doing.
While I have no idea what amazon's model looks like, to do a rollback on a change for us, even a minor one, it goes like this:
1.) Report
2.) Reproduction in one of our environments.
3.) Code fix and discussion and so many meetings, God. (emergency releases may not go through this.)
4.) DEV environment 1 (theoretical construct of the program, works very well, nothing like the real thing)
5.) DEV environment 2 (closer to the actual program, but not by much)(sometimes do not use both Dev 1 and Dev 2 both)
6.) SIT (sometimes skipped for emergency releases) (I have issues with their methodology.)
7.) User Acceptance (me! And some other people, somewhat close to field conditions with database as of last mass update, usually two to three months before)
8.) Prodfix (optional) (also me! And some other people, almost perfect mirror of field conditions with full database)
If it's really desperate, it goes to prodfix instead of or in addition to User Acceptance, which is the only environment we have that nearly-perfectly mirrors live field conditions and is fully updated with our field database as of five o'clock COB the day before. For me to do a basic test, they give me a (really horrifyingly short) version of events and if I get lucky, I get to see screenshots of the problem in progress.
[If I win the lottery, someone uploaded the specific patches themselves for me to look at, and I get to see what is going on pre-compiling. That has happened once. I did not take advantage of it. I kick myself sometimes.]
Usually, I get a fifth hand account that's gone through eight other people on what went wrong and what function I'm supposed to test and in what order to do it in. Depending on severity, I have four hours to four days to write the test (or several tests, or several variations of the same test for different user conditions, or different input conditions), send it to the person who owns the defect, have them check it, then I run the test in full, then fail or pass it. Or run it in full, fail or pass, then run it in prodfix, fail or pass it.
[Sometimes, I have a coder call me and we both stare in horror at our lot in life when both of us really don't know what the hell went wrong and hope to God this didn't break more things.]
The fastest I've ever seen an emergency release fix go through is three days from report to implementation, and at least once, we had a massive delay when they were too eager and crashed our database because the rollback didn't match the new information entered into the system since the problem started.
[And since this is welfare and under federal jurisdiction, the state gets fined by the feds when we cannot issue benefits correctly or have egrerious errors. Feds are really, really politely nasty about this sort of thing. And OIG, who audits us for errors, hates this program like you would not believe. To say there is motivation for speed is to understate the case.]
The program I test is huge, and terrifyingly complicated, and unevenly coded, and we can easily crash the servers for incredibly stupid small-seeming things. Amazon is about a hundred times larger. We do four major builds and four minor (just like major, just with a different name) per year, plus upwards of thirty emergency releases between builds. Our releases aren't live but overnight batched when the program goes to low-use after 8 PM, so we have some leeway if something goes dramatically bad or our testing isn't thorough enough. Which you know, that also happens. Amazon is always up and while it has the same constant database updates we do, I'm betting also has more frequent normal code updates, both automatic and human initiated.
If this is actually what happened, then the delay in fixing it makes sense, at least in my experience. Unless they release live code without testing it in an environment that is updated to current database conditions, which um, wow, see the thing where we crashed the state servers? The state is cheap and they suck and even they don't try to do even a minor release without at least my department getting to play with it first and give yea or nay because of that.
Short version: this matches my testing experience and also tells you more than you ever wanted to know about my daily life and times. YMMV for those who have a different model for code releases and updates.
And to add, again, if this is true, I am seriously feeling for the tech dept right now. Having to do unplanned system-wide fixes sucks. Someone is leaving really unkind post-it notes for the French coder. Not that I ever considered doing that or anything.
ETA: For us, there are two types of builds and fixes: mod (modification) and main (maintenance). The former is actual new things added to the code, like, I don't know, adding an interface or new policy or changing the color scheme. Maintenance is stuff that is already there that broke and needs to be fixed, like suddenly you can't make a page work. Emergency fixes in general are maintenance, something broken that needs fixing, with occasional mods, the legislature did something dramatic.
None of this means they aren't lying and it wasn't deliberate. My department failed an entire build once due to the errors in it.
Actually, the easiest way to find out if it was deliberate is to hunt down whoever did their testing and check the scripts they wrote, or conversely, if amazon does it all automated, the automated testing scripts will also tell you exactly what was being tested. If it was deliberate, there were several scripts specifically created to test this change.
Example:
If I wrote the user script and was running it in a near-field environment.
Step Four: Query for Beauty's Punishment from main page.
Expected Result: Does not display.
Actual Result: Does not display.
(add screenshot here)
Step Five: Query for Beauty's Punishment from Books.
Expected Result: Displays.
Actual Result: Displays.
(add screenshot here)
We're like the evidence trail. Generally, a tester has to know what they are supposed to be testing to test it. If this was live beta'ed earlier this year with just a few authors, it still had to, at some point, go through some kind of formal testing procedure and record the results. And there would be a test written specifically to see if X Story Marked Adult would appear if searched from the main page, and one specifically written to check that X Story Marked Adult was showing sales figures, either human-run or automated.
![[livejournal.com profile]](https://www.dreamwidth.org/img/external/lj-userinfo.gif)
AmazonFail: An Inside Look at What Happened
Amazon managers found that an employee who happened to work in France had filled out a field incorrectly and more than 50,000 items got flipped over to be flagged as "adult," the source said. (Technically, the flag for adult content was flipped from 'false' to 'true.')
Note: If they are telling the truth about what happened, this applies. And actually, it would apply if they lied, but worse. One error is one thing, but if this was a deliberate system-wide build that made the change, pretty much the same thing applies, but with less sympathy.
My expertise is not expertise, it is anecdata, but it's also ten builds and fifty emergency releases of professional anecdata, so take that as you will.
I am a professional tester because at some point, it occurred to people that things worked better when there was a level of testing that was specifically designed to mimic the experiences of the average user with a change to a program. Of course, they didn't use average users, they used former caseworkers and programmers, but the point stands.
I'm a professional program tester and do user acceptance, which means I am the last line of defense for users before we release a change to the program, major and minor. It's a web-based program that with three very idiotic ways to interface with it online for a user and about fifty for other agencies to do automatically, and I won't go into our vendor interfaces because it hurts me inside. I am one of thirty user acceptance testers for this program, because it's huge and covers a massive number of things and interfaces with federal and state level agencies outside of our own internal agencies. I test things straight from the hands of coders in emergency releases and also after they've gone through two other levels of testing in our quarterly builds.
This does ring true to my experience when something just goes stupid. And when I say stupid, I mean, someone accidentally killed off pieces of welfare policy with a misflag once, and that's not even the stupidest thing I've had to test when the program was built and is still coded modularly and the coders are in different parts of the country and sometimes at home in India when working on this. And none of them ever know what anyone else is doing.
While I have no idea what amazon's model looks like, to do a rollback on a change for us, even a minor one, it goes like this:
1.) Report
2.) Reproduction in one of our environments.
3.) Code fix and discussion and so many meetings, God. (emergency releases may not go through this.)
4.) DEV environment 1 (theoretical construct of the program, works very well, nothing like the real thing)
5.) DEV environment 2 (closer to the actual program, but not by much)(sometimes do not use both Dev 1 and Dev 2 both)
6.) SIT (sometimes skipped for emergency releases) (I have issues with their methodology.)
7.) User Acceptance (me! And some other people, somewhat close to field conditions with database as of last mass update, usually two to three months before)
8.) Prodfix (optional) (also me! And some other people, almost perfect mirror of field conditions with full database)
If it's really desperate, it goes to prodfix instead of or in addition to User Acceptance, which is the only environment we have that nearly-perfectly mirrors live field conditions and is fully updated with our field database as of five o'clock COB the day before. For me to do a basic test, they give me a (really horrifyingly short) version of events and if I get lucky, I get to see screenshots of the problem in progress.
[If I win the lottery, someone uploaded the specific patches themselves for me to look at, and I get to see what is going on pre-compiling. That has happened once. I did not take advantage of it. I kick myself sometimes.]
Usually, I get a fifth hand account that's gone through eight other people on what went wrong and what function I'm supposed to test and in what order to do it in. Depending on severity, I have four hours to four days to write the test (or several tests, or several variations of the same test for different user conditions, or different input conditions), send it to the person who owns the defect, have them check it, then I run the test in full, then fail or pass it. Or run it in full, fail or pass, then run it in prodfix, fail or pass it.
[Sometimes, I have a coder call me and we both stare in horror at our lot in life when both of us really don't know what the hell went wrong and hope to God this didn't break more things.]
The fastest I've ever seen an emergency release fix go through is three days from report to implementation, and at least once, we had a massive delay when they were too eager and crashed our database because the rollback didn't match the new information entered into the system since the problem started.
[And since this is welfare and under federal jurisdiction, the state gets fined by the feds when we cannot issue benefits correctly or have egrerious errors. Feds are really, really politely nasty about this sort of thing. And OIG, who audits us for errors, hates this program like you would not believe. To say there is motivation for speed is to understate the case.]
The program I test is huge, and terrifyingly complicated, and unevenly coded, and we can easily crash the servers for incredibly stupid small-seeming things. Amazon is about a hundred times larger. We do four major builds and four minor (just like major, just with a different name) per year, plus upwards of thirty emergency releases between builds. Our releases aren't live but overnight batched when the program goes to low-use after 8 PM, so we have some leeway if something goes dramatically bad or our testing isn't thorough enough. Which you know, that also happens. Amazon is always up and while it has the same constant database updates we do, I'm betting also has more frequent normal code updates, both automatic and human initiated.
If this is actually what happened, then the delay in fixing it makes sense, at least in my experience. Unless they release live code without testing it in an environment that is updated to current database conditions, which um, wow, see the thing where we crashed the state servers? The state is cheap and they suck and even they don't try to do even a minor release without at least my department getting to play with it first and give yea or nay because of that.
Short version: this matches my testing experience and also tells you more than you ever wanted to know about my daily life and times. YMMV for those who have a different model for code releases and updates.
And to add, again, if this is true, I am seriously feeling for the tech dept right now. Having to do unplanned system-wide fixes sucks. Someone is leaving really unkind post-it notes for the French coder. Not that I ever considered doing that or anything.
ETA: For us, there are two types of builds and fixes: mod (modification) and main (maintenance). The former is actual new things added to the code, like, I don't know, adding an interface or new policy or changing the color scheme. Maintenance is stuff that is already there that broke and needs to be fixed, like suddenly you can't make a page work. Emergency fixes in general are maintenance, something broken that needs fixing, with occasional mods, the legislature did something dramatic.
None of this means they aren't lying and it wasn't deliberate. My department failed an entire build once due to the errors in it.
Actually, the easiest way to find out if it was deliberate is to hunt down whoever did their testing and check the scripts they wrote, or conversely, if amazon does it all automated, the automated testing scripts will also tell you exactly what was being tested. If it was deliberate, there were several scripts specifically created to test this change.
Example:
If I wrote the user script and was running it in a near-field environment.
Step Four: Query for Beauty's Punishment from main page.
Expected Result: Does not display.
Actual Result: Does not display.
(add screenshot here)
Step Five: Query for Beauty's Punishment from Books.
Expected Result: Displays.
Actual Result: Displays.
(add screenshot here)
We're like the evidence trail. Generally, a tester has to know what they are supposed to be testing to test it. If this was live beta'ed earlier this year with just a few authors, it still had to, at some point, go through some kind of formal testing procedure and record the results. And there would be a test written specifically to see if X Story Marked Adult would appear if searched from the main page, and one specifically written to check that X Story Marked Adult was showing sales figures, either human-run or automated.
no subject
From:(- reply to this
- thread
- link
)
no subject
From:(- reply to this
- parent
- thread
- top thread
- expand
- link
)
(no subject)
From:(no subject)
From:no subject
From: (Anonymous) Date: 2009-04-14 03:16 am (UTC)You do not think it is a bit TOO easy considering how the media have been brainwashing us about blaming France for everything?
Xenophobia is not much prettier than homophobia, for anyone keeping track. I could possibly have believed this story if it was not playing about our well-known negative bias towards France.
Not that I think that Franch programmers are exempt from making mistakes by the way, but I am a a database admin myself and also do design complex enterprise level software: and blaming scapegoats in a different department/company/country is a well-known tactic when a system goes awry. You wouldn't believe how often I've seen it happen.
(- reply to this
- thread
- link
)
no subject
From:(- reply to this
- parent
- thread
- top thread
- expand
- link
)
(no subject)
From: (Anonymous) - Date: 2009-04-14 04:03 am (UTC) - expand(no subject)
From:(no subject)
From: (Anonymous) - Date: 2009-04-14 04:36 am (UTC) - expand(no subject)
From:Heh...
From: (Anonymous) - Date: 2009-04-14 05:04 am (UTC) - expandRe: Heh...
From:re BookSurge and Amazon's Buy buttons used as a weapon
From: (Anonymous) - Date: 2009-04-14 08:07 am (UTC) - expandno subject
From:And, well, one wonders why the whole thing does not collapse sometimes!
(- reply to this
- thread
- link
)
no subject
From:(- reply to this
- parent
- top thread
- link
)
no subject
From:(- reply to this
- thread
- link
)
no subject
From:(- reply to this
- parent
- thread
- top thread
- expand
- link
)
(no subject)
From:(no subject)
From:(no subject)
From:no subject
From:(- reply to this
- thread
- expand
- link
)
no subject
From:Specifically, this doesn't defend them, just clarifies why they literally may not be able to do a full rollback very fast, even if they really, really want to.
(- reply to this
- parent
- top thread
- link
)
(no subject)
From:(no subject)
From:no subject
From:(- reply to this
- thread
- link
)
no subject
From:(- reply to this
- parent
- top thread
- link
)
no subject
From:I'm the software release manager for a very tiny arm of a Fortune 10 company and knowing both how we work and how the rest of the enterprise works, I find "it takes 3 days to change this back" credible as well.
Someone with my responsibility sat in a meeting and argued for just another day of testing, and described the process for dealing with this kind of thing, but it still comes down to someone making a call about how much is needed and how much heat the company can take.
(- reply to this
- thread
- link
)
no subject
From:Someone with my responsibility sat in a meeting and argued for just another day of testing, and described the process for dealing with this kind of thing, but it still comes down to someone making a call about how much is needed and how much heat the company can take.
Yes. I've honestly wished at least the test supervisors could go to some of the meetings to explain why release is a bad idea until we can check a few more things. I've had to use a single test run for three or four separate variables and I absolutely hate having to do that. It will always come up two or three builds later when a user finds out it affected something entirely unexpected that might have been caught if we'd a.) had more information or b.) had more time.
(- reply to this
- parent
- thread
- top thread
- expand
- link
)
(no subject)
From:no subject
From:I'm enough of a geek that I actually found that little explanation very interesting. Admittedly, the closest I get to coding is fiddling with our Access database (teeny, tiny thing, but I get the whole "I changed this little thing in X, so why the heck have the figures for A DROPPED OFF THE FACE OF THE EARTH?!" aspect and that sometimes, big things go wrong without anybody being entirely sure why or precisely how to fix it). But, still, it was interesting to read.
(- reply to this
- thread
- link
)
no subject
From:For maintenance items, most of the time, this is how it starts. When I look up a defect (error in the program) that is in the process of being fixed, there's a log with comments as it moves from the help desk to the coders to development testing, and it usually takes a while to identify teh specific issue. And even then, often it takes a while to figure out how to fix it, especially if its integrated with eight other things, without killing the program.
And that doesn't even include the arguments between coders, policy specialists, analysts, and etc.
(- reply to this
- parent
- top thread
- link
)
no subject
From:http://www.feministing.com/archives/014797.html
Amazon Rep: This was not a "glitch"
(- reply to this
- thread
- link
)
no subject
From:If they are, in fact, going to remove the code. This won't apply if htey have no intention of changing it. But if they do, again, it will still need time for the code to be removed, code to be rewritten, and code to be tested.
(- reply to this
- parent
- thread
- top thread
- expand
- link
)
(no subject)
From:no subject
From:(- reply to this
- thread
- link
)
no subject
From:From what I understand, we do load testing and--balance testing? There's another term for it--every week period because our servers suck and they go down a *lot*, as in daily. During the big updates, they do it with SIT and potentially (this is the part that tends to be weird), the night before a build goes live (usually Saturdays). There's also random testing when the servers go down more than two hours.
Now what environment they do it in is a mystery. We have two separate UAT environments, one specific for interface testing, and the other for general testing. Id' always assumed that prodfix, being almost-field conditions, was the one they used for that, since we only use it when specific tests need to be run on it and leave it alone otherwise. However, it doesn't get new code until we pass it in UAT.
[
Interestingly, I'll soon know more than I want to about it, because the state does not have this program in use state-wide, just in specific locations with a very, very slow, constantly delayed rollout. Right now, the userbase is comparatively tiny, and adding even a few counties will crash us fairly consistently for days. We're adding more soon, at which time our environments will collapse about once every ten minutes and many frantic emails will be sent across the hall. *G* Including from me.
(- reply to this
- parent
- thread
- top thread
- expand
- link
)
(no subject)
From:(no subject)
From:no subject
From:This whole thing is pretty damn amazing. I did notice one thing tonight (apropos of nothing in your post I don't think). I'd ironically just performed a ton of searching in m/m fiction just a couple of days ago -- mainly due to the release of a two friends' new books. And when I signed on today, voila! it tells me my last search, recs books accordingly, and... it's basing everything off the ONE non-m/m book I looked at. Even though I looked at about 50 m/m,lol. Now, it *could* be that that was the one I looked at last, but I really, really don't think so. So it's possibly not only doing all the things everyone else has noted, but even remembering customers wrong. LOL, it's like, "no, you did not really want to find gay erotica, that was just a figment of your imagination...let me rec you all these (totally uninteresting to me) thrillers!"
Anyway, thanks for the fascinating look into your world, and how something like this might look at a micro level.
Edited to clarify what in the world I was talking about with it using my prior search.
(- reply to this
- thread
- link
)
no subject
From:(- reply to this
- parent
- top thread
- link
)
no subject
From:I never realized they had an adult filter, because they don't tell you this (at least not anywhere prominent), and I'm suspicious that they'll now backpedal in public and "fix" this for the well known books and so on, the ones that they actually didn't want to give the leper flag, but that in the end they'll still stick with this forced "adult content filtering" policy, for books that are less well known and/or have actual sex in them, those that need the search functions and related books displays and such the most for their exposure compared to books that won awards and had movie deals. But because it'll only be fewer titles again, and Amazon "fixed" the hack'n'slash method of removal (or misflagging or whatever it was) the outrage will have died down, when they quietly mess with the rankings again and you don't even notice that they filtered some gay stripper memoir you never heard of from your suggestion list again.
(- reply to this
- thread
- expand
- link
)
no subject
From:Yes, this.
If they reply to my mail that they were filtering for adult content, "where do I turn the filter off" will be my next mail to them.
(- reply to this
- parent
- top thread
- link
)
(no subject)
From:(no subject)
From:(no subject)
From:no subject
From:(- reply to this
- link
)
no subject
From:(- reply to this
- link
)
no subject
From:I don't know that I'm willing to entirely trust their explanation, plausibility aside.
B
(- reply to this
- link
)
no subject
From:One would think that many different pieces of information is a nice diversion tactic.
(- reply to this
- link
)
no subject
From:(- reply to this
- link
)
no subject
From:I work closely with our sw testers, and, word.
If that is, it's a code issue. Amazon makes it sound as if it was a manual flagging of items or the setting of a parameter, and the code doing what it's supposed to do. But I don't expect them to tell their customers these things -- TMI.
(- reply to this
- thread
- link
)
no subject
From:Their description totally left me confused on what exactly went wrong. He just happened to fill in this field in this one thing that just happened to propogate through GLBT, erotica, and some feminist literature only?
It's--I mean, honestly, a bad code update makes more sense.
(- reply to this
- parent
- thread
- top thread
- expand
- link
)
(no subject)
From:(no subject)
From:(no subject)
From:no subject
From:For example: The causal linking between oil and authoritarianism is anecdata at best.
(- reply to this
- thread
- link
)
no subject
From:(- reply to this
- parent
- top thread
- link
)
no subject
From: (Anonymous) Date: 2009-04-14 02:51 pm (UTC)It WAS a policy change, no if buts or maybe.
Amazon just go afraid and realized that they had to reverse their changes when they saw the level of outrage this generated. And the implementation was clearly buggy anyway.
(- reply to this
- thread
- link
)
no subject
From:Yes, I know. It's been mentioned.
I've disclaimed this a few times in the entry, but I'll try this again. Glitch or deliberate, if they are truthful about fixing it, it will still take time to roll it back to the state it was in before this update, whenever it was implemented, and not lose subsequent updates and database changes. So it can be in fact a complete and total lie. But for the purposes of making a website-wide system change like this, it wouldnt' matter if they meant to or not, they still have to roll back to fix it.
(- reply to this
- parent
- top thread
- link
)
no subject
From:So I pretty much got that. Which is wow.
(- reply to this
- link
)
no subject
From:My take on it?
If it was a deliberate inside job (which it may have been) Amazon isn't going to want to publicize it. If it was an outside job where someone capitalized on a code weakness, Amazon isn't going to want to publicize it. If it was a complete hack of the system, again, they're not going to want to publicize it. If it was a policy issue that has since been reversed, they're not going to want to publicize it.
Honestly, "glitch" is about the most explanation we're ever going to get.
(- reply to this
- thread
- link
)
no subject
From:Even if it's about as effective as the little Dutch boy with his finger in a dike.
(- reply to this
- parent
- top thread
- link
)
no subject
From:(- reply to this
- link
)