From failure to mastery with Janelle Klein

Janelle Klein, Founder of Open Mastery

Janelle Klein, Founder of Open Mastery

This week we sit down with Janelle Klein,  author of the book, Idea Flow: How to Measure the PAIN in Software Development (leanpub.com/ideaflow), and founder of Open Mastery, an industry peer mentorship network focused on data-driven software mastery.

We discuss several lessons she learned throughout her career, how it became the genesis for Open Mastery and her goal to rally the industry in working together, and learning together to break down the wall of ignorance between managers and developers that drives our software projects into the ground.

Full Transcript:

[music]

Announcer:

You’re listening to the “No Fluff Just Stuff podcast,” the JVM conference series with dozens of dates around the country. There’ll be one near you. Check out our tour dates at nofluffjuststuff.com.

Michael Carducci:

Hey everybody, this is your Michael Carducci. This week we talk with the newest regular on the No Fluff Just Stuff tour, Janelle Klein.

Janelle has been doing some really fascinating work around improving the daily lives of software engineers, tools that make the pain and the problems that we face every day visible, to management, in ways that we can communicate them on their terms.

She’s even working on some talks with her own ideas around world peace. If you get a chance to check out some of her talks on the tour, I highly recommend you to do so. She fascinating, she’s brilliant and let’s hear what she has to say.

[music]

Michael:

We’re here in ArchConf in San Diego. Have you’ve been enjoying it?

Janelle Klein:

Hi, I’ve been having a great time so far.

Michael:

I’m joined here with Janelle. Janelle, Janelle Klein. Janelle is one of the speakers on the tour. She’s speaking here at ArchConf. The first time I saw you speak, you told a story of your career that really resonated with me. Would you mind sharing that with us now?

Janelle:

Sure, I can do that. This was back about eight years ago, or so. I was working on this process control project in semi-conductor manufacturing.

Michael:

PLCs, that kind of thing, or…?

Janelle:

Statistical process controls, so SPC. Our software was responsible for detecting when things went wrong in the manufacturing plant, and then shutting down the tools responsible. It was a 24/7 factory operation.

We had this really awesome team. It was an old software — legacy kind of system — though, it had been in production a long time, like 12-year-old application and all the things that come with that.

But, we had this great team, and we had CI unit testing, design reviews, code reviews — all that stuff that you’re supposed to have.

Michael:

All the safety nets. All the things that prevent bad things from happening.

Janelle:

Exactly. Very disciplined team, and QA process, and such. I had been working on performance improvements. They hired me because I had Oracle wizardry skills when it came to doing performance magic.

I started looking at the application, and I got this idea of how I could make performance improvements by essentially turning the architecture inside-out. In my test runs I had, like, order of magnitude performance improvements with my changes I was testing.

We put the new software in production. At first, we kind of tied a bow on it and shipped it to production. We got on this conference call with IT as they installed our software. I just remember this guy, when we were on the call, who just started screaming in the background. We were like, “Oh my God, what’s wrong?”

Apparently, we shut down every single tool in the entire factory. We all felt terrible about it, and rolled back the software, and tried to figure out what happened. It was this configuration change that didn’t quite make it to production. Just one of those little things that gets missed sometimes.

We fixed the problem, and tied a bow on it, and shipped to production again. Then, later that night, we were back on the conference call with IT again, and guess what happened.

Michael:

Somebody started screaming?

Janelle:

[laughs] Something like that. Everything went down one more time. This time, we rolled it back again, and we were running tests in our test environment and we couldn’t reproduce the problem at all. It didn’t matter what we threw at it, we couldn’t make the system crash. Everything looked like it was working fine, but it failed in production.

Months and months started passing by with us trying to figure out how to reproduce this problem. Meanwhile, our development team was sitting pretty much idle waiting for us to get this release out the door. Management just told them to go ahead with the next release, because what else are we going to do with all these idle developers?

They started working on the next release. A few more months rolled by. Eventually, we figured it out. It was this synchronized call deeply in some multi-threaded code. Being a data processing engine that was highly parallelized, one synchronization block.

In fact, it was a log statement that called this call that ended up being synchronized. Basically, somebody added a log statement in the wrong place calling the wrong thing, which snowballed the system. We had to have a realistic data stream, I found out, to be able to reproduce that particular issue.

We ended up building an entire test harness that was production like even to be able to reproduce it.

Michael:

Great, you solved it.

Janelle:

Solved the problem, tied a bow on it, and crossed our fingers, and got ready to ship to production again. Back on the conference call with IT. I remember we were just in this huge room, and now there were three times as many people on the call. We were all just holding our breath and watching these real-time activity charts just hoping everything would be OK.

They spun up the server and, finally, everything looked fine. Went home that night feeling good that things would finally be all right again and go back to normal. We could get back to a nice rhythm of releases. Then, about 3:00 AM, my phone rang, and it was my team lead calling.

He’d asked me about some code that I’d written. Remember that performance improvement I’d made at the very beginning? Yeah, that code that I wrote months ago finally went live in production. Apparently, I introduced a memory leak that ground the system, again, to a screeching halt.

Michael:

Rollback again?

Janelle:

We tried to rollback again, but this time, the rollback failed. At that point, I got out my laptop and started looking around for plan B. I noticed I had a feature toggle that I’d added in the code that disabled the part that was exploding.

Michael:

I always write my best code when I’m suddenly awoken at 3:00 in the morning.

Janelle:

I told my team lead about the feature toggle, and he disabled the feature toggle. Then, the log file started filling with null pointer exceptions.

Not only could we not rollback the code or disable the code, production was down and we couldn’t get our software out of production, and it was completely my fault. There I was, 3:00 AM in the morning, hacking out a fix to disable this code so we could get the software out of production finally.

Finally got the patch fix installed, deployed it to production, and got the thing running barely. Looking back on what I’d basically done with this, I’d had this 24/7 operation, and I had this idea in my head about what was important, and making things better, and needing to make the code look pretty being an important thing to optimize for.

I just felt so confident in myself that I couldn’t screw up. I’m hot shot developer.

Michael:

This is beautiful code.

Janelle:

It’s beautiful code. What could be wrong with that? That was really tough. I went back to work the next morning, and I just felt terrible.

Michael:

I know that feeling.

Janelle:

My boss called me in his office, and he’s like, “What happened?” I just completely broke down sobbing. He’s like, “You know, you’ve got two choices. You can put it behind you and try and let it go, or you can face your failure with courage and learn everything that it has to teach you.”

I really took that to heart and looked at my situation and where we were at. Went back to my team. There’s really nothing we could do at that point. You can’t rewind the past. You can’t change what’s already happened. All you can do is start from where you are in that moment.

Michael:

Just own it. What are we going to do differently?

Janelle:

We’d been trying to do all these things that we thought were the right, most important things to do. This was one of those moments that really shattered my faith in best practices. It’s not just about writing beautiful code or following all the best practices.

You really have to take ownership of the risk you’re introducing with any change and what that means to the business and come up with a realistic strategy to solve whatever problems are on your team.

We ended up taking an entirely different approach to solving the problems just because we didn’t know what to do from that point. Where do you go from we’re doing all the right things already? Where do we go from here?

We were failing catastrophically, despite all those things. How do you get better at that point?

Michael:

It seems like, from listening to a lot of your talks, that became this defining moment of your life, of your career, and started to shape some of the things that you’ve gone and done since.

Janelle:

Probably the thing that made the biggest difference was just that I learned that I couldn’t really start looking at things a different way until I let go of all the assumptions I had already. I had to lose faith in best practices to be able to be open to see a different way of looking at the world.

It taught me that learning starts with uncertainty. Until you’re ready to doubt, as long as you’re holding on to an existing belief, you can’t really see alternatives. That’s when I started looking for alternative explanations for what is good, what is better. How do we define these things in some way other than best practices?

Practices are all just a means to an end. What is that end that we’re actually aiming for? What is better? What does better actually mean?

Michael:

If I understand correctly, what you’re saying about best practices is that they might be too rigid, that there are just far too many variables to distill good code versus bad code into these neat, little boxes?

Janelle:

I think we spend a lot of time trying to take reality and shove reality into some kind of box that we already have defined, because it makes life simpler and it makes decision-making simpler. In reality, probably the main thing I’ve learned is it’s really easy to do all the right things from a practice standpoint and solve the general problems without solving your specific problems.

For example, you can write lots of automation and not actually write automation that catches your bugs that you have in your software.

If you don’t understand what’s actually causing the people on the team to make mistakes or causing the people on the team to miss problems, why defects are getting missed, you can write automation all day and keep having quality problems, even when you have mountains of automation.

That’s what we learned. It’s not about having a mountain of automation. It’s about understanding what your problems are and then designing solutions to fix it. The key lesson I learned was really to stop making generalizations about these things and to start digging into the details of what’s actually going wrong and understanding your problems.

The hard part with improvement isn’t solving the problems. It’s identifying the right problems to solve.

Michael:

That’s harder than it sounds on the surface, and on the surface, it sounds difficult. I don’t want to diminish it. This has led you to do some really interesting research around these problems that teams face. Tell us a little bit about that, what you’re doing right now.

Janelle:

What I started doing is originally, we thought the main problem was technical debt that was building up in the code base and causing us to make mistakes. I wrote this tool that could detect high risk changes in the code and let us know where we needed to do extra testing.

When we did that, what we found in the data wasn’t really what we expected at all. Most of our mistakes were actually in the most well-written parts of the code. Not the crufty, technical, debt-filled stuff, it was the code written by our most senior engineers.

At first, we were just totally confused by that. Then, once we started digging into it a little more, we found that most of the mistakes were associated with low familiarity, like people making changes to code that they didn’t write themselves. That made some sense, low familiarity causing more mistakes.

We’ve had this experience in coding for years where you interact with complex code, and it’s really painful. When I started thinking about what makes development feel more painful and trying to find answers to those questions, I started keeping track of the pain I was experiencing during my experience and actually visualizing my pain on a timeline.

Once I started doing that, I started realizing that a lot of the problems I was running into were caused by more human interaction factors as opposed to problems in the code itself.

For example, a stale memory mistake is when I make an assumption about the code because I have some memory of how it worked, and it doesn’t work that way anymore because somebody changed it. The memory I have in my head is stale. It’s not really a problem with the code itself. It’s a consequence of how we interact with the code as a team.

Michael:

You’re going in with assumptions about, “Well, this code would be the same as I left it when I last touched it.”

Janelle:

Yeah. When we look at the properties of the code itself and ignore the humans in that process, we make a lot of bad assumptions about what our problems are.

I think it’s easier to externalize things and blame the code for our problems, but those human interaction factors make the situation way more complex of how change occurs and how evolution occurs, not just of the code but of our team, of how features’ needs change over time.

There’s all these evolutionary factors and interaction factors that make our work difficult that aren’t necessarily properties of the code itself. I started visualizing all of these things and then realizing that I could start aggregating all this data to identify the biggest causes of pain.

We ultimately ended up identifying our biggest problems on the team and solving our problems with a data-driven feedback loop. That’s the discovery that led to all of this growth around data-driven software mastery.

I started using these tools in consulting and mentorship. It gave me a universal definition of better, which is essentially optimizing for developer experience as opposed to optimizing the code itself.

Michael:

For being faster, or being more according to the “right” way to do it, or things like that, all these other ideas that we have around code?

Janelle:

It’s more than that in that there’s a lot of biases that come into play. For example, one of the things that happened when I started recording things in elapsed time is our sense of time when we work is way off. One of the things that developers commonly optimize for is lowering execution time.

When I actually started looking at the data and how much time I was spending on execution versus how much time I’m spending on human cycle stuff like setting up testing or looking at results and trying to figure out what the hell was going on in the system.

And diagnose problems and stuff, most of the pain, by several orders of magnitude, was taken up in those human cycles as opposed to execution time.

When you’re executing stuff, you’re waiting, and waiting always feels slow. Whereas, when you’re intensely focused on trying to troubleshoot a problem, the time just…

Michael:

You can look up and it’s 1:00 in the morning.

Janelle:

Exactly. Because of that time skew, we have this bias toward, we want execution time to be faster, but once I had a data-driven feedback loop and I could see all the time that was taking up, I had a way to say, “These are the things that are actually taking up the majority of my time.”

I started shifting my focus to improving human cycle time related problems instead, and the amount of time I spent troubleshooting dropped dramatically. I started coming up with all these patterns and principles around optimization rules for how to optimize developer experience based on human interaction factors.

Michael:

You’ve done a lot of this research. You’ve got a lot of people involved in the process now. The tools that you’ve created to do this are incredible. I was actually really impressed with what I saw. You’ve just released a book literally here at the conference on stage.

Janelle:

I did. I did. I’ve been working on this book for five years, so it was a big deal to hit the publish button finally, though. It’s one of those things where people say, “Writing a book is a hard thing to do. It’s really hard,” but until you actually try and actually go and do it and write a book, it’s way more work than you ever think it’s going to be.

No matter how much you try and think it’s going to be hard, just multiply that by 10 or by 100 and you’ll get a little closer to what that’s like. Yeah, I published my book on stage when you gave me a nice drum roll.

Michael:

I did. I provided the drum roll. I’m happy to help. I’m just glad to have played a small part in the release of your book. Tell us a little more about the book, though.

Janelle:

The book starts with my story of tragedy and talks about the discoveries that we made and how we ultimately turned things around on the team as well as the tooling and method we’re using for measuring development pain.

Then, I’m also looking at the systemic problems across our industry with how projects get run into the ground over and over again with just organizational business pressure.

The other major problem that we have in development is despite all the things that we try and do with writing maintainable code, we don’t really talk about, how do you actually pull this off in the context of a business?

We have all these communication problems between the engineering world and the management world with struggling to communicate our pain, and being under constant urgency to deliver features.

The other major theme of the book is, how do we solve those organizational problems? Once we have visibility, it really starts a chain reaction effect with being able to solve problems with communication and organizational structure so you start working together.

In the book, I start with visibility down in the weeds and take it all the way up to the organizational investment level with redesigning our financial tools that we use as an organization to really put the pain on center stage and then start using that data-driven feedback loop to learn our way to better.

Michael:

That alone, actually using this empirical data, gathering the empirical data, which I don’t think anybody has really done at the scale that you’re doing it, and then using that data to articulate these problems and potential solutions across the organization.

Also, identifying, seeing things, these problems that we’ve felt for so long, but we’ve had no syntax for this. We’ve had no way to really articulate what these things are.

Janelle:

I think that’s probably the thing that I’m most excited about this, though. Don’t get me wrong, the ability to transform our organization with visibility at the center of it is pretty cool, but the thing that I’m really excited about is just being able to share our experience in a universal language that we’ve never really had before.

To be able to derive patterns and principles from lessons that we learn, and share them across our industry, and have a shared knowledge base that’s actually backed by evidence and data as opposed to just anecdote is a huge leap forward for our industry.

Michael:

You’ve got some really big plans, though, for what you’re trying to do.

Janelle:

I do. [laughs] I am not short of big plans, that’s for sure. The other thing that’s happening this month, about two weeks ago, I started a new company.

Michael:

This month being April, because I’m not sure when this is going to get released, but hopefully soon.

Janelle:

I’m kicking off this company, Open Mastery, a pay-it-forward industry peer learning network. In addition to the stuff with visibility, I’ve actually got a second book — I split my book in two, so it’s 80 percent written already — on open mastery learning framework.

Which is unique in that rather than being off solving all of our problems alone, since we have this universal language and we can get much more leverage out of it by working together as opposed to everybody working on their own, I designed a learning framework that is built on an industry knowledge sharing hub.

The learning framework is integrated with a support community. Essentially, we’re supporting anybody that’s interested in using these tools. We broke down the learning framework for the organizational transformation into an iterative road map. If you want to work your way there as an organization, we’re going to help people to do it.

Since it’s a pay-it-forward industry peer learning network model, we’re basically going to dump our hearts into helping support the community, and solving these problems, and working together. The expectation if you join the group is to pay it forward.

I’ve been working with Austin Software Mastery Circle, which is my community group, and we’ve been designing peer mentorship protocols for digging into these problems.

It took us about six months or so to get a process down, but we’ve got a process that can essentially scale up to industry level so that this can truly be an open environment that anybody can join and get help and support with digging into their problems.

We’ll ask people lots of questions and help them understand what’s causing their pain using this same data-driven method. People can walk away with the most important thing to a successful improvement effort, a better understanding of their problems. I’m really excited about that, too.

High-level, long-term plan is to repair the broken feedback loop in our socioeconomic system with respect to industry and education. Right now, we’re in this state where the education system has gotten completely irrelevant from being so disconnected from industry.

You spend all this money to go and get an accredited education, and 90 percent of the things that you learned you don’t really need to know, and you don’t learn 90 percent of the things you really needed to know.

There’s been some companies that do craftsmanship training, but you’re still talking 20K for an eight-week boot camp. What we really need in our industry is mentorship programs. The art of software development is really learned through mentorship.

I’m trying to fill that gap. Once we build up this knowledge base in the community, the plan is to turn that into an education and mentorship system online and make mastery-level education for software craftsmanship free to everyone in the world.

Michael:

Wow. That is incredible. I’m really excited by the things that you’re speaking about, what it actually adds to the overall texture of the tour itself, so it’s been really fun being on the road with you, and everything is coming out. So, where can we find your book?

Janelle:

You can find my book on leanpub/ideaflow, and then openmastery.org is the website.

Michael:

Wonderful. Thank you so much for your time and sharing your story. It’s incredible. We all have so many of these learning experiences in our career, and I know it takes tremendous amount of courage just to stand up and say, “I completely screwed this thing up.”

Janelle:

I think that’s what we need on stage, though. We all learn so much more from failures than from all the stuff that falls between the cracks. I’m a big believer in putting mistakes and pain on center stage. It’s the theme of all this stuff is let’s just face what’s really going on and then work on solving these problems together.

Michael:

Wonderful. It was so much fun talking with you. Thank you again for joining us.

[music]

Janelle:

Thank you.

Michael:

I’ll see you at the next stop.

[music]

Announcer:

At No Fluff Just Stuff, we bring the best technologists to you on a roadshow format. Early bird discounts are available for the 2016 season. Check out the entire show lineup and tour dates at nofluffjuststuff.com.

Michael:

I’m your host, Michael Carducci. Thanks for listening and stay subscribed.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*