This is an exclusive interview with Mircea Strugaru, the man behind Crossover‘s software engineering. In this interview, I wanted to go deep into how they’re enabling full-time remote work for thousands of professionals around the world and how are they achieving their world-famous performance boost in software engineering teams. My name is Sinan Ata and I’m the founder of Remote Tips and Director of Local Operations at Crossover. Let’s not waste any time and let me take you to the interview.
Can you tell me a bit about yourself and your background?
I started coding when I was around 12. My father was working as the head of the computer science department of a Romanian University in Timisoara, he got me the computer bug. And since then I was always close to technology.
In 2000, I had my first job. I started working as a developer but from day one I asked my boss to let me run the development team. We were in a trial phase as a development center. In that trial, the German parent company was evaluating us as a team. We were serving as an outsourcing team to this software engineering group. In a couple of years, we managed to actually replace all the other outsourcing teams. Back then they had software engineering operations in 3-4 sites and we managed to get all of their budgets and consolidated all software engineering related efforts on our site. That was a very good success of mixing development with management.
Then I spent 3 years with that company, then I spent 10 years in a corporation in various levels of software engineering management. The company was Alcatel Lucent. Then I got sick of the bureaucracy that I was facing. I knew that I needed to find the next thing to grow my career and to become creative and productive again but I had no clue where to find it. Similar to Slava, who told his story to Russian speaking world recently. I didn’t want to go to Seattle, Dublin etc to work with world giants like Google, Microsoft etc. I wanted to stay home.
Then I met Andy Tryba, and 3 months later I was starting as the VP of Engineering for Crossover, I met him and he explained to me what he wanted to build. I understood the logic model and met with some key players like Andy Montgomery, that was everything I needed to decide if I should join, I understood this is something I must be a part of and I joined.
What exactly are you doing right now in Crossover and the parent company, ESW Capital?
I do a few things in parallel, one of them is to run software engineering for Andy Tryba’s companies. You know he has Crossover, he has EngineYard, he has DNN and he’s acquiring more and more companies this year. He recently established a billion dollar fund named Think3 to acquire SaaS companies around the world.
That’s the first branch of my responsibilities, the other branch is running the central group for maintenance. We have about 130 engineers doing maintenance for the entire group. For all companies and all products. That’s the epitome of a factory. That’s what we do in the group. We create large teams and we run them as a Worksmart Factory.
And the third thing I’m doing is cross-functional. I’m deploying some end-to-end processes such as customer issue resolution, defect resolution, roadmap to release, recovery for outage the challenge there is to build something that spans across more than one function. Worksmart Pro is really optimizing more at a team level and we need processes that are uniting more than one function so that’s what I’m also doing. Those are the main 3 things that I’m currently handling as the VP of Engineering for Crossover.
What are the first steps right after acquiring a software company, how are you transforming companies and teams?
Of course, there are the financial calculations part at first but from the software engineering perspective, the first thing we do is due diligence, this is before the acquisition. In this due diligence process, we’re interested in understanding mostly the quality of the product, how well it is designed? What technologies have been used? How much it aligns with our standard model? We do have a standard model. Once the product is acquired, we’re doing what we call an Import. During the Import, we’re taking whatever company have created and we’re re-doing or restructuring it, actually improving it to match our standardized model for software products.
What does it mean? It means that we have a standard way of doing CI/CD so continuous integrations and continuous deployment pipelines are always run with the same technologies and in the same way. Then we’re changing how the product gets deployed, from whatever the model they use to do it via docker. So we only have dockerized products running in containers in all of our production and dev environments. Then we’re changing the database technology from whatever they have been using towards Amazon Aurora.
We’re only running products in the cloud. We’re not running on-premise. Whatever deployment they may have on a data center or another cloud provider, we’re migrating to Amazon, so we’re pulling things out of data centers and moving them to Amazon and we’re dockerizing them we’re running CI/CD. And in some cases, we’re also changing the language in which the product was built, either fully or partially. We have a set of preferred languages and if the product that we’re acquiring is not built with those languages we’re usually re-writing a part of the code.
For front-end, we’ve created a UI factory which is using Angular and for the back-end, we’re using .NET and Java for the vast majority of our products.
What are your key metrics while you’re transforming software engineering teams?
While we’re acquiring a new company or product, we’re aiming to keep top talent from existing teams and starting to inject tested talent from Crossover to these teams to increase overall quality. You know Crossover’s mission is to reach top 1% of the talent from all around the world and we’re pretty good at that.
All the companies that we’re acquiring, they have at best, a division between QA and development so they have a bunch of developers and a bunch of testers and that’s pretty much all the segmentation they have within software engineering. We’re going way beyond that. We’re splitting the engineers doing feature development from the engineers doing maintenance from the engineers doing performance improvements (we call them Faster), from the engineers changing how the UIs are working (we call that Easier).
Ok, so we have 4 major internal software factories, Feature, Faster, Easier, and Maintenance. When we acquire a company we take those top performers and get them within those teams (Feature, Faster, Easier, Maintenance) and then next thing we do is that we’re running a pair work program so that their knowledge is spread across the group.
We have something that is called a cross-training factor which is hovering between 3 and 5, meaning that each person can work on up to 5 products. Obviously not at the same time but they can switch between products.
And then, these teams are measured with a single metric and they’re working based on a common process. So imagine for example for maintenance we have 80 products that we’re doing maintenance for, we have 130 developers around the world working on them, they have exactly the same process, exactly the same quality bar, exactly the same SLAs across all products.
This uniform way of doing things across all products giving us an opportunity to optimize and see who are the best people in a large team. This is what’s generating an economy of scale, we’re becoming very very productive and we’re also able to identify who is the top talent across more than 100 contributors.
Within a process we have the following two criteria:
Productivity in Software Engineering
Cost per unit
- How expensive it is to deliver a single unit which is the team’s metric. This is an indicator that we’re always looking at, we measure that per individual, per project, per product, per team and also per unit. The unit has to be something that delivers value to the customer. A unit is specific to each team. Units are not subjective, we only work with things that can quantifiably create business value, it’s not like a subjective value like story points. For maintenance it’s a defect fix, for feature teams it’s a capability that customer can actually use, for performance guys it is speed % improvement, etc.
- When we say cycle time we’re referring to the time that we need between work started until the work is delivered. We’re aiming to shorten this cycle time every quarter. To give you an example we started Q4 2017 with the cycle time for defect fixing of 30 days. So we needed 30 days to deliver 1 defect fix on average to production and now our cycle time is 6 days. So we shortened our cycle time by 5 times in less than 2 quarters. We’re actually planning to shorten it further.
Quality in Software Engineering
The failure rate on quality bars
- We do the following; we’re setting quality bars for the transitions between teams, one is the consequences of the specializations of the teams we just talked about is introducing a number of handoffs between the teams. Those handoffs will need to happen according to a certain quality bar. We define the quality bar, we make it very objective, we try to automate it as much as we can. And once we have it, all the transitions are either passing or failing according to the quality bar. When a transition is failing the quality bar, we are recording that. That is a percentage of the total number of transitions. You have the quality bar failure rate of 10% and your end-state is obviously 0% so that the work doesn’t get rejected. We want to have 100% pass rate or 0% fail rate for the transitions between states.
SLAs (service-level agreements)
- Then because we have this kind of fragmentation, we also have a certain time in which we want each function to do their job. For that, we’re introducing another thing called SLAs (service level agreement) how long can you spend on each state or the work that you’re doing in one of those states until they need to be advanced. For that, again we do the same thing, if the ticket is going through a certain state, for example, QA, our SLA for QA is 24 hours. If the QA team gets a ticket today at 13:00, they have to finish the QA by tomorrow 13:00. If they haven’t finished it, they’ll keep working on it for sure but their SLA will be broken. If they do a 100 tickets and they finish half of them in 24 hours their failure rate will be 50%. That’s the other metric that we have. So failure rates are the things that we use to quantify if our process is of high quality. It’s about the time that you’ve spent in a certain state. If you need to spend more time than the desired time, it means the process is still broken so the quality of the process is not good. If the work than you advanced is getting rejected it means the quality of work is not good. We also consider that as a quality aspect.
- To give you an example, a particular case of quality bar failure is Regressions. If we have a ticket that gets deployed to production and then it returns to the software engineering as having caused the regression, that is a quality bar failure for releasing. If the developer is releasing a ticket for QA internally and it fails the QA and goes back into development, we’re not saying it’s ok, it’s normal. We don’t accept that, we want the developer to have a 100% success rate in delivering tickets without defects. So that QA is clearing them for releasing, if success rate is not 100% we’re making deep dives to see root causes, maybe the project doesn’t have enough unit testing, maybe the build is broken, maybe you have the universal developer syndrome which is “it works for me but not for you”. So these numbers are telling us where to look at when we’re improving the process. If you have 0% failure rate, your work is as smooth as it gets. We’re considering this as the aerodynamics of our organization as a formula car. To get the formula car go faster, we’re shortening the SLAs.Question for next quarter is how about if the QA process can be done in 1 hour vs 24 hours? And then the failure rate may be 50%, and hey we can get it to 100% again and still our car will be going faster comparing to last quarter. If that’s going the change the failure rate for regressions to production, we’ll learn if we’re over speeding. We’ll know if we’re not ready for that, we’ll either get slower with SLAs (which we usually do not) instead we’re actually looking at what’s missing, do we need test automation? What do we need to do to stay in 1 hour vs 24 hours?
- This is something we do that others companies are not doing. Other companies say “go faster and do not break things”. We say “go fast and if you see something is breaking, don’t slow down but fix it”. In Formula 1 you don’t tell the drives not to crush the car, it’s obvious. You tell the drivers to go as fast as they can and make improvements on the engineering of formula 1, make the drag smaller, make the grip of tires higher, make the corner speed higher so they can take it faster. It’s not about making an accident, that’s kind of implicit. You don’t have a carte blanche for not making mistakes, you have a carte blanche to fix what is creating those mistakes, you’re requested to go after the root causes and you’re not allowed to hide behind the “do not make mistakes” excuse for not going faster.
NPS (net promoter score)
- Obviously, we’re also looking at the NPS (net promoter score), in the end, we also go to our customers and ask them if our work is ok.
- Are you a promoter of our work or are you a detractor. We’re using NPS as a final check to see if we’re going in the right direction.
Can you give us some dollar figures from your previous business achievements?
Yes, I could also give percentages, actually let me tell you a bit more about compliance. Compliance is complementary to performance. It is correlated with performance in terms of productivity with quality in the sense of that it drives certain behaviors. When you drive those behaviours in the team, you can actually do two things, you can be very agile, because you can do a bunch of experiments with certain behaviours, you can see the results and when you’re happy with the results compliance is letting you roll out those behavioral incentives if you will, across a very large number of people instantaneously.
Let’s say you have a set of compliance criteria and then you added 2 more because you have had a successful experiment then you can roll it out across everyone and if you’re adding 2 more compliance criteria and the rest of the organization that hasn’t gone through the experiment is not exhibiting those behaviors their compliance will drop, they will have to quickly adjust, they will adopt those behaviours and then now your entire organization is doing those things. You don’t need to write long documents, you don’t need to write long playbooks, you don’t need to rely on people reading your documents. You don’t have to do checks on everyone, you just look at the score, the score calculation is automated. So it’s about driving behavior. When you’re driving behavior and you’re achieving performance results then you see it in the number of units delivered.
Our standard which is very rarely missed is that we improve 25% quarter over quarter, which is our mantra. So, if you’re running compliance on a team, you’re expected to deliver at least 25% cost per unit reduction which means your team is 25% more productive. And if you do it for a year, obviously you’re more than twice as cost-effective and/or productive.
To give you an example, our time-motion studies were showing that for every defect we’re attempting to reproduce, we were spending 1 day on attempting to reproduce the defect. By enforcing a quality bar on that defect we were able to save 90% of 1 day. When you multiply it with 130 engineers and all days in a year you can end up having massive dollar savings.
Here’s another example, we’re fixing 5000 defects per quarter. Now, from a productivity perspective what we’ve done is that we have improved our ability to fix a defect by going deep and looking at all the costs that are involved in fixing a defect and our average cost per defect fix went down 25% in a row over the last 4 quarters. We could reduce it down to $316 vs $1000 per defect fix. This means only with this optimization move, we managed to save $13.6M per year for the future years.
Let me tell you another optimization story from the infrastructure side of our world, people used to call what wasn’t virtualized as “bare metal” and now it’s kind of ironic but we call the cloud as “bare cloud”. So when you’re deploying your software in the cloud, that is for many people, that’s the latest and greatest thing, our product is in the loud, we benefit from cloud’s elasticity, we ‘re going to AWS etc. This is like as good as it gets, for us, that’s really not enough. Moving everything to the cloud was predictable and easy to handle and we did it already. What we are now able to do is actually a lot more than that. We are actually dockerizing our products and we’re deploying those in very large docker hosts where we have thousands of containers running within a single virtual machine. So it’s like the cloud in a cloud. We bought the largest sized Amazon virtual machine instances that money can buy. And within those rather than deploying our product as simple AMI images, because you can’t really do that except for one in an EC2 instance. We take the very large instance and we’re running a docker host on it. Within the docker host, we’re deploying 1000 containers. Which may be handling 30-40 products. By doing that, we’re achieving huge economies of scale. We have thousands of containers, instead of having those containers running as different AWS instances. We’re consolidating them in less than 20 very large Amazon instances instead of having 3-4 thousands.
Imagine you have a cargo ship where you can place thousands of containers, you can do the same work by having small boats carrying a single container, the point here is operating/optimizing and loading the cargo ship in full so you can get a better price for transportation per kilo.
With this, we have a successful track record of cutting the operating costs for more than 95%. So imagine we’re acquiring your SaaS company, if you were spending $1M a year for operating your product, we can get it down to $50K/year.
Do you have any suggestions for software engineering companies and startups of any size?
What we do works very well at scale, we have hundreds of people doing maintenance, feature development, performance improvement etc. But the things that we’re doing are also most definitely applicable at a startup level or SME level. I think the very first thing we strive for and people usually forget about is simplicity. People are over-engineering, they’re building complexity that they don’t need. There’s even a design pattern called Yagni. Software gurus like Martin Fowler are advocating also about adding complex design only when necessary, not before that, but we’re taking this very very seriously. Keeping simplicity in your software, in your deployment, in your team organization, in your metric, keeping simplicity everywhere. That is the most important thing that I’d advise people to do. Simplicity allows you to debug more easily, it allows you to make decisions more quickly, to understand where you need to pay more attention to.
Only add complexity if you get a 10 times improvement in performance. That’s lesson number one. Keeping things simple.
The second thing that I’d give as an advice which is applicable to teams of any size is measurement. Measurement of everything we do is key to understand and improve. We’re using Worksmart Pro right, we’re measuring ourselves, how many hours we’re working, what is our focus, what is our intensity, where do we spend time doing what kind of activities, I’ve seen some of my startup friends installing tools similar to Worksmart Pro and they realized they’re kind of wasting a third of their time looking at Facebook and other things. And then they realize they’re doing it because actually their processes are not automated and they’re filling their time doing random stuff instead of working on what really matters. So measuring, putting a single metric on your team measuring your productivity, measuring your quality, measuring your speed, measurements, and metrics is another important asset which is applicable to everyone.
The third thing is automation. You can get huge improvements if you automate everything you do. Some of the things are obvious, having a one-click deployment, automating your deployment, automating your dev environment creation, automating your tests, automating your monitoring of production environment, automating the recovery from errors or outages or failures.
Automation is very very important for gaining productivity because it saves you from manual work.
When we talk about focus, it saves developers sacrifice their focus by context switching to another activity. It’s about saving work and preserving focus. If you only need to click on a couple of buttons for half an hour you can write a script and automate that process once which saves you hours during the whole year.
Do you have any advice for software engineers from any technology stack on how to improve themselves to be a global player?
I think exposure is something important. You need to be exposed to projects which are using cutting-edge technology. There’s a certain attitude where companies and managers are just settling in, they feel comfortable, they are sitting in a comfort zone, they don’t want to fix so many defects with the fear of breaking the product. They’re unwilling to innovate, unwilling to drive progress and to drive improvement.
We see many candidates which are answering questions related to productivity improvement and technology refresh and bringing simplicity with “it wasn’t needed” or “it wasn’t requested”. So, when you’re in an organization like this, which is not driving for continuous improvement, there’s a very high chance that you as an engineer will become less and less competitive.
Leave Your Comfort Zone
How to battle that? Join a startup, write a product from scratch, have the courage to rewrite old, shitty module with a new technology, even if you have to do it in your private time, just to prove your management that it can be done, do it. Try not to settle for not good enough. Being in a comfort zone is a sure way to become non-competitive and not to be part of the best of the world. Imagine how many days Elon Musk or his direct reports sat in the comfort zone for past couple of years and start today.
Hopefully, anyone reading this interview should have a better idea on who might be the best fit for our culture and organization. If you want to join our team, you can see all the available positions we have here.
— End of the Interview —
Thanks Mircea for sharing your knowledge. I’m pretty sure thousands of software engineers and managers around the world will find inspiration in your words.