Scaling Techniques to Increase Magento Capacity

At Meet Magento NY 2015 Kevin Bortnick, Senior Magento Solutions Architect of Clustrix, hosted a workshop titled, “Scaling Techniques to Increase Magento Capacity.” There he spoke about scaling strategies used to overcome performance bottlenecks associated with the MySQL database used by most Magento implementations. Kevin highlighted the shortcomings of ‘read slaves’, ‘multiple masters’ and ‘sharding’ and shared is his real world experiences. Check out Kevin’s presentation below on how scale-out database opens new possibilities for scaling to meet these demands either in the datacenter or cloud.

View the Slides

Kevin’s presentation notes can be viewed on SlideShare, please enjoy.

 

View on SlideShare

Want the Audio?

Want to listen to the audio presentation? Visit SoundCloud to download the Podcast.

 

Download on SoundCloud

Read the Transcript

Magento NYC Scaling Techniques to Increase Magento Capacity

(0:00:22.1) Good afternoon. I am Kevin Bortnick and this is the presentation on scaling techniques to increase Magento capacity, or how to get ready for the holiday and cope with fast growth. My name is Kevin Bortnick. I am a solution architect with Clustrix. I’ve been working with Magento for over six years now, since back on 1.3. I’ve done everything from small deployments to very, very large deployments for Fortune 500s, and extremely customized deployments. In that time, I’ve become a Magento architect and there’s a lot of key things that I’ve started to look at when doing either development or architecting server infrastructures.

(0:01:03.8) The first and most critical to me is how easy is it to maintain, and the reason I always look at this is because if it’s hard to maintain, it’s going to come back on me. I’m going to have to do a lot of work to keep it going, I’m going to run out of time to do new development and new projects. Similarly, how can someone else support it? Is someone else able to come in and look at what I’ve done and replicate it and understand it. Because if they can’t, they’re going to have to come back and ask questions. If they can’t ask me a question, they can’t fix it, it’s no good. The third is how can it handle growth, and how well does it handle growth? This is really important, especially in the e- commerce field because hopefully our customers are growing bigger, and they’ll also have seasons of great, large growth where if you’re not testing and thinking of how well this is going to scale, you’re going to run into significant bottlenecks. No one wants to be up on Thanksgiving night trying to fix a problem. The fourth is can it be used for something else? Since a lot of us are working with software for multiple companies or doing different projects, I always try and figure out how can we make this a little bit more generic, how can we make it reusable? Because a few hours extra in the planning phase can sometimes save you weeks or months down the road building something twice. The final is when is it going to break? Almost anything you build is going to break under some sort of edge case or use condition, and you want to spot these as early as possible. Because if you don’t, you’re going to have some late nights.

(0:02:35.3) Symptoms of problems in terms of Magento capacity and load: the first thing that you’re going to start noticing is increases in page load time. Hopefully, you’re running some sort of user tracking such as New Relic to tell you how fast your pages are loading, errors, and things like that. If you’re noticing consistent slowdown either during certain hours of the day, or even just week to week, you probably have a problem coming. The next is spikes and errors. If you notice sudden bursts of extreme page load time when there’s user spikes or even errors appearing in the error logs under high load, that’s a good indicator. The third is a pretty obvious indicator, it’s when your site crashes. Site crashes and goes down, there’s a good chance there’s a bottleneck somewhere that’s blocking your way, and the fourth and less common one is support tickets. It’s always good to have some way for users to report issues on your site, just because there’s a lot of unusual interactions that can happen. One of those big ones is things like if you have a read latency bottlenecks, users will add something to the cart and they won’t be able to read it out. These kind of issues are usually very hard to detect on development unless you’re doing load testing scenarios, so your support tickets can be a critical way to identify these issues that would otherwise go unnoticed.

(0:04:00.5) So, how do we actually make our sites fast? The first thing is you look in the obvious places, and I’m sure most of us have already done most of these. The first is scaling Magento web nodes. You can stick Magento on multiple web nodes, use it for higher throughput. Second, is adding Memcached or Redis for session variables. This allows you to access these quickly without adding any additional load to either the file system or database. Third is adding Varnish or some other full page cache solution. This allows you to dramatically reduce the processing load on your servers. Once all these are tuned, though, your database is going to become a bottleneck, and database becoming a bottleneck is an issue that people have had a lot of trouble solving. For me, I’ve gone through a number of potential solutions. There’s good ones, there’s bad ones, they all have their pros and cons, and there’s a few key things I start to look at. The first is is it MySQL compatible? If you’re not MySQL compatible you’re going to have a really bad time. You’re going to be doing a lot of initial development, a lot of late development, you’re going to lose a lot of the features of Magento. That’s a critical piece, to me. The second is improves performance. Does this actually improve your performance, or are you just shifting around scenarios where it performs better here but performs worse elsewhere? Toward that end is scaling reads. Magento is very, very read heavy, we all know that, so any solution that we look at should be able to scale reads, preferably indefinitely. The fourth is scaling writes. Scaling writes is really important because this is the lifeblood of an e-commerce system. If you can’t do writes, you can’t do checkouts, you can’t do add to cart. Not having that is brutal (laughs). The fifth is no application changes. To the best of our ability, we want to not have to modify the original application to get these scaling solutions to work. Any time we have to dig into that, it makes the solution more complicated and increases chance of failure. Speaking of failure, the file is no single point of failure. That’s the last thing you want in an e-commerce system is for one server to go out and your whole system to go down.

(0:06:17.5) The first thing I do is I install Percona. Percona is a MySQL drop in. It’s super easy to set up, it’s very quick. The key benefits of it is its high consistency of performance under higher loads, so if your boxes are running at 75, 80 percent load, you’re going to see more consistent performance out of it. On the other side it is still using the MySQL base, and so if your boxes are hitting 100 percent, capping out, and dying, this might not be the best thing for you. Additionally, if you’re only running 10 to 20 percent on your boxes, you’re not going to see too big of a difference. Initially, since this is just the MySQL fork, there’s no high availability or disaster built in. You’re still going to have to have a DBA go in and architect additional features around that.

(0:07:05.0) Once I have Percona installed, you start looking at what are we actually doing here? The first solution and the easiest is faster hardware. When you get a bigger box, your stuff goes faster and you can support more. It’s pretty straightforward, it’s very easy to do. You do a deployment, you get a bigger box, get a deployment, get a bigger box, do a deployment and then you start getting to a limit. The cost of doing this becomes higher than you want it to be or is reasonable, or you straight up run out of space. We’ve worked with clients that have 64 cores and 192 gigs of RAM. There’s not too much place you can go once you’re up that high. Similarly, when you’re running boxes that big to support your software, it makes backups and fail overs much more expensive. In order to have these running, you need boxes of the same size that you aren’t using that can double to triple your cost. When it come to exotic hardware that can be very expensive. Similarly, you end up with excess hardware during off peak seasons. A lot of companies just eat this cost or they have to go through DBA time to downscale. Holiday season, you’re dealing with anywhere between three to 10 times your normal load. If you have the hardware for that running in off peak season, that can get pretty expensive.

(0:08:27.8) Once you’ve figured out what your relatively good box size is, the next option is to start adding read slaves. Read slaves are a great way for you to scale up your read throughput ability without too much of a detriment to your software. It requires no additional changes. It’s very easy to do. You can just go to the Magento core config, add a new read connection, and you’re good to go. If you want to have multiple read slaves you would usually stick these behind a load balancer, and then you can have as many as you want and it can scale pretty much indefinitely. The two downsides of this one. One of them is that it does not help solve write bottlenecks. If you are getting so many checkouts getting through that your master can’t handle it, this will offload some of the pressure on the master, but it won’t clear the bottleneck completely. Once you cap out on what your master can do, you’re stuck again. The second is eventual consistency. Because you’re now doing a replication solution, there is going to be a small delay. This will cause things like race conditions where a user will add something to the cart on the master and then attempt to read from the cart on the slave, and they wouldn’t see the product appear. This is a very rare use case, and it usually occurs on some of the very largest of slights. The only way that I’ve seen around this is tracking and reading back from the master.

(0:09:54.5) Another solution is a double master set up. This is a very specific use case. These are for sites that the company is doing a lot of updates very frequently in the admin, or they’re running a 24 hour store. When you’re doing these updates, you’re sending writes to the current master, and when you’re running re-indexers that puts a lot of load and pressure on the database. By splitting out your admin into a separate database server, the company can run these updates and run indexers on a completely separate server, and so long as you’re doing row-level application, it will leave the performance on the main store intact. This is absolutely critical for 24 hour stores.

(0:10:38.9) The downside is, as I get into multimaster I’ll explain a bit more, is accommodating for edge cases. You can’t do some things like if there’s a customer service department that does orders and things like that in the backend, you’re going to have to create a backend that’s attached to the main store, because you don’t want to be splitting those unless you do a lot of additional software development. Additionally, you’re now doing an active active replication scheme, which dramatically increases the chance of locking and failure, so you’re going to have to have a DBA to monitor those issues and be able to resolve them quickly. Finally, it won’t fix frontend, only bottlenecks. If something like your checkout is your absolute limiter, and it’s not something like indexers or doing a lot of updates, this isn’t going to be too helpful for you.

(0:11:24.5) The final is true multimaster, and this is sticking multiple masters on the front end. This is quite the task, to say the least. You can get overall better performance, but only in very specific use cases. This is absolutely fantastic for companies that have large product catalogs with large order loads. Especially if those orders or the products that are being sold are very distributive. The downside is that you have to significantly modify your code and architecture. The first thing that you want to do is, depending on the number of servers you have, you need to offset your autoincrement keys so that you never have collisions, as well as go in and do things like offset your purchase order numbers. So, server one might be taking odd number orders, and server two might be taking even number orders. That way you never have a scenario where two orders come in and get the same order number at the same time. This is the same issue that you would have if your customer service department was placing orders in the backend in the previous scenario.

(0:12:27.4) The second issue is inventory with these. The inventory update inside of the order transaction is going to want to go to the same place. You can’t update the inventory separately on both of these for the same product or you’re going to run into a collision. So you’re going to have to either do some software level tracking, or funnel it all into a single server, split your inventory, move your inventory over to a Redis database. There’s a lot of solutions. They’re all a little bit challenging to do.

(0:13:00.4) Final problem is latency causing sync issues. Again, going back to this idea of a user adding something to their cart and then trying to read it again. If they add something to master one, you’re now going to take two replication hops to get over to slave 2A. Since active active replication is going be slower because of load, this dramatically increases the chance of your race condition failing, and so you’re going to have to go in and modify the application to start tracking what server it’s working on. So if they write to master one, they’re now going to have to start reading from slave 1A and slave 2B. Obviously, this is a fairly complex endeavor and also reduces the performance output that you can actually get from multimaster. Of course, all of this is expensive to develop and maintain, and as your complexity increases your scalability is going to drop down.

(0:13:58.8) Final piece is partitioning. This is breaking out pieces of database into other sections. The primary example is if you’re a company that needs to run the mage log, which tracks users on the site by doing it right to the database every time a user requests a page. You’re not going to want that hitting your primary server. If you need to run that module and not have it turned off, what you can do is you can go into the configuration of these modules on a per module basis, and point these at new databases. By partitioning this stuff out, you reduce the load of that off onto another server. This is a technique that Magento 2 is implementing in their new deployments to get a lot of performance improvement. The downside is that you can’t join between these partitioned off segments. If you move this table over to a different database, you can no long run queries of joining them together and get those file results. This is actually really problematic in Magento 1 because a lot of the modules are extremely intertwined. With Magento 2, they’ve gone through with a bit of a scalpel and they’ve distributed it into I believe three or four distinct sections, allowing you to actually partition them out. In Magento 1 that’s going to be a lot of additional development time though. Because this isn’t default in Magento 1 with this architecture, you’re also going to have to have a lot of documentation. Since you’re writing these things in the configs of the modules, it’s going to be a little spread out, and it’s going to be a little bit hard to track. Anyone who’s coming onto the system later, is going to have a little bit of a challenging time figuring out why is stuff saving in different places, if you don’t have proper documentation.

(0:15:36.2) Once you’re done scaling up that and you’re hanging your limits, we start looking at some exotic solutions. The number one thing that people have been going to is NoSQL, and they’ve been trying this. There’s been a lot of attempts to do NoSQL and a lot of failures to do NoSQL. One of the primary things of NoSQL is that it runs extremely, extremely, fast, but the reason it does that is because it drops some of the ACID compliancy features. One of the biggest performance improvements they’ve found is allowing you to finish your commit before your data is written to disk and instead is in memory. The problem is, if your server crashes, you’ve got a transaction back that was claimed to be completed and then it disappears. Of course, if you start turning off of these features that shortcut ACID, the performance starts to drop down. So a lot of the performance improvements that you actually end up getting from NoSQL are because you’re completely rewriting the Magento ORM layer in order to use it. In doing so, you lose a lot of Magento’s features that come with the relational database. Additionally, there’s no cross-document transactions, which means that you have the possibility that if you want to update your inventory or product while adding an order and your server crashes in the middle, there’s no automated rollback functionality for this kind of stuff. This is very, very risky in a large e-commerce platform. In most cases, companies that (0:17:03.4 unclear) NoSQL end up doing it for the product catalog only, and they move that out so the actual transactions and financial stuff remain on a relational, transactional database. Again, this takes huge amounts of resources to develop and proof, and most of your performance is going to come from cutting corners. As a general rule for me, as an architect, I don’t want to drop ACID.

(0:17:37.0) The final piece, and the reason I’m really here today, ClustrixDB. ClustrixDB is not based off the MySQL architecture, it’s not based off NODB or MyISAM. It’s its own platform. Part of this is it’s designed to scale reads and writes. What we do is it’s a distributed data architecture that tracks where the location of the data is and pushes the queries to the nodes. The way that we handle the data allows us to paralyze lots of queries, it allows us to distribute the load in a way that’s not possible in a standard replication architecture. Additionally, we have data redundancy built in. This means that if you lose a node, there’s a copy of that data somewhere else, and so your site doesn’t go down immediately. Additionally, the software will detect that a node has dropped and it will immediately begin rebalancing that data, sharing it out so that you again have multiple replications of this software. Additionally, because of the way that it’s architected, you can have much smaller servers doing your data backup for disaster recovery. Allowing you to cut costs there. Finally, we have a software solution called Flex. This allows you to, on the fly, scale up and scale down your database solution, just like you would for your web nodes. The next thing, we have a great administration admin. This is fairly new for us, and it’s really fun to use. It tells you a lot of data from transaction speeds and things like that to real time load on your server, making it very, very easy to debug issues. If you need to figure out what query is running slow, why it’s running slow, we offer really in depth analytics on that. Play a little video real quick, and this is a video off of our website.

(0:19:41.6-0:19:59.3 Video plays.)

(0:19:59.7) This is a beginning of a clip from our website on how to actually do the Flex up. She runs a single command and this automatically deploys a current version of Clustrix running on this node to another node. It sets up the software, sets up all the configuration, and then your hardware is ready to go. From there you run a single MySQL query of alter cluster, add IP address, and that new server is now added to your cluster. It takes minutes to do and you suddenly have dramatically increased capacity based linearly on the number of servers that you’ve added. This takes things that would normally take days or months of planning and turns it into minutes of work. The Flex commands are also in our new UI that’s coming soon, so you won’t even need to log in to the command lines to do it. You’ll be able to go directly into the admin interface and just add servers on the fly. Going back to the beginning, of these solutions, the reason I joined Clustrix is because it’s the only one that I’ve found that can really nail all the points. There’s a lot of good stuff out there you can do. There’s a lot of ways that you can improve your scaling capabilities with different architectures, but I came to Clustrix for a reason. At Clustrix my number one job is basically working directly with Magento and only Magento, and figuring out how we can rewrite things like the OR emulator to use it more efficiently so you guys can scale to infinity. And that’s about it. Any questions? Nope? I would be happy to talk about these more in depth at my booth, and I’ll be here throughout the week.

(Applause.)