A Brief Lesson in History
A brief look at the history of infrastructure and what has led us to where we are today.
The datacenter has evolved significantly over the last several decades. The following sections will examine each era in detail.
The mainframe ruled for many years and laid the core foundation of where we are today. It allowed companies to leverage the following key characteristics:
- Natively converged CPU, main memory, and storage
- Engineered internal redundancy
But the mainframe also introduced the following issues:
- The high costs of procuring infrastructure
- Inherent complexity
- A lack of flexibility and highly siloed environments
With mainframes, it was very difficult for organizations within a business to leverage these capabilities which partly led to the entrance of pizza boxes or stand-alone servers. Key characteristics of stand-alone servers included:
- CPU, main memory, and direct-attached storage (DAS)
- Higher flexibility than the mainframe
- Accessed over the network
These stand-alone servers introduced more issues:
- Increased number of silos
- Low or unequal resource utilization
- The server became a single point of failure (SPOF) for both compute AND storage
Businesses always need to make money and data is a key piece of that puzzle. With direct-attached storage (DAS), organizations either needed more space than was locally available, or data high availability (HA) where a server failure wouldn’t cause data unavailability.
Centralized storage replaced both the mainframe and the stand-alone server with sharable, larger pools of storage that also provided data protection. Key characteristics of centralized storage included:
- Pooled storage resources led to better storage utilization
- Centralized data protection via RAID eliminated the chance that server loss caused data loss
- Storage were performed over the network
Issues with centralized storage included:
- They were potentially more expensive, however data is more valuable than the hardware
- Increased complexity (SAN Fabric, WWPNs, RAID groups, volumes, spindle counts, etc.)
- They required another management tool / team
At this point in time, compute utilization was low and resource efficiency was impacting the bottom line. Virtualization was then introduced and enabled multiple workloads and operating systems (OSs) to run as virtual machines (VMs) on a single piece of hardware. Virtualization enabled businesses to increase utilization of their pizza boxes, but also increased the number of silos and the impacts of an outage. Key characteristics of virtualization included:
- Abstracting the OS from hardware (VM)
- Very efficient compute utilization led to workload consolidation
Issues with virtualization included:
- An increase in the number of silos and management complexity
- A lack of VM high-availability, so if a compute node failed the impact was much larger
- A lack of pooled resources
- The need for another management tool / team
The hypervisor became a very efficient and feature-filled solution. With the advent of tools, including VMware vMotion, HA, and DRS, users obtained the ability to provide VM high availability and migrate compute workloads dynamically. The only caveat was the reliance on centralized storage, causing the two paths to merge. The only down turn was the increased load on the storage array before and VM sprawl led to contention for storage I/O.
Key characteristics included:
- Clustering led to pooled compute resources
- The ability to dynamically migrate workloads between compute nodes (DRS / vMotion)
- The introduction of VM high availability (HA) in the case of a compute node failure
- A requirement for centralized storage
- Higher demand on storage due to VM sprawl
- Requirements to scale out more arrays creating more silos and more complexity
- Higher $ / GB due to requirement of an array
- The possibility of resource contention on array
- It made storage configuration much more complex due to the necessity to ensure:
- VM to datastore / LUN ratios
- Spindle count to facilitate I/O requirements
SSDs helped alleviate this I/O bottleneck by providing much higher I/O performance without the need for tons of disk enclosures. However, given the extreme advances in performance, the controllers and network had not yet evolved to handle the vast I/O available. Key characteristics of SSDs included:
- Much higher I/O characteristics than traditional HDD
- Essentially eliminated seek times
SSD issues included:
- The bottleneck shifted from storage I/O on disk to the controller / network
- Silos still remained
- Array configuration complexity still remained
The term cloud can be very ambiguous by definition. Simply put it’s the ability to consume and leverage a service hosted somewhere provided by someone else.
With the introduction of cloud, the perspectives IT, the business and end-users have shifted.
Business groups and IT consumers require IT provide the same capabilities of cloud, its agility and time to value. If not, they will go directly to cloud which causes another issue for IT: data security.
Core pillars of any cloud service:
- Self-service / On-demand
- Rapid time to value (TTV) / little barrier to entry
- Service and SLA focus
- Contractual guarantees around uptime / availability / performance
- Fractional consumption model
- Pay for what you use (some services are free)
Most general classifications of cloud fall into three main buckets (starting at the highest level and moving downward):
- Software as a Service (SaaS)
- Any software / service consumed via a simple url
- Examples: Workday, Salesforce.com, Google search, etc.
- Platform as a Service (PaaS)
- Development and deployment platform
- Examples: Amazon Elastic Beanstalk / Relational Database Services (RDS), Google App Engine, etc.
- Infrastructure as a Service (IaaS)
- VMs/Containers/NFV as a service
- Examples: Amazon EC2/ECS, Microsoft Azure, Google Compute Engine (GCE), etc.
Cloud poses an interesting dilemma for IT. They can embrace it, or they can try to provide an alternative. They want to keep the data internal, but need to allow for the self-service, rapid nature of cloud.
This shift forces IT to act more as a legitimate service provider to their end-users (company employees).
The figure below characterizes the various latencies for specific types of I/O:
|L1 cache reference||0.5 ns|
|L2 cache reference||7 ns||14x L1 cache|
|DRAM access||100 ns||20x L2 cache, 200x L1 cache|
|3D XPoint based NVMe SSD read||10,000 of ns (expected)||10 us or 0.01 ms|
|NAND NVMe SSD R/W||20,000 ns||20 us or 0.02 ms|
|NAND SATA SSD R/W||50,000-60,000 ns||50-60 us or 0.05-0.06 ms|
|Read 4K randomly from SSD||150,000 ns||150 us or 0.15 ms|
|P2P TCP/IP latency (phy to phy)||150,000 ns||150 us or 0.15 ms|
|P2P TCP/IP latency (vm to vm)||250,000 ns||250 us or 0.25 ms|
|Read 1MB sequentially from memory||250,000 ns||250 us or 0.25 ms|
|Round trip within datacenter||500,000 ns||500 us or 0.5 ms|
|Read 1MB sequentially from SSD||1,000,000 ns||1 ms, 4x memory|
|Disk seek||10,000,000 ns or 10,000 us||10 ms, 20x datacenter round trip|
|Read 1MB sequentially from disk||20,000,000 ns or 20,000 us||20 ms, 80x memory, 20x SSD|
|Send packet CA -> Netherlands -> CA||150,000,000 ns||150 ms|
(credit: Jeff Dean, https://gist.github.com/jboner/2841832)
The table above shows that the CPU can access its caches at anywhere from ~0.5-7ns (L1 vs. L2). For main memory, these accesses occur at ~100ns, whereas a local 4K SSD read is ~150,000ns or 0.15ms.
If we take a typical enterprise-class SSD (in this case the Intel S3700 - SPEC), this device is capable of the following:
- Random I/O performance:
- Random 4K Reads: Up to 75,000 IOPS
- Random 4K Writes: Up to 36,000 IOPS
- Sequential bandwidth:
- Sustained Sequential Read: Up to 500MB/s
- Sustained Sequential Write: Up to 460MB/s
- Read: 50us
- Write: 65us
For traditional storage, there are a few main types of media for I/O:
- Fiber Channel (FC)
- 4-, 8-, 16- and 32-Gb
- Ethernet (including FCoE)
- 1-, 10-Gb, (40-Gb IB), etc.
For the calculation below, we are using the 500MB/s Read and 460MB/s Write BW available from the Intel S3700.
The calculation is done as follows:
numSSD = ROUNDUP((numConnections * connBW (in GB/s))/ ssdBW (R or W))
NOTE: Numbers were rounded up as a partial SSD isn’t possible. This also does not account for the necessary CPU required to handle all of the I/O and assumes unlimited controller CPU power.
|Network BW||SSDs required to saturate network BW|
|Controller Connectivity||Available Network BW||Read I/O||Write I/O|
|Dual 4Gb FC||8Gb == 1GB||2||3|
|Dual 8Gb FC||16Gb == 2GB||4||5|
|Dual 16Gb FC||32Gb == 4GB||8||9|
|Dual 32Gb FC||64Gb == 8GB||16||19|
|Dual 1Gb ETH||2Gb == 0.25GB||1||1|
|Dual 10Gb ETH||20Gb == 2.5GB||5||6|
As the table shows, if you wanted to leverage the theoretical maximum performance an SSD could offer, the network can become a bottleneck with anywhere from 1 to 9 SSDs depending on the type of networking leveraged
Typical main memory latency is ~100ns (will vary), we can perform the following calculations:
- Local memory read latency = 100ns + [OS / hypervisor overhead]
- Network memory read latency = 100ns + NW RTT latency + [2 x OS / hypervisor overhead]
If we assume a typical network RTT is ~0.5ms (will vary by switch vendor) which is ~500,000ns that would come down to:
- Network memory read latency = 100ns + 500,000ns + [2 x OS / hypervisor overhead]
If we theoretically assume a very fast network with a 10,000ns RTT:
- Network memory read latency = 100ns + 10,000ns + [2 x OS / hypervisor overhead]
What that means is even with a theoretically fast network, there is a 10,000% overhead when compared to a non-network memory access. With a slow network this can be upwards of a 500,000% latency overhead.
In order to alleviate this overhead, server side caching technologies are introduced.
One frequently debated topic is the argument between doing things in kernel vs. in user-space. Here I’ll explain what each is and their respective pros/cons.
Any operating system (OS) has two core areas of execution:
- Kernel space
- The most priviliged part of the OS
- Handles scheduling, memory management, etc.
- Contains the physical device drivers and handles hardware interaction
- User space
- “Everything else”
- This is where most applications and processes live
- Protected memory and execution
These two spaces work in conjunction for the OS to operate. Now before moving on let’s define a few key items:
- System call
- A.k.a. kernel call, a request made via interrupt (more here later) from an active process that something be done by the kernel
- Context switch
- Shifting the execution from the process to the kernel and vice-versa
For example, take the following use-case of a simple app writing some data to disk. In this the following would take place:
- App wants to write data to disk
- Invokes a system call
- Context switch to kernel
- Kernel copies data
- Executes write to disk via driver
The following shows a sample of these interactions:
User and Kernel Space Interaction
Is one better than the other? In reality there are pros and cons for each:
- User space
- Very flexible
- Isolated failure domains (process)
- Can be inefficient
- Context switches cost time(~1,000ns)
- Kernel space
- Very rigid
- Large failure domain
- Can be efficient
- Reduces context switches
Another core component is how the interaction between the two is handled. There are two key types of interaction:
- Constantly “poll” e.g. consistently ask for something
- Examples: Mouse, monitor refresh rate, etc.
- Requires constant CPU, but much lower latency
- Eliminates expense of kernel interrupt handler
- Removes context switch
- “Excuse me, I need foo”
- Example: Raising hand to ask for something
- Can be more “CPU efficient”, but not necessarily
- Typically much higher latency
As devices have become far faster (e.g. NVMe, Intel Optane, pMEM), the kernel and device interaction has become a bottleneck. To eliminate these bottlenecks, a lot of vendors are moving things out of the kernel to user space with polling and seeing much better results.
A great example of this are the Intel Storage Performance Development Kit (SPDK) and Data Plane Development Kit (DPDK). These projects are geared at maximizing the performance and reducing latency as much as possible, and have shown great success.
This shift is composed of two core changes:
- Moving device drivers to user space (instead of kernel)
- Using polling (instead of interrupts)
This enables far superior performance when compared to the kernel based predecessors, as it eliminates:
- Expensive system calls and the interrupt handler
- Data copies
- Context switches
The following shows the device interaction using user space drivers:
User Space and Polling Interaction
In fact, a piece of software Nutanix had developed for their AHV product (vhost-user-scsi), is actually being used by Intel for their SPDK project.
web·scale - /web ‘ skãl/ - noun - computing architecture
a new architectural approach to infrastructure and computing.
This section will present some of the core concepts behind “Web-scale” infrastructure and why we leverage them. Before I get started, I just wanted to clearly state the Web-scale doesn’t mean you need to be “web-scale” (e.g. Google, Facebook, or Microsoft). These constructs are applicable and beneficial at any scale (3-nodes or thousands of nodes).
Historical challenges included:
- Complexity, complexity, complexity
- Desire for incremental based growth
- The need to be agile
There are a few key constructs used when talking about “Web-scale” infrastructure:
- Software defined intelligence
- Distributed autonomous systems
- Incremental and linear scale out
Other related items:
- API-based automation and rich analytics
- Security as a core tenant
The following sections will provide a technical perspective on what they actually mean.
There are differing opinions on what hyper-convergence actually is. It also varies based on the scope of components (e.g. virtualization, networking, etc.). However, the core concept comes down to the following: natively combining two or more components into a single unit. ‘Natively’ is the key word here. In order to be the most effective, the components must be natively integrated and not just bundled together. In the case of Nutanix, we natively converge compute + storage to form a single node used in our appliance. For others, this might be converging storage with the network, etc.
What it really means:
- Natively integrating two or more components into a single unit which can be easily scaled
- Single unit to scale
- Localized I/O
- Eliminates traditional compute / storage silos by converging them
Software-defined intelligence is taking the core logic from normally proprietary or specialized hardware (e.g. ASIC / FPGA) and doing it in software on commodity hardware. For Nutanix, we take the traditional storage logic (e.g. RAID, deduplication, compression, etc.) and put that into software that runs in each of the Nutanix Controller VMs (CVM) on standard hardware.
Nutanix currently supports both x86 and IBM POWER architectures.
What it really means:
- Pulling key logic from hardware and doing it in software on commodity hardware
- Rapid release cycles
- Elimination of proprietary hardware reliance
- Utilization of commodity hardware for better economics
- Lifespan investment protection
To elaborate on the last point: old hardware can run the latest and greatest software. This means that a piece of hardware years into its depreciation cycle can run the latest shipping software and be feature parity with new deployments shipping from the factory.
Distributed autonomous systems involve moving away from the traditional concept of having a single unit responsible for doing something and distributing that role among all nodes within the cluster. You can think of this as creating a purely distributed system. Traditionally, vendors have assumed that hardware will be reliable, which, in most cases can be true. However, core to distributed systems is the idea that hardware will eventually fail and handling that fault in an elegant and non-disruptive way is key.
These distributed systems are designed to accommodate and remediate failure, to form something that is self-healing and autonomous. In the event of a component failure, the system will transparently handle and remediate the failure, continuing to operate as expected. Alerting will make the user aware, but rather than being a critical time-sensitive item, any remediation (e.g. replace a failed node) can be done on the admin’s schedule. Another way to put it is fail in-place (rebuild without replace) For items where a “leader” is needed, an election process is utilized. In the event this leader fails a new leader is elected. To distribute the processing of tasks MapReduce concepts are leveraged.
What it really means:
- Distributing roles and responsibilities to all nodes within the system
- Utilizing concepts like MapReduce to perform distributed processing of tasks
- Using an election process in the case where a “leader” is needed
- Eliminates any single points of failure (SPOF)
- Distributes workload to eliminate any bottlenecks
Incremental and linear scale out relates to the ability to start with a certain set of resources and as needed scale them out while linearly increasing the performance of the system. All of the constructs mentioned above are critical enablers in making this a reality. For example, traditionally you’d have 3-layers of components for running virtual workloads: servers, storage, and network – all of which are scaled independently. As an example, when you scale out the number of servers you’re not scaling out your storage performance. With a hyper-converged platform like Nutanix, when you scale out with new node(s) you’re scaling out:
- The number of hypervisor / compute nodes
- The number of storage controllers
- The compute and storage performance / capacity
- The number of nodes participating in cluster wide operations
What it really means:
- The ability to incrementally scale storage / compute with linear increases to performance / ability
- The ability to start small and scale
- Uniform and consistent performance at any scale
- Inefficient compute utilization led to the move to virtualization
- Features including vMotion, HA, and DRS led to the requirement of centralized storage
- VM sprawl led to increased load and contention on storage
- SSDs came in to alleviate the issues but changed the bottleneck to the network / controllers
- Cache / memory accesses over the network face large overheads, minimizing their benefits
- Array configuration complexity still remains the same
- Server side caches were introduced to alleviate the load on the array / impact of the network, however introduces another component to the solution
- Locality helps alleviate the bottlenecks / overheads traditionally faced when going over the network
- Shifts the focus from infrastructure to ease of management and simplifying the stack
- The birth of the Web-Scale world!