HPE Tackles AI Ops R&D to Improve Energy Efficiency, Sustainability and Resiliency in Data Centers

11.18.2019

Hewlett Packard Enterprise (HPE) announced today an AI Ops R&D collaboration with the U.S Department of Energy’s National Renewable Energy Laboratory (NREL) to develop Artificial Intelligence (AI) and Machine Learning (ML) technologies to automate and improve operational efficiency, including resiliency and energy usage, in data centers for the exascale era. The effort is part of NREL’s ongoing mission as a world leader in advancing energy efficiency and renewable energy technologies to create and implement new approaches that reduce energy consumption and lower operating costs.

The project is part of a three year collaboration that introduces monitoring and predictive analytics to power and cooling systems in NREL’s Energy Systems Integration Facility (ESIF) HPC Data Center.

HPE and NREL are using more than five years’ worth of historical data, which total more than 16 terabytes of data1, collected from sensors in NREL’s supercomputers, Peregrine and Eagle, and its facility, to train models for anomaly detection to predict and prevent issues before they occur.

The collaboration will also address future water and energy consumption in data centers, that in the U.S. alone will reach approximately 73 billion kWh and 174 billion gallons of water by 2020.2 HPE and NREL will focus on monitoring energy usage to optimize energy efficiency and sustainability as measured by key metrics such as Power Usage Effectiveness (PUE), Water Usage Effectiveness (WUE), and Carbon Usage Effectiveness (CUE).

Early results based on models trained with historical data have successfully predicted or identified events that previously occurred in NREL’s data center, demonstrating the promise of using predictive analytics in future data centers.

The AI Ops project sprung from HPE’s R&D efforts involved with PathForward, a program backed by the U.S. Department of Energy to accelerate the nation’s technology roadmap for exascale computing, which represents the next major leap in supercomputing. HPE realized a critical need to develop AI and automation capabilties to manage and optimize data center environments for the exascale era. Applying AI-driven operations to an exascale supercomputer – which will run at a speed that will represent a thousandfold increase over today’s systems – will enable energy-efficient operations, and increase resiliency and reliability through smart and automated capabilities.

“We are passionate about architecting new technologies that are impactful to powering the next era of innovation with exascale computing and its extent of operational needs,” said Mike Vildibill, vice president of Advanced Technologies Group, HPE. “We believe our journey to develop and test AI Ops with NREL, one of our longstanding and innovative partners, will allow the industry to build and maintain smarter and more efficient supercomputing data centers as they continue to scale power and performance.”

“Our research collaboration will span the areas of data management, data analytics, and AI/ML optimization for both manual and autonomous intervention in data center operations,” said Kristin Munch, manager for Data, Analysis and Visualization Group, National Renewable Energry Laboratory (NREL). “We’re excited to join HPE in this multi-year, multi-staged effort—and we hope to eventually build capabilities for an advanced smart facility after demonstrating these techniques in our existing data center.”

The project will use open source software and libraries such as TensorFlow, NumPy and Sci-kit to develop machine learning algorithms. The project will focus on the following key areas:

Monitoring: Collect, process and analyze vast volumes of IT and facility telemetry from disparate sources before applying algorithms to data in real-time
Analytics: Big data analytics and machine learning will be used to analyze data from various tools and devices spanning the data center facility
Control: Algorithms will be applied to enable machines to solve issues autonomously as well as intelligently automate repetitive tasks and perform predictive maintenance on both the IT and the datacenter facility
Datacenter operations: AI Ops will evolve to become a validation tool for continuous integration (CI) and continuous deployment (CD) for core IT functions that span the modern datacenter facility

HPE plans to demonstrate additional capabilities in the future with the enhancement of the HPE High Performance Cluster Management (HPCM) system to provide complete provisioning, management, and monitoring for clusters scaling to 100,000 nodes at a faster rate. Other testing plans include exploring integration of HPE InfoSight, a cloud-based AI-driven management tool that monitors, collects and analyzes data on IT infrastructure. HPE InfoSight is used to predict and prevent probable events to maintain the overall health of server performance.

The solution will be showcased at HPE booth 1325 at Supercomputing 2019 (SC 19) in Denver, Colorado.