The latest version of cloudera data platform adopts spark 3.0 accelerated by NVIDIA technology, which can help the operation team achieve 8x performance improvement, so as to successfully run an impossible work.
Deborah Taylor completed an impossible task with perseverance and the right tools.
As a data scientist, Taylor’s task is to sort out the more than 300 TB database of the U.S. Internal Revenue Service and find laws that may help identify identity theft and other fraud. But even if she let a large number of CPU servers work all night, she couldn’t finish the data collation.
When she came back in the morning, she found that the work had failed, so she tried again, but failed again.
Just then, nasheb ismaily, a solution engineer at cloudera, knocked on the door of Rahul tikekar, Taylor’s boss. Rahul tikekar is the manager of the technical support team of data analysts at the IRS. Ismaily asked tikekar’s team if they needed to use the cloudera data platform (CDP) with its own GPU acceleration Apache spark 3.0 software.
“I seized this opportunity. Although our stand-alone servers are equipped with NVIDIA graphics cards, we can’t use spark to run them on distributed clusters, so this is a great opportunity for us,” tikekar said
Break through obstacles
After a quick test of the software, Taylor immediately accelerated many steps in this work five times without modifying any code, but several parts still lag behind.
Ismaily convened a team of NVIDIA data scientists to examine the core content of the code. They soon found that some tasks with very poor data structure were still running on the CPU. So they wrote code to handle this work and inserted it into Spark’s rapid software interface. Rapids is an open resource library that runs data analysis on GPU.
Taylor conducted another test and found that everything can run smoothly on the GPU of the distributed spark cluster, and the speed improvement is very obvious. She ran the whole program on a four node cluster.
“Through the technical integration of cloudera and NVIDIA, we can use data-based insight to drive mission critical use cases,” said Joe Ansaldi, technical director of research and applied analysis and statistics at the IRS
“We are currently applying this technology integration, which enables our data engineering and data science workflow to improve more than 10 times at half the cost,” Ansaldi added.
Spark 3.0 + GPU = New Vision
The IRS team is exploring some of the possible rewards of this technology application.
With the spark cluster composed of GPU driven servers, the team can accelerate all the current work and run other work previously considered impossible. And these efforts can help the team deal with the big data sets they have.
Tikekar said: “before spark 3.0, we could not complete these tasks, but now we have greatly improved the speed through GPU, and we can expect to solve the problems that could not be solved before.”
Mapping AI Roadmap
The team plans to apply its successful experience to data preparation, that is, extraction / transformation / loading (ETL) in data analysis. The next major plan is to accelerate all kinds of AI reasoning.
Tikekar said: “this cooperation with cloudera and NVIDIA helps us to control GPUs in the cluster. When such technological advances occur, it takes some time to understand their power and develop applications that can use them, so Deborah Taylor has indeed developed a new roadmap for us – she is the protagonist in the whole thing.”
Specifically, the team next focuses on natural language processing and analysis by establishing a large-scale deep learning neural network.
Rich machine learning applications
This is the machine learning transformation that many enterprises are seeking today.
“I personally think machine learning has incredible potential to make possible things that were difficult to achieve in the past,” tikekar said. As a doctor of computer science, he joined the IRS 13 years ago and previously taught at the University of Southern Oregon for ten years.
“For example, now we can scan tables and use optical character recognition to read fragments. But with AI, we can read tables more efficiently and find rules that help identify identity theft or reduce waste. Many applications benefit from AI in many ways,” he added.
To learn more about CDP 7.1.6 using NVIDIA GPU to accelerate cloudera, please watch the GTC speech released in October 2020 (free viewing after registration). The two companies also announced their partnership at that time.