东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于OpenCL的Viola-Jones人脸检测算法性能优化研究

ISSN号：0254-4164
期刊名称：《计算机学报》
时间：0
分类：TP302[自动化与计算机技术—计算机系统结构;自动化与计算机技术—计算机科学与技术]
作者机构：中国科学院计算技术研究所计算机体系结构国家重点实验室,北京100190
相关基金：本课题得到国家自然科学基金（61133005,61272136,61521092,61402441）资助.

关键词： OPENCL, 负载不均衡, 任务队列, 线程与任务动态映射, 性能移植, OpenCL, workload imbalance, task queue, dynamic mapping, performance portability

中文摘要：

Viola-Jones人脸检测算法是最为成功的可实用的人脸检测算法之一.然而,随着该算法所在领域数据处理规模的不断扩大,现有算法的性能已经越来越无法满足日益增长的交互性与实时性要求.使用GPU计算平台提升该算法性能,以满足日益增长的实时性要求已经成为研究热点.然而,该算法在对GPU的实现和优化中,存在线程间负载不均衡的非规则特性,如果仅使用传统的优化方法,则难以在GPU计算平台上达到较高性能.针对此种情况,该文构建了针对此类算法的并行优化框架,通过Uberkernel、粗粒度并行、Persistent Thread、线程与数据的动态映射、全局及本地队列等优化方法的应用,突破了负载不均衡非规则特性导致的性能瓶颈,大幅提高了人脸检测算法在GPU计算平台上的性能.同时,该文通过对不同GPU计算平台关键性能参数的定义、抽取和传递,实现了该算法在不同GPU计算平台间的性能移植.实验结果表明,与OpenCV2.4中经过高度优化的CPU版本在Intel Xeon X5550CPU上的性能相比,优化后的算法在AMD HD7970和NVIDIA GTX680两个不同GPU计算平台上分别达到了11.24-20.27和9.24-17.62倍的加速比,不仅实现了高性能,而且实现了在不同GPU计算平台间的性能移植.

英文摘要：

Viola-Jones face detection algorithm is one of the most successful and functional face detection algorithm. However, with the continuous expansion of the processing data, the performance of existing algorithm has become increasingly unable to meet its growing interactive and real-time requirements. Improving the algorithm performance through using GPU computing platform, so as to meet the growing requirements of real-time has become a hot topic. However, face detection algorithm exposes irregular feature of workload imbalance among threads when ported to GPUs. It is hard to obtain high performance if only using the conventional optimization methods. In this paper, we present an OpenCL-implementation of Viola-Jones face detection algorithm with high performance on GPUs through five main techniques： kernel merge, coarse-grained parallelism, persistent threads, dynamic mapping between thread and task and global queues. Furthermore, this paper also achieves performance portability between different GPU computing platforms by defining, extracting and delivering key performance parameters of hard-ware. We also demonstrate the high performance of our implementation by comparing it with a well-optimized CPU version from OpenCV library. Experiment results show that the performance speedup reaches up to 11. 24-20. 27 times and 9. 24-17. 62 times on AMD HD7970 and NVIDIA GTX680 GPU respectively, not only achieves high performance but also achieves performance portability among different GPU computing platforms.

同期刊论文项目

超并行高效能计算机体系结构与设计方法研究

期刊论文 13

众核处理器上并行稠密矩阵计算关键技术研究

期刊论文 2

大规模异构并行系统的高效能调度理论与方法

期刊论文 15

众核体系架构并行计算模型与算法自适应调优框架研究

期刊论文 3

同项目期刊论文

云环境中面向随机任务的用户效用优化模型

异构分布式系统DAG可靠性模型与容错算法

一种最大团问题的Tile自组装高效模型

基于多项式模型和低风险的贝叶斯垃圾邮件过滤算法

基于CUDA的AES并行算法优化

基于并发性发掘的低开销回卷恢复实现方法

基于MapReduce的频繁项集并行挖掘算法

求解药代动力学参数的混合人口迁移算法

DHT网络中一种基于虚拟服务器拆分的负载平衡算法

Verilog HDL语言的AES密码算法FPGA优化实现

一种椭圆曲线密码算法ECC旁路攻击方法研究

Memory Efficient Two-Pass 3D Coprocessor FFT Algorithm for Intel（R） Xeon Phi TM

Energy-aware scheduling with reconstruction and frequency equalization on heterogeneous systems

MPFFT：An Auto-Tuning FFT Library for OpenCL GPUs

Memory Efficient Two-Pass 3D Coprocessor FFT Algorithm for Intel（R） Xeon Phi TM

MPFFT：An Auto-Tuning FFT Library for OpenCL GPUs

Memory Efficient Two-Pass 3D Coprocessor FFT Algorithm for Intel（R） Xeon Phi TM

低功耗高速时钟数据恢复电路

时分复用片上网络的设计与优化

轻量级大数据运算系统Helius

Parallel Incremental Frequent Itemset Mining for Large Data

内建自调整的仲裁器物理不可克隆函数

一种面向数据仓库周期性查询的增量优化方法

基于数据流块的空间指令调度方法

DLPlib： A Library for Deep Learning Processor

数据中心中DVFS对程序性能影响模型的设计

基于非易失存储器件的内存键值存储系统的性能研究

LFF：一种面向大数据应用的众核处理器访存公平性调度机制

基于动态电路的高速发送端设计

期刊信息

《计算机学报》
北大核心期刊（2011版）

主管单位:中国科学院
主办单位:中国计算机学会中国科学院计算技术研究所
主编：孙凝晖
地址：北京中关村科学院南路6号
邮编：100190
邮箱：cjc@ict.ac.cn
电话：010-62620695

国际标准刊号：ISSN：0254-4164
国内统一刊号：ISSN：11-1826/TP
邮发代号:2-833

获奖情况:
中国期刊方阵“双效”期刊

国内外数据库收录:
美国数学评论（网络版）,荷兰文摘与引文数据库,美国工程索引,美国剑桥科学文摘,日本日本科学技术振兴机构数据库,中国中国科技核心期刊,中国北大核心期刊（2004版）,中国北大核心期刊（2008版）,中国北大核心期刊（2011版）,中国北大核心期刊（2014版）,中国北大核心期刊（2000版）

被引量:48433