VCIP 2019

Keynote Speeches

Keynote Speech 1: The Re-Emergence of Gating Networks - Edge-Aware Sparse Representations for Image Processing and Compression

Time: 09:00-10:00 Monday, 2 December

Mixture-of-Experts are stochastic Gating Networks which were invented 25 years ago and have found interesting applications in classification and regression tasks. In the last few years, these Networks have gained significant attention for their design of specific Gating Layers to enable conditional computation in “outrageously” large Deep Neural Networks.

Gating Networks can also be used to arrive at novel forms of sparse representations of images and video, suitable for various applications, such as compression, denoising and graph processing. One of the most intriguing features of Gating Networks is that they can be designed to model in-stationarities in high-dimensional pixel data as well. As such, sparse representations can be made edge-aware and allow soft-gated partitioning and reconstruction of pixels in regions of 2D images or any N-dimensional signal in general. In contrast, wavelet-type representations such as Steerable Wavelets and beyond struggle to address pixel data beyond 3D efficiently. The unique mathematical framework of Gating Networks admits elegant End-to-End optimisation of network parameters with objective functions that also addresses visual quality criteria such as SSIM i.e. for coding rate and network size. This makes the approach attractive for many image processing tasks.

This talk will give an introduction into the novel field of Gating Networks with particular emphasis on segmentation and compression of N-dimensional pixel data. We will show that these networks allow the design of powerful and disruptive compression algorithms for images, video and light-fields that completely depart from existing JPEG- and MPEGtype approaches with blocks, block-transforms and motion vectors.

Thomas Sikora

Professor

Institute of Telecommunications Systems, Department of Telecommunications

Technical University of Berlin

Thomas Sikora is professor and director of the Communication Systems Lab and director of the Institute for Telecommunications at Technische Universität Berlin, Germany.

He received the Dipl.-Ing. and Dr.-Ing. degrees in electrical engineering from Bremen University, Bremen, Germany, in 1985 and 1989, respectively. In 1990, he joined Siemens Ltd. and Monash University, Melbourne, Australia, as a Project Leader responsible for video compression research activities in the Australian Universal Broadband Video Codec consortium. Between 1994 and 2001, he was the Director of the Interactive Media Department, Heinrich Hertz Institute (HHI) Berlin GmbH, Germany. Dr. Sikora is co-founder of Vis-a-Pix GmbH (www.visapix.com) and imcube media GmbH (www.imcube.com), two Berlin-based start-up companies involved in research and development of audio and video signal processing and compression technology. He is an Appointed Member of the Advisory and Supervisory board of a number of German companies and international research organizations. Currently he serves as a board member of the Berlin-Brandenburg Academy of Science. He frequently works as an industry consultant and patent reviewer on issues related to interactive digital audio and video.

Dr. Sikora has been involved in international ITU and ISO standardization activities as well as in several European research activities for more than 15 years. As the Chairman of the ISO-MPEG (Moving Picture Experts Group) video group, he was responsible for the development and standardization of the MPEG-4 and MPEG-7 video algorithms.

Dr. Sikora is a member of the German Society for Information Technology (ITG) and recipient of the 1996 ITG Award. He received the 1996 engineering Emmy Award of the US National Academy of Television Arts and Sciences (NATAS) as a member of the ISO MPEG-2 video standards group.

He has published more than 300 journal and conference papers related to audio and video processing and he is a frequent invited speaker at international conferences. He co-authored three books: "Introduction to MPEG-7: Multimedia Content Description Interface", Wiley 2002, "MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval", Wiley 2005 and "3D Videocommunication: Algorithms, Concepts and Real-Time Systems in User-Centered Communications", Wiley 2005.

He is a past Editor-in-Chief of the IEEE Transactions on Circuits and Systems for Video Technology. From 1998 to 2002, he was an Associate Editor of the IEEE Signal Processing Magazine. He is an Advisory Editor for the EURASIP Signal Processing: Image Communication Journaland was an Associate Editor of the EURASIP Signal Processing Journal.

Keynote Speech 2: Health Surveillance

Time: 13:30-14:30 Monday, 2 December

Though surveillance evokes mixed feelings, visual surveillance has become a dominant technology all over the globe. However, the primary purpose of intelligent surveillance systems is to detect and monitor real-time evolving situations using all observed data with an implicit goal to help in managing them. Increasingly surveillance is being used in managing health, particularly personal health. From diverse medical imaging to video based facial expression detection and gait analyses are being used to understand an individual’s health. The central element of personalization is the model of a person from a healthy perspective. Deep personal models require personal chronicle of events not only from cyberspace as used by many current search systems and social networks, but also from physical, environmental, and biological aspects. Episodic models are very shallow for personalization. Multimodal processing, including computer vision, plays a key role in creating detailed personal chronicles, aka Personicles, for such emerging applications. We are building such Personicles for health applications using smart phones, wearable devices, different biological sensors, cameras, and social media. These Personicles and other relevant event streams may then be used to build personal models using event mining and related AI approaches. In this presentation, we will discuss and demonstrate an approach to build Personicles using diverse data streams and show how this could result in deeper personal models for applications like personal health navigators.

Ramesh Jain

Professor

Director of Institute for Future Health

University of California, United States

Ramesh Jain is an entrepreneur, researcher, and educator.

He is a Donald Bren Professor in Information & Computer Sciences at University of California, Irvine. His research interests covered Control Systems, Computer Vision, Artificial Intelligence, and Multimedia Computing. His current research passion is in addressing health issues using cybernetic principles building on the progress in sensors, mobile, processing, artificial intelligence, computer vision, and storage technologies. He is founding director of the Institute for Future Health at UCI. He is a Fellow of AAAS, ACM, IEEE, AAAI, IAPR, and SPIE.

Ramesh co-founded several companies, managed them in initial stages, and then turned them over to professional management. He enjoys new challenges and likes to use technology to solve them. He is participating in addressing the biggest challenge for us all: how to live long in good health.

Keynote Speech 3: AVS3 -- A New Generation of Video Coding Standard for 8K and VR/AR

Time: 09:00-10:00 Tuesday, 3 December

4K video format is becoming a majority in TV device market, and 8K video format is in coming, by two major driving forces from industries. The first is the flat panel industry which makes the panel resolution higher and higher, and the second is the video broadcasting industry which makes content more and more rich, in term of resolution, HDR, VR and AR. Whatever when content becomes rich, it results a big increasing in data size, which causes cost increasing on transmission and storage. So, a more efficient video coding standard will be the key solution to solve the cost problem. Actually video coding standards have play the key role in TV industry in last three decades, such as MPEG1/2 created digital TV industry from 1993, MPEG4 AVC/H.264 and AVS+ supported HDTV industry from 2003, HEVC/H.265 and AVS2 supported 4K TV from 2016. Now, 8K TV is coming, which supposes to support VR/AR, with 4-10 times data size compared to 4K video. Therefore a new generation of video coding standard is expected to be created for this new demand.

AVS3 is such a new generation of video coding standard developed by China Audio and Video Coding Standard Workgroup (AVS), which targets to the emerging 8KUHD and VR/AR applications. And the first phase of AVS3 was released in March 2019. Compared to the previous AVS2 and HEVC/H.265 standards, AVS3 plans to achieve about 50% bitrate saving. Recently, Hisilicon announced the world first AVS3 8K video decoder chip at IBC2019, which supports 8K and 120P(fps) real time decoding. That indicates the opening of a new era of 8K and immersive video experience. This talk will give a brief introduction to the AVS3 standard, including the development process, key techniques, and the applications.

Wen Gao

Professor

Director of Faculty of Information and Engineering Sciences

Peking University, China

Wen Gao now is a Boya Chair Professor at Peking university. He also serves as the president of China Computer Federation (CCF) from 2016. He received his Ph.D. degree in electronics engineering from the University of Tokyo in 1991. He joined with Harbin Institute of Technology from 1991 to 1995, and Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS) from 1996 to 2005. He joined the Peking University in 2006.

Prof. Gao works in the areas of multimedia and computer vision, topics including video coding, video analysis, multimedia retrieval, face recognition, multimodal interfaces, and virtual reality. His most cited contributions are model-based video coding and face recognition. He published seven books, over 280 papers in refereed journals, and over 700 papers in selected international conferences. He is a fellow of IEEE, a fellow of ACM, and a member of Chinese Academy of Engineering.

Keynote Speech 4: Long term learning of ‘action objects’ and prediction of object interaction

Time: 09:00-10:00 Wednesday, 4 December

By using body mount and head mount first-person video (FPV) cameras, we have a detailed view of the person’s interactions with and responses to objects and stimuli in the environment. What is he doing? What is he looking at? How does he reach out to grasp an object? FPV allows us to ground our representation of the person’s behavior to the ego-centric space. By anchoring personal observations to the ego-centric space, we accumulate strong behavioral priors, allowing recognition of behavior at a fine grain. This is an important because for useful robot/human interactions to occur, a co-robot must do more than real-time reaction, it must think “ahead of real-time” to predict what is going to happen next.

Jianbo Shi

Professor

Professor of GRASP Laboratory

University of Pennsylvania, United States

Jianbo Shi was born in Shanghai, China. Since then he has been moving.

Jianbo Shi studied Computer Science and Mathematics as an undergraduate at Cornell University where he received his B.A. in 1994. He received his Ph.D. degree in Computer Science from University of California at Berkeley in 1998, for his thesis on Normalize Cuts image segmentation algorithm. He joined The Robotics Institute at Carnegie Mellon University in 1999 as a research faculty, where he lead the Human Identification at Distance(HumanID) project, developing vision techniques for human identification and activity inference. In January 2003, he joined the Department of Computer & Information Science at University of Pennsylvania where he is currently a Professor.

His current research focus on human behavior analysis and image recognitionsegmentation. His other research interests include image/video retrieval, and vision based desktop computing. His long-term interests center around a broader area of machine intelligence, he wishes to develop a "visual thinking" module that allows computers not only to understand the environment around us, but also to achieve higher level cognitive abilities such as machine memory and learning.

2019 IEEE International Conference on Visual Communications and Image Processing (VCIP)

December 1-4, 2019 • Sydney, AUSTRALIA

Keynote Speeches

Keynote Speech 1: The Re-Emergence of Gating Networks - Edge-Aware Sparse Representations for Image Processing and Compression

Keynote Speech 2: Health Surveillance

Keynote Speech 3: AVS3 -- A New Generation of Video Coding Standard for 8K and VR/AR

Keynote Speech 4: Long term learning of ‘action objects’ and prediction of object interaction

2019 IEEE International Conference on Visual Communications and Image Processing (VCIP)