Time:        09:00-10:00 Monday, 2 December
Mixture-of-Experts are stochastic Gating Networks which were invented 25 years ago
and have found interesting applications in classification and regression tasks. In the last
few years, these Networks have gained significant attention for their design of specific
Gating Layers to enable conditional computation in “outrageously” large Deep Neural
Networks.
Gating Networks can also be used to arrive at novel forms of sparse representations of
images and video, suitable for various applications, such as compression, denoising and
graph processing. One of the most intriguing features of Gating Networks is that they
can be designed to model in-stationarities in high-dimensional pixel data as well. As
such, sparse representations can be made edge-aware and allow soft-gated partitioning
and reconstruction of pixels in regions of 2D images or any N-dimensional signal in
general. In contrast, wavelet-type representations such as Steerable Wavelets and
beyond struggle to address pixel data beyond 3D efficiently. The unique mathematical
framework of Gating Networks admits elegant End-to-End optimisation of network
parameters with objective functions that also addresses visual quality criteria such as
SSIM i.e. for coding rate and network size. This makes the approach attractive for many
image processing tasks.
This talk will give an introduction into the novel field of Gating Networks with particular
emphasis on segmentation and compression of N-dimensional pixel data. We will show
that these networks allow the design of powerful and disruptive compression algorithms
for images, video and light-fields that completely depart from existing JPEG- and MPEGtype
approaches with blocks, block-transforms and motion vectors.
Thomas Sikora
Professor
Institute of Telecommunications Systems, Department of Telecommunications
Technical University of Berlin
Time:        13:30-14:30 Monday, 2 December
Though surveillance evokes mixed feelings, visual surveillance has become a dominant technology all over the globe. However, the primary purpose of intelligent surveillance systems is to detect and monitor real-time evolving situations using all observed data with an implicit goal to help in managing them. Increasingly surveillance is being used in managing health, particularly personal health. From diverse medical imaging to video based facial expression detection and gait analyses are being used to understand an individual’s health. The central element of personalization is the model of a person from a healthy perspective. Deep personal models require personal chronicle of events not only from cyberspace as used by many current search systems and social networks, but also from physical, environmental, and biological aspects. Episodic models are very shallow for personalization. Multimodal processing, including computer vision, plays a key role in creating detailed personal chronicles, aka Personicles, for such emerging applications. We are building such Personicles for health applications using smart phones, wearable devices, different biological sensors, cameras, and social media. These Personicles and other relevant event streams may then be used to build personal models using event mining and related AI approaches. In this presentation, we will discuss and demonstrate an approach to build Personicles using diverse data streams and show how this could result in deeper personal models for applications like personal health navigators.
Ramesh Jain
Professor
Director of Institute for Future Health
University of California, United States
Time:        09:00-10:00 Tuesday, 3 December
4K video format is becoming a majority in TV device market, and 8K video format is in coming, by two major driving forces from industries. The first is the flat panel industry which makes the panel resolution higher and higher, and the second is the video broadcasting industry which makes content more and more rich, in term of resolution, HDR, VR and AR. Whatever when content becomes rich, it results a big increasing in data size, which causes cost increasing on transmission and storage. So, a more efficient video coding standard will be the key solution to solve the cost problem. Actually video coding standards have play the key role in TV industry in last three decades, such as MPEG1/2 created digital TV industry from 1993, MPEG4 AVC/H.264 and AVS+ supported HDTV industry from 2003, HEVC/H.265 and AVS2 supported 4K TV from 2016. Now, 8K TV is coming, which supposes to support VR/AR, with 4-10 times data size compared to 4K video. Therefore a new generation of video coding standard is expected to be created for this new demand.
AVS3 is such a new generation of video coding standard developed by China Audio and Video Coding Standard Workgroup (AVS), which targets to the emerging 8KUHD and VR/AR applications. And the first phase of AVS3 was released in March 2019. Compared to the previous AVS2 and HEVC/H.265 standards, AVS3 plans to achieve about 50% bitrate saving. Recently, Hisilicon announced the world first AVS3 8K video decoder chip at IBC2019, which supports 8K and 120P(fps) real time decoding. That indicates the opening of a new era of 8K and immersive video experience. This talk will give a brief introduction to the AVS3 standard, including the development process, key techniques, and the applications.
Wen Gao
Professor
Director of Faculty of Information and Engineering Sciences
Peking University, China
Time:        09:00-10:00 Wednesday, 4 December
By using body mount and head mount first-person video (FPV) cameras, we have a detailed view of the person’s interactions with and responses to objects and stimuli in the environment. What is he doing? What is he looking at? How does he reach out to grasp an object? FPV allows us to ground our representation of the person’s behavior to the ego-centric space. By anchoring personal observations to the ego-centric space, we accumulate strong behavioral priors, allowing recognition of behavior at a fine grain. This is an important because for useful robot/human interactions to occur, a co-robot must do more than real-time reaction, it must think “ahead of real-time” to predict what is going to happen next.
Jianbo Shi
Professor
Professor of GRASP Laboratory
University of Pennsylvania, United States
http://www.vcip2019.org