Professor Xiao Hu and her collaborators Binyang Dai and Ke Yi from the Hong Kong University of Science and Technology have received a for their paper, Reservoir Sampling over Joins. Their work, which presents a new algorithm that maintains a random sample over joins in a streaming setting while achieving near-linear computational complexity, also received an .
SIGMOD Research Highlights showcase research projects that exemplify core database research. In particular, the projects address an important problem, represent a definitive milestone in solving the problem, and have the potential for significant impact. SIGMOD Research Highlights also aim to make the selected works widely known in the database community, to industry partners, and the broader ACM community.
鈥淐ongratulations to Xiao and her colleagues on receiving a 2025 SIGMOD Research Highlight Award,鈥 said Raouf Boutaba, University Professor and Director of the Cheriton School of Computer Science. 鈥淭his marks the third recognition of her research excellence this year, following her Best Paper and Distinguished Paper Awards at PODS 2025.鈥

Professor 鈥檚 research aims to equip modern data systems with efficient query processing algorithms that offer scalability, low latency, and privacy guarantees across emerging data applications and computing architectures.
About this award-winning research
In large-scale data analytics, complicated functions often need to be computed on top of query results over the underlying relational database. The join operator, however, presents a major challenge, as the join size can be orders of magnitude larger than the original database. Computing and storing the join results is costly, especially as data size increases. For many tasks, the full join does not need to be computed. Sampling the join results uniformly is often sufficient to estimate analytical queries or to train machine-learning models, at much lower computational cost.
The challenge becomes even more difficult when data arrive as a stream. Here, the goal is to maintain a uniform sample of the join results as new tuples arrive. They classic reservoir sampling algorithm works well for the special case of a single table but fails in the multi-table case because the sampling space 鈥 i.e., the number of join results 鈥 grows exponentially with the size of the input data stream.
In their paper, Professor Hu and her colleagues introduce a novel algorithm that solves this problem with near-linear time complexity. Unlike previous approaches, their new reservoir sampling algorithm for maintaining a uniform sample of size 饾憳 over join result has a running time of 饾憘 (饾憗 log 饾憗 + 饾憳 log 饾憗 log 饾憗/k), where 饾憳 is the given sample size, and 饾憗, unknown in advance, is the size of the data stream. In other words, the result is an approach that scales nearly linearly with the size of the incoming data stream, a dramatically more efficient run-time than previous methods.
Their experimental results, conducted on both graph and relational data, show significant improvements over current state-of-the-art algorithms. This work contributes not only a theoretical advancement but also practical tools for real-time, scalable analytics in modern data systems.
To learn more about the research on which this article is based, please see . Binyang Dai, Xiao Hu, Ke Yi. 2024. Proceedings of the ACM on Management of Data, Volume 2, Issue 3, Article 118, 1鈥26.