The Data Systems Seminar Series provides a forum for presenting and discussing key issues in data systems, both current and emerging. It complements our internal meetings by welcoming insights from external colleagues.
The schedule for the 2024–25 academic year is outlined below and will be updated as additional speakers are confirmed.
Seminars are typically held onMondays at 10:30 a.m.in DC 1302 during the winter 2025 term, unless otherwise noted. Some sessions may be held virtually on Zoom; these will be clearly marked.
The talks are open to the public.
We willpost the presentation videos whenever possible. Past DSG Seminar videos can be found on the.
The Data Systems Seminar Series is supported by

Kris Shrishak |
Arijit Khan |
Tilmann Rabl |
Miao Qiao |
Philip Bernstein |
Weiran Liu |
Jianguo Wang |
Thomas Haigh |
Ricardo Baeza-Yates |
Lingyang Chu |
Ana Klimovic |
Karima Echihabi |
Stavros Sintos |
Stefan Böttcher |
26 September 2024;2pm (Note special time)
Title: | Privacy and PETs: An Interaction with Human Rights Law |
Speaker: | , Enforce |
Abstract: | Privacy enhancing technologies (PETs) have been researched and promoted for the past few decades. Amidst greater public awareness of personal data collection and application of data protection regulations, the number of implementations of PETs have increased in the past few years. Given the promise and expectation of PETs to protect people’s privacy and the hope of researchers to see use-cases of PETs, what is the reality of PETs in ٴǻ岹’s world? Are the privacy needs of people being met? This talk will take you through a journey of PETs, visiting data protection law and international human rights law along the way. |
Bio: | Dr. Kris Shrishak is a public interest technologist and a Senior Fellow at Enforce. He advises legislators on emerging technologies and global AI governance (including EU AI Act). He is regularly invited to speak at the European Parliament and has testified at the Irish Parliament. His work focuses on privacy tech, anti-surveillance, emerging technologies, and algorithmic decision making. His expert commentary appears in The New York Times, The Washington Post, the BBC, the LA Times, Süddeutsche Zeitung, Politico, The Irish Times and other leading media. He has been interviewed on TV and radio, including on CNN, the BBC, Euronews and France24. He has written for Bulletin of Atomic Scientists, Nikkei Asia and Euronews, among others. He works on the kind of cryptography that allows computing on encrypted data and proving existence of information without revealing them. These technologies, broadly known as privacy enhancing technologies (PETs), could be beneficial. However, there are risks that have not been sufficiently researched. Previously Kris was a researcher at Technical University Darmstadt in Germany where he worked on applied cryptography, PETs and Internet infrastructure security. |
7October 2024;10:30
21 October 2024; 10:30
Title: | Efficiency in Data Systems |
Speaker: | ,University of Potsdam |
Abstract: | For the longest time, acquiring new hardware resulted in significant software efficiency gains due to exponential improvements of hardware capabilities. Physical limits in hardware manufacturing have brought former niche designs into standard components, such as multiple cores and specialized circuits. Even with these new designs, hardware improvements are decreasing, while software and applications are still becoming increasingly complex and resource demanding. In this talk, we will discuss efficiency of data systems. We will start with a general discussion of system efficiency and look at the design of efficient architectures. Incorporating estimations on hardware and power production carbon intensity, we will then discuss hardware replacement frequencies and try to establish new rules of thumb on the ideal hardware lifecycles for database deployments and discuss implications on database development. |
Bio: | Tilmann Rabl is a professor for Data Engineering Systems at the Digital Engineering Faculty of the University of Potsdam and the Hasso Plattner Institute. His research focuses on efficiency of database and ML systems, real-time analytics, hardware efficient data processing, and benchmarking. |
28 October 2024; 10:30
Title: | Scalable Query Processing with Graphs |
Speaker: | , University of Auckland |
Abstract: | Graph-based query processing faces scalability challenges. This talk explores two facets of the problem. First, when a graph grows too large for efficient querying, can query processing algorithms exhibit strongly local properties, making the search independent of the graph’s overall size? Second, in approximate nearest neighbor search using indexes of Hierarchical Navigable Small World (HNSW) graphs, how can we compress the index while maintaining equivalent query performance, when queries include attribute-based filters? To address the first question, we examine cases where dense subgraph search admits strongly local algorithms and where it does not. For the second, we present a novel compression method that transforms the n^2 HNSW graphs into a more compact structure called the 2D segment graph, enabling lossless compression while preserving query efficiency. Theory plays a central role in both solutions, shaping the performance and feasibility of scalable graph-based querying. |
Bio: | Dr. Qiao is a Senior Lecturer in Computer Science at the University of Auckland, New Zealand, a role equivalent to Associate Professor in tenure-track systems. Her research centers on big data management, with a focus on query optimization, indexing, joins, sampling, graph analysis, and graph-based nearest neighbor search. She has advanced indexing techniques for query processing in graph databases, including shortest distance and subgraph matching queries. Her recent work on range-filtering nearest neighbor search, along with an ongoing submission on its dynamic variant, has potential applications in modern vector databases, particularly for unstructured queries. |
13 December 2024; 11:00 (Note the unusual day and time)
Title: | DDS: DPU-optimized Disaggregated Storage![]() |
Speaker: | , Microsoft Research |
Abstract: |
A DPU is a network interface card (NIC) with programmable compute and memory resources. It sits on the system bus, PCIe, which is the fastest path to access SSDs, and it directly connects to the network. It therefore can process storage requests as soon as they arrive at the NIC, rather than passing them through to the host. DPUs are widely deployed in public clouds and will soon be ubiquitous. In this talk, we’ll describe DPU-Optimized Disaggregated Storage (DDS), our software platform for offloading storage operations from a host storage server to a DPU. It reduces the cost and improves the performance of supporting a database service. DDS heavily uses DMA, zero-copy, and userspace I/O to minimize overhead and thereby improve throughput. It also introduces an offload engine that can directly execute storage requests on the DPU. For example, it can offload GetPage@LSN to the DPU of an Azure SQL Hyperscale page server. This removes all host CPU consumption (saving up to 17 cores), reduces latency by 70%, and increases throughput by 75%. This is joint work with Qizhen Zhang, Badrish Chandramouli, Jason Hu, and Yiming Zheng. |
Bio: | Philip A. Bernstein is a Distinguished Scientist in the Data Systems Group in Microsoft Research. He has published over 200 papers and two books on the theory and implementation of database systems, especially on transaction processing and data integration, and has contributed to many database products. He is a Fellow of the ACM and AAAS, a winner of the E.F. Codd SIGMOD Innovations Award, and a member of the Washington State Academy of Sciences and the National Academy of Engineering. He received a B.S. degree from Cornell and M.Sc. and Ph.D. from University of Toronto. |
29 January 2025; 12:00(Note the unusual day and time)
Title: | Efficient Simple Keyword Private Information Retrieval![]() |
Speaker: | ,Alibaba Group |
Abstract: | Keyword Private Information Retrieval (Keyword PIR) enables private queries on public key-value databases. Unlike standard index-based PIR, keyword PIR presents greater challenges, since the query’s position within the database is unknown and the domain of keywords is vast. The key insight to obtain efficient keyword PIR is to construct an efficient and compact key-to-index mapping, thereby reducing the keyword PIR problem to standard PIR. In this talk, I will introduce the basic concept of (Keyword) PIR, the state-of-the-art (SOTA) index/keyword PIR constructions based on Learning With Error (LWE) assumptions, and our new constructions on more efficient Keyword PIR. Notably, our construction includes several advanced data structures coming from the database field, i.e., binary fuse filter and learned index, demonstrating that new data structures can have potential applications in crypto primitives. |
Bio: | Weiran Liu received his B.S. degree in Electronic Information and Engineering from Beihang University, China, in 2012 and his Ph.D. degree in Information and Communication Engineering from Beihang University, China, in 2017. He is currently a staff security engineer at the Department of Data Technology and Products, Alibaba Group, China. His main areas of interest include applied cryptography, fully homomorphic encryption, secure multi-party computation, and differential privacy. He has published several works on top-tier conferences such as USENIX Security, ACM CCS, SIGMOD, VLDB, ICDE, and PKC. He also contributed several books on the data security field, including “The Greate Crypto,” “A Pragmatic Introduction to Secure Multi-Party Computation (Chinese version),” and “Programming Differential Privacy (Chinese version).” He served as reviewer/extended reviewer for top-tier international conferences across different fields of studies, including ICML, NIPS, ICLR, and ASIACRYPT. |
31 March 2025; 1:30 in DC 1304 (Note the unusual time and location)
Title: | Database Systems for LLMs: Vector Databases and Beyond![]() |
Speaker: | , Purdue University |
Abstract: |
Vector databases have recently emerged as a hot topic due to the widespread interest in LLMs, where vector databases provide the relevant context that enables LLMs to generate more accurate responses. Current vector databases can be broadly categorized into two types: specialized and integrated. Specialized vector databases are explicitly designed for managing vector data, while integrated vector databases support vector search within an existing database system. While specialized vector databases are interesting, there is a significant customer base interested in integrated vector databases for various reasons, such as reluctance to move data out, the desire to link vector embeddings with their source data, and the need for advanced vector search capabilities. However, integrated vector databases face challenges in performance and interoperability. In this talk, I will share our recent experience in building integrated vector databases within two important classes of databases: Relational Databases and Graph Databases. I will show how we address the performance and interoperability challenges, resulting in much more powerful database systems that support advanced RAGs. Next, I will present other challenges in vector databases along with our ongoing work. Finally, I will discuss the broader role of database systems in the era of LLMs and explore how to build future databases that extend beyond vector databases to better support LLMs. |
Bio: |
Jianguo Wang is an Assistant Professor of Computer Science at Purdue University. He obtained his Ph.D. from the University of California, San Diego. He has worked or interned at Zilliz, Amazon AWS, Microsoft Research, Oracle, and Samsung on various database systems. His current research interests include database systems for the cloud and LLMs, especially Disaggregated Databases and Vector Databases. He regularly publishes and serves as a program committee member at premier database conferences such as SIGMOD, VLDB, and ICDE. He also served as a panel moderator for the VLDB'24 panel on vector databases. His research has won multiple awards, including the ACM SIGMOD Research Highlight Award and the NSF CAREER Award. More information can be found at |
12 May 2025; 10:30
Title: | Where the Database Management System Comes From, and Why it Matters![]() |
Speaker: | , University of Wisconsin-Milwaukee |
Abstract: | For more than fifty years the database management system (DBMS) has been the essential foundation information systems of all kinds, from enterprise software to personal websites. Developed to support the integration of different applications and data types for corporate mainframes, the DBMS had technological roots in Cold War defense systems. Thomas Haigh, a leading historian of computing, looks back to the 1960s and 70s for the origins of the DBMS and at related concepts such as the data base administrator, the management information system, and the data warehouse. Today data science is a hot field, and the potential of “data history” is exciting historians of science. Haigh argues that we can’t understand either of those things without recognizing the DBMS as vital infrastructure that mediates and structures interactions between users, applications, and data. |
Bio | is a professor and chair of the history department at the University of Wisconsin-Milwaukee. After studying computer science at Manchester University, he won a Fulbright award for a Ph.D. in the history and sociology of science from the University of Pennsylvania. He has researched many topics in the history of computing, from database management systems to internet technologies. Haigh is the lead author of A New History of Modern Computing (2021) and ENIAC in Action (2016), both published by MIT Press. At UWM he runs a retrocomputing lab with working systems from the 1980s and 1990s. His current book project is Artificial Intelligence: The History of a Brand. |
2 June 2025; 10:30 in DC 1304 (Note the unusual room)
Title: | The Limitations of Data, Machine Learning & Us![]() |
Speaker: | , Northeastern University, Universitat Pompeu Fabra and Universidad de Chile |
Abstract: | Machine learning (ML), particularly deep learning, is being used everywhere. However, not always is used well, ethically and scientifically. In this talk we first do a deep dive in the limitations of supervised ML and data, its key component. We cover small data, datification, bias, predictive optimization issues, evaluating success instead of harm, and pseudoscience, among other problems. The second part is about our own limitations using ML, including different types of human incompetence: cognitive biases, unethical applications, no administrative competence, misinformation, and the impact on mental health. In the final part we discuss regulation on the use of AI and responsible AI principles, that can mitigate the problems outlined above. |
Bio | Ricardo Baeza-Yates is a Visiting Professor in the Khoury College of Computer Sciences at the Silicon Valley campus of Northeastern University as well as part-time professor at the departments of Engineering of Universitat Pompeu Fabra and Computer Science of University of Chile. Before, he was VP of Research at Yahoo Labs, based in Barcelona, Spain, and later in Sunnyvale, California, from 2006 to 2016. He is co-author of the best-seller Modern Information Retrieval textbook published by Addison-Wesley in 1999 and 2011 (2nd ed), that won the ASIST 2012 Book of the Year award. In 2009 he was named ACM Fellow and in 2011 IEEE Fellow. He has won national scientific awards in Chile and Spain, among other accolades and distinctions. He obtained a Ph.D. in CS from the University of ݮƵ, Canada, and his areas of expertise are responsible AI, web search and data mining plus data science and algorithms in general. |
14 July 2025; 10:30 in DC 1304 (Note the unusual room)
Title: |
Invisible Yet Powerful: Watermarking to Protect Datasets and Models in Machine Learning ![]() |
Speaker: | |
Abstract: |
The rapid advancement of AI has transformed both datasets and models into valuable assets, yet they remain vulnerable to unauthorized use, theft, and replication. Watermarking provides a promising solution by embedding verifiable ownership signals to establish ownership protection. Traditional database watermarking techniques assume that attackers seek to preserve query utility, which inherently restricts the extent of modifications they can apply to the data. However, this assumption does not hold for machine learning, where models can maintain predictive performance even when trained on significantly altered datasets. As a result, adversaries can heavily modify a dataset or distill a model while preserving its learning utility, which enables much stronger watermark removal attacks than those in traditional database watermarking. How can we design watermarking methods that safeguard AI-related assets against these threats while maintaining their usability? This talk presents our recent research on addressing the novel challenges in watermarking tabular datasets and deep learning models in the context of machine learning. First, I will introduce TabularMark, a non-blind watermarking framework that embeds verifiable ownership signals into tabular datasets while ensuring that models trained on watermarked data retain high predictive performance. Second, I will discuss blind watermarking for numerical tabular datasets, which enables watermark verification without requiring access to the original data, making it more practical for real-world data-sharing scenarios. Third, I will introduce a robust model watermarking approach that embeds ownership signals into deep neural networks to withstand ensemble distillation attacks. Finally, I will conclude with open challenges and future directions. |
Bio: |
Lingyang Chu is an Assistant Professor in the Department of Computing and Software at McMaster University. He received his Ph.D. in Computer Science from the University of Chinese Academy of Sciences. Before joining McMaster University, he was a postdoctoral fellow at Simon Fraser University and a Principal Researcher at Huawei Technologies Canada. His research focuses on data mining, explainable machine learning, and trustworthy computing, with a growing focus on data security in database systems. Some of his recent works explore AI-related data watermarking techniques to ensure data integrity and provenance in large-scale systems and data markets. He is an Associate Editor of ACM Transactions on Knowledge Discovery from Data (TKDD) and he also served as a program committee member and reviewer for conferences and journals including SIGMOD, VLDB, KDD, ICDE, ICDM, CIKM, CVPR, NeurIPS, ICML, ICLR, ACM Multimedia, TKDE, TMM, etc. |
18 July 2025; 10:30 a.m. in DC 1304 (Note the unusual room)
Title: | Unlocking True Elasticity in the Cloud-Native Era |
Speaker: | , ETH Zürich |
Abstract: | Resource elasticity is fundamental to cloud computing. The more quickly a cloud platform can allocate resources to match the demand of each user request as it arrives, the less resources need to be pre-provisioned to meet performance requirements. However, even serverless platforms — which can boot sandboxes in 10s to 100s of milliseconds — are not sufficiently elastic to avoid over-provisioning expensive resources (e.g., warm sandboxes to avoid cold starts). A key obstacle for true elasticity is that ٴǻ岹’scloud platforms are stuck retrofitting system software designed for a more traditional execution model of cloud computing based on long-running virtual machines that provide each user application with a POSIX-like interface. While providing a POSIX interface was important in the early days of cloud computing to ease migration from on premise clusters, today's developers design cloud-native applications, in which user-defined computations interact with a variety of cloud services (e.g. storage, AI inference, data analytics engines) over REST APIs. In this talk, I will propose a declarative programming model catered to cloud-native applications that enables co-designing a much more efficient and elastic underlying execution system. I will present Dandelion, a new elastic cloud platform that implements this declarative programming model. Dandelion applications are expressed as DAGs of pure compute functions and HTTP-based communication functions. This enables Dandelion to securely execute user-defined compute functions in lightweight sandboxes that cold start in hundreds of microseconds, since executing pure functions does not require initializing a POSIX environment. Dandelion makes it practical to boot a sandbox on-demand for every compute function invocation, decreasing performance variability by two to three orders of magnitude compared to Firecracker and reducing committed memory by 96% on average when running the Azure Functions trace. I will discuss the implications of true elasticity for cloud applications like interactive data analytics and emerging agentic AI workflows. |
Bio: | Ana Klimovic is an Assistant Professor in the Systems Group of the Computer Science Department at ETH Zurich. Her research interests span operating systems, computer architecture, and their intersection with machine learning. Ana's work focuses on computer system design for large-scale applications such as cloud computing services, data analytics, and machine learning. Before joining ETH in August 2020, Ana was a Research Scientist at Google Brain and completed her Ph.D. in Electrical Engineering at Stanford University. |
28 July 2025; 10:30 a.m. in DC 1304 (Note the unusual room)
Title: | Graph-Based Vector Search: Recent Advances and Future Directions |
Speaker: | , Mohammed VI Polytechnic University |
Abstract: | High-dimensional vector similarity search has been recognized for half a century as a fundamental and challenging problem in computer science. It has recently gained increased prominence, both in academia and industry, due to the proliferation of deep network embeddings, which are typically dense vectors representing complex objects (e.g., images, text, tables), and the growing number of AI tasks that require dense vector search as a key subroutine (e.g., cleaning, discovery, retrieval). Nowadays, graph-based approximate vector search, despite lacking theoretical guarantees, is considered the method of choice for many applications thanks to its excellent empirical performance. In this talk, we will give a brief overview of the vector similarity search problem, highlight some promising research directions covering both the exact and approximate flavors of the problem, and dive into the key findings of a recent extensive experimental evaluation of graph-based approximate vector search methods. |
Bio: | Karima Echihabi is an Assistant Professor of Computer Science at Mohammed VI Polytechnic University in Morocco (UM6P). Her research interests lie in responsible and scalable data science. This spans topics in data management, machine learning, high-performance computing, and socio-legal studies. One of the fundamental problems she focuses on is similarity search because it is a key operation in many critical data science tasks such as data cleaning, data integration, and information retrieval. She is the founding faculty advisor for the UM6P ACM Student Chapter, a senior member of the IEEE, and an Associate Editor for the ACM SIGMOD and PVLDB. Her research work has led to publications in top venues such as the ACM SIGMOD, Communications of the ACM, ICDE and PVLDB. Before joining UM6P, she worked as a software engineer at Microsoft, Redmond, and the IBM Toronto Lab, and as an entrepreneur running her own consultancy business. She is passionate about computer science education and has coached students in various competitions, leading to several awards including a Gold Medal at the 2024 UNESCO/IRCAI AI in Africa Competition, and a Silver Medal at the 2023 Moroccan National Programming Competition. She earned a BSc. in Software Engineering from Al Akhawayn University, an MSc in Computer Science from the University of Toronto, and a PhD in Computer Science from Mohammed V University and Université de Paris. |
6 August 2025; 10:30 a.m. in DC 1304 (Note theunusual date androom)
Title: | Efficient Algorithms on Relational Databases through the Lens of Geometry |
Speaker: | |
Abstract: | Exploring and analyzing relational data typically involves two costly steps: data preparation and data processing. During data preparation, tuples from multiple tables are joined to form a comprehensive dataset, while in the subsequent processing step, algorithms are applied to the join results for analysis. This two-step approach is often prohibitively expensive because join outputs can be polynomially larger than the total size of the input tables. To address this challenge, we develop efficient approximation algorithms for various NP-complete optimization problems on join results, without explicitly materializing the join. Our work focuses on clustering problems and demonstrates how computational geometry enables the fastest known approximation algorithms for relational k-center, k-median, k-means, and related variants. Specifically, for a database instance D of size N and an acyclic join query Q, we introduce novel techniques that combine ideas from computational geometry and database theory to achieve constant-factor approximations in roughly O(kN) time, even when the number of the join results |Q(D)| is orders of magnitude larger than N. |
Bio: | Stavros Sintos is an Assistant Professor in the Department of Computer Science at the University of Illinois at Chicago (UIC). Before joining UIC, he was a Postdoctoral Scholar on Data Management in the Department of Computer Science at the University of Chicago. He obtained his Ph.D. in the Department of Computer Science at Duke University under the supervision of Prof. Pankaj K. Agarwal. He is a recipient of the James B. Duke Fellowship, and he was nominated for the 2019-2020 outstanding Ph.D. dissertation award for his thesis titled “Efficient Algorithms for Querying Large and Uncertain Data”. Recently, he got the best paper award at ICDT 2024. His main research interest is in the design of efficient algorithms with theoretical guarantees for problems and queries in databases and data management. An important aspect of his research has been on combining geometric optimization with query processing. His work has been published in top-tier conferences and journals such as PODS, SIGMOD, VLDB, ICDT, and KDD. His research is supported by NSF and Google-CAHSI. For more details, please visit . |
7 August 2025; 12:00 p.m. in DC 1302 (Note the unusual date and time)
Title: | Towards Efficient Algorithms on Compressed Graph Databases |
Speaker: | , Paderborn University |
Abstract: | The speed of algorithms on massive graphs depends on the size of the given data. Grammar-based compression is a technique to compress the size of a graph while still allowing to read or to modify the graph with a little time overhead. When data access methods to compressed data are chosen carefully, the speed-up gained by data size reduction significantly predominates the time overhead needed to partially uncompress compressed data. The talk gives an overview of the key ideas behind grammar-based compression for large graphs and shows how to apply graph compression to graph databases. Furthermore, it introduces recompression as a fast technique to keep compressed graphs small when they are frequently modified. |
Bio: | Stefan Böttcher is a professor for computer science at Paderborn University. His background research areas are query processing, parallel transactions, and security in database systems. His current main research topics are compressed graph databases, and genome index construction for bioinformatics applications. He has published more than 100 papers. Furthermore, he has been cooperating with more than 20 companies in various industry branches. He is furthermore active in computer science education in companies and is working on explanation systems for e-learning. |