BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Events - ECPv6.15.20//NONSGML v1.0//EN
CALSCALE:GREGORIAN
METHOD:PUBLISH
X-ORIGINAL-URL:https://events.ucsc.edu
X-WR-CALDESC:Events for Events
REFRESH-INTERVAL;VALUE=DURATION:PT1H
X-Robots-Tag:noindex
X-PUBLISHED-TTL:PT1H
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:20250309T100000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:20251102T090000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:20260308T100000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:20261101T090000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:20270314T100000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:20271107T090000
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTART;TZID=America/Los_Angeles:20260209T130000
DTEND;TZID=America/Los_Angeles:20260209T143000
DTSTAMP:20260427T072818
CREATED:20260127T195054Z
LAST-MODIFIED:20260127T195054Z
UID:10009120-1770642000-1770647400@events.ucsc.edu
SUMMARY:Li\, X. (CSE) - Compute-Efficient Scaling of Fully-Open Visual Encoders
DESCRIPTION:Vision encoders have demonstrated significant performance gains in visual generation and multimodal reasoning. These improvements are primarily attributed to the scaling of data\, model capacity\, and compute. However\, this progress is becoming less accessible due to a lack of transparency in data curation and training recipes. In combination with the high compute requirements of foundation-scale pre-training\, these factors hinder independent reproducibility. \nIn this dissertation\, we democratize large-scale visual encoder training by developing compute-efficient\, reproducible training recipes for video encoders\, vision-language models (VLMs)\, and multimodal large language models (MLLMs). First\, we challenge the common belief that scaling necessarily requires proportionally more resources. Specifically\, we show that decoupled pre-training separates key factors such as space/time and token length\, and learns strong priors first. This design yields dramatic efficiency gains across image\, video\, and generative modeling. Next\, we address the challenge of undisclosed or inaccessible training data by releasing and systematically studying the curation of high-quality\, large-scale datasets. We demonstrate that high-quality synthetic captions at scale enable vision-language models to learn stronger visual representations\, especially when paired with training frameworks that unify contrastive and generative objectives. Lastly\, building on these findings\, we develop fully open vision encoders with complete training data\, recipes\, and checkpoints\, and show that transparency can enable rather than hinder state-of-the-art performance as an MLLMs’ visual backbone. \nTogether\, these contributions establish that openness and efficiency are mutually reinforcing\, providing a reproducible foundation for the next generation of visual intelligence. \nEvent Host: Xianhang Li\, Ph.D. Candidate\, Computer Science and Engineering \nAdvisor: Cihang Xie  \nZoom- https://ucsc.zoom.us/j/95801462664?pwd=koENnyV65jyPnkJYTbiYr1jaNsV5BE.1 \nPasscode- 782017
URL:https://events.ucsc.edu/event/li-x-cse-compute-efficient-scaling-of-fully-open-visual-encoders/
LOCATION:
CATEGORIES:Ph.D. Presentations
ATTACH;FMTTYPE=image/jpeg:https://events.ucsc.edu/wp-content/uploads/2026/01/ph.d.-presentation-graphic-option2-1.jpg
END:VEVENT
END:VCALENDAR