BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20230124T171523Z
LOCATION:C144-145
DTSTART;TZID=America/Chicago:20221118T083000
DTEND;TZID=America/Chicago:20221118T091000
UID:submissions.supercomputing.org_SC22_sess445_misc255@linklings.com
SUMMARY:IA^3 - Invited Talk 1: Efficient Processing of Large Graph Applica
 tions Using Asynchronous Architectures
DESCRIPTION:Workshop\n\nIA^3 - Invited Talk 1: Efficient Processing of Lar
 ge Graph Applications Using Asynchronous Architectures\n\nKinsy\n\nGraph a
 lgorithms and techniques are increasingly being used in scientific and com
 mercial applications to express relations and explore large data sets. Alt
 hough conventional or commodity computer architectures, like CPU or GPU, c
 an compute fairly well dense graph algorithms, they are often inadequate i
 n processing large sparse graph applications. Memory access patterns, memo
 ry bandwidth requirements and on-chip network communications in these appl
 ications do not fit in the conventional program execution flow. In this wo
 rk, we propose and design a new architecture for fast processing of large 
 graph applications. To leverage the lack of the spatial and temporal local
 ities in these applications and to support scalable computational models, 
 we design the architecture around two key concepts. (1) The architecture i
 s a multicore processor of independently clocked processing elements. Thes
 e elements communicate in a self-timed manner and use handshaking to perfo
 rm synchronization, communication, and sequencing of operations. By being 
 asynchronous, the operating speed at each processing element is determined
  by actual local latencies rather than global worst-case latencies. We cre
 ate a specialized ISA to support these operations. (2) The application com
 pilation and mapping process uses a graph clustering algorithm to optimize
  parallel computing of graph operations and load balancing. Through the cl
 ustering process, we make scalability an inherent property of the architec
 ture where task-to-element mapping can be done at the graph node level or 
 at node cluster level. A prototyped version of the architecture outperform
 s a comparable CPU by 10~20x across all benchmarks and provides 2~5x bette
 r power efficiency when compared to a GPU.\n\nSession Format: Recorded\n\n
 Tag: Accelerator-based Architectures, Algorithms, Architectures, Big Data,
  Data Analytics, Parallel Programming Languages and Models, Productivity T
 ools\n\nRegistration Category: Workshop Reg Pass
END:VEVENT
END:VCALENDAR