Cuda Trending

annsbuilding-blocksclusteringcudadistance

nvbench

CUDA Kernel Benchmarking Library

899112+1

SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

3,491444+1

attentioncudaefficient-attentioninference-accelerationllm

cugraph

cuGraph - RAPIDS Graph Analytics Library

2,207362

HigherOrderCO /

HVM2

A massively parallel, optimal functional runtime in Rust

11,324438

causal-conv1d

Causal depthwise conv1d in CUDA, with a PyTorch interface

919199

cuCollections

cub

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

1,840462

warp-ctc

Fast parallel CTC.

4,0671,028

vie, 17 de julio de 2026

llm.c

LLM training in simple, raw C/CUDA

30,5653,700+7

SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

3,490444+4

attentioncudaefficient-attentioninference-accelerationllm

mirage

Mirage Persistent Kernel: Compiling LLMs into a MegaKernel

2,375231+3

instant-ngp

#19

Instant neural graphics primitives: lightning fast NeRF and more

17,4922,066+2

3d-reconstructioncomputer-graphicscomputer-visioncudafunction-approximation

rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

1,277238+2

gptinferencellamallmllm-serving

DeepGEMM

DeepGEMM: clean and efficient BLAS kernel library on GPU

7,5241,116+2

cub

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

1,840462+1

nvbench

CUDA Kernel Benchmarking Library

898111+1

ThunderKittens

Tile primitives for speedy kernels

3,548311+1

nccl-tests

NCCL Tests

1,593392+1

cugraph-gnn

cuopt

#18

GPU accelerated decision optimization

974209

Fast parallel CTC.

causal-conv1d

Causal depthwise conv1d in CUDA, with a PyTorch interface

919198

siboehm /

SGEMM_CUDA

Fast CUDA matrix multiplication from scratch

1,254203

cuvs

cuVS - a library for vector search and clustering on the GPU

cugraph

cuGraph - RAPIDS Graph Analytics Library

2,207362

DeepEP

DeepEP: an efficient expert-parallel communication library

9,8561,324

jue, 16 de julio de 2026

ThunderKittens

Tile primitives for speedy kernels

3,547311+9

DeepEP

DeepEP: an efficient expert-parallel communication library

9,8581,323+7

llm.c

#18

LLM training in simple, raw C/CUDA

30,5623,698+6

DeepGEMM

DeepGEMM: clean and efficient BLAS kernel library on GPU

7,5231,116+5

siboehm /

SGEMM_CUDA

Fast CUDA matrix multiplication from scratch

1,254203+4

nccl-tests

NCCL Tests

1,592392+3

rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

1,275235+3

gptinferencellamallmllm-serving

cuopt

GPU accelerated decision optimization

974209+2

nvbench

#19

CUDA Kernel Benchmarking Library

897111+1

instant-ngp

Instant neural graphics primitives: lightning fast NeRF and more

17,4912,066+1

3d-reconstructioncomputer-graphicscomputer-visioncudafunction-approximation

HigherOrderCO /

HVM2

A massively parallel, optimal functional runtime in Rust

11,322438+1

SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

3,486444+1

attentioncudaefficient-attentioninference-accelerationllm

causal-conv1d

Causal depthwise conv1d in CUDA, with a PyTorch interface

919198+1

brucefan1983 /

GPUMD

Graphics Processing Units Molecular Dynamics

811198

cudagpugpumdheat-transporthigh-performance-computing

cub

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

1,839462

cudf-spark-jni

RAPIDS Accelerator JNI For Apache Spark

6187

cuCollections

#10

warp-ctc

Fast parallel CTC.

4,0671,028

cuvs

cuVS - a library for vector search and clustering on the GPU

mié, 15 de julio de 2026

llm.c

LLM training in simple, raw C/CUDA

30,5573,699+8

DeepEP

DeepEP: an efficient expert-parallel communication library

9,8521,320+6

DeepGEMM

DeepGEMM: clean and efficient BLAS kernel library on GPU

7,5181,112+6

ThunderKittens

Tile primitives for speedy kernels

3,538311+3

nccl-tests

NCCL Tests

1,589390+3

cugraph

cuGraph - RAPIDS Graph Analytics Library

2,207362+1

instant-ngp

Instant neural graphics primitives: lightning fast NeRF and more

17,4912,067+1

3d-reconstructioncomputer-graphicscomputer-visioncudafunction-approximation

BBuf /

how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

cuopt

GPU accelerated decision optimization

972209

brucefan1983 /

GPUMD

Graphics Processing Units Molecular Dynamics

811197

cudagpugpumdheat-transporthigh-performance-computing

princeton-vl /

lietorch

cudf-spark-jni

RAPIDS Accelerator JNI For Apache Spark

6187

SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

3,485444

attentioncudaefficient-attentioninference-accelerationllm

cuCollections

#10

warp-ctc

Fast parallel CTC.

4,0671,028

causal-conv1d

Causal depthwise conv1d in CUDA, with a PyTorch interface

918198

cuvs

cuVS - a library for vector search and clustering on the GPU

mar, 14 de julio de 2026

llm.c

LLM training in simple, raw C/CUDA

30,5513,699+10

DeepGEMM

DeepGEMM: clean and efficient BLAS kernel library on GPU

7,5111,110+7

causal-conv1d

Causal depthwise conv1d in CUDA, with a PyTorch interface

918198+3

ThunderKittens

Tile primitives for speedy kernels

3,535310+3

nccl-tests

NCCL Tests

1,586390+3

rtp-llm

#18

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

1,269233+2

gptinferencellamallmllm-serving

siboehm /

SGEMM_CUDA

Fast CUDA matrix multiplication from scratch

1,248202+2

SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

3,485443+2

attentioncudaefficient-attentioninference-accelerationllm

DeepEP

DeepEP: an efficient expert-parallel communication library

9,8451,320+2

cub

#10

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

1,839462+1

raft

#20

1,028240

annsbuilding-blocksclusteringcudadistance

mirage

#19

Mirage Persistent Kernel: Compiling LLMs into a MegaKernel

2,368229

cugraph

cuGraph - RAPIDS Graph Analytics Library

2,206362

cuvs

cuVS - a library for vector search and clustering on the GPU

cuCollections

warp-ctc

Fast parallel CTC.

4,0671,029

nvbench

CUDA Kernel Benchmarking Library

896111

rahul-goel /

fused-ssim

Lightning fast differentiable SSIM.

23382

cuopt

GPU accelerated decision optimization

974209

instant-ngp

Instant neural graphics primitives: lightning fast NeRF and more

17,4902,067

3d-reconstructioncomputer-graphicscomputer-visioncudafunction-approximation

lun, 13 de julio de 2026

llm.c

LLM training in simple, raw C/CUDA

30,5423,696+6

instant-ngp

Instant neural graphics primitives: lightning fast NeRF and more

17,4902,067+5

3d-reconstructioncomputer-graphicscomputer-visioncudafunction-approximation

SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

3,483443+5

attentioncudaefficient-attentioninference-accelerationllm

cuvs

cuVS - a library for vector search and clustering on the GPU

815210+4

DeepGEMM

DeepGEMM: clean and efficient BLAS kernel library on GPU

7,5041,106+3

DeepEP

DeepEP: an efficient expert-parallel communication library

9,8431,318+2

mirage

Mirage Persistent Kernel: Compiling LLMs into a MegaKernel

2,368229+2

ThunderKittens

Tile primitives for speedy kernels

3,532309+2

cuopt

GPU accelerated decision optimization

974209+2

cub

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

1,838462+2

cugraph

cuGraph - RAPIDS Graph Analytics Library

2,206362+2

causal-conv1d

Causal depthwise conv1d in CUDA, with a PyTorch interface

915198+1

rahul-goel /

fused-ssim

Lightning fast differentiable SSIM.

23382+1

supranational /

sppark

Zero-knowledge template library

22098+1

bls12-377bls12-381cudanttpasta-curves

warp-ctc

Fast parallel CTC.

4,0671,029+1

nccl-tests

NCCL Tests

cuCollections

dom, 12 de julio de 2026

llm.c

LLM training in simple, raw C/CUDA

30,5383,694+6

mirage

Mirage Persistent Kernel: Compiling LLMs into a MegaKernel

2,367229+2

instant-ngp

Instant neural graphics primitives: lightning fast NeRF and more

17,4862,067+2

3d-reconstructioncomputer-graphicscomputer-visioncudafunction-approximation

ThunderKittens

Tile primitives for speedy kernels

3,530308+2

BBuf /

how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

raft

1,027240+1

annsbuilding-blocksclusteringcudadistance

SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

3,478442+1

attentioncudaefficient-attentioninference-accelerationllm