Cuda Trending
Trending Cuda repos on GitHub · last 7 days
llm.c
LLM training in simple, raw C/CUDA
rtp-llm
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
DeepGEMM
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
cuopt
GPU accelerated decision optimization
SageAttention
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
DeepEP
DeepEP: an efficient expert-parallel communication library
mirage
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
GPUMD
Graphics Processing Units Molecular Dynamics
instant-ngp
Instant neural graphics primitives: lightning fast NeRF and more
DiffPhysDrone
Published on Nature Machine Intelligence! The first real robot(quadrotor) based on differentiable physics training.
causal-conv1d
Causal depthwise conv1d in CUDA, with a PyTorch interface
CUDALibrarySamples
CUDA Library Samples
ThunderKittens
Tile primitives for speedy kernels
cugraph
cuGraph - RAPIDS Graph Analytics Library
nccl-tests
NCCL Tests
rtp-llm
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
llm.c
LLM training in simple, raw C/CUDA
DeepGEMM
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
ThunderKittens
Tile primitives for speedy kernels
DeepEP
DeepEP: an efficient expert-parallel communication library
CUDALibrarySamples
CUDA Library Samples
SageAttention
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
instant-ngp
Instant neural graphics primitives: lightning fast NeRF and more
causal-conv1d
Causal depthwise conv1d in CUDA, with a PyTorch interface
cuvs
cuVS - a library for vector search and clustering on the GPU
nccl-tests
NCCL Tests
cuopt
GPU accelerated decision optimization
GPUMD
Graphics Processing Units Molecular Dynamics
llm.c
LLM training in simple, raw C/CUDA
cuopt
GPU accelerated decision optimization
DeepGEMM
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
rtp-llm
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
SageAttention
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
llm.c
LLM training in simple, raw C/CUDA
rtp-llm
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
cuopt
GPU accelerated decision optimization
SageAttention
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
CUDA-Practice
CUDA编程练习项目-Hands-on CUDA kernels and performance optimization, covering GEMM, FlashAttention, Tensor Cores, CUTLASS, quantization, KV cache, NCCL, and profiling.
DeepGEMM
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
ThunderKittens
Tile primitives for speedy kernels
instant-ngp
Instant neural graphics primitives: lightning fast NeRF and more
llm.c
LLM training in simple, raw C/CUDA
rtp-llm
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
ThunderKittens
Tile primitives for speedy kernels
instant-ngp
Instant neural graphics primitives: lightning fast NeRF and more
mirage
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
SageAttention
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
nccl-tests
NCCL Tests
DeepEP
DeepEP: an efficient expert-parallel communication library
DeepGEMM
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
cuvs
cuVS - a library for vector search and clustering on the GPU
instant-ngp
Instant neural graphics primitives: lightning fast NeRF and more
rtp-llm
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
DeepGEMM
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
causal-conv1d
Causal depthwise conv1d in CUDA, with a PyTorch interface
ThunderKittens
Tile primitives for speedy kernels
how-to-optim-algorithm-in-cuda
how to optimize some algorithm in cuda.
nccl-tests
NCCL Tests
DeepEP
DeepEP: an efficient expert-parallel communication library
cuopt
GPU accelerated decision optimization
DeepGEMM
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
DeepEP
DeepEP: an efficient expert-parallel communication library
ThunderKittens
Tile primitives for speedy kernels
causal-conv1d
Causal depthwise conv1d in CUDA, with a PyTorch interface
nccl-tests
NCCL Tests
cuvs
cuVS - a library for vector search and clustering on the GPU
rtp-llm
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
cub
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl