You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SnapKV: Cache compression technique for faster LLM generation with less compute and memory
In a recent paper, authors introduced 饾棪饾椈饾棶饾椊饾棡饾棭 as a novel technique which efficiently compresses the key-value (KV) cache in large language models (LLMs), resulting in faster generation with less compute overhead and memory footprint. It compresses KV caches by selecting clustered important KV positions for each attention head.
Your contribution
I'm not sure how much work this may induce or if it is at all feasible (notably enabling sharding with adapters). I'll gladly read any insights on the complexity and the relevance of adding this feature.
The text was updated successfully, but these errors were encountered:
Feature request
https://github.com/FasterDecoding/SnapKV
Motivation
SnapKV: Cache compression technique for faster LLM generation with less compute and memory
In a recent paper, authors introduced 饾棪饾椈饾棶饾椊饾棡饾棭 as a novel technique which efficiently compresses the key-value (KV) cache in large language models (LLMs), resulting in faster generation with less compute overhead and memory footprint. It compresses KV caches by selecting clustered important KV positions for each attention head.
Your contribution
I'm not sure how much work this may induce or if it is at all feasible (notably enabling sharding with adapters). I'll gladly read any insights on the complexity and the relevance of adding this feature.
The text was updated successfully, but these errors were encountered: