Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

SnapKV support #1881

Open
icyxp opened this issue May 13, 2024 · 0 comments
Open

SnapKV support #1881

icyxp opened this issue May 13, 2024 · 0 comments

Comments

@icyxp
Copy link

icyxp commented May 13, 2024

Feature request

https://github.com/FasterDecoding/SnapKV

Motivation

SnapKV: Cache compression technique for faster LLM generation with less compute and memory

In a recent paper, authors introduced 饾棪饾椈饾棶饾椊饾棡饾棭 as a novel technique which efficiently compresses the key-value (KV) cache in large language models (LLMs), resulting in faster generation with less compute overhead and memory footprint. It compresses KV caches by selecting clustered important KV positions for each attention head.

Your contribution

I'm not sure how much work this may induce or if it is at all feasible (notably enabling sharding with adapters). I'll gladly read any insights on the complexity and the relevance of adding this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant