Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unused vector name cost storage and (maybe)memory #4112

Open
hugh2slowmo opened this issue Apr 25, 2024 · 4 comments
Open

Unused vector name cost storage and (maybe)memory #4112

hugh2slowmo opened this issue Apr 25, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@hugh2slowmo
Copy link

Hello, i have a big collection which info shows below:

{
  "result": {
    "status": "yellow",
    "optimizer_status": "ok",
    "vectors_count": 29225918,
    "indexed_vectors_count": 19271275,
    "points_count": 23517920,
    "segments_count": 10,
    "config": {
      "params": {
        "vectors": {
          "gemini:models/embedding-001": {
            "size": 768,
            "distance": "Cosine"
          },
          "openai:text-embedding-ada-002": {
            "size": 1536,
            "distance": "Cosine",
            "on_disk": true
          }
        },
        "shard_number": 1,
        "replication_factor": 1,
        "write_consistency_factor": 1,
        "on_disk_payload": true
      },
      "hnsw_config": {
        "m": 16,
        "ef_construct": 100,
        "full_scan_threshold": 10000,
        "max_indexing_threads": 0,
        "on_disk": false
      },
      "optimizer_config": {
        "deleted_threshold": 0.2,
        "vacuum_min_vector_number": 1000,
        "default_segment_number": 0,
        "max_segment_size": null,
        "memmap_threshold": 120000,
        "indexing_threshold": 20000,
        "flush_interval_sec": 5,
        "max_optimization_threads": 16
      },
      "wal_config": {
        "wal_capacity_mb": 32,
        "wal_segments_ahead": 0
      },
      "quantization_config": {
        "binary": {
          "always_ram": true
        }
      }
    },
    "payload_schema": {
      "content": {
        "data_type": "text",
        "points": 23513232
      },
      "created_at": {
        "data_type": "integer",
        "points": 23513232
      }
    }
  },
  "status": "ok",
  "time": 0.000123933
}

I have 2 vectors setting, "openai:text-embedding-ada-002" which is in current using, and "gemini:models/embedding-001" is design for future usage and i did not put any vectors in it.
what confusing me is that, even in current config which optimize for memory usage(vector&payload on_disk), this collection still consume over 180GB memory, dose the payload index costs? or empty vectors has memory costs? I'll appreciate it if you guys can give some guides cuz i cannot find docs about it by myself.
And another thing i found out is that unused vector name costs disk storage:

find .  -name '*gemini*'|xargs du -sh -c
1.5G	./e5dd2e49-63ba-453e-a8c5-c1660205a05e/vector_index-gemini:models
8.6G	./e5dd2e49-63ba-453e-a8c5-c1660205a05e/vector_storage-gemini:models
332M	./00f91f9c-82e1-42e9-b27e-6b3f85121984/vector_index-gemini:models
5.6G	./00f91f9c-82e1-42e9-b27e-6b3f85121984/vector_storage-gemini:models
2.3G	./0d0d5bac-68f7-4f44-8067-b5293b6d89f0/vector_index-gemini:models
5.7G	./0d0d5bac-68f7-4f44-8067-b5293b6d89f0/vector_storage-gemini:models
2.1G	./7fce8419-baf2-4a4d-8b74-9ddf16243ec5/vector_index-gemini:models
3.0G	./7fce8419-baf2-4a4d-8b74-9ddf16243ec5/vector_storage-gemini:models
627M	./2783ed36-f8cf-4f81-8d4f-3b9ee58c091d/vector_index-gemini:models
5.2G	./2783ed36-f8cf-4f81-8d4f-3b9ee58c091d/vector_storage-gemini:models
217M	./a3d5f780-faed-439e-946d-f0adcd049423/vector_index-gemini:models
5.6G	./a3d5f780-faed-439e-946d-f0adcd049423/vector_storage-gemini:models
170M	./0536f81d-c847-47bf-86fc-4d145dbdda67/vector_index-gemini:models
4.7G	./0536f81d-c847-47bf-86fc-4d145dbdda67/vector_storage-gemini:models
46G	total
find .  -name '*ada-002*'|xargs du -sh -c
3.2G	./e5dd2e49-63ba-453e-a8c5-c1660205a05e/vector_index-openai:text-embedding-ada-002
18G	./e5dd2e49-63ba-453e-a8c5-c1660205a05e/vector_storage-openai:text-embedding-ada-002
3.0G	./00f91f9c-82e1-42e9-b27e-6b3f85121984/vector_index-openai:text-embedding-ada-002
12G	./00f91f9c-82e1-42e9-b27e-6b3f85121984/vector_storage-openai:text-embedding-ada-002
2.7G	./0d0d5bac-68f7-4f44-8067-b5293b6d89f0/vector_index-openai:text-embedding-ada-002
12G	./0d0d5bac-68f7-4f44-8067-b5293b6d89f0/vector_storage-openai:text-embedding-ada-002
1.4G	./7fce8419-baf2-4a4d-8b74-9ddf16243ec5/vector_index-openai:text-embedding-ada-002
5.9G	./7fce8419-baf2-4a4d-8b74-9ddf16243ec5/vector_storage-openai:text-embedding-ada-002
2.0G	./2783ed36-f8cf-4f81-8d4f-3b9ee58c091d/vector_index-openai:text-embedding-ada-002
11G	./2783ed36-f8cf-4f81-8d4f-3b9ee58c091d/vector_storage-openai:text-embedding-ada-002
3.6G	./a3d5f780-faed-439e-946d-f0adcd049423/vector_index-openai:text-embedding-ada-002
12G	./a3d5f780-faed-439e-946d-f0adcd049423/vector_storage-openai:text-embedding-ada-002
6.2M	./68edd3d7-01a6-4835-a406-af7e3e38689d/vector_storage-openai:text-embedding-ada-002
57G	./f49fb9ac-e287-4e02-b37a-0684f45a9796/vector_storage-openai:text-embedding-ada-002
3.4G	./0536f81d-c847-47bf-86fc-4d145dbdda67/vector_index-openai:text-embedding-ada-002
9.3G	./0536f81d-c847-47bf-86fc-4d145dbdda67/vector_storage-openai:text-embedding-ada-002
34M	./8ceadfdd-f6ed-4710-a39d-3c98e601e579/vector_storage-openai:text-embedding-ada-002
8.2G	./1fe48489-32da-49b2-9230-a931d378d374/vector_storage-openai:text-embedding-ada-002
161G	total

Dose it the normal case? or some thing wrong already happened on the collection causes it?
PS: currently i'm using qdrant version 1.8.3
Thx!

@hugh2slowmo hugh2slowmo added the bug Something isn't working label Apr 25, 2024
@hugh2slowmo
Copy link
Author

hugh2slowmo commented Apr 25, 2024

BTW, I want to know if there has any safe and efficient way to remove "gemini:models/embedding-001" in vector schema and those files in storage?
And i scroll some points which has "gemini:models/embedding-001" vector data, all they have is an vector fill with value 1.0, i think they might comes from some crashes.

@generall
Copy link
Member

Hey @hugh2slowmo, you are right, the current implementation of multiple vectors per point reserves space for all vectors even if some of them are not created.

It is possible that we would make some optimizations for corner cases like yours, but overall that's the idea. I recommend to use on_disk option, so it would be only the disk which is used

@hugh2slowmo
Copy link
Author

Thanks for your reply @generall, so in current state, the better solution to reduce mem&disk usage with a large collection, is not declaring vector config that are not being used currently when creating collection, am i get it right?

@generall
Copy link
Member

from the disk perspective, it doesn't matter if you have one collection or many, but not creating empty vectors will definitely help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants