Unused vector name cost storage and (maybe)memory #4112

hugh2slowmo · 2024-04-25T04:19:06Z

Hello, i have a big collection which info shows below:

{
  "result": {
    "status": "yellow",
    "optimizer_status": "ok",
    "vectors_count": 29225918,
    "indexed_vectors_count": 19271275,
    "points_count": 23517920,
    "segments_count": 10,
    "config": {
      "params": {
        "vectors": {
          "gemini:models/embedding-001": {
            "size": 768,
            "distance": "Cosine"
          },
          "openai:text-embedding-ada-002": {
            "size": 1536,
            "distance": "Cosine",
            "on_disk": true
          }
        },
        "shard_number": 1,
        "replication_factor": 1,
        "write_consistency_factor": 1,
        "on_disk_payload": true
      },
      "hnsw_config": {
        "m": 16,
        "ef_construct": 100,
        "full_scan_threshold": 10000,
        "max_indexing_threads": 0,
        "on_disk": false
      },
      "optimizer_config": {
        "deleted_threshold": 0.2,
        "vacuum_min_vector_number": 1000,
        "default_segment_number": 0,
        "max_segment_size": null,
        "memmap_threshold": 120000,
        "indexing_threshold": 20000,
        "flush_interval_sec": 5,
        "max_optimization_threads": 16
      },
      "wal_config": {
        "wal_capacity_mb": 32,
        "wal_segments_ahead": 0
      },
      "quantization_config": {
        "binary": {
          "always_ram": true
        }
      }
    },
    "payload_schema": {
      "content": {
        "data_type": "text",
        "points": 23513232
      },
      "created_at": {
        "data_type": "integer",
        "points": 23513232
      }
    }
  },
  "status": "ok",
  "time": 0.000123933
}

I have 2 vectors setting, "openai:text-embedding-ada-002" which is in current using, and "gemini:models/embedding-001" is design for future usage and i did not put any vectors in it.
what confusing me is that, even in current config which optimize for memory usage(vector&payload on_disk), this collection still consume over 180GB memory, dose the payload index costs? or empty vectors has memory costs? I'll appreciate it if you guys can give some guides cuz i cannot find docs about it by myself.
And another thing i found out is that unused vector name costs disk storage:

find .  -name '*gemini*'|xargs du -sh -c
1.5G	./e5dd2e49-63ba-453e-a8c5-c1660205a05e/vector_index-gemini:models
8.6G	./e5dd2e49-63ba-453e-a8c5-c1660205a05e/vector_storage-gemini:models
332M	./00f91f9c-82e1-42e9-b27e-6b3f85121984/vector_index-gemini:models
5.6G	./00f91f9c-82e1-42e9-b27e-6b3f85121984/vector_storage-gemini:models
2.3G	./0d0d5bac-68f7-4f44-8067-b5293b6d89f0/vector_index-gemini:models
5.7G	./0d0d5bac-68f7-4f44-8067-b5293b6d89f0/vector_storage-gemini:models
2.1G	./7fce8419-baf2-4a4d-8b74-9ddf16243ec5/vector_index-gemini:models
3.0G	./7fce8419-baf2-4a4d-8b74-9ddf16243ec5/vector_storage-gemini:models
627M	./2783ed36-f8cf-4f81-8d4f-3b9ee58c091d/vector_index-gemini:models
5.2G	./2783ed36-f8cf-4f81-8d4f-3b9ee58c091d/vector_storage-gemini:models
217M	./a3d5f780-faed-439e-946d-f0adcd049423/vector_index-gemini:models
5.6G	./a3d5f780-faed-439e-946d-f0adcd049423/vector_storage-gemini:models
170M	./0536f81d-c847-47bf-86fc-4d145dbdda67/vector_index-gemini:models
4.7G	./0536f81d-c847-47bf-86fc-4d145dbdda67/vector_storage-gemini:models
46G	total

find .  -name '*ada-002*'|xargs du -sh -c
3.2G	./e5dd2e49-63ba-453e-a8c5-c1660205a05e/vector_index-openai:text-embedding-ada-002
18G	./e5dd2e49-63ba-453e-a8c5-c1660205a05e/vector_storage-openai:text-embedding-ada-002
3.0G	./00f91f9c-82e1-42e9-b27e-6b3f85121984/vector_index-openai:text-embedding-ada-002
12G	./00f91f9c-82e1-42e9-b27e-6b3f85121984/vector_storage-openai:text-embedding-ada-002
2.7G	./0d0d5bac-68f7-4f44-8067-b5293b6d89f0/vector_index-openai:text-embedding-ada-002
12G	./0d0d5bac-68f7-4f44-8067-b5293b6d89f0/vector_storage-openai:text-embedding-ada-002
1.4G	./7fce8419-baf2-4a4d-8b74-9ddf16243ec5/vector_index-openai:text-embedding-ada-002
5.9G	./7fce8419-baf2-4a4d-8b74-9ddf16243ec5/vector_storage-openai:text-embedding-ada-002
2.0G	./2783ed36-f8cf-4f81-8d4f-3b9ee58c091d/vector_index-openai:text-embedding-ada-002
11G	./2783ed36-f8cf-4f81-8d4f-3b9ee58c091d/vector_storage-openai:text-embedding-ada-002
3.6G	./a3d5f780-faed-439e-946d-f0adcd049423/vector_index-openai:text-embedding-ada-002
12G	./a3d5f780-faed-439e-946d-f0adcd049423/vector_storage-openai:text-embedding-ada-002
6.2M	./68edd3d7-01a6-4835-a406-af7e3e38689d/vector_storage-openai:text-embedding-ada-002
57G	./f49fb9ac-e287-4e02-b37a-0684f45a9796/vector_storage-openai:text-embedding-ada-002
3.4G	./0536f81d-c847-47bf-86fc-4d145dbdda67/vector_index-openai:text-embedding-ada-002
9.3G	./0536f81d-c847-47bf-86fc-4d145dbdda67/vector_storage-openai:text-embedding-ada-002
34M	./8ceadfdd-f6ed-4710-a39d-3c98e601e579/vector_storage-openai:text-embedding-ada-002
8.2G	./1fe48489-32da-49b2-9230-a931d378d374/vector_storage-openai:text-embedding-ada-002
161G	total

Dose it the normal case? or some thing wrong already happened on the collection causes it?
PS: currently i'm using qdrant version 1.8.3
Thx!

The text was updated successfully, but these errors were encountered:

hugh2slowmo · 2024-04-25T04:25:16Z

BTW, I want to know if there has any safe and efficient way to remove "gemini:models/embedding-001" in vector schema and those files in storage?
And i scroll some points which has "gemini:models/embedding-001" vector data, all they have is an vector fill with value 1.0, i think they might comes from some crashes.

generall · 2024-04-25T10:02:01Z

Hey @hugh2slowmo, you are right, the current implementation of multiple vectors per point reserves space for all vectors even if some of them are not created.

It is possible that we would make some optimizations for corner cases like yours, but overall that's the idea. I recommend to use on_disk option, so it would be only the disk which is used

hugh2slowmo · 2024-04-25T11:59:02Z

Thanks for your reply @generall, so in current state, the better solution to reduce mem&disk usage with a large collection, is not declaring vector config that are not being used currently when creating collection, am i get it right?

generall · 2024-04-25T12:14:05Z

from the disk perspective, it doesn't matter if you have one collection or many, but not creating empty vectors will definitely help

hugh2slowmo added the bug Something isn't working label Apr 25, 2024

This was referenced Apr 30, 2024

Optimization task panicked after a collection is recovered, causing search API timeout #4131

Open

[question]search API timeout , collection is yellow , not sure when will turn green and wanna know why? #4146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unused vector name cost storage and (maybe)memory #4112

Unused vector name cost storage and (maybe)memory #4112

hugh2slowmo commented Apr 25, 2024

hugh2slowmo commented Apr 25, 2024 •

edited

generall commented Apr 25, 2024

hugh2slowmo commented Apr 25, 2024

generall commented Apr 25, 2024

Unused vector name cost storage and (maybe)memory #4112

Unused vector name cost storage and (maybe)memory #4112

Comments

hugh2slowmo commented Apr 25, 2024

hugh2slowmo commented Apr 25, 2024 • edited

generall commented Apr 25, 2024

hugh2slowmo commented Apr 25, 2024

generall commented Apr 25, 2024

hugh2slowmo commented Apr 25, 2024 •

edited