CI Failure (key symptom) in `ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy` #18454

vbotbuildovich · 2024-05-13T21:13:07Z

https://buildkite.com/redpanda/vtools/builds/13666

Module: rptest.tests.e2e_shadow_indexing_test
Class: ShadowIndexingWhileBusyTest
Method: test_create_or_delete_topics_while_busy
Arguments: {
    "short_retention": true,
    "cloud_storage_type": 1
}

test_id:    ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy
status:     FAIL
run time:   754.352 seconds

<BadLogLines nodes=ip-172-31-7-99(4) example="ERROR 2024-05-12 22:27:05,266 [shard 1:fetc] cloud_storage - [fiber7677 kafka/topic-mgkcebtxjr/33] - remote_partition.cc:626 - Failed to query spillover manifests: cloud_storage::error_outcome:10, query: {3954}">
Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 276, in run_test
    return self.test_context.function(self.test)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 535, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 188, in wrapped
    redpanda.raise_on_bad_logs(
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1478, in raise_on_bad_logs
    lsearcher.search_logs(_searchable_nodes)
  File "/home/ubuntu/redpanda/tests/rptest/services/utils.py", line 197, in search_logs
    raise BadLogLines(bad_loglines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=ip-172-31-7-99(4) example="ERROR 2024-05-12 22:27:05,266 [shard 1:fetc] cloud_storage - [fiber7677 kafka/topic-mgkcebtxjr/33] - remote_partition.cc:626 - Failed to query spillover manifests: cloud_storage::error_outcome:10, query: {3954}">

JIRA Link: CORE-2942

The text was updated successfully, but these errors were encountered:

abhijat · 2024-05-14T09:30:43Z

The errors appear quite close to a truncation triggered by gc

ip-172-31-7-99/redpanda.log
12309637:DEBUG 2024-05-12 22:27:05,063 [shard 1:main] cluster - ntp: {kafka/topic-mgkcebtxjr/33} - archival_metadata_stm.cc:1451 - Updating start offset, current value 0, update 5143
12320158:ERROR 2024-05-12 22:27:05,266 [shard 1:fetc] cloud_storage - [fiber7677 kafka/topic-mgkcebtxjr/33] - remote_partition.cc:626 - Failed to query spillover manifests: cloud_storage::error_outcome:10, query: {3954}
12320611:ERROR 2024-05-12 22:27:05,272 [shard 1:fetc] cloud_storage - [fiber7684 kafka/topic-mgkcebtxjr/33] - remote_partition.cc:626 - Failed to query spillover manifests: cloud_storage::error_outcome:10, query: {761}
12321547:ERROR 2024-05-12 22:27:05,287 [shard 1:fetc] cloud_storage - [fiber7663 kafka/topic-mgkcebtxjr/33] - remote_partition.cc:626 - Failed to query spillover manifests: cloud_storage::error_outcome:10, query: {2773}
12321695:ERROR 2024-05-12 22:27:05,290 [shard 1:fetc] cloud_storage - [fiber7683 kafka/topic-mgkcebtxjr/33] - remote_partition.cc:626 - Failed to query spillover manifests: cloud_storage::error_outcome:10, query: {958}

abhijat · 2024-05-14T09:35:10Z

The condition https://github.com/redpanda-data/redpanda/blob/dev/src/v/cloud_storage/remote_partition.cc#L581-L588 should prevent the error because the searched for offset is now below the log start, but it does not, maybe the offset returned by full_log_kafka_offset does not match

abhijat · 2024-05-14T09:55:33Z

debugging in #18465 - seems to only happen on CDT and arm64

abhijat · 2024-05-16T08:16:28Z

The sequence of events appears to be:

Offset is advanced ahead of all segments in manifest to satisfy retention (5204 appears to be greater than all segments in manifest)

INFO  2024-05-15 18:46:56,124 [shard 3:au  ] archival - [fiber105 kafka/topic-qajbludzpc/9] - ntp_archiver_service.cc:2660 - {[size_based_retention]} Advancing start offset to 5204 satisfy retention policy
DEBUG 2024-05-15 18:46:56,330 [shard 3:main] cluster - ntp: {kafka/topic-qajbludzpc/9} - archival_metadata_stm.cc:1451 - Updating start offset, current value 0, update 5204
DEBUG 2024-05-15 18:46:56,330 [shard 3:main] cluster - ntp: {kafka/topic-qajbludzpc/9} - archival_metadata_stm.cc:1458 - Start offset updated to 5204

During query the log start offset is now computed as nullopt. We cannot compute the kafka offset for a log offset outside range:

DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - partition_manifest.cc:375 - {kafka/topic-qajbludzpc/9}: get_start_kafka_offset: start offset 5204
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - partition_manifest.cc:384 - {kafka/topic-qajbludzpc/9}: get_start_kafka_offset: compute start offset
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - partition_manifest.cc:432 - {kafka/topic-qajbludzpc/9}: compute_start_kafka_offset_local: manifest front offset 0, front delta 0, start offset 5204
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - partition_manifest.cc:392 - {kafka/topic-qajbludzpc/9}: get_start_kafka_offset: computed start offset {nullopt}
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - [fiber22 kafka/topic-qajbludzpc/9] - async_manifest_view.cc:575 - query [kafka offset: 979] is out of valid range
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - partition_manifest.cc:375 - {kafka/topic-qajbludzpc/9}: get_start_kafka_offset: start offset 5204
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - partition_manifest.cc:384 - {kafka/topic-qajbludzpc/9}: get_start_kafka_offset: compute start offset
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - partition_manifest.cc:432 - {kafka/topic-qajbludzpc/9}: compute_start_kafka_offset_local: manifest front offset 0, front delta 0, start offset 5204
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - partition_manifest.cc:392 - {kafka/topic-qajbludzpc/9}: get_start_kafka_offset: computed start offset {nullopt}
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - partition_manifest.cc:375 - {kafka/topic-qajbludzpc/9}: get_start_kafka_offset: start offset 5204
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - partition_manifest.cc:384 - {kafka/topic-qajbludzpc/9}: get_start_kafka_offset: compute start offset
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - partition_manifest.cc:432 - {kafka/topic-qajbludzpc/9}: compute_start_kafka_offset_local: manifest front offset 0, front delta 0, start offset 5204
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - partition_manifest.cc:392 - {kafka/topic-qajbludzpc/9}: get_start_kafka_offset: computed start offset {nullopt}
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - [fiber7749 kafka/topic-qajbludzpc/9] - remote_partition.cc:592 - log start offset {nullopt}, query offset 979, archive start offset -9223372036854775808, delta: -9223372036854775808, start kafka offset {nullopt}

Because the log start offset is nullopt, the condition to avoid error is missed

https://github.com/redpanda-data/redpanda/blob/dev/src/v/cloud_storage/remote_partition.cc#L581-L589

abhijat · 2024-05-16T08:18:39Z

We should possibly check for a full log truncation coinciding with a null kafka start offset, and if the query then falls within the log segments which have been truncated, we should not report an error.

piyushredpanda · 2024-05-21T03:49:43Z

@nvartolomei fixed this in #18097

vbotbuildovich added auto-triaged used to know which issues have been opened from a CI job ci-failure labels May 13, 2024

abhijat self-assigned this May 14, 2024

abhijat added the area/cloud-storage Shadow indexing subsystem label May 16, 2024

abhijat mentioned this issue May 18, 2024

cst: Change acceptable query miss condition #18574

Closed

7 tasks

piyushredpanda closed this as completed May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure (key symptom) in `ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy` #18454

CI Failure (key symptom) in `ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy` #18454

vbotbuildovich commented May 13, 2024 •

edited by github-actions bot

abhijat commented May 14, 2024

abhijat commented May 14, 2024 •

edited

abhijat commented May 14, 2024

abhijat commented May 16, 2024 •

edited

abhijat commented May 16, 2024

piyushredpanda commented May 21, 2024

CI Failure (key symptom) in ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy #18454

CI Failure (key symptom) in ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy #18454

Comments

vbotbuildovich commented May 13, 2024 • edited by github-actions bot

abhijat commented May 14, 2024

abhijat commented May 14, 2024 • edited

abhijat commented May 14, 2024

abhijat commented May 16, 2024 • edited

abhijat commented May 16, 2024

piyushredpanda commented May 21, 2024

CI Failure (key symptom) in `ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy` #18454

CI Failure (key symptom) in `ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy` #18454

vbotbuildovich commented May 13, 2024 •

edited by github-actions bot

abhijat commented May 14, 2024 •

edited

abhijat commented May 16, 2024 •

edited