Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (key symptom) in ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy #18454

Closed
vbotbuildovich opened this issue May 13, 2024 · 6 comments
Assignees
Labels
area/cloud-storage Shadow indexing subsystem auto-triaged used to know which issues have been opened from a CI job ci-failure

Comments

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented May 13, 2024

https://buildkite.com/redpanda/vtools/builds/13666

Module: rptest.tests.e2e_shadow_indexing_test
Class: ShadowIndexingWhileBusyTest
Method: test_create_or_delete_topics_while_busy
Arguments: {
    "short_retention": true,
    "cloud_storage_type": 1
}
test_id:    ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy
status:     FAIL
run time:   754.352 seconds

<BadLogLines nodes=ip-172-31-7-99(4) example="ERROR 2024-05-12 22:27:05,266 [shard 1:fetc] cloud_storage - [fiber7677 kafka/topic-mgkcebtxjr/33] - remote_partition.cc:626 - Failed to query spillover manifests: cloud_storage::error_outcome:10, query: {3954}">
Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 276, in run_test
    return self.test_context.function(self.test)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 535, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 188, in wrapped
    redpanda.raise_on_bad_logs(
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1478, in raise_on_bad_logs
    lsearcher.search_logs(_searchable_nodes)
  File "/home/ubuntu/redpanda/tests/rptest/services/utils.py", line 197, in search_logs
    raise BadLogLines(bad_loglines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=ip-172-31-7-99(4) example="ERROR 2024-05-12 22:27:05,266 [shard 1:fetc] cloud_storage - [fiber7677 kafka/topic-mgkcebtxjr/33] - remote_partition.cc:626 - Failed to query spillover manifests: cloud_storage::error_outcome:10, query: {3954}">

JIRA Link: CORE-2942

@vbotbuildovich vbotbuildovich added auto-triaged used to know which issues have been opened from a CI job ci-failure labels May 13, 2024
@abhijat
Copy link
Contributor

abhijat commented May 14, 2024

The errors appear quite close to a truncation triggered by gc

ip-172-31-7-99/redpanda.log
12309637:DEBUG 2024-05-12 22:27:05,063 [shard 1:main] cluster - ntp: {kafka/topic-mgkcebtxjr/33} - archival_metadata_stm.cc:1451 - Updating start offset, current value 0, update 5143
12320158:ERROR 2024-05-12 22:27:05,266 [shard 1:fetc] cloud_storage - [fiber7677 kafka/topic-mgkcebtxjr/33] - remote_partition.cc:626 - Failed to query spillover manifests: cloud_storage::error_outcome:10, query: {3954}
12320611:ERROR 2024-05-12 22:27:05,272 [shard 1:fetc] cloud_storage - [fiber7684 kafka/topic-mgkcebtxjr/33] - remote_partition.cc:626 - Failed to query spillover manifests: cloud_storage::error_outcome:10, query: {761}
12321547:ERROR 2024-05-12 22:27:05,287 [shard 1:fetc] cloud_storage - [fiber7663 kafka/topic-mgkcebtxjr/33] - remote_partition.cc:626 - Failed to query spillover manifests: cloud_storage::error_outcome:10, query: {2773}
12321695:ERROR 2024-05-12 22:27:05,290 [shard 1:fetc] cloud_storage - [fiber7683 kafka/topic-mgkcebtxjr/33] - remote_partition.cc:626 - Failed to query spillover manifests: cloud_storage::error_outcome:10, query: {958}

@abhijat
Copy link
Contributor

abhijat commented May 14, 2024

The condition https://github.com/redpanda-data/redpanda/blob/dev/src/v/cloud_storage/remote_partition.cc#L581-L588 should prevent the error because the searched for offset is now below the log start, but it does not, maybe the offset returned by full_log_kafka_offset does not match

@abhijat
Copy link
Contributor

abhijat commented May 14, 2024

debugging in #18465 - seems to only happen on CDT and arm64

@abhijat abhijat self-assigned this May 14, 2024
@abhijat
Copy link
Contributor

abhijat commented May 16, 2024

The sequence of events appears to be:

Offset is advanced ahead of all segments in manifest to satisfy retention (5204 appears to be greater than all segments in manifest)

INFO  2024-05-15 18:46:56,124 [shard 3:au  ] archival - [fiber105 kafka/topic-qajbludzpc/9] - ntp_archiver_service.cc:2660 - {[size_based_retention]} Advancing start offset to 5204 satisfy retention policy
DEBUG 2024-05-15 18:46:56,330 [shard 3:main] cluster - ntp: {kafka/topic-qajbludzpc/9} - archival_metadata_stm.cc:1451 - Updating start offset, current value 0, update 5204
DEBUG 2024-05-15 18:46:56,330 [shard 3:main] cluster - ntp: {kafka/topic-qajbludzpc/9} - archival_metadata_stm.cc:1458 - Start offset updated to 5204

During query the log start offset is now computed as nullopt. We cannot compute the kafka offset for a log offset outside range:

DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - partition_manifest.cc:375 - {kafka/topic-qajbludzpc/9}: get_start_kafka_offset: start offset 5204
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - partition_manifest.cc:384 - {kafka/topic-qajbludzpc/9}: get_start_kafka_offset: compute start offset
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - partition_manifest.cc:432 - {kafka/topic-qajbludzpc/9}: compute_start_kafka_offset_local: manifest front offset 0, front delta 0, start offset 5204
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - partition_manifest.cc:392 - {kafka/topic-qajbludzpc/9}: get_start_kafka_offset: computed start offset {nullopt}
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - [fiber22 kafka/topic-qajbludzpc/9] - async_manifest_view.cc:575 - query [kafka offset: 979] is out of valid range
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - partition_manifest.cc:375 - {kafka/topic-qajbludzpc/9}: get_start_kafka_offset: start offset 5204
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - partition_manifest.cc:384 - {kafka/topic-qajbludzpc/9}: get_start_kafka_offset: compute start offset
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - partition_manifest.cc:432 - {kafka/topic-qajbludzpc/9}: compute_start_kafka_offset_local: manifest front offset 0, front delta 0, start offset 5204
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - partition_manifest.cc:392 - {kafka/topic-qajbludzpc/9}: get_start_kafka_offset: computed start offset {nullopt}
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - partition_manifest.cc:375 - {kafka/topic-qajbludzpc/9}: get_start_kafka_offset: start offset 5204
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - partition_manifest.cc:384 - {kafka/topic-qajbludzpc/9}: get_start_kafka_offset: compute start offset
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - partition_manifest.cc:432 - {kafka/topic-qajbludzpc/9}: compute_start_kafka_offset_local: manifest front offset 0, front delta 0, start offset 5204
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - partition_manifest.cc:392 - {kafka/topic-qajbludzpc/9}: get_start_kafka_offset: computed start offset {nullopt}
DEBUG 2024-05-15 18:46:56,333 [shard 3:fetc] cloud_storage - [fiber7749 kafka/topic-qajbludzpc/9] - remote_partition.cc:592 - log start offset {nullopt}, query offset 979, archive start offset -9223372036854775808, delta: -9223372036854775808, start kafka offset {nullopt}

Because the log start offset is nullopt, the condition to avoid error is missed

https://github.com/redpanda-data/redpanda/blob/dev/src/v/cloud_storage/remote_partition.cc#L581-L589

@abhijat
Copy link
Contributor

abhijat commented May 16, 2024

We should possibly check for a full log truncation coinciding with a null kafka start offset, and if the query then falls within the log segments which have been truncated, we should not report an error.

@abhijat abhijat added the area/cloud-storage Shadow indexing subsystem label May 16, 2024
@piyushredpanda
Copy link
Contributor

@nvartolomei fixed this in #18097

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem auto-triaged used to know which issues have been opened from a CI job ci-failure
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants