-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Webdataset: KeyError: 'png' on some datasets when streaming #6880
Comments
The error is caused by malformed basenames of the files within the TARs:
To get the expected behavior, the basenames of the files within the TARs should be fixed so that they only contain a single dot, the one separating the file extension. |
I reopen it because I think we should try to give a clearer error message with a specific error code. For now, it's hard for the user to understand where the error comes from (not everybody knows the subtleties of the webdataset filename structure). (we can transfer it to https://github.com/huggingface/dataset-viewer if it fits better there) |
same with .jpg -> https://huggingface.co/datasets/ProGamerGov/synthetic-dataset-1m-dalle3-high-quality-captions
|
More details in the spec (https://docs.google.com/document/d/18OdLjruFNX74ILmgrdiCI9J1fQZuhzzRBCHV9URWto0/edit#heading=h.hkptaq2kct2s)
|
OK, the issue is different in the latter case: some files are suffixed as Is it a limitation of the webdataset format, or of the datasets library @lhoestq? And could we be able to give a clearer error? |
reported at https://huggingface.co/datasets/tbone5563/tar_images/discussions/1
The text was updated successfully, but these errors were encountered: