Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webdataset: KeyError: 'png' on some datasets when streaming #6880

Open
lhoestq opened this issue May 7, 2024 · 5 comments
Open

Webdataset: KeyError: 'png' on some datasets when streaming #6880

lhoestq opened this issue May 7, 2024 · 5 comments
Assignees

Comments

@lhoestq
Copy link
Member

lhoestq commented May 7, 2024

reported at https://huggingface.co/datasets/tbone5563/tar_images/discussions/1

>>> from datasets import load_dataset
>>> ds = load_dataset("tbone5563/tar_images")
Downloadingdata: 100%1.41G/1.41G [00:48<00:00, 17.2MB/s]
Downloadingdata: 100%619M/619M [00:11<00:00, 57.4MB/s]
Generatingtrainsplit: 
 970/0 [00:02<00:00, 534.94examples/s]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
   1747                 _time = time.time()
-> 1748                 for key, record in generator:
   1749                     if max_shard_size is not None and writer._num_bytes > max_shard_size:

7 frames
[/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/webdataset/webdataset.py](https://localhost:8080/#) in _generate_examples(self, tar_paths, tar_iterators)
    108                 for field_name in image_field_names + audio_field_names:
--> 109                     example[field_name] = {"path": example["__key__"] + "." + field_name, "bytes": example[field_name]}
    110                 yield f"{tar_idx}_{example_idx}", example

KeyError: 'png'

The above exception was the direct cause of the following exception:

DatasetGenerationError                    Traceback (most recent call last)
[<ipython-input-2-8e0fbb7badc9>](https://localhost:8080/#) in <cell line: 3>()
      1 from datasets import load_dataset
      2 
----> 3 ds = load_dataset("tbone5563/tar_images")

[/usr/local/lib/python3.10/dist-packages/datasets/load.py](https://localhost:8080/#) in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
   2607 
   2608     # Download and prepare data
-> 2609     builder_instance.download_and_prepare(
   2610         download_config=download_config,
   2611         download_mode=download_mode,

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
   1025                         if num_proc is not None:
   1026                             prepare_split_kwargs["num_proc"] = num_proc
-> 1027                         self._download_and_prepare(
   1028                             dl_manager=dl_manager,
   1029                             verification_mode=verification_mode,

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _download_and_prepare(self, dl_manager, verification_mode, **prepare_splits_kwargs)
   1787 
   1788     def _download_and_prepare(self, dl_manager, verification_mode, **prepare_splits_kwargs):
-> 1789         super()._download_and_prepare(
   1790             dl_manager,
   1791             verification_mode,

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
   1120             try:
   1121                 # Prepare split will record examples associated to the split
-> 1122                 self._prepare_split(split_generator, **prepare_split_kwargs)
   1123             except OSError as e:
   1124                 raise OSError(

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split(self, split_generator, check_duplicate_keys, file_format, num_proc, max_shard_size)
   1625             job_id = 0
   1626             with pbar:
-> 1627                 for job_id, done, content in self._prepare_split_single(
   1628                     gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
   1629                 ):

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
   1782             if isinstance(e, SchemaInferenceError) and e.__context__ is not None:
   1783                 e = e.__context__
-> 1784             raise DatasetGenerationError("An error occurred while generating the dataset") from e
   1785 
   1786         yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset
@albertvillanova
Copy link
Member

The error is caused by malformed basenames of the files within the TARs:

  • 15_Cohen_1-s2.0-S0929664620300449-gr3_lrg-b.png becomes 15_Cohen_1-s2 as the grouping __key__, and 0-S0929664620300449-gr3_lrg-b.png as the additional key to be added to the example
  • whereas the intended behavior was to use 15_Cohen_1-s2.0-S0929664620300449-gr3_lrg-b as the grouping __key__, and png as the additional key to be added to the example

To get the expected behavior, the basenames of the files within the TARs should be fixed so that they only contain a single dot, the one separating the file extension.

@albertvillanova albertvillanova closed this as not planned Won't fix, can't repro, duplicate, stale May 10, 2024
@severo
Copy link
Contributor

severo commented May 14, 2024

I reopen it because I think we should try to give a clearer error message with a specific error code.

For now, it's hard for the user to understand where the error comes from (not everybody knows the subtleties of the webdataset filename structure).

(we can transfer it to https://github.com/huggingface/dataset-viewer if it fits better there)

@severo severo reopened this May 14, 2024
@severo
Copy link
Contributor

severo commented May 14, 2024

same with .jpg -> https://huggingface.co/datasets/ProGamerGov/synthetic-dataset-1m-dalle3-high-quality-captions

Error code:   DatasetGenerationError
Exception:    DatasetGenerationError
Message:      An error occurred while generating the dataset
Traceback:    Traceback (most recent call last):
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1748, in _prepare_split_single
                  for key, record in generator:
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 818, in wrapped
                  for item in generator(*args, **kwargs):
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/webdataset/webdataset.py", line 109, in _generate_examples
                  example[field_name] = {"path": example["__key__"] + "." + field_name, "bytes": example[field_name]}
              KeyError: 'jpg'
              
              The above exception was the direct cause of the following exception:
              
              Traceback (most recent call last):
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1316, in compute_config_parquet_and_info_response
                  parquet_operations, partial = stream_convert_to_parquet(
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 909, in stream_convert_to_parquet
                  builder._prepare_split(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1627, in _prepare_split
                  for job_id, done, content in self._prepare_split_single(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1784, in _prepare_split_single
                  raise DatasetGenerationError("An error occurred while generating the dataset") from e
              datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

@severo
Copy link
Contributor

severo commented May 14, 2024

More details in the spec (https://docs.google.com/document/d/18OdLjruFNX74ILmgrdiCI9J1fQZuhzzRBCHV9URWto0/edit#heading=h.hkptaq2kct2s)

The prefix of a file is all directory components of the file plus the file name component up to the first “.” in the file name.
The last extension (i.e., the portion after the last “.”) in a file name determines the file type.

Example:
images17/image194.left.jpg
images17/image194.right.jpg
images17/image194.json
images17/image12.left.jpg
images17/image12.json
images17/image12.right.jpg
images3/image1459.left.jpg

When reading this with a WebDataset library, you would get the following two dictionaries back in sequence:

    { “__key__”: “images17/image194”, “left.jpg”: b”...”, “right.jpg”: b”...”, “json”: b”...”}
    { “__key__”: “images17/image12”, “left.jpg”: b”...”, “right.jpg”: b”...”, “json”: b”...”}

@severo
Copy link
Contributor

severo commented May 14, 2024

OK, the issue is different in the latter case: some files are suffixed as .jpeg, and others as .jpg :)

Is it a limitation of the webdataset format, or of the datasets library @lhoestq? And could we be able to give a clearer error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants