Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VectorDataset sampling triples between 2 identical inputs #1649

Open
roybenhayun opened this issue Oct 12, 2023 · 5 comments
Open

VectorDataset sampling triples between 2 identical inputs #1649

roybenhayun opened this issue Oct 12, 2023 · 5 comments
Labels
datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation

Comments

@roybenhayun
Copy link

roybenhayun commented Oct 12, 2023

Description

description

VectorDataset seems to produce triple number of samples when giving it a folder or a shapefile.
in the below example code can see how the same Shapefile is loaded to a VectorDataset in two ways but the number of samples is always triple when pointing to a folder.

the expected behavior is that the VectorDataset will have identical results if given a folder with file or the actual file.

NOTE: when used a RasterDataset with tif or sdat layers, the results were as expected in size and identify between tif, sdat, folders and files. however, the number of samples produced by the GridGeoSampler was different from when using a VectorDataset. not opening an issue on that, still worth checking VectorDataset compared to RasterDataset as well.

example code

    eo_ds = Sentinel2(paths=r'S2B_MSIL1C_20230912T095609_N0509_R122_T32TQM_20230912T123044.SAFE')
    print(f"eo_ds: {eo_ds}")
    patch_size = 256


    bug_folder = r'C:\work_climate_eyes\resources\qgis\vector-dataset-bug'
    vector_folder = bug_folder + r'\shp'
    vector_shp = bug_folder + r'\shp\test_bug.shp'
    tif = bug_folder + r'\tif\test_bug_tif.tif',
    sdat = bug_folder + r'\sdat\test_bug_sdat.sdat',
    # all_mask_ds = RasterDataset(tif, transforms=preprocess_mask)  # NOTE: worth double checking also vector vs raster...
    all_mask_ds = VectorDataset(vector_folder, transforms=preprocess_mask)  # TODO: test with either vector_folder or vector_shp
    all_mask_ds.is_image = False
    print(f"all_mask_ds: {all_mask_ds}")

    union_ds_2 = eo_ds & all_mask_ds
    print(f"union ds 2: {union_ds_2}")
    print(f"tile_to_chips: {tile_to_chips(union_ds_2.bounds, (patch_size, patch_size))}")

    grid_sampler = GridGeoSampler(union_ds_2, size=patch_size, stride=patch_size-5)
    sampler_dataloader =    DataLoader(union_ds_2, sampler=grid_sampler, collate_fn=stack_samples)
    dataloader = sampler_dataloader
    print(f"dataloader: {type(dataloader)}, {dataloader.sampler is None} | {dataloader.batch_sampler  is None}")
    grid_size = 10
    fig, axs = plt.subplots(nrows=grid_size, ncols=grid_size)
    for sample in dataloader:
		# TODO: let the sampler produce all samples, and count them...
		
    print(f"total images sampled: {c} images")

example output

folder with Shapefile

    all_mask_ds = VectorDataset(vector_folder, transforms=preprocess_mask)  # all
eo_ds: Sentinel2 Dataset
    type: GeoDataset
    bbox: BoundingBox(minx=699960.0, maxx=809760.0, miny=4590240.0, maxy=4700040.0, mint=1694501769.0, maxt=1694501769.999999)
    size: 1
all_mask_ds: VectorDataset Dataset
    type: GeoDataset
    bbox: BoundingBox(minx=715632.8043691834, maxx=723452.1472648822, miny=4686555.833013168, maxy=4694146.540151682, mint=0.0, maxt=9.223372036854776e+18)
    **size: 3**
Converting VectorDataset res from 0.0001 to 10
union ds 2: IntersectionDataset Dataset
    type: IntersectionDataset
    bbox: BoundingBox(minx=715632.8043691834, maxx=723452.1472648822, miny=4686555.833013168, maxy=4694146.540151682, mint=1694501769.0, maxt=1694501769.999999)
    **size: 3**
**tile_to_chips: (30, 31)**
dataloader: <class 'torch.utils.data.dataloader.DataLoader'>, False | False
**total images sampled: 48 images**

actual Shapefile

    all_mask_ds = VectorDataset(vector_shp, transforms=preprocess_mask)  # all
eo_ds: Sentinel2 Dataset
    type: GeoDataset
    bbox: BoundingBox(minx=699960.0, maxx=809760.0, miny=4590240.0, maxy=4700040.0, mint=1694501769.0, maxt=1694501769.999999)
    size: 1
all_mask_ds: VectorDataset Dataset
    type: GeoDataset
    bbox: BoundingBox(minx=715632.8043691834, maxx=723452.1472648822, miny=4686555.833013168, maxy=4694146.540151682, mint=0.0, maxt=9.223372036854776e+18)
    **size: 1**
Converting VectorDataset res from 0.0001 to 10
union ds 2: IntersectionDataset Dataset
    type: IntersectionDataset
    bbox: BoundingBox(minx=715632.8043691834, maxx=723452.1472648822, miny=4686555.833013168, maxy=4694146.540151682, mint=1694501769.0, maxt=1694501769.999999)
    **size: 1**
**tile_to_chips: (30, 31)
total images sampled: 16 images**

Steps to reproduce

  1. create a shapefile e.g., with QGIS
  2. create a GridGeoSampler
  3. create a VectorDataset mask pointing to the folder of the shapefile
  4. check the dataset size
  5. check the number of samples produced by the sampler

repeat the steps again, but in step 3, with the mask pointing to the actual shapefile.
compare results of step 6

see example code in description.

Version

torchgeo 0.5.0, QGIS version 3.18.3-Zürich

@adamjstewart
Copy link
Collaborator

Can you upload an example shapefile and image so this is easier to reproduce?

@adamjstewart adamjstewart added the datasets Geospatial or benchmark datasets label Oct 12, 2023
@roybenhayun
Copy link
Author

roybenhayun commented Oct 12, 2023

Can you upload an example shapefile and image so this is easier to reproduce?

  1. download Sentinel2 product from ESA Hub: https://scihub.copernicus.eu/dhus/odata/v1/Products('480ea1ba-007e-41ee-8867-045b75e3b21c')/$value
    then unzip the product

  2. use the example layers attached
    vector-dataset-bug.zip

then modify the two relevant lines in the example code to point to the right location:

    eo_ds = Sentinel2(paths=r'S2B_MSIL1C_20230912T095609_N0509_R122_T32TQM_20230912T123044.SAFE')
    bug_folder = r'C:\work_climate_eyes\resources\qgis\vector-dataset-bug'

@roybenhayun
Copy link
Author

to check also the raster dataset, uncomment and move the relevant line:
# all_mask_ds = RasterDataset(tif, transforms=preprocess_mask) # NOTE: worth double checking also vector vs raster...

then compare between tif, sdat, shp

@adamjstewart
Copy link
Collaborator

Thanks for the data, I was able to reproduce your issue with the following MRE:

from torchgeo.datasets import VectorDataset

ds = VectorDataset("path/to/shp")
print(len(ds))

The rest of the above code isn't needed to reproduce the issue.

The cause of the problem: your shp directory contains multiple files that fiona is able to read, so it adds all of them. RasterDataset/VectorDataset were never intended to be used by users, and I don't think there's anything in our docs that suggests using them in this way.

The solution to your problem: you should really subclass VectorDataset and create a new dataset that sets filename_glob appropriately. Something like:

class MyDataset(VectorDataset):
    filename_glob = "*.shp"

Hopefully that helps, and let me know if our docs aren't clear about this.

@roybenhayun
Copy link
Author

roybenhayun commented Oct 13, 2023

thanks Adam, now can clearly see this is a user mistake.
technically, perhaps can either close this issue and open a new issue on documentation.

here are thoughts about documentation:
foremost, the answer hits the nail on the head showing why documentation and tutorials should be a focus goal for upcoming releases. reading the docs from user perspective shows the documentation has to be much clearer and proactive.

  • clearer means: not only a short technical note, but mostly what should be done. a short mentioning is not sufficient. users need a more elaborate descriptive documentation.
  • proactive means: need an explanation and guidance on the right thing to do and\or what should be done. and mostly, must have many more tutorials. much more. and there should be cross references from sections to these tutorials.

as a specific example, could suggest the following for preventing wrong user usage of classes:

  1. the top of the datasets documentation page, should be clearer (see 'clearer'). specifically, more elaborate, more descriptive.
  2. the base classes section should also be more elaborate. it should also point as a minimum to the tutorial on custom raster dataset. it's a good tutorial and it needs to be referenced right there where it is may be needed by a reader.
  3. the specific section e.g., on VectorDataset, should be more elaborate and proactively say "users should extend". once again, should add a cross reference to a tutorial. this section must include more information what inputs can be handled. e.g., a folder containing multiple files or a single file, which types of GIS files e.g., ESRI shapefiles, sdat, tif, geojson etc, perhaps a table. some comments explaining that are around but it has to be much clearer.
  4. as a general note on being clearer and generally about highlighting - should consider using terms like "extend" and "inheritance" more often rather then "define" or "write". the message will be clearer.

overall, need to look at documentation from the reader's point of view.
developers mostly need code examples.
the docs should allow easy navigation with cross references.

HTH.

@adamjstewart adamjstewart added the documentation Improvements or additions to documentation label Oct 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants