VectorDataset sampling triples between 2 identical inputs #1649

roybenhayun · 2023-10-12T06:47:29Z

Description

description

VectorDataset seems to produce triple number of samples when giving it a folder or a shapefile.
in the below example code can see how the same Shapefile is loaded to a VectorDataset in two ways but the number of samples is always triple when pointing to a folder.

the expected behavior is that the VectorDataset will have identical results if given a folder with file or the actual file.

NOTE: when used a RasterDataset with tif or sdat layers, the results were as expected in size and identify between tif, sdat, folders and files. however, the number of samples produced by the GridGeoSampler was different from when using a VectorDataset. not opening an issue on that, still worth checking VectorDataset compared to RasterDataset as well.

example code

    eo_ds = Sentinel2(paths=r'S2B_MSIL1C_20230912T095609_N0509_R122_T32TQM_20230912T123044.SAFE')
    print(f"eo_ds: {eo_ds}")
    patch_size = 256


    bug_folder = r'C:\work_climate_eyes\resources\qgis\vector-dataset-bug'
    vector_folder = bug_folder + r'\shp'
    vector_shp = bug_folder + r'\shp\test_bug.shp'
    tif = bug_folder + r'\tif\test_bug_tif.tif',
    sdat = bug_folder + r'\sdat\test_bug_sdat.sdat',
    # all_mask_ds = RasterDataset(tif, transforms=preprocess_mask)  # NOTE: worth double checking also vector vs raster...
    all_mask_ds = VectorDataset(vector_folder, transforms=preprocess_mask)  # TODO: test with either vector_folder or vector_shp
    all_mask_ds.is_image = False
    print(f"all_mask_ds: {all_mask_ds}")

    union_ds_2 = eo_ds & all_mask_ds
    print(f"union ds 2: {union_ds_2}")
    print(f"tile_to_chips: {tile_to_chips(union_ds_2.bounds, (patch_size, patch_size))}")

    grid_sampler = GridGeoSampler(union_ds_2, size=patch_size, stride=patch_size-5)
    sampler_dataloader =    DataLoader(union_ds_2, sampler=grid_sampler, collate_fn=stack_samples)
    dataloader = sampler_dataloader
    print(f"dataloader: {type(dataloader)}, {dataloader.sampler is None} | {dataloader.batch_sampler  is None}")
    grid_size = 10
    fig, axs = plt.subplots(nrows=grid_size, ncols=grid_size)
    for sample in dataloader:
		# TODO: let the sampler produce all samples, and count them...
		
    print(f"total images sampled: {c} images")

example output

folder with Shapefile

    all_mask_ds = VectorDataset(vector_folder, transforms=preprocess_mask)  # all

eo_ds: Sentinel2 Dataset
    type: GeoDataset
    bbox: BoundingBox(minx=699960.0, maxx=809760.0, miny=4590240.0, maxy=4700040.0, mint=1694501769.0, maxt=1694501769.999999)
    size: 1
all_mask_ds: VectorDataset Dataset
    type: GeoDataset
    bbox: BoundingBox(minx=715632.8043691834, maxx=723452.1472648822, miny=4686555.833013168, maxy=4694146.540151682, mint=0.0, maxt=9.223372036854776e+18)
    **size: 3**
Converting VectorDataset res from 0.0001 to 10
union ds 2: IntersectionDataset Dataset
    type: IntersectionDataset
    bbox: BoundingBox(minx=715632.8043691834, maxx=723452.1472648822, miny=4686555.833013168, maxy=4694146.540151682, mint=1694501769.0, maxt=1694501769.999999)
    **size: 3**
**tile_to_chips: (30, 31)**
dataloader: <class 'torch.utils.data.dataloader.DataLoader'>, False | False
**total images sampled: 48 images**

actual Shapefile

    all_mask_ds = VectorDataset(vector_shp, transforms=preprocess_mask)  # all

eo_ds: Sentinel2 Dataset
    type: GeoDataset
    bbox: BoundingBox(minx=699960.0, maxx=809760.0, miny=4590240.0, maxy=4700040.0, mint=1694501769.0, maxt=1694501769.999999)
    size: 1
all_mask_ds: VectorDataset Dataset
    type: GeoDataset
    bbox: BoundingBox(minx=715632.8043691834, maxx=723452.1472648822, miny=4686555.833013168, maxy=4694146.540151682, mint=0.0, maxt=9.223372036854776e+18)
    **size: 1**
Converting VectorDataset res from 0.0001 to 10
union ds 2: IntersectionDataset Dataset
    type: IntersectionDataset
    bbox: BoundingBox(minx=715632.8043691834, maxx=723452.1472648822, miny=4686555.833013168, maxy=4694146.540151682, mint=1694501769.0, maxt=1694501769.999999)
    **size: 1**
**tile_to_chips: (30, 31)
total images sampled: 16 images**

Steps to reproduce

create a shapefile e.g., with QGIS
create a GridGeoSampler
create a VectorDataset mask pointing to the folder of the shapefile
check the dataset size
check the number of samples produced by the sampler

repeat the steps again, but in step 3, with the mask pointing to the actual shapefile.
compare results of step 6

see example code in description.

Version

torchgeo 0.5.0, QGIS version 3.18.3-Zürich

The text was updated successfully, but these errors were encountered:

adamjstewart · 2023-10-12T11:16:02Z

Can you upload an example shapefile and image so this is easier to reproduce?

roybenhayun · 2023-10-12T13:50:42Z

Can you upload an example shapefile and image so this is easier to reproduce?

download Sentinel2 product from ESA Hub: https://scihub.copernicus.eu/dhus/odata/v1/Products('480ea1ba-007e-41ee-8867-045b75e3b21c')/$value
then unzip the product
use the example layers attached
vector-dataset-bug.zip

then modify the two relevant lines in the example code to point to the right location:

    eo_ds = Sentinel2(paths=r'S2B_MSIL1C_20230912T095609_N0509_R122_T32TQM_20230912T123044.SAFE')
    bug_folder = r'C:\work_climate_eyes\resources\qgis\vector-dataset-bug'

roybenhayun · 2023-10-12T13:53:07Z

to check also the raster dataset, uncomment and move the relevant line:
# all_mask_ds = RasterDataset(tif, transforms=preprocess_mask) # NOTE: worth double checking also vector vs raster...

then compare between tif, sdat, shp

adamjstewart · 2023-10-12T19:07:50Z

Thanks for the data, I was able to reproduce your issue with the following MRE:

from torchgeo.datasets import VectorDataset

ds = VectorDataset("path/to/shp")
print(len(ds))

The rest of the above code isn't needed to reproduce the issue.

The cause of the problem: your shp directory contains multiple files that fiona is able to read, so it adds all of them. RasterDataset/VectorDataset were never intended to be used by users, and I don't think there's anything in our docs that suggests using them in this way.

The solution to your problem: you should really subclass VectorDataset and create a new dataset that sets filename_glob appropriately. Something like:

class MyDataset(VectorDataset):
    filename_glob = "*.shp"

Hopefully that helps, and let me know if our docs aren't clear about this.

roybenhayun · 2023-10-13T07:04:58Z

thanks Adam, now can clearly see this is a user mistake.
technically, perhaps can either close this issue and open a new issue on documentation.

here are thoughts about documentation:
foremost, the answer hits the nail on the head showing why documentation and tutorials should be a focus goal for upcoming releases. reading the docs from user perspective shows the documentation has to be much clearer and proactive.

clearer means: not only a short technical note, but mostly what should be done. a short mentioning is not sufficient. users need a more elaborate descriptive documentation.
proactive means: need an explanation and guidance on the right thing to do and\or what should be done. and mostly, must have many more tutorials. much more. and there should be cross references from sections to these tutorials.

as a specific example, could suggest the following for preventing wrong user usage of classes:

the top of the datasets documentation page, should be clearer (see 'clearer'). specifically, more elaborate, more descriptive.
the base classes section should also be more elaborate. it should also point as a minimum to the tutorial on custom raster dataset. it's a good tutorial and it needs to be referenced right there where it is may be needed by a reader.
the specific section e.g., on VectorDataset, should be more elaborate and proactively say "users should extend". once again, should add a cross reference to a tutorial. this section must include more information what inputs can be handled. e.g., a folder containing multiple files or a single file, which types of GIS files e.g., ESRI shapefiles, sdat, tif, geojson etc, perhaps a table. some comments explaining that are around but it has to be much clearer.
as a general note on being clearer and generally about highlighting - should consider using terms like "extend" and "inheritance" more often rather then "define" or "write". the message will be clearer.

overall, need to look at documentation from the reader's point of view.
developers mostly need code examples.
the docs should allow easy navigation with cross references.

HTH.

adamjstewart added the datasets Geospatial or benchmark datasets label Oct 12, 2023

adamjstewart added the documentation Improvements or additions to documentation label Oct 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VectorDataset sampling triples between 2 identical inputs #1649

VectorDataset sampling triples between 2 identical inputs #1649

roybenhayun commented Oct 12, 2023 •

edited by adamjstewart

adamjstewart commented Oct 12, 2023

roybenhayun commented Oct 12, 2023 •

edited

roybenhayun commented Oct 12, 2023

adamjstewart commented Oct 12, 2023

roybenhayun commented Oct 13, 2023 •

edited

VectorDataset sampling triples between 2 identical inputs #1649

VectorDataset sampling triples between 2 identical inputs #1649

Comments

roybenhayun commented Oct 12, 2023 • edited by adamjstewart

Description

description

example code

example output

folder with Shapefile

actual Shapefile

Steps to reproduce

Version

adamjstewart commented Oct 12, 2023

roybenhayun commented Oct 12, 2023 • edited

roybenhayun commented Oct 12, 2023

adamjstewart commented Oct 12, 2023

roybenhayun commented Oct 13, 2023 • edited

roybenhayun commented Oct 12, 2023 •

edited by adamjstewart

roybenhayun commented Oct 12, 2023 •

edited

roybenhayun commented Oct 13, 2023 •

edited