Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample/frame created_at & Dataset/Sample/Frame last_updated_at #3805

Draft
wants to merge 13 commits into
base: develop
Choose a base branch
from

Conversation

swheaton
Copy link
Contributor

@swheaton swheaton commented Nov 14, 2023

What changes are proposed in this pull request?

Add created_at field to all samples and frames. It's set to "now" when added to the dataset, and reset back to None if removed from the dataset.

Add last_updated_at field to all samples and frames. It's set to now when added to the dataset, , and reset back to None if removed from the dataset.

Add last_updated_at to Dataset document which is updated whenever the document is saved.

Add Dataset.last_updated_at property which is equal to this pseudocode:

max(DatasetDocument.last_updated_at, max(sample.last_updated_at for sample in dataset.samples))

TODO

  • Ensure generated views syncing back to base dataset updates last_updated_at. Resolve in-code TODO surrounding video dataset merge.
    • Patches view
    • Frames view
    • Clips view
    • Frame patches
    • Clip frames
    • Clip frame patches
  • When a frame is updated independently of a sample, make sure sample update time is separately updated.
  • Migration plan
  • Check SaveContext
  • Test for merge_python
  • Test for view and sampleview
  • Complete audit of public mutator functions to ensure all base mutators are covered
  • Fix existing broken unit tests

CreatedAt Notes

  • Locations to add created_at were identified using as a template adding dataset id to sample and frames docs
  • Samples get a creation time only when they're added to a dataset. We could set created_at if the sample gets created outside of the dataset first, but it's easier to do it this way and there's not likely to be a meaningful distinction between when a sample was created and when added to dataset.
  • In a long add_samples() call, each batch of samples will get the same created_at time.
  • Merge for existing samples was a little tricky but if we first addField but then add it to delete_fields when we create the whenMatched pipeline, we'll make sure that it will be added on insert but ignored on merge.

Frames Notes

  • All frames are saved when you call frames.save(). This means all of their last_updated_at times are set to now, even if there was no edits made to that frame.
  • Previously no update would be made ever to the sample when an update was made to sample.frames. Now we are calling self._dataset._update_last_updated_at([self._sample_id]). This will be another call to mongo so we should definitely not be updating frames in a loop (shouldn't be doing that anyways!).
  • A side effect of this is that deferred doesn't exactly work (used for SaveContext I think?) because the ops that are gathered are for the frame collection but there's currently no way to return ops for the sample collection.

Readonly Notes

  • Writes to these fields through set_field, __setattr__, set_values(), clear_sample_field, etc., are prohibited.
  • I attempted to achieve this by creating a readonly flag for the Field, akin to the is_virtual property in virtual fields PR.

Performance Notes

  • We will add a new default index to the sample collection on the last_updated_at field.
    • Makes getting maximum sample last_updated_at time O(1).
    • A concern is on making additional overhead for each value write due to this default index (others would be by users' choice, as the other default filepath index would be rarely or never updated).
    • There is a marginal increase in time taken. For example with cifar10 dataset (10k samples), I set all values 10 times and repeated 5 times, taking the minimum. Without the index, the writes take 5.39s. With the index, the writes take 5.74s. So about a 6% increase in write time for this specific case.
      timeit.Timer('ds.set_values("foo", [random.randint(0, 100000)] * 10000)', setup=setup).repeat(5, 10)
  • Adding the last_updated_at time has a performance impact on writes. Using the above test outside this branch, the minimal value is 4.41s. Compared to the 5.74s indexed form above, this is a 28% increase in write time.
  • There will also be an additional write to the sample every time a frame is updated, but this cost will be amortized if many frames are updated at once.

How is this patch tested? If it is not, please explain why.

  • Added tests for each case in both image and video, and ran manually. Created_at is set on creation but not again. Distinct cases below:
    • add_samples()
    • clone()
    • add_collection(new_ids=True)
    • merge_samples() # both existing and new samples
    • fo.Dataset.from_dir()
  • Then tested that any manner of modifying the field throws an error.
  • Updating a sample in any of myriad of ways updates its last_updated_at time, and dataset.last_updated_at is equal.
  • Updating Dataset metadata updates its last_updated_at time.
  • If you want to test manually, following along with the test cases is best rather than pasting full here.

Release Notes

Is this a user-facing change that should be mentioned in the release notes?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release
    notes for FiftyOne users.

Add created_at default DateTimeField to samples and frames, which contains the time the sample was added to the dataset.

What areas of FiftyOne does this PR affect?

  • App: FiftyOne application changes
  • Build: Build and test infrastructure changes
  • Core: Core fiftyone Python library changes
  • Documentation: FiftyOne documentation changes
  • Other

@swheaton swheaton mentioned this pull request Nov 14, 2023
7 tasks
Copy link

codecov bot commented Nov 14, 2023

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (223cfab) 16.16% compared to head (8063516) 15.52%.
Report is 112 commits behind head on develop.

Files Patch % Lines
...ackages/core/src/components/Actions/ActionsRow.tsx 0.00% 1 Missing ⚠️
app/packages/core/src/components/Grid/Grid.tsx 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3805      +/-   ##
===========================================
- Coverage    16.16%   15.52%   -0.65%     
===========================================
  Files          641      653      +12     
  Lines        74140    74834     +694     
  Branches       982      996      +14     
===========================================
- Hits         11988    11620     -368     
- Misses       62152    63214    +1062     
Flag Coverage Δ
app 15.52% <0.00%> (-0.65%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@swheaton swheaton marked this pull request as draft November 17, 2023 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant