Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coredump/segfault when training depending on db status #748

Open
7 tasks
dgtlmoon opened this issue Jun 29, 2020 · 10 comments
Open
7 tasks

coredump/segfault when training depending on db status #748

dgtlmoon opened this issue Jun 29, 2020 · 10 comments
Assignees

Comments

@dgtlmoon
Copy link
Contributor

dgtlmoon commented Jun 29, 2020

Configuration

  • Version of DeepDetect:
    • Locally compiled on:
      • Ubuntu 14.04 LTS
      • Mac OSX
      • Other:
    • [x ] Docker Docker version 19.03.1, build 74b1e89
    • Amazon AMI
  • Commit (shown by the server when starting): 88e93254ead67a8032166e98af3e46837fbba039

16Gb GPU, 32Gb RAM, nvidia-smi seems to work fine

Your question / the problem you're facing:

DeepDetect [ commit 88e9325 ]

I'm able to segfault the server when training depending on if db is absent, or if db is set to false

My end goal is to finetune a Vgg16 and use those weights in simsearch to see if it performs better in my project, found this bug while trying to solve solver creation exception error (still didnt solve it tho!)

Error message (if any) / steps to reproduce the problem:

  • list of API calls:
nvidia-docker  run --name dd_tags -d -p 8080:8080 -v "`pwd`":/tags_dataset jolibrain/deepdetect_gpu


rm models/vgg16/*txt models/vgg16/*json models/vgg16/model* models/vgg16/*proto*
rm -rf models/vgg16/*lmdb
sleep 1

curl -X PUT "http://localhost:8080/services/tag_detect_vgg16" -d '{
  "mllib": "caffe",
  "description": "tag detector vgg16",
  "type": "supervised",
  "parameters": {
    "input": {
      "connector": "image",
       "width":224,
       "height":224
    },
    "mllib": {
      "finetuning": true,
       "nclasses":3,
      "weights": "/tags_dataset/models/vgg16/VGG_ILSVRC_16_layers.caffemodel",
      "template": "vgg_16"
    }
  },
  "model": {
    "templates": "../templates/caffe/",
    "repository": "/tags_dataset/models/vgg16",
    "create_repository": true
  }
}'


sleep 2

curl -X POST "http://localhost:8080/train" -d '{
    "service": "tag_detect_vgg16",
    "async": true,
    "parameters": {
	"input": {
           "connector": "image",
           "test_split":0.1,
           "shuffle":true,
           "width":224,
           "height":224
	},
	"mllib": {
	    "gpu": true,
	    "resume": false,
	    "net": {
		"batch_size": 2
	    },
	    "solver": {
		"iterations": 80000,
		"test_interval": 500,
		"snapshot": 1000,
		"solver_type": "RMSPROP",
		"base_lr": 0.001
	    },
	    "noise":{"all_effects":true, "prob":0.001},
	    "distort":{"all_effects":true, "prob":0.01},
	    "bbox": true
	},
         "output": {
          "measure": [
           "map", "map_1", "map_2", "map_3"
          ]
         }
    },
    "data": [
	"/tags_dataset/train.txt",
	"/tags_dataset/test.txt"
    ]
}'

Here's the segfault - would be really nice to get a better error here

[2020-06-29 19:43:52.394] [caffe] [info] Read 492.000000 images with 0.000000 labels
[2020-06-29 19:43:52.395] [api] [info] 172.17.0.1 "POST /train" 201 1
[2020-06-29 19:43:52.395] [caffe] [info] Opened lmdb /tags_dataset/models/vgg16/test.lmdb
Segmentation fault

if I include db: true for example..

curl -X POST "http://localhost:8080/train" -d '{
    "service": "tag_detect_vgg16",
    "async": true,
    "parameters": {
	"input": {
	    "shuffle": true,
	    "db": true,

then I get a different error of

[2020-06-29 19:49:06.031] [tag_detect_vgg16] [info] Net total flops=15466180608 / total params=134260416
[2020-06-29 19:49:06.031] [tag_detect_vgg16] [info] detected network type is classification
[2020-06-29 19:49:06.031] [tag_detect_vgg16] [info] user batch_size=2 / inputc batch_size=
[2020-06-29 19:49:06.031] [tag_detect_vgg16] [info] batch_size=2 / test_batch_size=2 / test_iter=223
[2020-06-29 19:49:06.031] [tag_detect_vgg16] [info] input db = true
[2020-06-29 19:49:06.031] [caffe] [info] Initializing solver from parameters: 
[2020-06-29 19:49:06.032] [caffe] [info] Creating training net specified in net_param.
[2020-06-29 19:49:06.032] [caffe] [info] The NetState phase (0.000000) differed from the phase (1.000000) specified by a rule in layer vgg16
[2020-06-29 19:49:06.032] [caffe] [info] The NetState phase (0.000000) differed from the phase (1.000000) specified by a rule in layer probt
[2020-06-29 19:49:06.032] [caffe] [info] Initializing net from parameters: 
[2020-06-29 19:49:06.032] [caffe] [info] Creating layer / name=data / type=ImageData
[2020-06-29 19:49:06.033] [caffe] [info] Creating Layer data
[2020-06-29 19:49:06.033] [caffe] [info] data -> data
[2020-06-29 19:49:06.033] [caffe] [info] data -> label
[2020-06-29 19:49:06.033] [caffe] [info] Opening file 

(end of dd logs)

    "body": {
        "Error": {
            "code": 500,
            "msg": "InternalError",
            "dd_code": 500,
            "dd_msg": "solver creation exception"
        }
    }
}

(would be nice to get a better message here too, but this is at the caffe layer right? any clues?)

  • Server log output:
[2020-06-29 19:43:31.612] [api] [info] Running DeepDetect HTTP server on 0.0.0.0:8080
[2020-06-29 19:43:40.205] [tag_detect_vgg16] [info] instantiating model template vgg_16
[2020-06-29 19:43:40.205] [tag_detect_vgg16] [info] source=../templates/caffe//vgg_16/
[2020-06-29 19:43:40.205] [tag_detect_vgg16] [info] dest=/tags_dataset/models/vgg16/vgg_16.prototxt
[2020-06-29 19:43:40.209] [api] [info] 172.17.0.1 "PUT /services/tag_detect_vgg16" 201 758
[2020-06-29 19:43:52.394] [caffe] [info] Opening file /tags_dataset/test.txt
[2020-06-29 19:43:52.394] [caffe] [info] Read 492.000000 images with 0.000000 labels
[2020-06-29 19:43:52.395] [api] [info] 172.17.0.1 "POST /train" 201 1
[2020-06-29 19:43:52.395] [caffe] [info] Opened lmdb /tags_dataset/models/vgg16/test.lmdb
Segmentation fault
@dgtlmoon
Copy link
Contributor Author

dgtlmoon commented Jun 29, 2020

I believe this is caused by

      "db_width": 224,
      "db_height": 224

missing in the "parameters": {.. "input": { part of the /train call

@dgtlmoon
Copy link
Contributor Author

dgtlmoon commented Jun 29, 2020

note, I still get "dd_msg": "solver creation exception" when I create the service with

  "parameters": {
    "input": {
      "connector": "image",
       "width":224,
       "height":224

and I have in the /train call...

    "parameters": {
        "input": {
          "db_width": 224,
           "db_height": 224,
           "db": true

And the following will cause a segfault

    "parameters": {
        "input": {
          "db_width": 224,
           "db_height": 224,
           "db": false

@dgtlmoon
Copy link
Contributor Author

#742 maybe related?

@dgtlmoon dgtlmoon changed the title coredump/segfault when training coredump/segfault when training depending on db status Jun 29, 2020
@dgtlmoon
Copy link
Contributor Author

dgtlmoon commented Jun 29, 2020

note: https://www.deepdetect.com/server/docs/train-image-classifier/ is missing the db: true part aswell
solver creation exception first mentioned at https://gitter.im/beniz/deepdetect?at=5ec942692c49c45f5a994000

@dgtlmoon
Copy link
Contributor Author

same segfault on docker jolibrain/deepdetect_cpu and jolibrain/deepdetect_gpu

@beniz
Copy link
Collaborator

beniz commented Jun 30, 2020

Hi, thanks for the thorough report as we try to get around any possible crash. We'll investigate the crash, however, your issue with training the model is elsewhere I believe, see below:

  • "weights": "/tags_dataset/models/vgg16/VGG_ILSVRC_16_layers.caffemodel", you should set "weights":"VGG_ILSVRC_16_layers.caffemodel and copy the file VGG_ILSVRC_16_layers.caffemodel into the model director beforehand

  • Training an image classifier requires one directory per class, with images in each directory, and you want to pass the directory to the data field:

"data": [
	"/path/to/directories/"
    ]

see the full documentation here: https://www.deepdetect.com/server/docs/train-image-classifier/

  • Training an image classifier requires db to be set to true
  • Your measure field is wrong, see the doc pointer above, it should have everything you need.

Let us know how this goes.

@beniz beniz self-assigned this Jun 30, 2020
@dgtlmoon
Copy link
Contributor Author

dgtlmoon commented Jun 30, 2020

@beniz

  1. I'm training (finetuning) a object detector, not classifier, but important here is to have a fine tuned model for simsearch. (per your recommendation to use Vgg16 for simsearch, but I found the results could be a lot better, I assume by fine-training - I also need to extract image objects from the scene)
  2. Training an image classifier requires db to be set to true - that's what this bug is about, I know that :)
  3. I believe my measure field is correct because I'm training for object detection (finetuning using object detection, is that right?), but I also tried with the example values and there is no change, still getting the exception.
  4. https://www.deepdetect.com/server/docs/train-image-classifier/ is still missing db: true
  5. I have the vgg16 caffeemodel in the right place
dd@44b35786ec21:/opt/deepdetect/build/main$ ls -al /tags_dataset/models/vgg16/VGG_ILSVRC_16_layers.caffemodel
-rwxrwxrwx 1 dd dd 553432081 Mar 22  2019 /tags_dataset/models/vgg16/VGG_ILSVRC_16_layers.caffemodel

Using the following still does not fix my solver creation exception

curl -X PUT "http://localhost:8080/services/tag_detect_vgg16" -d '{
  "mllib": "caffe",
  "description": "tag detector vgg16",
  "type": "supervised",
  "parameters": {
    "input": {
      "connector": "image",
      "width": 224,
      "height": 224
    },
    "mllib": {
      "finetuning": true,
       "nclasses": 3,
      "template": "vgg_16",
      "weights" : "VGG_ILSVRC_16_layers.caffemodel"
    }
  },
  "model": {
    "templates": "../templates/caffe/",
    "repository": "/tags_dataset/models/vgg16"
  }
}'
docker logs dd_tags


sleep 3

curl -X POST "http://localhost:8080/train" -d '{
    "service": "tag_detect_vgg16",
    "async": true,
    "parameters": {
	"input": {
      "db": true,
           "connector": "image",
      "db_width": 224,
      "db_height": 224
	},
	"mllib": {
	    "gpu": true,
	    "mirror":true,
	    "net": {
		"batch_size": 2
	    },
	    "solver": {
		"iterations": 80000,
		"test_interval": 500,
		"snapshot": 1000,
		"solver_type": "RMSPROP",
		"base_lr": 0.0001
	    },
	    "noise":{"all_effects":true, "prob":0.001},
	    "distort":{"all_effects":true, "prob":0.01},
	    "bbox": true
	},
         "output": {
             "measure":["acc","mcll","f1"]
         }
    },
    "data": [
	"/tags_dataset/train.txt",
	"/tags_dataset/test.txt"
    ]
}'

[2020-06-30 08:48:03.860] [tag_detect_vgg16] [info] Using pre-trained weights from /tags_dataset/models/vgg16/VGG_ILSVRC_16_layers.caffemodel
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 553432081
[2020-06-30 08:48:04.493] [caffe] [info] Attempting to upgrade input file specified using deprecated V1LayerParameter: /tags_dataset/models/vgg16/VGG_ILSVRC_16_layers.caffemodel
[2020-06-30 08:48:05.335] [caffe] [info] Successfully upgraded file specified using deprecated V1LayerParameter
[2020-06-30 08:48:05.470] [caffe] [info] Ignoring source layer fc8
[2020-06-30 08:48:05.499] [tag_detect_vgg16] [info] Net total flops=15466180608 / total params=134260416
[2020-06-30 08:48:05.500] [tag_detect_vgg16] [info] detected network type is classification
[2020-06-30 08:48:05.500] [tag_detect_vgg16] [info] user batch_size=2 / inputc batch_size=
[2020-06-30 08:48:05.500] [tag_detect_vgg16] [info] batch_size=2 / test_batch_size=2 / test_iter=2229
[2020-06-30 08:48:05.500] [tag_detect_vgg16] [info] input db = true
[2020-06-30 08:48:05.500] [caffe] [info] Initializing solver from parameters: 
[2020-06-30 08:48:05.500] [caffe] [info] Creating training net specified in net_param.
[2020-06-30 08:48:05.500] [caffe] [info] The NetState phase (0.000000) differed from the phase (1.000000) specified by a rule in layer vgg16
[2020-06-30 08:48:05.500] [caffe] [info] The NetState phase (0.000000) differed from the phase (1.000000) specified by a rule in layer probt
[2020-06-30 08:48:05.500] [caffe] [info] Initializing net from parameters: 
[2020-06-30 08:48:05.500] [caffe] [info] Creating layer / name=data / type=ImageData
[2020-06-30 08:48:05.500] [caffe] [info] Creating Layer data
[2020-06-30 08:48:05.501] [caffe] [info] data -> data
[2020-06-30 08:48:05.501] [caffe] [info] data -> label
[2020-06-30 08:48:05.501] [caffe] [info] Opening file 

(end)

{
    "status": {
        "code": 200,
        "msg": "OK"
    },
    "head": {
        "method": "/train",
        "job": 1,
        "status": "error"
    },
    "body": {
        "Error": {
            "code": 500,
            "msg": "InternalError",
            "dd_code": 500,
            "dd_msg": "solver creation exception"
        }
    }
}

@beniz
Copy link
Collaborator

beniz commented Jun 30, 2020

You can't train a detector with vgg16. Simsearch by default uses a classification model. Look at https://www.deepdetect.com/applications/img_simsearch/

@dgtlmoon
Copy link
Contributor Author

@beniz ok great, so it sounds like I was confused by your recommendation to use vgg16 with my object detector :), So in which case I have about 8 classes of image to train on with ~100,000 images in each class.

So then I'll use the finetuned output of the trained vgg16 image classifier as my simsearch model? does that sound right? thanks!

@dgtlmoon
Copy link
Contributor Author

And use squeezenet as the object detector, chained with my vgg16 finetuned model for imgsearch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants