coredump/segfault when training depending on db status #748

dgtlmoon · 2020-06-29T19:49:57Z

Configuration

Version of DeepDetect:
- Locally compiled on:
  - Ubuntu 14.04 LTS
  - Mac OSX
  - Other:
- [x ] Docker Docker version 19.03.1, build 74b1e89
- Amazon AMI
Commit (shown by the server when starting): 88e93254ead67a8032166e98af3e46837fbba039

16Gb GPU, 32Gb RAM, nvidia-smi seems to work fine

Your question / the problem you're facing:

DeepDetect [ commit 88e9325 ]

I'm able to segfault the server when training depending on if db is absent, or if db is set to false

My end goal is to finetune a Vgg16 and use those weights in simsearch to see if it performs better in my project, found this bug while trying to solve solver creation exception error (still didnt solve it tho!)

Error message (if any) / steps to reproduce the problem:

list of API calls:

nvidia-docker  run --name dd_tags -d -p 8080:8080 -v "`pwd`":/tags_dataset jolibrain/deepdetect_gpu


rm models/vgg16/*txt models/vgg16/*json models/vgg16/model* models/vgg16/*proto*
rm -rf models/vgg16/*lmdb
sleep 1

curl -X PUT "http://localhost:8080/services/tag_detect_vgg16" -d '{
  "mllib": "caffe",
  "description": "tag detector vgg16",
  "type": "supervised",
  "parameters": {
    "input": {
      "connector": "image",
       "width":224,
       "height":224
    },
    "mllib": {
      "finetuning": true,
       "nclasses":3,
      "weights": "/tags_dataset/models/vgg16/VGG_ILSVRC_16_layers.caffemodel",
      "template": "vgg_16"
    }
  },
  "model": {
    "templates": "../templates/caffe/",
    "repository": "/tags_dataset/models/vgg16",
    "create_repository": true
  }
}'


sleep 2

curl -X POST "http://localhost:8080/train" -d '{
    "service": "tag_detect_vgg16",
    "async": true,
    "parameters": {
	"input": {
           "connector": "image",
           "test_split":0.1,
           "shuffle":true,
           "width":224,
           "height":224
	},
	"mllib": {
	    "gpu": true,
	    "resume": false,
	    "net": {
		"batch_size": 2
	    },
	    "solver": {
		"iterations": 80000,
		"test_interval": 500,
		"snapshot": 1000,
		"solver_type": "RMSPROP",
		"base_lr": 0.001
	    },
	    "noise":{"all_effects":true, "prob":0.001},
	    "distort":{"all_effects":true, "prob":0.01},
	    "bbox": true
	},
         "output": {
          "measure": [
           "map", "map_1", "map_2", "map_3"
          ]
         }
    },
    "data": [
	"/tags_dataset/train.txt",
	"/tags_dataset/test.txt"
    ]
}'

Here's the segfault - would be really nice to get a better error here

[2020-06-29 19:43:52.394] [caffe] [info] Read 492.000000 images with 0.000000 labels
[2020-06-29 19:43:52.395] [api] [info] 172.17.0.1 "POST /train" 201 1
[2020-06-29 19:43:52.395] [caffe] [info] Opened lmdb /tags_dataset/models/vgg16/test.lmdb
Segmentation fault

if I include db: true for example..

curl -X POST "http://localhost:8080/train" -d '{
    "service": "tag_detect_vgg16",
    "async": true,
    "parameters": {
	"input": {
	    "shuffle": true,
	    "db": true,

then I get a different error of

[2020-06-29 19:49:06.031] [tag_detect_vgg16] [info] Net total flops=15466180608 / total params=134260416
[2020-06-29 19:49:06.031] [tag_detect_vgg16] [info] detected network type is classification
[2020-06-29 19:49:06.031] [tag_detect_vgg16] [info] user batch_size=2 / inputc batch_size=
[2020-06-29 19:49:06.031] [tag_detect_vgg16] [info] batch_size=2 / test_batch_size=2 / test_iter=223
[2020-06-29 19:49:06.031] [tag_detect_vgg16] [info] input db = true
[2020-06-29 19:49:06.031] [caffe] [info] Initializing solver from parameters: 
[2020-06-29 19:49:06.032] [caffe] [info] Creating training net specified in net_param.
[2020-06-29 19:49:06.032] [caffe] [info] The NetState phase (0.000000) differed from the phase (1.000000) specified by a rule in layer vgg16
[2020-06-29 19:49:06.032] [caffe] [info] The NetState phase (0.000000) differed from the phase (1.000000) specified by a rule in layer probt
[2020-06-29 19:49:06.032] [caffe] [info] Initializing net from parameters: 
[2020-06-29 19:49:06.032] [caffe] [info] Creating layer / name=data / type=ImageData
[2020-06-29 19:49:06.033] [caffe] [info] Creating Layer data
[2020-06-29 19:49:06.033] [caffe] [info] data -> data
[2020-06-29 19:49:06.033] [caffe] [info] data -> label
[2020-06-29 19:49:06.033] [caffe] [info] Opening file 

(end of dd logs)

    "body": {
        "Error": {
            "code": 500,
            "msg": "InternalError",
            "dd_code": 500,
            "dd_msg": "solver creation exception"
        }
    }
}

(would be nice to get a better message here too, but this is at the caffe layer right? any clues?)

Server log output:

[2020-06-29 19:43:31.612] [api] [info] Running DeepDetect HTTP server on 0.0.0.0:8080
[2020-06-29 19:43:40.205] [tag_detect_vgg16] [info] instantiating model template vgg_16
[2020-06-29 19:43:40.205] [tag_detect_vgg16] [info] source=../templates/caffe//vgg_16/
[2020-06-29 19:43:40.205] [tag_detect_vgg16] [info] dest=/tags_dataset/models/vgg16/vgg_16.prototxt
[2020-06-29 19:43:40.209] [api] [info] 172.17.0.1 "PUT /services/tag_detect_vgg16" 201 758
[2020-06-29 19:43:52.394] [caffe] [info] Opening file /tags_dataset/test.txt
[2020-06-29 19:43:52.394] [caffe] [info] Read 492.000000 images with 0.000000 labels
[2020-06-29 19:43:52.395] [api] [info] 172.17.0.1 "POST /train" 201 1
[2020-06-29 19:43:52.395] [caffe] [info] Opened lmdb /tags_dataset/models/vgg16/test.lmdb
Segmentation fault

The text was updated successfully, but these errors were encountered:

dgtlmoon · 2020-06-29T19:54:37Z

I believe this is caused by

      "db_width": 224,
      "db_height": 224

missing in the "parameters": {.. "input": { part of the /train call

dgtlmoon · 2020-06-29T19:59:54Z

note, I still get "dd_msg": "solver creation exception" when I create the service with

  "parameters": {
    "input": {
      "connector": "image",
       "width":224,
       "height":224

and I have in the /train call...

    "parameters": {
        "input": {
          "db_width": 224,
           "db_height": 224,
           "db": true

And the following will cause a segfault

    "parameters": {
        "input": {
          "db_width": 224,
           "db_height": 224,
           "db": false

dgtlmoon · 2020-06-29T20:01:19Z

#742 maybe related?

dgtlmoon · 2020-06-29T20:19:19Z

note: https://www.deepdetect.com/server/docs/train-image-classifier/ is missing the db: true part aswell
solver creation exception first mentioned at https://gitter.im/beniz/deepdetect?at=5ec942692c49c45f5a994000

dgtlmoon · 2020-06-29T23:10:48Z

same segfault on docker jolibrain/deepdetect_cpu and jolibrain/deepdetect_gpu

beniz · 2020-06-30T08:27:42Z

Hi, thanks for the thorough report as we try to get around any possible crash. We'll investigate the crash, however, your issue with training the model is elsewhere I believe, see below:

"weights": "/tags_dataset/models/vgg16/VGG_ILSVRC_16_layers.caffemodel", you should set "weights":"VGG_ILSVRC_16_layers.caffemodel and copy the file VGG_ILSVRC_16_layers.caffemodel into the model director beforehand
Training an image classifier requires one directory per class, with images in each directory, and you want to pass the directory to the data field:

"data": [
	"/path/to/directories/"
    ]

see the full documentation here: https://www.deepdetect.com/server/docs/train-image-classifier/

Training an image classifier requires db to be set to true
Your measure field is wrong, see the doc pointer above, it should have everything you need.

Let us know how this goes.

dgtlmoon · 2020-06-30T08:44:21Z

@beniz

I'm training (finetuning) a object detector, not classifier, but important here is to have a fine tuned model for simsearch. (per your recommendation to use Vgg16 for simsearch, but I found the results could be a lot better, I assume by fine-training - I also need to extract image objects from the scene)
Training an image classifier requires db to be set to true - that's what this bug is about, I know that :)
I believe my measure field is correct because I'm training for object detection (finetuning using object detection, is that right?), but I also tried with the example values and there is no change, still getting the exception.
https://www.deepdetect.com/server/docs/train-image-classifier/ is still missing db: true
I have the vgg16 caffeemodel in the right place

dd@44b35786ec21:/opt/deepdetect/build/main$ ls -al /tags_dataset/models/vgg16/VGG_ILSVRC_16_layers.caffemodel
-rwxrwxrwx 1 dd dd 553432081 Mar 22  2019 /tags_dataset/models/vgg16/VGG_ILSVRC_16_layers.caffemodel

Using the following still does not fix my solver creation exception

curl -X PUT "http://localhost:8080/services/tag_detect_vgg16" -d '{
  "mllib": "caffe",
  "description": "tag detector vgg16",
  "type": "supervised",
  "parameters": {
    "input": {
      "connector": "image",
      "width": 224,
      "height": 224
    },
    "mllib": {
      "finetuning": true,
       "nclasses": 3,
      "template": "vgg_16",
      "weights" : "VGG_ILSVRC_16_layers.caffemodel"
    }
  },
  "model": {
    "templates": "../templates/caffe/",
    "repository": "/tags_dataset/models/vgg16"
  }
}'
docker logs dd_tags


sleep 3

curl -X POST "http://localhost:8080/train" -d '{
    "service": "tag_detect_vgg16",
    "async": true,
    "parameters": {
	"input": {
      "db": true,
           "connector": "image",
      "db_width": 224,
      "db_height": 224
	},
	"mllib": {
	    "gpu": true,
	    "mirror":true,
	    "net": {
		"batch_size": 2
	    },
	    "solver": {
		"iterations": 80000,
		"test_interval": 500,
		"snapshot": 1000,
		"solver_type": "RMSPROP",
		"base_lr": 0.0001
	    },
	    "noise":{"all_effects":true, "prob":0.001},
	    "distort":{"all_effects":true, "prob":0.01},
	    "bbox": true
	},
         "output": {
             "measure":["acc","mcll","f1"]
         }
    },
    "data": [
	"/tags_dataset/train.txt",
	"/tags_dataset/test.txt"
    ]
}'

[2020-06-30 08:48:03.860] [tag_detect_vgg16] [info] Using pre-trained weights from /tags_dataset/models/vgg16/VGG_ILSVRC_16_layers.caffemodel
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 553432081
[2020-06-30 08:48:04.493] [caffe] [info] Attempting to upgrade input file specified using deprecated V1LayerParameter: /tags_dataset/models/vgg16/VGG_ILSVRC_16_layers.caffemodel
[2020-06-30 08:48:05.335] [caffe] [info] Successfully upgraded file specified using deprecated V1LayerParameter
[2020-06-30 08:48:05.470] [caffe] [info] Ignoring source layer fc8
[2020-06-30 08:48:05.499] [tag_detect_vgg16] [info] Net total flops=15466180608 / total params=134260416
[2020-06-30 08:48:05.500] [tag_detect_vgg16] [info] detected network type is classification
[2020-06-30 08:48:05.500] [tag_detect_vgg16] [info] user batch_size=2 / inputc batch_size=
[2020-06-30 08:48:05.500] [tag_detect_vgg16] [info] batch_size=2 / test_batch_size=2 / test_iter=2229
[2020-06-30 08:48:05.500] [tag_detect_vgg16] [info] input db = true
[2020-06-30 08:48:05.500] [caffe] [info] Initializing solver from parameters: 
[2020-06-30 08:48:05.500] [caffe] [info] Creating training net specified in net_param.
[2020-06-30 08:48:05.500] [caffe] [info] The NetState phase (0.000000) differed from the phase (1.000000) specified by a rule in layer vgg16
[2020-06-30 08:48:05.500] [caffe] [info] The NetState phase (0.000000) differed from the phase (1.000000) specified by a rule in layer probt
[2020-06-30 08:48:05.500] [caffe] [info] Initializing net from parameters: 
[2020-06-30 08:48:05.500] [caffe] [info] Creating layer / name=data / type=ImageData
[2020-06-30 08:48:05.500] [caffe] [info] Creating Layer data
[2020-06-30 08:48:05.501] [caffe] [info] data -> data
[2020-06-30 08:48:05.501] [caffe] [info] data -> label
[2020-06-30 08:48:05.501] [caffe] [info] Opening file

(end)

{
    "status": {
        "code": 200,
        "msg": "OK"
    },
    "head": {
        "method": "/train",
        "job": 1,
        "status": "error"
    },
    "body": {
        "Error": {
            "code": 500,
            "msg": "InternalError",
            "dd_code": 500,
            "dd_msg": "solver creation exception"
        }
    }
}

beniz · 2020-06-30T09:27:00Z

You can't train a detector with vgg16. Simsearch by default uses a classification model. Look at https://www.deepdetect.com/applications/img_simsearch/

dgtlmoon · 2020-06-30T09:33:06Z

@beniz ok great, so it sounds like I was confused by your recommendation to use vgg16 with my object detector :), So in which case I have about 8 classes of image to train on with ~100,000 images in each class.

So then I'll use the finetuned output of the trained vgg16 image classifier as my simsearch model? does that sound right? thanks!

dgtlmoon · 2020-06-30T10:07:48Z

And use squeezenet as the object detector, chained with my vgg16 finetuned model for imgsearch

dgtlmoon changed the title ~~coredump/segfault when training~~ coredump/segfault when training depending on db status Jun 29, 2020

beniz self-assigned this Jun 30, 2020

dgtlmoon mentioned this issue Jun 30, 2020

Docs missing db:true please fix! #750

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coredump/segfault when training depending on db status #748

coredump/segfault when training depending on db status #748

dgtlmoon commented Jun 29, 2020 •

edited

dgtlmoon commented Jun 29, 2020 •

edited

dgtlmoon commented Jun 29, 2020 •

edited

dgtlmoon commented Jun 29, 2020

dgtlmoon commented Jun 29, 2020 •

edited

dgtlmoon commented Jun 29, 2020

beniz commented Jun 30, 2020

dgtlmoon commented Jun 30, 2020 •

edited

beniz commented Jun 30, 2020

dgtlmoon commented Jun 30, 2020

dgtlmoon commented Jun 30, 2020

coredump/segfault when training depending on db status #748

coredump/segfault when training depending on db status #748

Comments

dgtlmoon commented Jun 29, 2020 • edited

Configuration

Your question / the problem you're facing:

Error message (if any) / steps to reproduce the problem:

dgtlmoon commented Jun 29, 2020 • edited

dgtlmoon commented Jun 29, 2020 • edited

dgtlmoon commented Jun 29, 2020

dgtlmoon commented Jun 29, 2020 • edited

dgtlmoon commented Jun 29, 2020

beniz commented Jun 30, 2020

dgtlmoon commented Jun 30, 2020 • edited

beniz commented Jun 30, 2020

dgtlmoon commented Jun 30, 2020

dgtlmoon commented Jun 30, 2020

dgtlmoon commented Jun 29, 2020 •

edited

dgtlmoon commented Jun 29, 2020 •

edited

dgtlmoon commented Jun 29, 2020 •

edited

dgtlmoon commented Jun 29, 2020 •

edited

dgtlmoon commented Jun 30, 2020 •

edited