Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SVM connector training fails #764

Open
8 of 12 tasks
YaYaB opened this issue Jul 28, 2020 · 2 comments
Open
8 of 12 tasks

SVM connector training fails #764

YaYaB opened this issue Jul 28, 2020 · 2 comments

Comments

@YaYaB
Copy link
Contributor

YaYaB commented Jul 28, 2020

Configuration

  • Version of DeepDetect:
    • Locally compiled on:
      • Ubuntu 14.04 LTS
      • Mac OSX
      • Other:
    • Docker
    • Amazon AMI
  • Commit (shown by the server when starting):
    073e9a1

Your question / the problem you're facing:

This issue is related to #761 and more precisely to the fix #762 .
It resolved the inference issue however now it is impossible for me to train a model using a svm connector
I joined some random data if you want to replicate.
To replicate you can just download it and extract it.
In the next section we will consider PATH_MODEL as the path where the model is stored.

Error message (if any) / steps to reproduce the problem:

Let us first create the service we will use to train:

  • list of API calls:
    curl -X PUT "http://localhost:8082/services/svm_test" -d '{
                "sname": "svm_test",
                "description": "classification model",
                "mllib": "caffe",
                "type": "supervised",
                "parameters": {
                        "input": {
                                "connector": "svm"
                        },
                        "mllib": {
                                "gpu": true,
                                "gpuid": 1,
                                "template": "mlp",
                                "nclasses": 2,
                                "ntargets": null,
                                "layers": [128,64,32],
                                "activation": "relu",
                                "dropout": 0.5,
                                "regression": false,
                                "finetuning": false,
                                "db": true
                        },
                        "output":{}
                },
                "model": {
                        "repository": "PATH_MODEL/bug_svm_prediction",
                        "templates": "../templates/caffe",
                        "weights": null
                }
        }'
  • Server log output:
DeepDetect [ commit 073e9a1f5cf5a565ee91c5a9b46a6b8b3afc19f3 ]
[2020-07-29 00:15:30.030] [api] [info] Running DeepDetect HTTP server on localhost:8082
[2020-07-29 00:16:02.615] [svm_test] [info] Using GPU 1
ETC.
ETC.
[2020-07-29 00:16:03.168] [svm_test] [info] instantiating model template mlp
[2020-07-29 00:16:03.168] [svm_test] [info] source=../templates/caffe/mlp/
[2020-07-29 00:16:03.168] [svm_test] [info] dest=PATH_MODEL/mlp.prototxt
[2020-07-29 00:16:03.170] [api] [info] 127.0.0.1 "PUT /services/svm_test" 201 556

Now we can try launching a training with an older version of DD (caaeb78).
We observe that the training is launched.

  • list of API calls:
curl -X POST "http://127.0.0.1:8082/train" -d '{
                "service": "svm_test",
                "async": false,
                "data": [
                        "PATH_MODEL/data/train.svm",
                        "PATH_MODEL/data/test.svm"
                ],
                "parameters":{
                        "input": {
                                "db": true
                        },
                        "mllib": {
                                "gpu": true,
                                "resume": false,
                                "ignore_label": null,
                                "solver": {
                                        "iterations": 1000,
                                        "snapshot": 500,
                                        "snapshot_prefix": null,
                                        "solver_type": "ADAM",
                                        "test_interval": 100,
                                        "test_initialization": false,
                                        "lr_policy": "step",
                                        "base_lr": 0.001,
                                        "gamma": 0.1,
                                        "stepsize": 100,
                                        "momentum": 0.9,
                                        "weight_decay": 0.00001,
                                        "power": null,
                                        "iter_size": 1
                                },
                                "net": {
                                        "batch_size": 1,
                                        "test_batch_size": 1
                                }
                        },
                        "output": {
                                "best": 2,
                                "measure": ["accp", "mcll", "f1", "mcc"]
                        }
                }
        }'
  • Server log output:
[2020-07-29 00:25:17.238] [svm_test] [info] Net total flops=10560 / total params=10560
[2020-07-29 00:25:17.238] [svm_test] [info] detected network type is classification
[2020-07-29 00:25:17.238] [caffe] [info] Opened lmdb PATH_MODEL/bug_svm_prediction/test.lmdb
[2020-07-29 00:25:17.244] [api] [info] 127.0.0.1 "POST /train" 201 1297

However now if we use the new version corresponding to commit 073e9a1 the train fails almost immediately.

  • list of API calls:
curl -X POST "http://127.0.0.1:8082/train" -d '{
                "service": "svm_test",
                "async": false,
                "data": [
                        "PATH_MODEL/data/train.svm",
                        "PATH_MODEL/data/test.svm"
                ],
                "parameters":{
                        "input": {
                                "db": true
                        },
                        "mllib": {
                                "gpu": true,
                                "resume": false,
                                "ignore_label": null,
                                "solver": {
                                        "iterations": 1000,
                                        "snapshot": 500,
                                        "snapshot_prefix": null,
                                        "solver_type": "ADAM",
                                        "test_interval": 100,
                                        "test_initialization": false,
                                        "lr_policy": "step",
                                        "base_lr": 0.001,
                                        "gamma": 0.1,
                                        "stepsize": 100,
                                        "momentum": 0.9,
                                        "weight_decay": 0.00001,
                                        "power": null,
                                        "iter_size": 1
                                },
                                "net": {
                                        "batch_size": 1,
                                        "test_batch_size": 1
                                }
                        },
                        "output": {
                                "best": 2,
                                "measure": ["accp", "mcll", "f1", "mcc"]
                        }
                }
        }'
  • Server log output:
{"status":{"code":500,"msg":"InternalError","dd_code":1007,"dd_msg":"./include/caffe/util/db_lmdb.hpp:15 / Check failed (custom): (mdb_status) == (0)"}}

[2020-07-29 00:20:37.494] [svm_test] [info] detected network type is classification
[2020-07-29 00:20:37.505] [svm_test] [info] Iteration 0, lr = 0.001, smoothed_loss=0.523027
[2020-07-29 00:20:37.562] [caffe] [info] Ignoring source layer prob
[2020-07-29 00:20:37.562] [svm_test] [error] Error while filling up network for testing
[2020-07-29 00:20:37.565] [svm_test] [error] training call failed
[2020-07-29 00:20:37.565] [api] [error] 127.0.0.1 "POST /train" 500 814

bug_svm_training.zip

@beniz
Copy link
Collaborator

beniz commented Jul 29, 2020

Ah, see #765. I had tested in-memory training only. To add svm + db training to unit tests I had to add test_split support to SVM + db training first, so this is what #765 does now. I've left a test for in-memory svm training.

@YaYaB
Copy link
Contributor Author

YaYaB commented Jul 29, 2020

It seems to work on my side.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants