Imbalanced dataset binary classification The 2019 Stack Overflow Developer Survey Results Are In Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?Imbalanced data classification using boosting algorithmsBinary classification in imbalanced dataClassification algorithms for handling Imbalanced data setsWhat is the effect of training a model on an imbalanced dataset & using it on a balanced dataset?imbalanced binary classification with skewed featuresCross validation and imbalanced learningimbalanced datasetcross validation gives wrong resultsData augmentation or weighted loss function for imbalanced classes?Handling imbalanced data for classification

Did the UK government pay "millions and millions of dollars" to try to snag Julian Assange?

Road tyres vs "Street" tyres for charity ride on MTB Tandem

Does the AirPods case need to be around while listening via an iOS Device?

Are spiders unable to hurt humans, especially very small spiders?

Who or what is the being for whom Being is a question for Heidegger?

how can a perfect fourth interval be considered either consonant or dissonant?

Is every episode of "Where are my Pants?" identical?

Problems with Ubuntu mount /tmp

Match Roman Numerals

Is it ethical to upload a automatically generated paper to a non peer-reviewed site as part of a larger research?

Make it rain characters

How do you keep chess fun when your opponent constantly beats you?

When did F become S in typeography, and why?

Scientific Reports - Significant Figures

What do you call a plan that's an alternative plan in case your initial plan fails?

The following signatures were invalid: EXPKEYSIG 1397BC53640DB551

What was the last x86 CPU that did not have the x87 floating-point unit built in?

Simulating Exploding Dice

How should I replace vector<uint8_t>::const_iterator in an API?

How did passengers keep warm on sail ships?

How to copy the contents of all files with a certain name into a new file?

Does Parliament need to approve the new Brexit delay to 31 October 2019?

Working through the single responsibility principle (SRP) in Python when calls are expensive

Do working physicists consider Newtonian mechanics to be "falsified"?

Imbalanced dataset binary classification

The 2019 Stack Overflow Developer Survey Results Are In

Announcing the arrival of Valued Associate #679: Cesar Manara

Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?Imbalanced data classification using boosting algorithmsBinary classification in imbalanced dataClassification algorithms for handling Imbalanced data setsWhat is the effect of training a model on an imbalanced dataset & using it on a balanced dataset?imbalanced binary classification with skewed featuresCross validation and imbalanced learningimbalanced datasetcross validation gives wrong resultsData augmentation or weighted loss function for imbalanced classes?Handling imbalanced data for classification

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I am new in ML & DS and i have a dataset with an imbalance of 9:1 for Binary Classification,as an assignment. Could you please guide me in this regard? Also Which classifier is best for Imbalanced Binary Classification?

Regrds.

asked Apr 8 at 10:31

Sid_Mirza

112

New contributor

$begingroup$
Related: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?
$endgroup$
– Stephan Kolassa
Apr 8 at 19:10

add a comment |

Regrds.

asked Apr 8 at 10:31

Sid_Mirza

112

New contributor

$begingroup$
Related: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?
$endgroup$
– Stephan Kolassa
Apr 8 at 19:10

add a comment |

Regrds.

asked Apr 8 at 10:31

Sid_Mirza

112

New contributor

Regrds.

machine-learning classification binary-data unbalanced-classes

asked Apr 8 at 10:31

Sid_Mirza

112

New contributor

asked Apr 8 at 10:31

Sid_Mirza

112

New contributor

asked Apr 8 at 10:31

Sid_Mirza

112

New contributor

asked Apr 8 at 10:31

Sid_Mirza

112

asked Apr 8 at 10:31

Sid_Mirza

112

New contributor

Sid_Mirza is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

$begingroup$
Related: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?
$endgroup$
– Stephan Kolassa
Apr 8 at 19:10

add a comment |

$begingroup$
Related: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?
$endgroup$
– Stephan Kolassa
Apr 8 at 19:10

Related: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?

– Stephan Kolassa
Apr 8 at 19:10

add a comment |

1 Answer
1

active

oldest

votes

You got off on the wrong foot by conceptualizing this as a classification problem. The fact that $Y$ is binary has nothing to do with trying to make classifications. And when the balance of $Y$ is far from 1:1 you need to think about modeling tendencies for $Y$, not modeling $Y$. In other words, the appropriate task is to estimate $P(Y=1 | X)$ using a model such as the binary logistic regression model. The logistic model is a direct probability estimator. Details may be found here and here.

Once you have a validated probability model and a utility/cost/loss function you can generate optimum decisions. The probabilities help to trade off the consequences of wrong decisions.

answered Apr 8 at 11:59

Frank Harrell

56k3110245

$begingroup$
Thanks Sir Frank Harrell, The dataset is in floating point values but the target is in binary form as you said 'Y'. i applied Linear Regression, Random Forests,Decision Tree and some ensemble methods but the Linear regression gave an AUC score of 78.2% whereas random forests and LightGBM performed better. Now i want to increase the AUC score. Here is the list of parameters i used for lgb:
$endgroup$
– Sid_Mirza
Apr 8 at 17:18

$begingroup$
params = "objective" : "binary", "metric" : "auc", "boosting": 'gbdt', "max_depth" : -1, "num_leaves" : 13, "learning_rate" : 0.01, "bagging_freq": 5, "bagging_fraction" : 0.4, "feature_fraction" : 0.05, "min_data_in_leaf": 80, "min_sum_heassian_in_leaf": 10, "tree_learner": "serial", "boost_from_average": "false", "bagging_seed" : random_state, "verbosity" : 1, "seed": random_state
$endgroup$
– Sid_Mirza
Apr 8 at 17:21

$begingroup$
Is the binary target a derivation from a floating point continuous outcome variable? If so you will need to go back to that variable and not use an information-losing dichotomization.
$endgroup$
– Frank Harrell
Apr 9 at 4:11

$begingroup$
Yes, all the 100+ attributes have continuous values on the basis of which, we have to classify the target in binary form either yes or no.
$endgroup$
– Sid_Mirza
Apr 9 at 19:29

$begingroup$
I assume by that you mean that the target originated as binary in its rawest form. You are still trying to cast the problem inappropriately as classification. You cannot do anything but estimate tendencies, nor should you. Once you have probability estimates you can make optimum decisions given the loss function.
$endgroup$
– Frank Harrell
Apr 10 at 11:31

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

Sid_Mirza is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f401800%2fimbalanced-dataset-binary-classification%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Once you have a validated probability model and a utility/cost/loss function you can generate optimum decisions. The probabilities help to trade off the consequences of wrong decisions.

answered Apr 8 at 11:59

Frank Harrell

56k3110245

$begingroup$
Thanks Sir Frank Harrell, The dataset is in floating point values but the target is in binary form as you said 'Y'. i applied Linear Regression, Random Forests,Decision Tree and some ensemble methods but the Linear regression gave an AUC score of 78.2% whereas random forests and LightGBM performed better. Now i want to increase the AUC score. Here is the list of parameters i used for lgb:
$endgroup$
– Sid_Mirza
Apr 8 at 17:18

$begingroup$
params = "objective" : "binary", "metric" : "auc", "boosting": 'gbdt', "max_depth" : -1, "num_leaves" : 13, "learning_rate" : 0.01, "bagging_freq": 5, "bagging_fraction" : 0.4, "feature_fraction" : 0.05, "min_data_in_leaf": 80, "min_sum_heassian_in_leaf": 10, "tree_learner": "serial", "boost_from_average": "false", "bagging_seed" : random_state, "verbosity" : 1, "seed": random_state
$endgroup$
– Sid_Mirza
Apr 8 at 17:21

$begingroup$
Is the binary target a derivation from a floating point continuous outcome variable? If so you will need to go back to that variable and not use an information-losing dichotomization.
$endgroup$
– Frank Harrell
Apr 9 at 4:11

$begingroup$
Yes, all the 100+ attributes have continuous values on the basis of which, we have to classify the target in binary form either yes or no.
$endgroup$
– Sid_Mirza
Apr 9 at 19:29

$begingroup$
I assume by that you mean that the target originated as binary in its rawest form. You are still trying to cast the problem inappropriately as classification. You cannot do anything but estimate tendencies, nor should you. Once you have probability estimates you can make optimum decisions given the loss function.
$endgroup$
– Frank Harrell
Apr 10 at 11:31

add a comment |

Once you have a validated probability model and a utility/cost/loss function you can generate optimum decisions. The probabilities help to trade off the consequences of wrong decisions.

answered Apr 8 at 11:59

Frank Harrell

56k3110245

$begingroup$
Thanks Sir Frank Harrell, The dataset is in floating point values but the target is in binary form as you said 'Y'. i applied Linear Regression, Random Forests,Decision Tree and some ensemble methods but the Linear regression gave an AUC score of 78.2% whereas random forests and LightGBM performed better. Now i want to increase the AUC score. Here is the list of parameters i used for lgb:
$endgroup$
– Sid_Mirza
Apr 8 at 17:18

$begingroup$
params = "objective" : "binary", "metric" : "auc", "boosting": 'gbdt', "max_depth" : -1, "num_leaves" : 13, "learning_rate" : 0.01, "bagging_freq": 5, "bagging_fraction" : 0.4, "feature_fraction" : 0.05, "min_data_in_leaf": 80, "min_sum_heassian_in_leaf": 10, "tree_learner": "serial", "boost_from_average": "false", "bagging_seed" : random_state, "verbosity" : 1, "seed": random_state
$endgroup$
– Sid_Mirza
Apr 8 at 17:21

$begingroup$
Is the binary target a derivation from a floating point continuous outcome variable? If so you will need to go back to that variable and not use an information-losing dichotomization.
$endgroup$
– Frank Harrell
Apr 9 at 4:11

$begingroup$
Yes, all the 100+ attributes have continuous values on the basis of which, we have to classify the target in binary form either yes or no.
$endgroup$
– Sid_Mirza
Apr 9 at 19:29

$begingroup$
I assume by that you mean that the target originated as binary in its rawest form. You are still trying to cast the problem inappropriately as classification. You cannot do anything but estimate tendencies, nor should you. Once you have probability estimates you can make optimum decisions given the loss function.
$endgroup$
– Frank Harrell
Apr 10 at 11:31

add a comment |

Once you have a validated probability model and a utility/cost/loss function you can generate optimum decisions. The probabilities help to trade off the consequences of wrong decisions.

answered Apr 8 at 11:59

Frank Harrell

56k3110245

Once you have a validated probability model and a utility/cost/loss function you can generate optimum decisions. The probabilities help to trade off the consequences of wrong decisions.

answered Apr 8 at 11:59

Frank Harrell

56k3110245

answered Apr 8 at 11:59

Frank Harrell

56k3110245

answered Apr 8 at 11:59

Frank Harrell

56k3110245

answered Apr 8 at 11:59

Frank Harrell

56k3110245

$begingroup$
Thanks Sir Frank Harrell, The dataset is in floating point values but the target is in binary form as you said 'Y'. i applied Linear Regression, Random Forests,Decision Tree and some ensemble methods but the Linear regression gave an AUC score of 78.2% whereas random forests and LightGBM performed better. Now i want to increase the AUC score. Here is the list of parameters i used for lgb:
$endgroup$
– Sid_Mirza
Apr 8 at 17:18

$begingroup$
params = "objective" : "binary", "metric" : "auc", "boosting": 'gbdt', "max_depth" : -1, "num_leaves" : 13, "learning_rate" : 0.01, "bagging_freq": 5, "bagging_fraction" : 0.4, "feature_fraction" : 0.05, "min_data_in_leaf": 80, "min_sum_heassian_in_leaf": 10, "tree_learner": "serial", "boost_from_average": "false", "bagging_seed" : random_state, "verbosity" : 1, "seed": random_state
$endgroup$
– Sid_Mirza
Apr 8 at 17:21

$begingroup$
Is the binary target a derivation from a floating point continuous outcome variable? If so you will need to go back to that variable and not use an information-losing dichotomization.
$endgroup$
– Frank Harrell
Apr 9 at 4:11

$begingroup$
Yes, all the 100+ attributes have continuous values on the basis of which, we have to classify the target in binary form either yes or no.
$endgroup$
– Sid_Mirza
Apr 9 at 19:29

$begingroup$
I assume by that you mean that the target originated as binary in its rawest form. You are still trying to cast the problem inappropriately as classification. You cannot do anything but estimate tendencies, nor should you. Once you have probability estimates you can make optimum decisions given the loss function.
$endgroup$
– Frank Harrell
Apr 10 at 11:31

add a comment |

$begingroup$
Thanks Sir Frank Harrell, The dataset is in floating point values but the target is in binary form as you said 'Y'. i applied Linear Regression, Random Forests,Decision Tree and some ensemble methods but the Linear regression gave an AUC score of 78.2% whereas random forests and LightGBM performed better. Now i want to increase the AUC score. Here is the list of parameters i used for lgb:
$endgroup$
– Sid_Mirza
Apr 8 at 17:18

$begingroup$
params = "objective" : "binary", "metric" : "auc", "boosting": 'gbdt', "max_depth" : -1, "num_leaves" : 13, "learning_rate" : 0.01, "bagging_freq": 5, "bagging_fraction" : 0.4, "feature_fraction" : 0.05, "min_data_in_leaf": 80, "min_sum_heassian_in_leaf": 10, "tree_learner": "serial", "boost_from_average": "false", "bagging_seed" : random_state, "verbosity" : 1, "seed": random_state
$endgroup$
– Sid_Mirza
Apr 8 at 17:21

$begingroup$
Is the binary target a derivation from a floating point continuous outcome variable? If so you will need to go back to that variable and not use an information-losing dichotomization.
$endgroup$
– Frank Harrell
Apr 9 at 4:11

$begingroup$
Yes, all the 100+ attributes have continuous values on the basis of which, we have to classify the target in binary form either yes or no.
$endgroup$
– Sid_Mirza
Apr 9 at 19:29

$begingroup$
I assume by that you mean that the target originated as binary in its rawest form. You are still trying to cast the problem inappropriately as classification. You cannot do anything but estimate tendencies, nor should you. Once you have probability estimates you can make optimum decisions given the loss function.
$endgroup$
– Frank Harrell
Apr 10 at 11:31

Thanks Sir Frank Harrell, The dataset is in floating point values but the target is in binary form as you said 'Y'. i applied Linear Regression, Random Forests,Decision Tree and some ensemble methods but the Linear regression gave an AUC score of 78.2% whereas random forests and LightGBM performed better. Now i want to increase the AUC score. Here is the list of parameters i used for lgb:

– Sid_Mirza
Apr 8 at 17:18

params = "objective" : "binary", "metric" : "auc", "boosting": 'gbdt', "max_depth" : -1, "num_leaves" : 13, "learning_rate" : 0.01, "bagging_freq": 5, "bagging_fraction" : 0.4, "feature_fraction" : 0.05, "min_data_in_leaf": 80, "min_sum_heassian_in_leaf": 10, "tree_learner": "serial", "boost_from_average": "false", "bagging_seed" : random_state, "verbosity" : 1, "seed": random_state

– Sid_Mirza
Apr 8 at 17:21

Is the binary target a derivation from a floating point continuous outcome variable? If so you will need to go back to that variable and not use an information-losing dichotomization.

– Frank Harrell
Apr 9 at 4:11

Yes, all the 100+ attributes have continuous values on the basis of which, we have to classify the target in binary form either yes or no.

– Sid_Mirza
Apr 9 at 19:29

I assume by that you mean that the target originated as binary in its rawest form. You are still trying to cast the problem inappropriately as classification. You cannot do anything but estimate tendencies, nor should you. Once you have probability estimates you can make optimum decisions given the loss function.

– Frank Harrell
Apr 10 at 11:31

add a comment |

Sid_Mirza is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sid_Mirza is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Cross Validated!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Usbrth

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

369. pr. Kr. Događaji Rođenja Smrti

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

369. pr. Kr. Događaji Rođenja Smrti

1 Answer
1

1 Answer
1

1 Answer
1