Subject: Marine Geospatial Ecology Tools (MGET) help
Text archives
From: | "Jason Roberts" <> |
---|---|
To: | "'Mara Schmiing'" <> |
Cc: | <> |
Subject: | RE: [mget-help] GAM and GLM |
Date: | Wed, 29 Apr 2009 11:47:39 -0400 |
Mara,
In MGET 0.7a14, I implemented two new options in the Fit GLM tool: the first
lets you specify how many points you want to label in the diagnostic plots
and the second allows you to specify the field that should be used to label
the points. If your data has a specific field that uniquely identifies the
records, you would use it for the second option. Otherwise the OBJECTID or
FID of the record will be used, as is done now.
In your message below, you raise some interesting alternative possibilities.
I'm not sure what would provide the best utility in the end. If you find
something that is particularly useful, let me know and we can see how hard
it would be to implement. For now, I hope that the simple labels will be
sufficient; these will allow the user at least look up the records by hand
using the label.
Best,
Jason
-----Original Message-----
From: Mara Schmiing
[mailto:]
Sent: Monday, April 27, 2009 1:51 PM
To: Jason Roberts
Subject: RE: [mget-help] GAM and GLM
Dear Jason,
Thank you so much for your answers!
Regarding the cook's distances, I think it might be difficult to define
how many should be labeled. With my data sometimes three were
sufficient, sometimes 6-10 would have been but I can imagine that this
depends on the each data set, especially the sample size. I used
distance >1 as criterium for exclusion, how about the idea to identify
those samples?! Maybe not with the help of a plot and labels but via a
table?! Another idea might be to label a certain %age from the sample
size?! Of course, I have no idea if this is possible and again the
question would be is e.g. 1% or even 5% statistically justifiable?!
I am still working on the other problems I encountered and will come
back to you as soon as I have more and reliable information about this.
Thanks again for everything,
Mara
-----Original Message-----
From: Jason Roberts
[mailto:]
Sent: 24 April 2009 15:24
To: Mara Schmiing
Subject: RE: [mget-help] GAM and GLM
Hi Mara,
Just following up with you on the Cook's Distance plot.
There is a way to specify the number of outliers to label on this plot.
I could expose this parameter from the MGET Fit GAM tool. Would you like
me to do this?
Jason
-----Original Message-----
From: Jason Roberts
[mailto:]
Sent: Thursday, April 16, 2009 3:38 PM
To: 'Mara';
''
Subject: RE: [mget-help] GAM and GLM
Hi Mara,
I am the software engineer on the team here, not a statistics expert,
but I will try to answer your questions. If this doesn't help, we may be
able to pull someone else in.
I have not seen the errors you mention when fitting GAMs with MGCV using
splines with shrinkage, but I have only just started to use the
shrinkage smoothers. For more information on those, I recommend the MGVC
documentation: http://cran.r-project.org/web/packages/mgcv/mgcv.pdf, see
gam.selection discussion starting on page 39, and check out the
references at the end of the discussion. If you can reproduce the error
reliably, we can enable some additional logging output to obtain more
details. If that does not provide any clues, we can try contacting the
MGCV author, Simon Wood, directly.
My colleague, a statistics postdoc, worked with the shrinkage smoothers
last summer. She found the worked ok in some situations but exhibited
strange behavior in certain circumstances. Because she was not able to
fully understand what was going on, she ended up implementing a more
traditional model selection strategy (stepwise model selection that
minimized the UBRE score).
The gam package predates MGCV and was written by the inventors of GAMs,
principally Trevor Hastie I believe. It does not support shrinkage
smoothers. If you try to use those you should get an error, or at least
some unexpected results. My stats colleague seemed more excited about
the MGCV package, and said that Simon Wood is more actively working on
improving it, while the gam package does not seem to be under active
development.
If you are seeing Df=1 for your terms in the GAM package, I believe it
means the fitting algorithm determined that linear fits were appropriate
for your model terms. This will occur, of course, if you fit the model
without using a smoothing function. But if you do use a smoothing
function and the Df approaches 1 for all terms, you could just as well
use a GLM for that model and achieve similar results. (But remember, I
am not the stats expert here!)
Regarding the cook's distance plot, I will see if there is a way to get
it label more than three samples. Ideally, how many would you like it to
label?
Jason
-----Original Message-----
From: Mara
[mailto:]
Sent: Thursday, April 16, 2009 1:23 PM
To:
Subject: [mget-help] GAM and GLM
Hello!
I use mget to predict the occurence of species and Jason already helped
me with inital problems. I have to admit that I am a beginner in
statistic modeling and also just started using R. Please forgive me if I
thus address minor problems here (and that I wrote half a novel).
When I want to fit a GAM using the mgcv package I sometimes have
problems when adding splines with "shrinkage". (Models with exactly the
same input but without shrinking run perfectly.) I used variables
separately to see if I can track down the problem but here the models
will always run. Depending on the combination of variables I get one of
the following error messages:
RPy_RException: Error: NA/NaN/Inf in foreign function call (arg 3) or
RPy_RException: Error: no valid set of coefficients has been
found:please supply starting values As far as I understand this normally
means there are missing values or maybe typing errors but like I said
before the input is always the same (and it works without shrinkage).
On the other hand, using the same predictors but a difference response
will work perfectly even with shrinkage...
Using the gam package I don't get results at all but Df=1 for all
predictor
variables:
Df
(Intercept) 1
Depth 1
...
What does this mean?
Last but not least I have a comment/question regarding GLMs. I used
cook's distance to identify outliers. Row numbers are used to identify
samples and numbers of the first three samples with highest values are
plotted. As far as I can see it there is no way to identify the other
numbers/samples. Only when I remove the first three outliers (using the
"where clause") I can see the row numbers of the next three samples with
highest values. Unfortunately, the row numbers are now not identically
to the input table as three samples were not considered. Identification
of the propper row can be very time consuming. Is there a way to improve
this (other than changing the input table)?!
Thanks already for reading all of this! Am looking forward to help,
thanks!
In MGET 0.7a14, I implemented two new options in the Fit GLM tool: the first
lets you specify how many points you want to label in the diagnostic plots
and the second allows you to specify the field that should be used to label
the points. If your data has a specific field that uniquely identifies the
records, you would use it for the second option. Otherwise the OBJECTID or
FID of the record will be used, as is done now.
In your message below, you raise some interesting alternative possibilities.
I'm not sure what would provide the best utility in the end. If you find
something that is particularly useful, let me know and we can see how hard
it would be to implement. For now, I hope that the simple labels will be
sufficient; these will allow the user at least look up the records by hand
using the label.
Best,
Jason
-----Original Message-----
From: Mara Schmiing
[mailto:]
Sent: Monday, April 27, 2009 1:51 PM
To: Jason Roberts
Subject: RE: [mget-help] GAM and GLM
Dear Jason,
Thank you so much for your answers!
Regarding the cook's distances, I think it might be difficult to define
how many should be labeled. With my data sometimes three were
sufficient, sometimes 6-10 would have been but I can imagine that this
depends on the each data set, especially the sample size. I used
distance >1 as criterium for exclusion, how about the idea to identify
those samples?! Maybe not with the help of a plot and labels but via a
table?! Another idea might be to label a certain %age from the sample
size?! Of course, I have no idea if this is possible and again the
question would be is e.g. 1% or even 5% statistically justifiable?!
I am still working on the other problems I encountered and will come
back to you as soon as I have more and reliable information about this.
Thanks again for everything,
Mara
-----Original Message-----
From: Jason Roberts
[mailto:]
Sent: 24 April 2009 15:24
To: Mara Schmiing
Subject: RE: [mget-help] GAM and GLM
Hi Mara,
Just following up with you on the Cook's Distance plot.
There is a way to specify the number of outliers to label on this plot.
I could expose this parameter from the MGET Fit GAM tool. Would you like
me to do this?
Jason
-----Original Message-----
From: Jason Roberts
[mailto:]
Sent: Thursday, April 16, 2009 3:38 PM
To: 'Mara';
''
Subject: RE: [mget-help] GAM and GLM
Hi Mara,
I am the software engineer on the team here, not a statistics expert,
but I will try to answer your questions. If this doesn't help, we may be
able to pull someone else in.
I have not seen the errors you mention when fitting GAMs with MGCV using
splines with shrinkage, but I have only just started to use the
shrinkage smoothers. For more information on those, I recommend the MGVC
documentation: http://cran.r-project.org/web/packages/mgcv/mgcv.pdf, see
gam.selection discussion starting on page 39, and check out the
references at the end of the discussion. If you can reproduce the error
reliably, we can enable some additional logging output to obtain more
details. If that does not provide any clues, we can try contacting the
MGCV author, Simon Wood, directly.
My colleague, a statistics postdoc, worked with the shrinkage smoothers
last summer. She found the worked ok in some situations but exhibited
strange behavior in certain circumstances. Because she was not able to
fully understand what was going on, she ended up implementing a more
traditional model selection strategy (stepwise model selection that
minimized the UBRE score).
The gam package predates MGCV and was written by the inventors of GAMs,
principally Trevor Hastie I believe. It does not support shrinkage
smoothers. If you try to use those you should get an error, or at least
some unexpected results. My stats colleague seemed more excited about
the MGCV package, and said that Simon Wood is more actively working on
improving it, while the gam package does not seem to be under active
development.
If you are seeing Df=1 for your terms in the GAM package, I believe it
means the fitting algorithm determined that linear fits were appropriate
for your model terms. This will occur, of course, if you fit the model
without using a smoothing function. But if you do use a smoothing
function and the Df approaches 1 for all terms, you could just as well
use a GLM for that model and achieve similar results. (But remember, I
am not the stats expert here!)
Regarding the cook's distance plot, I will see if there is a way to get
it label more than three samples. Ideally, how many would you like it to
label?
Jason
-----Original Message-----
From: Mara
[mailto:]
Sent: Thursday, April 16, 2009 1:23 PM
To:
Subject: [mget-help] GAM and GLM
Hello!
I use mget to predict the occurence of species and Jason already helped
me with inital problems. I have to admit that I am a beginner in
statistic modeling and also just started using R. Please forgive me if I
thus address minor problems here (and that I wrote half a novel).
When I want to fit a GAM using the mgcv package I sometimes have
problems when adding splines with "shrinkage". (Models with exactly the
same input but without shrinking run perfectly.) I used variables
separately to see if I can track down the problem but here the models
will always run. Depending on the combination of variables I get one of
the following error messages:
RPy_RException: Error: NA/NaN/Inf in foreign function call (arg 3) or
RPy_RException: Error: no valid set of coefficients has been
found:please supply starting values As far as I understand this normally
means there are missing values or maybe typing errors but like I said
before the input is always the same (and it works without shrinkage).
On the other hand, using the same predictors but a difference response
will work perfectly even with shrinkage...
Using the gam package I don't get results at all but Df=1 for all
predictor
variables:
Df
(Intercept) 1
Depth 1
...
What does this mean?
Last but not least I have a comment/question regarding GLMs. I used
cook's distance to identify outliers. Row numbers are used to identify
samples and numbers of the first three samples with highest values are
plotted. As far as I can see it there is no way to identify the other
numbers/samples. Only when I remove the first three outliers (using the
"where clause") I can see the row numbers of the next three samples with
highest values. Unfortunately, the row numbers are now not identically
to the input table as three samples were not considered. Identification
of the propper row can be very time consuming. Is there a way to improve
this (other than changing the input table)?!
Thanks already for reading all of this! Am looking forward to help,
thanks!
Archives powered by MHonArc.