Also important are the additive functions to which we will return in Chapter ??.

Definition 4.10

An additive function is a sequence such that (gcd(a, b) = 1) implies (f (ab) = f(a) + f(b)). A completely addititive function is one where the condition that (gcd(a, b) = 1) is not needed.

Definition 4.11

Let (omega (n)) denote the number of distinct prime divisors of (n) and let (Omega (n)) denote the number of prime powers that are divisors of (n). These functions are called the prime omega functions.

So if (n = prod_{i=1}^{s} p_{i}^{l_{i}}), then

[egin{array}{ccc} {omega_{n} = s}&{and}&{Omega (n) = sum_{i = 1}^{2} l_{i}} end{array} onumber]

The additivity of (omega) and the complete additivity of (Omega) should be clear.

If you pick $g$ first, you can take your condition as a recursive definition of $f$. i.e.

If you disallow the null string, the only requirement on $g$ is that it is associative:

If you allow the null string, then you also require there be an element $z$ such that $g(a,z) = a = g(z, a)$, and you define $f() = z$.

Some simple examples include

The intent of the second example is for a hash function like

where computing the hash of a concatenation requires knowledge of the length of the string. So I've paired the length with the hash. Or, if you prefer, it's for the hash function

In the last section we learned rules to symbolically differentiate some elementary functions. To summarize, we have established 4 rules.

If f((x)=x^n ext<,>) then (f'(x)=n*x^<(n-1)> ext<,>) for any real number (n ext<.>)

However we do not yet have a rule for taking the derivative of a function as simple as (f(x)=x+2 ext<.>) Rather than producing rules for each kind of function, we wish to discover how to differentiate functions obtained by arithmetic on functions we already know how to differentiate. This would let us differentiate functions like (f(x)=5 x^3+3x^2+1 ext<,>) or (g(x)=(x+2) 1.03^x ext<,>) or (F(x)= ln(x)/(x+3) ext<,>) which are built up from our elementary functions. We want rules for multiplying a known function by a constant, for adding or subtracting two known functions, and for multiplying or dividing two known functions.

### Subsection 4.2.1 Derivatives of scalar products

We start by differentiating a constant times a function.

###### Claim 4.2.1 . Scalar multiple rule.

The derivative of (c*f(x)) is (c*f'(x) ext<.>) In other words,

###### Example 4.2.2 . Derivatives of constants times standard functions.

Find the derivatives of the following functions:

1. (displaystyle f'(x)=2[e^x ]'=2e^x.)
2. (displaystyle g'(x)=500[(1.05)^x ]'=500(1.05)^x ln(1.05).)
3. (displaystyle h'(x)=[ln(x^7)]'=[7 ln(x)]'=7[ln(x)]'=7/x.)

Next we want to look at the sum or difference of two functions.

###### Claim 4.2.3 . Sum and difference rule.

The derivative of (f(x)pm g(x)) is (f'(x)pm g'(x) ext<.>) In other words,

###### Example 4.2.4 . Derivatives of sums and differences of standard functions.

Find the derivatives of the following functions:

1. (displaystyle f(x)=5x^3+3x^2-7)
2. (displaystyle g(x)=100e^x-1000(1.03)^x)
3. (displaystyle h(x)=5sqrt+2/sqrt-7x^<-3>)

The basic argument for all of our rules starts with local linearity. Recall that if (f(x)) is differentiable at (x_0 ext<,>) then in a region around (x_0 ext<,>) we can approximate (f(x)) by a linear function, (f(x)approx f'(x_0 )(x-x_0 )+f(x_0) ext<.>) To find the derivative of a scalar product, sum, difference, product, or quotient of known functions, we perform the appropriate actions on the linear approximations of those functions. We then take the coefficient of the linear term of the result.

For our first rule we are differentiating a constant times a function. Following the general method we look at how we multiply a constant times the linear approximation.

Taking the coefficient of the linear term gives the scalar multiple rule, the derivative of a constant times a functions is the constant times the derivative of the function.

Next we want to look at the sum or difference of two functions. Following the general method we look at the sum or difference of the linear approximations.

Taking the coefficient of the linear term gives the sum or difference rule, the derivative of a sum or difference of two functions is the sum or difference of the derivatives of the functions.

### Subsection 4.2.2 Derivatives of products

We turn our attention to the product of two functions.

###### Claim 4.2.5 . Product rule.

The derivative of (f(x)* g(x)) is (f'(x)g(x)+f(x)g'(x)) In other words,

Warning: Note that the derivative of a product is not the product of the derivatives!

We start with an example that we can do by multiplying before taking the derivative. This gives us a way to check that we have the rule correct.

###### Example 4.2.6 . Simple derivative of a product.

Let (f(x)=x) and (g(x)=x^2 ext<.>) Find the derivative of (f(x)*g(x) ext<.>)

Solution: Note that (f(x)*g(x)=x^3 ext<.>) Using our rule for monomials ((f(x)*g(x))'=(x^3 )'=3x^2 ext<.>) Using the same rule we see (f'(x)=1 ext<,>) and (g'(x)=2x ext<.>) We can now evaluate using the product rule:

Both methods give the same answer. Note that the product of the derivatives is (2x) which is NOT the derivative of the product.

###### Example 4.2.7 . General derivatives of products.

Find the derivatives of the following functions:

Following the general rule we look at the linear term of the product of the linear approximations. Consider the product of two linear expressions.

The coefficient of the linear term is ((a d+b c) ext<.>) Thus, when we take the product

the coefficient of the linear term is

### Subsection 4.2.3 Derivatives of quotients

Finally, we turn our attention to the quotient of two functions.

###### Claim 4.2.8 . Quotient rule.

Warning: Once again, note that the derivative of a quotient is NOT the quotient of the derivatives!

###### Example 4.2.9 . Simple derivative of a quotient.

Let (f(x)=x^2) and (g(x)=x ext<.>) Find the derivative of (f(x)/g(x) ext<.>)

Solution: Note that (f(x)/g(x)=x ext<.>) Using our rule for monomials ((f(x)*g(x))'=(x )'=1 ext<.>) Using the same rule we see (f'(x)=2x ext<,>) and (g'(x)=1 ext<.>) We can now evaluate using the quotient rule:

Both methods give the same answer. Note that the quotient of the derivatives is (2x ext<,>)

###### Example 4.2.10 . General derivatives of quotients.

Find the derivatives of the following functions:

Following the general method we look at the linear term of the quotient of the linear approximations. However, we need to do an algebraic trick before we can find the linear term. Consider the quotient of two linear expressions.

When x is small enough, we get a good approximation by ignoring the (x^2) term. In that approximation, the coefficient of the linear term is (frac<(c b-a d)> ext<.>) Thus, when we take the quotient

the coefficient of the linear term is

### Exercises 4.2.4 Exercises: Derivative Rules for Combinations of Functions Problems

Use the rules from the last two sections to find the derivatives of the following functions.

## Interpretable Machine Learning

Logistic regression models the probabilities for classification problems with two possible outcomes. It's an extension of the linear regression model for classification problems.

### 4.2.1 What is Wrong with Linear Regression for Classification?

The linear regression model can work well for regression, but fails for classification. Why is that? In case of two classes, you could label one of the classes with 0 and the other with 1 and use linear regression. Technically it works and most linear model programs will spit out weights for you. But there are a few problems with this approach:

A linear model does not output probabilities, but it treats the classes as numbers (0 and 1) and fits the best hyperplane (for a single feature, it is a line) that minimizes the distances between the points and the hyperplane. So it simply interpolates between the points, and you cannot interpret it as probabilities.

A linear model also extrapolates and gives you values below zero and above one. This is a good sign that there might be a smarter approach to classification.

Since the predicted outcome is not a probability, but a linear interpolation between points, there is no meaningful threshold at which you can distinguish one class from the other. A good illustration of this issue has been given on Stackoverflow.

Linear models do not extend to classification problems with multiple classes. You would have to start labeling the next class with 2, then 3, and so on. The classes might not have any meaningful order, but the linear model would force a weird structure on the relationship between the features and your class predictions. The higher the value of a feature with a positive weight, the more it contributes to the prediction of a class with a higher number, even if classes that happen to get a similar number are not closer than other classes.

FIGURE 4.5: A linear model classifies tumors as malignant (1) or benign (0) given their size. The lines show the prediction of the linear model. For the data on the left, we can use 0.5 as classification threshold. After introducing a few more malignant tumor cases, the regression line shifts and a threshold of 0.5 no longer separates the classes. Points are slightly jittered to reduce over-plotting.

### 4.2.2 Theory

A solution for classification is logistic regression. Instead of fitting a straight line or hyperplane, the logistic regression model uses the logistic function to squeeze the output of a linear equation between 0 and 1. The logistic function is defined as:

FIGURE 4.6: The logistic function. It outputs numbers between 0 and 1. At input 0, it outputs 0.5.

The step from linear regression to logistic regression is kind of straightforward. In the linear regression model, we have modelled the relationship between outcome and features with a linear equation:

For classification, we prefer probabilities between 0 and 1, so we wrap the right side of the equation into the logistic function. This forces the output to assume only values between 0 and 1.

Let us revisit the tumor size example again. But instead of the linear regression model, we use the logistic regression model:

FIGURE 4.7: The logistic regression model finds the correct decision boundary between malignant and benign depending on tumor size. The line is the logistic function shifted and squeezed to fit the data.

Classification works better with logistic regression and we can use 0.5 as a threshold in both cases. The inclusion of additional points does not really affect the estimated curve.

### 4.2.3 Interpretation

The interpretation of the weights in logistic regression differs from the interpretation of the weights in linear regression, since the outcome in logistic regression is a probability between 0 and 1. The weights do not influence the probability linearly any longer. The weighted sum is transformed by the logistic function to a probability. Therefore we need to reformulate the equation for the interpretation so that only the linear term is on the right side of the formula.

We call the term in the log() function "odds" (probability of event divided by probability of no event) and wrapped in the logarithm it is called log odds.

This formula shows that the logistic regression model is a linear model for the log odds. Great! That does not sound helpful! With a little shuffling of the terms, you can figure out how the prediction changes when one of the features (x_j) is changed by 1 unit. To do this, we can first apply the exp() function to both sides of the equation:

Then we compare what happens when we increase one of the feature values by 1. But instead of looking at the difference, we look at the ratio of the two predictions:

We apply the following rule:

In the end, we have something as simple as exp() of a feature weight. A change in a feature by one unit changes the odds ratio (multiplicative) by a factor of (exp(eta_j)) . We could also interpret it this way: A change in (x_j) by one unit increases the log odds ratio by the value of the corresponding weight. Most people interpret the odds ratio because thinking about the log() of something is known to be hard on the brain. Interpreting the odds ratio already requires some getting used to. For example, if you have odds of 2, it means that the probability for y=1 is twice as high as y=0. If you have a weight (= log odds ratio) of 0.7, then increasing the respective feature by one unit multiplies the odds by exp(0.7) (approximately 2) and the odds change to 4. But usually you do not deal with the odds and interpret the weights only as the odds ratios. Because for actually calculating the odds you would need to set a value for each feature, which only makes sense if you want to look at one specific instance of your dataset.

These are the interpretations for the logistic regression model with different feature types:

• Numerical feature: If you increase the value of feature (x_) by one unit, the estimated odds change by a factor of (exp(eta_))
• Binary categorical feature: One of the two values of the feature is the reference category (in some languages, the one encoded in 0). Changing the feature (x_) from the reference category to the other category changes the estimated odds by a factor of (exp(eta_)) .
• Categorical feature with more than two categories: One solution to deal with multiple categories is one-hot-encoding, meaning that each category has its own column. You only need L-1 columns for a categorical feature with L categories, otherwise it is over-parameterized. The L-th category is then the reference category. You can use any other encoding that can be used in linear regression. The interpretation for each category then is equivalent to the interpretation of binary features.
• Intercept (eta_<0>) : When all numerical features are zero and the categorical features are at the reference category, the estimated odds are (exp(eta_<0>)) . The interpretation of the intercept weight is usually not relevant.

### 4.2.4 Example

We use the logistic regression model to predict cervical cancer based on some risk factors. The following table shows the estimate weights, the associated odds ratios, and the standard error of the estimates.

TABLE 4.1: The results of fitting a logistic regression model on the cervical cancer dataset. Shown are the features used in the model, their estimated weights and corresponding odds ratios, and the standard errors of the estimated weights.
Weight Odds ratio Std. Error
Intercept -2.91 0.05 0.32
Hormonal contraceptives y/n -0.12 0.89 0.30
Smokes y/n 0.26 1.30 0.37
Num. of pregnancies 0.04 1.04 0.10
Num. of diagnosed STDs 0.82 2.27 0.33
Intrauterine device y/n 0.62 1.86 0.40

Interpretation of a numerical feature ("Num. of diagnosed STDs"): An increase in the number of diagnosed STDs (sexually transmitted diseases) changes (increases) the odds of cancer vs. no cancer by a factor of 2.27, when all other features remain the same. Keep in mind that correlation does not imply causation.

Interpretation of a categorical feature ("Hormonal contraceptives y/n"): For women using hormonal contraceptives, the odds for cancer vs. no cancer are by a factor of 0.89 lower, compared to women without hormonal contraceptives, given all other features stay the same.

Like in the linear model, the interpretations always come with the clause that 'all other features stay the same'.

Many of the pros and cons of the linear regression model also apply to the logistic regression model. Logistic regression has been widely used by many different people, but it struggles with its restrictive expressiveness (e.g. interactions must be added manually) and other models may have better predictive performance.

Another disadvantage of the logistic regression model is that the interpretation is more difficult because the interpretation of the weights is multiplicative and not additive.

Logistic regression can suffer from complete separation. If there is a feature that would perfectly separate the two classes, the logistic regression model can no longer be trained. This is because the weight for that feature would not converge, because the optimal weight would be infinite. This is really a bit unfortunate, because such a feature is really useful. But you do not need machine learning if you have a simple rule that separates both classes. The problem of complete separation can be solved by introducing penalization of the weights or defining a prior probability distribution of weights.

On the good side, the logistic regression model is not only a classification model, but also gives you probabilities. This is a big advantage over models that can only provide the final classification. Knowing that an instance has a 99% probability for a class compared to 51% makes a big difference.

Logistic regression can also be extended from binary classification to multi-class classification. Then it is called Multinomial Regression.

### 4.2.6 Software

I used the glm function in R for all examples. You can find logistic regression in any programming language that can be used for performing data analysis, such as Python, Java, Stata, Matlab, .

## Applied Time Series Analysis for Fisheries and Environmental Sciences

Plotting time series data is an important first step in analyzing their various components. Beyond that, however, we need a more formal means for identifying and removing characteristics such as a trend or seasonal variation. As discussed in lecture, the decomposition model reduces a time series into 3 components: trend, seasonal effects, and random errors. In turn, we aim to model the random errors as some form of stationary process.

Let’s begin with a simple, additive decomposition model for a time series (x_t)

where, at time (t) , (m_t) is the trend, (s_t) is the seasonal effect, and (e_t) is a random error that we generally assume to have zero-mean and to be correlated over time. Thus, by estimating and subtracting both () and () from () , we hope to have a time series of stationary residuals () .

### 4.2.1 Estimating trends

In lecture we discussed how linear filters are a common way to estimate trends in time series. One of the most common linear filters is the moving average, which for time lags from (-a) to (a) is defined as

This model works well for moving windows of odd-numbered lengths, but should be adjusted for even-numbered lengths by adding only (frac<1><2>) of the 2 most extreme lags so that the filtered value at time (t) lines up with the original observation at time (t) . So, for example, in a case with monthly data such as the atmospheric CO (_2) concentration where a 12-point moving average would be an obvious choice, the linear filter would be

It is important to note here that our time series of the estimated trend (_t>) is actually shorter than the observed time series by (2a) units.

Conveniently, R has the built-in function filter() in the stats package for estimating moving-average (and other) linear filters. In addition to specifying the time series to be filtered, we need to pass in the filter weights (and 2 other arguments we won’t worry about here–type ?filter to get more information). The easiest way to create the filter is with the rep() function:

Now let’s get our estimate of the trend (>) with filter() > and plot it:

The trend is a more-or-less smoothly increasing function over time, the average slope of which does indeed appear to be increasing over time as well (Figure 4.3).

Figure 4.3: Time series of the estimated trend (_t>) for the atmospheric CO (_2) concentration at Mauna Loa, Hawai’i.

### 4.2.2 Estimating seasonal effects

Once we have an estimate of the trend for time (t) ( (hat_t) ) we can easily obtain an estimate of the seasonal effect at time (t) ( (hat_t) ) by subtraction

which is really easy to do in R:

This estimate of the seasonal effect for each time (t) also contains the random error (e_t) , however, which can be seen by plotting the time series and careful comparison of Equations (4.1) and (4.4).

Figure 4.4: Time series of seasonal effects plus random errors for the atmospheric CO (_2) concentration at Mauna Loa, Hawai’i, measured monthly from March 1958 to present.

We can obtain the overall seasonal effect by averaging the estimates of (_t>) for each month and repeating this sequence over all years.

Before we create the entire time series of seasonal effects, let’s plot them for each month to see what is happening within a year:

It looks like, on average, that the CO (_2) concentration is highest in spring (March) and lowest in summer (August) (Figure 4.5). (Aside: Do you know why this is?)

Figure 4.5: Estimated monthly seasonal effects for the atmospheric CO (_2) concentration at Mauna Loa, Hawai’i.

Finally, let’s create the entire time series of seasonal effects (_t>) :

### 4.2.3 Completing the model

The last step in completing our full decomposition model is obtaining the random errors (_t>) , which we can get via simple subtraction

Again, this is really easy in R:

Now that we have all 3 of our model components, let’s plot them together with the observed data () . The results are shown in Figure 4.6.

Figure 4.6: Time series of the observed atmospheric CO (_2) concentration at Mauna Loa, Hawai’i (top) along with the estimated trend, seasonal effects, and random errors.

### 4.2.4 Using decompose() for decomposition

Now that we have seen how to estimate and plot the various components of a classical decomposition model in a piecewise manner, let’s see how to do this in one step in R with the function decompose() , which accepts a ts object as input and returns an object of class decomposed.ts.

co2_decomp is a list with the following elements, which should be familiar by now:

• x : the observed time series ()
• seasonal : time series of estimated seasonal component (_t>)
• figure : mean seasonal effect ( length(figure) == frequency(x) )
• trend : time series of estimated trend (_t>)
• random : time series of random errors (_t>)
• type : type of error ( "additive" or "multiplicative" )

We can easily make plots of the output and compare them to those in Figure 4.6:

Figure 4.7: Time series of the observed atmospheric CO (_2) concentration at Mauna Loa, Hawai’i (top) along with the estimated trend, seasonal effects, and random errors obtained with the function decompose() .

The results obtained with decompose() (Figure 4.7) are identical to those we estimated previously.

## Complex functions

Sum up all elements of a list. An empty list yields zero.

This function is inappropriate for number types like Peano. Maybe we should make sum a method of Additive. This would also make lengthLeft and lengthRight superfluous.

Sum up all elements of a non-empty list. This avoids including a zero which is useful for types where no universal zero is available.

Sum the operands in an order, such that the dependencies are minimized. Does this have a measurably effect on speed?

It follows $f$ is continuous at $a.$

Let's examine your situation. You have that $lim_ f(x) = f(a)$ for some $ain mathbb$ and that for any $x,yin mathbb, f(x)+f(y)=f(x+y)$. You want to prove that for any $cin mathbb, lim_ f(x) = f(c)$. The key step here is to realize that $lim_ f(x) = lim_ f(x-a+c)$ because $|(x-a+c)-c| = |x-a|$, so in plain english $x$ is close to $a$ if and only if $x-a+c$ is close to $c$. We can then complete the proof as follows: $lim_ f(x-a+c) = lim_ (f(x) + f(c-a)) = f(a) + f(c-a) = f(c)$

Suppose $x=x_0, f(x_0) + f(y) = f(x_0 + y)$ taking limit as $y$ tends to $, we get$lim_ (f(x_0) + f(y)) = lim_ f(x_0 + y)$From the continuity at$x_0$, we know that RHS of the above equation is$f(x_0)$which means that$lim_ f(y) =0$Next bit it simple. For any$x,y$in the domain,$f(x+y)=f(x)+f(y)$continuity can be established by checking whether$lim_ f(x+y) =f(x)$which is true since$f(x+y)=f(x)+f(y)$and$lim_f(y)=0$. Hence$f(x)\$ is continuous everywhere in the domain.

I'm new here so I don't know how to input equations using latex. Pls bear with it.

## §4.2 Definitions

where the integration path does not intersect the origin. This is a multivalued function of z with branch point at z = 0 .

The principal value, or principal branch, is defined by

where the path does not intersect ( - ∞ , 0 ] see Figure 4.2.1 . ln ⁡ z is a single-valued analytic function on ℂ ∖ ( - ∞ , 0 ] and real-valued when z ranges over the positive real numbers.

The real and imaginary parts of ln ⁡ z are given by

The only zero of ln ⁡ z is at z = 1 .

Most texts extend the definition of the principal value to include the branch cut

With this definition the general logarithm is given by

where k is the excess of the number of times the path in ( 4.2.1 ) crosses the negative real axis in the positive sense over the number of times in the negative sense.

In the DLMF we allow a further extension by regarding the cut as representing two sets of points, one set corresponding to the “upper side” and denoted by z = x + i ⁢ 0 , the other set corresponding to the “lower side” and denoted by z = x - i ⁢ 0 . Again see Figure 4.2.1 . Then

with either upper signs or lower signs taken throughout. Consequently ln ⁡ z is two-valued on the cut, and discontinuous across the cut. We regard this as the closed definition of the principal value.

In contrast to ( 4.2.5 ) the closed definition is symmetric. As a consequence, it has the advantage of extending regions of validity of properties of principal values. For example, with the definition ( 4.2.5 ) the identity ( 4.8.7 ) is valid only when | ph ⁡ z | < π , but with the closed definition the identity ( 4.8.7 ) is valid when | ph ⁡ z | ≤ π . For another example see ( 4.2.37 ).

In the DLMF it is usually clear from the context which definition of principal value is being used. However, in the absence of any indication to the contrary it is assumed that the definition is the closed one. For other examples in this chapter see §§ 4.23 , 4.24 , 4.37 , and 4.38 .

If you think you may have a food additive sensitivity, it’s important to seek professional help since all of the symptoms you may be experiencing can also be caused by other disorders.

It may help to keep a food diary and note carefully any adverse reactions. In the case of a sensitivity being identified, the usual practice is to eliminate all suspect foods from the diet and then reintroduce them one by one to see which additive (or additives) causes the reaction. This should only be done under medical supervision, since some of the reactions – such as asthma – can be serious.

Additives are substances used for a variety of reasons - such as preservation, colouring, sweetening, etc.- during the preparation of food. The European Union legislation defines them as "any substance not normally consumed as a food in itself and not normally used as a characteristic ingredient of food, whether or not it has nutritive value".

Added to food for technological purposes in its manufacture, processing, preparation, treatment, packaging, transport or storage, food additives become a component of the food.

Additives can be used for various purposes. EU legislation defines 26 "technological purposes". Additives are used, among other things, as:

Colours – they are used to add or restore colour in a food

Preservatives – these are added to prolong the shelf-life of foods by protecting them against micro-organisms

Antioxidants – substances which prolong the shelf-life of foods by protecting them against oxidation (i.e. fat rancidity and colour changes)

Flour treatment agents – added to flour or to dough to improve its baking quality