The US Environmental Protection Agency, or EPA, measures the “cleanliness” of the air across the country. Thanks to these measurements, we can make informed plans regarding our outdoor activities.

The EPA uses the Air Quality Index () to determine how benign the local air is at a given time. The is a number between 0 and 500. The lower the number, the cleaner the air. According to the EPA booklet “Air Quality Index – A Guide to Air Quality and Your Health”, the is calculated for four major air pollutants regulated by the Clean Air Act: ground-level ozone, airborne particles, carbon monoxide, and sulfur dioxide. The first two of these pose the greatest hazard to human health. Ozone can cause coughing, throat irritation or a burning sensation in the airways. Particle pollutants, which are microscopic solids or liquid droplets that can reach the lungs, may cause irritation of the nose, eyes and throat. Those at risk from ground-level ozone and particulate matter include people with asthma, children, and the elderly.

The associated with a specific pollutant is obtained from measurements of the pollutant concentration . The EPA uses a table of concentration ranges to which different values are pre-assigned (see the booklet link above). The desired is computed using a piece-wise linear function of :

,

where and are the lower and upper ends, respectively, of the tabulated pollutant concentration range such that , and and are the endpoints of the range assigned to the interval . Again, the lower the , the better the air quality.

The is formally computed by the above equation for each pollutant. Each measuring site might measure the concentration of all pollutants, of some of them, or of only one. When the concentration for more than one pollutant is measured, the final reported corresponds to the largest .

Which areas in the United States have the cleanest air, on average? More specifically, what are the time averages of the in different counties? To answer this question, one can obtain county-level data from this EPA webpage (scroll to the middle of the page to the table called “Daily AQI”; the csv files containing those data, as well as the R code to analyze the data, can be found in this GitHub repository). The reported daily values correspond to only 1,316 US counties (less than half of all US counties), because reports for areas with a population of less than 350,000 are not required by law. Note also that s for each sampled county are not reported strictly on a daily basis (for example, many measurements are reported every three days).

I used data from the year 2000 up to April 2020. The following map in Figure 1 shows the averaged over this period:

As described by the figure legend, the counties in green have the best air quality (remember that the lower the , the cleaner the air). In EPA terminology, the green counties have “Good” air quality. The counties in yellow have “Moderate” air quality. The air in the two regions in orange is “Unhealthy for Sensitive Groups”. A distribution of these mean s is shown in Figure 2:

Figure 2 shows that most (84.1%) of the US counties for which data exist have had clean air (i.e., “Good” air quality) since 2000. The air quality was moderate, on average, for 15.7% of counties. Only two counties (0.15% of the total measured) had an average air quality that was unhealthy for certain people. These two counties are Kings, in California, and Ada, in Idaho.

During the 2000 – 2020 period examined here, what were the best and worst recorded air qualities in each of the studied counties? In the case of the best air quality, all 1,316 examined counties reported minimum s that were below 51. In other words, at different points in that period, each of those counties enjoyed “Good” air quality. For that reason, the minimum- map shows all of those counties in green, and it is therefore not included here.

For the case of the worst air quality reached in each county in the same period, Fig. 3 shows a US map with the maximum measured :

Remarkably, there are a few counties in which the air quality remained “Good” throughout. A list of those counties is shown in Fig. 4:

The distribution of maximum s is shown in Fig. 5. Note that there were several values (25 of them, corresponding to 2% of the total number of reported counties) well above 500, extending the tail of the distribution to the right into the thousands. Those extreme values were not included in the distribution of Fig. 5 for visual convenience. Nevertheless, it would be worthwhile to inquire further with EPA experts as to the meaning of such outliers.

From Fig. 5, 42% of the studied counties reported “Unhealthy” air conditions at some point during the 2000 – 2020 period. “Very unhealthy” conditions occurred in 29% of the counties, and “Hazardous” conditions in 4% of the counties.

Finally, calculating the standard deviation of each county’s (, where RMS stands for “root-mean-square”) gives an idea of the variation in air quality over the same period. This variation is shown in Fig. 6. In this figure, the numbers shown correspond to those counties for which the is less than 100. Counties in blue have a low standard deviation, while counties in magenta have a high standard deviation. There were 2 counties in which exceeded 100, both in California (neither of which is colored): Kings () and Napa ().

In conclusion, the EPA data on daily county values show that

- on average, most of the counties for which values are available enjoyed clean air between January 2000 and April 2020;
- most of those counties experienced at least one instance of unhealthy air conditions in the same period; and
- all but two counties had a root-mean-square () of less than 100, with a median of 20.

To answer this question, one can look for the “Keywords” section that appears in most “recent” (i.e., from 1970 onwards) astronomy papers, and count the number of actual keywords listed. Once again, the NASA ADS Astronomy Abstract Service proves invaluable for this task. So without further ado, here are the results:

These keywords correspond to refereed articles only, and they have all been converted to lower case. The search gave 128, 394 different keywords. The 3 least-frequent keywords were “zz ceti”, “zz psc” and “zz uma”, with only one instance each. The keyword “dark energy” was number 311 in the list, “mars” was number 881, and “extrasolar planets” was number 1,410.

It is also interesting to look at the frequency of keywords over time. I picked four keywords (“stars: atmospheres”, “brown dwarfs”, “gamma rays: observations”, and “astrochemistry”), and plotted their available yearly counts since 1990. I did this for each of three journals: *The Astrophysical Journal* (ApJ, shown in red below), *Monthly Notices of the Royal Astronomical Society* (MNRAS, blue), and *Astronomy & Astrophysics* (A & A, green). I also included the yearly counts for all the journals combined (shown in black):

Can you explain the two peaks in the black curve for “gamma rays: observations”?

Check out this cool interactive website built by Stefano Meschiari, in which you can choose a keyword and see its behavior in time.

]]>I wanted to learn a bit more about the type of industries that have presence in Austin, as well as how much they pay (useful to know if you are in the job market), and I turned to the website Glassdoor.com, which contains first-hand information on companies, salaries associated with different job positions, and even employee reviews.

I did a search of all the companies listed on the website, and there were a little over 2,400. Not all of the listed companies included information about their positions or salaries (especially true for small companies). The first thing I scraped off the website was the distribution of companies over different industries. The results are summarized in this graph:

Austin’s character as a tech hub is obvious from this bar chart, where IT companies stand out. Keep in mind that this information is changing constantly.

How are salaries distributed across industries? Gathering the available salary figures for all positions in each company, and for every industry, gave this result:

Note that the salaries taken to produce this plot are averages for each job title. The black dot associated with each industry represents the median of the corresponding *average* salaries. The left end of each dashed bar corresponds to the minimum salary for that industry, while the right end is the maximum value. The left and right edges of the rectangular boxes denote the 1st and 3rd quartiles, respectively. There are several outliers indicated by the empty circles. Two industries in this plot (Non-Profit and Insurance) do not appear in the first plot, because not enough companies were listed in those industries on the website, which of course doesn’t mean that there aren’t many such companies in Austin.

Next, I counted the total number of different job titles. Unsurprisingly, a tech-related title stood out. I include the 15 most numerous job titles below:

The final count of different job titles turned out to be 13, 525. This number includes those cases in which there exist several variations of a job title. For example, similarly to Software Engineer, there are titles such as Software Development Engineer I, Software Development Senior Engineer, Software Applications Engineer, etc.

Salaries for some job titles are shown in this figure:

It’s also interesting to look at how salaries for one job title vary across different companies. Take a peek at this graph for Electrical Engineer:

The highest salary is approximately twice the amount of the smallest one. Now take a look at the case of Account Manager:

The abrupt changes shown in the bars are due to my selection of companies to include in the plot; since there are many more companies that can fit in one graph (over 100 of them), I picked three groups of companies from the ones available. The highest-paying company for Account Manager offers more than five times the salary of the one that pays the least.

In the two preceding plots, it’s worth bearing in mind that the same job title may involve different duties, depending on the company, so one has to be cautious about interpreting this raw information.

]]>I was curious about the trends in the number of authors per refereed astronomy paper, so I set out to write an R script that would read in data from the NASA Astrophysics Data System, an online database of both refereed and non-refereed academic papers in astronomy and physics. The script counts the monthly number of refereed astronomy and physics papers between January 1967 and September 2013, as well as the number of authors in each of those papers, and plots the results. It was a good exercise in manipulation of regular expressions. It is possible that the script can be made more efficient by using different R structures than the ones I used, but I wanted to look at the results first.

Here’s a plot of the monthly mean and maximum number of authors:

The upward trend in both sets of data is readily apparent. The maximum is much more variable, though. The record maximum number of authors for a single paper is 3062, for May 2011, which belongs to a study of proton-proton collisions performed at the Large Hadron Collider. The variability of the mean number has increased in the last three years or so.

Let’s decompose both series into their constituent parts, namely trend, seasonal and random:

In both cases, there is very little seasonality, and the random component is more stable for the series of mean number of authors.

As it turns out, there was a nice study by Schulman et al. (1997) in which it is reported that the *annua*l mean number of authors per *astronomy* paper, published in different journals, increased between the years 1975 and 1995 by a factor of between 1.5 and 2. We see something similar here. In their article, those authors showed that the fraction of papers written by a single author decreased steadily in the same period by a factor of about 3. To check that out, I looked at the monthly data of fraction of papers written by *N* authors, with *N*=1, 2, 3, 4, 5, and also *N*>5. This is what I found:

Single-author papers have decreased by a factor of almost 4 since 1967 (black line). The period covered by Schulman et al. shows a similar trend here, even though I’m also including physics articles. Papers written by two authors (red line) grew in quantity until the mid-1980s, and then slowly decayed to their present fraction of 20%.

Perhaps the most noteworthy behavior is that of papers written by more than five authors (orange line). From making up only 1% of the total papers in 1967, they are now the most numerous, at 28%. There was an abrupt rise in the trend of these *N>*5 publications in 1990, which coincides with the launch of the Hubble Space Telescope, and it hasn’t changed significantly ever since. Performing a Holt-Winters fit, which is a method that uses exponentially weighted moving averages to update estimates of the seasonally adjusted mean, the slope and seasonals, we can make a forecast of the fraction of papers that will be written by more than five authors in the next 24 months. R has functions called **HoltWinters **and **forecast.HoltWinters** that do this. The result can be seen in this figure:

The blue line represents the forecast of relative number of papers until December 2014. The dark and light shaded areas are the 85% and 95% confidence regions, respectively.

Schulman et al. (1997) have pointed out possible explanations for the decrease in the number of single-author papers and increasing numbers of multi-author articles. One is the “growth of multiwavelength astrophysics… which requires astronomers to be proficient in multiple wavebands or to collaborate with experts in other wavebands”. Another possible cause is that the increasing competition for jobs and grants forces applicants to look for collaborations that will result in a longer publication list for their CV. Also, the development and availability of rapid communication through the Internet has boosted the interactions among colleagues.

It would be interesting to look at economic indicators side by side with the data shown above, to try to elucidate if there’s a connection between number of authors and economic trends.

** Reference: **Schulman, E. , French, J. C., Powell, A. L., Eichhorn, G., Kurtz, M. J. and Murray, S. S. 1997, Publications of the Astronomical Society of the Pacific, 109, 1278

One way in which astrophysicists can understand these systems better is through numerical simulations of small disk patches . One of the simulation outputs is a very important quantity, known as “α” (alpha) in the planet-formation field, which is basically a dimensionless measure of the intensity of the turbulence. If scientists are able to decipher how disk turbulence varies during the lifetime of a protoplanetary disk (which can be on the order of 1 to 10 million years), we could potentially explain many fascinating features of our own solar system and of other planetary systems.

Up until now, astronomical observations of disks have provided only indirect, order-of-magnitude estimates of α. In spite of this, simulations that include the relevant physics allow us to glean the random variability of a disk’s α, and this provides theorists with a vital piece of information that can be used as input for their disk models.

However, thorough numerical descriptions of disks require large amounts of computational resources, particularly computing time, as well as knowledge of the complex codes that are used. What if it was possible to obtain α-values without having to carry out full 3-D simulations of disks every time someone needs them?

One way to do this is by performing a time series analysis of α-data from previously-run disk simulations, and obtaining a stochastic model that can be used in subsequent disk calculations. The statistical language R contains many packages and functions ideally suited for this task, and the analysis below uses those. Excellent introductions to the use of R in time series analysis can be found in “Introductory Time Series with R”, by P. S. P. Cowpertwait and A. V. Metcalfe (Springer), and in “Time Series Analysis WIth Applications in R”, by J. D. Cryer and K. Chan (also published by Springer). What follows is not intended to be a rigorous or comprehensive study, and I would welcome comments and suggestions to improve these results.

Figure 2 shows a realization of α from a typical protoplanetary disk calculation, using an astrophysical-fluid-dynamics code. The unit of time is one orbit, which is one revolution of the disk patch being simulated around the central star (this can easily be converted to years, or any other desired unit). Data points are spaced by 0.5 orbits. Here we are discarding the first 70 orbits, because it takes the turbulence some time to set in after the beginning of the simulation, and we want to make sure that the data are truly capturing the turbulent state.

The correlation between α-values at time *t* and values at times *t-k*, with *k*=1, 10 and 200 representing the number of time steps or “lag” for each data point, are shown in Figure 3. Values that are separated by one lag (half an orbit) are highly correlated, as evidenced by the black symbols. On the other hand, α-values that are 200 lags (100 orbits) apart show almost no correlation, as can be seen from the blue symbols. The corresponding correlation coefficients are indicated in the figure.

Another way of examining correlations between pairs of data points is to plot the *sample autocorrelation function* (or sample ACF). A *correlogram* is a helpful visual tool that can be used for this purpose. Figure 4 presents the sample ACF of the data in Figure 2; for each *k*, the correlogram shows the correlation between α_{1} and α_{1+k }, α_{2} and α_{2+k} , α_{3} and α_{3+k}, etc.

The blue horizontal lines in Figure 4 are drawn at ±2/√*N*, where *N *is the number of data points. That is, the lines represent two approximate standard errors above and below zero. The sample ACF has a sampling distribution from which we can infer whether the data come from a completely random series. If the sample ACF falls outside the lines (as it does for most lags in Figure 4), we have evidence against the null hypothesis that the autocorrelation function equals zero, at the 5% level. In other words, the correlations are statistically significant. If the sample ACF remained within the 95% confidence band, it would be reasonable to infer that no correlations exist.

The text annotation in Figure 4 reports the result of an augmented Dickey-Fuller (ADF) test on the α-data of Figure 2. I won’t go into the details of the test (an exposition can be found in the books mentioned above), but generally speaking, it can be used to quantify the certainty about whether a time series is *non-stationary* (strictly speaking, whether it contains a unit root, which is the null hypothesis). A non-stationary series has a non-constant mean and/or variance, and shows significant correlations in its correlogram. Additionally, in that case the Dickey-Fuller statistic will be a negative number with a relatively small absolute value. That seems to be the case for our data, and Figure 4 supports the notion that the disk turbulence intensity from the simulation is non-stationary.

Now, to proceed further we need to transform the original time series into a stationary one, that is, into a process which is, in a sense, in statistical equilibrium. We can achieve that by taking the natural logarithm of, and then differencing, the series. This is shown in Figure 5, where the left panel corresponds to the transformed series, and the right panel to its sample ACF.

The ACF shows a slow decay, but after some 350 lags no correlations remain. At this point, I would be tempted to say that even after this transformation (and some others that I’m not showing here) the series still remains non-stationary, given what we see in the correlogram. However, the augmented Dickey-Fuller test gives us reason to reject the null hypothesis of a unit root, or in other words, non-stationarity. Moreover, the transformed series shows a relatively stable mean compared to the original series in Figure 2, and the variance exhibits less change, except at a few times around 250 orbits. These larger changes would need to be incorporated into a more elaborate model, but for the time being I’m going with the result of the ADF test and take the transformed series to be stationary.

So, knowing that it was necessary to take the differences of the logarithms of α, we can use the “**arima**” function in R to obtain a model of that particular realization of α. “ARIMA” stands for Auto Regressive Integrated Moving Average, which describes the structure of a time series model consisting of recursive terms (the AR part) and terms of independent and identically distributed random variables (the MA part). The I in ARIMA refers to the fact that we had to difference the series to make it stationary. Exactly how many AR and MA terms are needed, and how many times we need to take differences, lies at the heart of the time series analysis problem. In this case, by trial and error, 5 AR terms and 5 MA terms are found to most accurately describe our stochastic process.

Using **arima**** **on the natural logarithm of the α-series, and once we specify that we differenced it once, R computes the coefficients of the ARMA components and their associated standard errors:

AR1 | AR2 | AR3 | AR4 | AR5 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|---|---|---|---|

Coefficient | 0.295 | -0.058 | -0.058 | 0.942 | -0.352 | -0.555 | 0.017 | -0.019 | -0.982 | 0.570 |

S.E. | 0.237 | 0.067 | 0.062 | 0.067 | 0.179 | 0.232 | 0.023 | 0.023 | 0.025 | 0.210 |

We can see that the coefficients for terms AR2, AR3, MA2 and MA3 are not significant. The resulting model is then

where the *w _{t}*‘s are white noise variables.

It is necessary to perform some diagnostics before choosing the model as a statistical descriptor of our data. To that end, we examine the standardized residuals obtained from our model fit. If the model is adequately specified and the estimates of the parameters are reasonably close to the true values, then the residuals should behave roughly like independent, identically distributed normal variables. If we plot their behavior over time, they should scatter approximately equally above and below zero, and not show any trends. This appears to be the case, as shown in Figure 6:

We can assess the normality of these residuals by means of a quantile-quantile, or Q-Q, plot. This graphical tool can be used to display the quantiles of the data versus the quantiles of a normal distribution. If the data are normally distributed, they will form an approximately straight line. They are close to doing that in Figure 7, except at the edges:

A Shapiro-Wilk test for normality can be performed to calculate the correlation between the residuals and the normal quantiles. A p-value of 0.69 is obtained, and we therefore can not reject the null hypothesis that the residuals are normally distributed.

As a final diagnostic for our model in this exercise, we look at the ACF of the residuals to establish whether any autocorrelations remain.:

Figure 8 shows that is not the case, and our model could be a good candidate to be used as a proxy of the turbulent activity in a protoplanetary disk, provided its physical features are similar to those of the 3-D numerical simulation.

]]>