Online resources useful for learning or teaching data science, biostats and bioinformatics including 15 online courses. Topics include:
R
tidyverse
Dataviz & ggplot2
Linux, GitHub
Probability
Inference & Modeling
Regression
Machine Learning
Bioconductor
Literally all I do as a statistician:
No.
No.
That's not the definition of a p-value.
No.
Trending towards significance is not a thing.
No.
No pie charts!
That only works if data is normal.
No.
That's logistic regression not AI.
No.
Your "novel" method was invented in 1918.
No.
All materials for the 4 hour workshop
Data Science for Statisticians
are available on GitHub.
We covered tidyverse, dataviz, wrangling and machine learning.
Includes
6 lectures in R markdown and html
5 labs
solutions in R markdown and markdown
A free PDF version of
Introduction to Data Science: Data Analysis and Prediction Algorithms with R
is now available on
Thanks to all the readers that through GitHub pull requests and issues improved the first gitbook version, specially to
@biochemnerd
!
If they want to improve the quality of scientific publications, rather than banning p-values or changing the 0.05 threshold, journals should make us show the data.
Academics that focus on theory often underestimate how difficult applied work is. If your publications contain only toy examples or simulated data, if you haven't built tools that others use, consider the possibility that working on real-world problems is harder than you imagine.
The good statistical collaborator paradox:
If you collaborate with a good statistician, you will appear less productive.
The paradox comes from the fact that the statistician will often catch false discoveries before you publish them. The benefits will come in the long run.
.
@HarvardBiostats
260 Introduction to Data Science starts next week.
Course notes and exercises, updated weekly, are publicly available here:
GitHub repo with Quarto code is here:
Update: Leads in PA and GA continue to decrease. The linear trend is consistent with a difference between votes counted last night and the mail-in votes being counted now. Note that the expected total vote is an estimate. If this changes, the entire plot shifts to right or left.
After a four-year pause, I am teaching Introduction to Data Science again and the free online textbook is being updated frequently. Good time to make requests. Changes already made:
-Added data.table chapter
-Caught up with dplyr 1.0.0
-Changed %>% to |>
Vaccines work in a gif update: COVID19 cases versus vaccination rates in US states through time. The Delta variant effect can be seen clearly starting in July. States with lower vaccination rates are affected much worse.
Vaccines work in a gif: COVID19 cases versus vaccination rates in US states through time.
Cases start decreasing after 30% are vaccinated. Today states with higher vaccination rates have less cases. Voted Trump = red Biden = blue. Increase in July probably due to Delta variant.
Open letter to journal editors: dynamite plots must die. Dynamite plots, also known as bar and line graphs, hide important information. Editors should require authors to show readers the data and avoid these plots.
Clustering algorithms report clusters even when none exist. In single-cell RNA-Seq pipelines, novel cell types are often identified by clustering algorithms. Expanding on Kimes et al.'s work, we introduce significance analysis for single-cell RNA-Seq data:
Collaborator: I know you are super busy finishing the analysis for our current project, but when you are done I need help with this new cool dataset ...
Applied statistician:
New version of
Introduction to Data Science
Data Analysis and Prediction Algorithms with R
is available. Many improvements, mostly suggested by readers, have been incorporated.
We are posting materials for this year's Introduction to Data Science course here:
Includes slides, exercises, and labs.
Textbook:
To convert the Rmd book chapters to Rmd slides, I use this R function:
Preliminary data from Puerto Rico suggest Omicron is about 40% as severe as Delta in terms of sending infected individuals to hospital. For children under 12 it appears just as severe.
Graph compres the % hospitalized during surge dominated by delta to the current omicron one.
Dear everybody,
If you have to choose one nice thing to do for the computer geek helping you, don't use spaces in your filenames.
Instead of "My Document", use "my-document", "my_document" or "myDocument"
Spaces indicate the end of the filename in some of the tools we use.
La tasa de vacunación comenzó a subir de nuevo en PR. Casi 90% de los adultos han recibido al menos una dosis. Es normal que hubiera preocupaciones, pero esto demuestra que los que creen conspiraciones son una minoría. Gracias a la prensa y salubristas por sus esfuerzos educando.
First draft of
#DataScience
book is online.. just in time for start of the semester. Focus is not math nor coding but answering questions through data analysis.
R Basics
DataViz
Probability
Inference
Wrangling
Regression
Machine Learning
Productivity Tools
Advice for undergrads interested in data analysis
Courses:
Probability
Stat inference
Linear models with matrix algebra
Machine learning
Scientific computing
Skills:
Code analyses in R or Python
EDA
SQL, html, git, Unix
Google-fu
Get real world experience homework is not enough
A paper version of Introduction to Data Science: Data Analysis and Prediction Algorithms with R is now available on Amazon:
We are working on a solution manual for the 502 exercises it includes, for those interested in using it as a course textbook.
Los salarios de los maestros en Puerto Rico son los más bajos en Estados Unidos por mucho. Pero el presupuesto del Departamento de Educación es alrededor de $13,500 por estudiante, cerca del promedio en EEUU. ¿A dónde va todo ese dinero si no a salarios de maestros?
📣A second edition of our Introduction to
#DataScience
is on its way, now split into two books.
After teaching the course this semester, we've made significant improvements. Current drafts are online:
📘Intro:
📙Advanced:
#rstats
📣 Thanks to your feedback we've made many updates to Introduction to
#DataScience
including:
✅ High Dimensional Data part
✅ Treatment effect models chapter
✅ Code in Quarto
✅ Split into two parts
📘 Intro:
📙 Advanced:
#RStats
Estoy considerando traducir la versión electrónica gratis del libro Introduction to Data Science al español.
¿Hay interés, o la versión en inglés basta?
Making a spreadsheet look good to the human eye often makes it very hard for data analysts to extract what they need to help you with the analysis.
@kwbroman
and
@kara_woo
's paper should be required reading for anybody creating spreadsheets by hand:
A few weeks ago Puerto Rico was in the middle of a surge in COVID19 cases. The governor imposed restrictions and strict mandates making it the US jurisdiction that most incentivizes vaccination. Today PR has a higher vaccination rate and lower case rate than all 50 US states.
Recently, Puerto Rico became the US jurisdiction that most incentivizes vaccination. Students and public employees need to be vaccinated. Several venues including restaurants require vaccination cards. This has resulted in an increased vaccination rate that could soon make PR
#1
.
Datos de PR demuestran que las vacunas contra COVID19 funcionan. Comparamos contagios, hospitalizaciones y muertes de vacunados a no vacunados. La evidencia es clara. La gráfica compara tasa de mortalidad por grupo de edad. Más ejemplos en informe completo
La ciencia hay que comunicarla como es. Aunque sea inconveniente o incomode. En Puerto Rico, desde hace meses tenemos datos que indican que la efectividad de la vacuna mengua. No priorizar comunicar esta importante información en su momento, ahora está causando daño y confusión.
A second version of The Data Science
@HarvardOnline
series is now up and running. The series is composed of eight courses and a capstone:
R Basics
Data visualization
Probability
Inference and modeling
Linear Regression
Data wrangling
Machine Learning
Current
#SingleCell
pipelines overcluster, leading to over-reporting novel cell types. We present a statistical method to help determine which clusters are real. Can be applied to raw counts or existing clusters.
Code:
Preprint:
Unpopular opinion: The GRE can be useful for quantitative PhDs. I did my undergrad in a university that doesn't even appear on the rankings. The GRE gave me a chance to demonstrate that I could compete with students from top ranked universities. To study, I borrowed prep books.
Several updates have been made to the Introduction to Data Science online book. The main one being the addition of dozens of exercises to the wrangling, regression and machine learning sections.
A PDF version is coming soon.
Using vital statistics from Puerto Rico, Louisiana, New Jersey and Florida we compared the effects of María to other recent hurricanes
We estimate about 3,000 excess deaths after María, a higher toll than Katrina. Only other comparable tragedy was after Georges, also in PR.
1/4
The Role of Academia in Data Science Education is now published. We argue that data science is not a discipline but an umbrella term for a complex process involving a team with complementary skills. We then provide recs for designing academic programs.
Ideal mentees are enthusiastic, energetic, organized, and focused. They embrace feedback while remaining honest and responsive. And they learn to underpromise and overdeliver.
Encontramos casi 150,000 errores en la base de datos de vacunas de Puerto Rico.
Nombres entrados incorrectamente resulta en que no se combinen récords de la misma persona.
Los récords de dosis de refuerzo son los más afectados. Explica por qué no aparece en VacuID para muchos🧵
Daughter: Why are academy award best picture winners so boring?
Me: I don't think it was always like this.
Daughter: Really?
A few weeks later... figure from her high school stats class project:
New manuscript with recs on how to normalize scRNA-Seq data. Main message: don't use log(CPM+1) transformation, it magnifies unwanted source of variability. For example, see tSNE plots of technical replicates below.
For more see and thread by
@sandakano
.
Ahora mismo en Puerto Rico los casos COVID19 se están disparando como nunca antes visto.
La tasa de positividad brincó de 2% a 5% en una semana. Entre los de 20-29 está sobre 10%.
Se han detectado 731 casos ayer martes y aún están entrando datos.
Seguimos actualizando aquí.
El crecimiento exponencial parece haber parado en PR.
Se están detectando 5,000 casos al día y las hospitalizaciones están creciendo, pero la tasa de positivdad diaria (no el promedio semanal) bajó 3 días corridos. El sacrificio de minimizar encuentros parece estar funcionando.
The polls did NOT fail. Plot below shows
@FiveThirtyEight
's forecast plotted against the actual result. We do see an overall bias of about 3%. But this is not unusual and was accounted for. 92% of the confidence intervals covered and only GA, NC, & FL were in the wrong quadrant.
Si se han quitado la mascarilla en un espacio cerrado como un restaurante, barra, iglesia, o un gym, por favor háganse la prueba, especialmente si no tienen el booster.
Los datos muestran que éstas son las actividades más riesgosas para contraer COVID19.
This animation helps explain why it is so hard to predict when/if the COVID-19 surge will come, and when it will peak, in places like Puerto Rico where very few cases have been reported.
I spent a humiliating amount of time learning how to make animated graphs, just to illustrate a fairly obvious point.
“Forecasting s-curves is hard”
My views on why carefully following daily figures is unlikely to provide insight.
Real world data from Puerto Rico shows the importance of boosters.
After 7 months Pfizer effectiveness drops substantially, but booster brings it back to ~85%
J&J effectiveness drops after 2 months, but with Moderna or Pfizer booster it increases to higher level than original.
Aesthetics are important, but the main point of a figure is NOT to make the paper look pretty. When adding a figure to a paper, think hard about how the visual cues help the reader understand a result. Published network hairballs, for example, rarely covey anything useful to me.
En Puerto Rico se están detectando sobre 10,000 casos al día. La tasa de positividad indica que 1 de cada 3 pruebas moleculares sale positiva. Con antígenos 1 de 4. No se están haciendo suficientes pruebas por lo cual muchos casos no se detectan. Esto dificulta frenar el repunte.
Esta semana la ola omicrón por fin llega a su fin en Puerto Rico. Los casos por día han bajado a niveles no vistos desde principios de diciembre. Durante los 80 días en esta ola, se detectaron casi 300,000 casos, sobre 4,000 de estos fueron hospitalizados y sobre 800 fallecieron.
La tasa de positividad en Puerto Rico comenzó a subir esta semana. Importante notar:
- Con la llegada de la variante delta se han observado brotes sustanciales en jurisdicciones con tasas de vacunación parecidas a PR
- Sobre 99% de las muertes y hospitalizaciones son no vacunados
New preprint: Data from Puerto Rico shows importance of Moderna/Pfizer boosters:
After 6 months Pfizer effectiveness against infection wanes to ~36%, booster brings it back to ~85%.
After 2 months J&J wanes to ~36%, booster brings it up to ~88%, higher than the original 65%.
Datos preliminares de los 9,000 casos registrados en PR durante repunte que comenzó en 12/8:
Entre los que tienen el booster:
- 0 hospitalizaciones/muertes registradas
- Tasa de infección 2X menor que los sin boosters
- Casi 4X menor que los no vacunados
Had Fisher suggested 0.005, instead of 0.05, as the arbitrary p-value cutoff to reject a null hypothesis, back in 1925, how would the world be different today?
What is the probability of a randomly selected person having a disease given a positive test? If the test accuracy is 99% but the prevalence is 1 in 4,000 Bayes' Theorem tells us it is 2.5%. Some students find this counterintuitive. Monte Carlo simulations sometimes help clarify:
After 2 years, and several rejections,
@stephaniehicks
scRNA-seq paper is finally published. Thanks
@biorxivpreprint
for letting us share it before pub (and get 25 citations).
Why learn stats? Data analysis has been around for decades. Through the years, ideas that generalize across applications have been developed and common ways to get fooled by apparent patterns identified. Learning stats saves you from reinventing the wheel and repeating mistakes.
📢 Introducing the Data Science Postdoctoral Fellows Program at Harvard/DFCI!
🔹 Join a research group in our department
🔹 Co-mentoring opportunities with 2+ faculty
🔹 Collaborate with DFCI investigators beyond our department
🔹 Salary starts at $75K
@PRicansInSTEM
My name is Rafael Irizarry. I am the chair of the Department of Data Science at Dana-Farber Cancer Institute. Also a Biostatistics professor at Harvard
If you are interested in a career in biostatistics or data science in general don't hesitate to reach out
#PuertoRicansInSTEM
I am offering a 5-week paid course on data wrangling, visualization, and machine learning. Includes graded assessments and problem sets based on real-world challenges.
Space limited to a small cohort so apply soon if you are interested. Details here:
My pitch if I ever interview for university president:
Under my leadership, I will not email you.
So to summarize my promises:
1 - AV will work
2- You will have one password
3- No spam
Hoy Puerto Rico sobrepasó el 70% de la población vacunada, antes que todos los 50 estados de EEUU. Todas las otras tendencias se ven bien.
Pero aún quedan muchos sin vacunarse, incluyendo sobre 200,000 mayores de 60 y se detectan sobre 100 casos al día. Seguimos monitoreando.
We ran a survey to better understand mortality in Puerto Rico after hurricane María. The official death count of 64 is likely a substantial underestimate. Lack of access to medical care was a major problem.
Code and data are here:
The number of
#COVID19
related deaths seems to be trending down in most places with large totals.
Nowhere does the growth appear to be exponential for more than 2 weeks.
Possible good news.
Versión PDF gratis del libro "Introducción a la ciencia de datos" ahora disponible en
@leanpub
Para obtener versión gratis, deslicen la barra del precio a $0.00 y opriman "Añadir libro al carrito"
¡Gracias a todos los que ayudaron con la traducción!
La tasa de positividad basada en pruebas moleculares hoy llegó al umbral de 3% en Puerto Rico por primera vez desde julio 8, 2020. Y va bajando.
Ahora a ver si las tendencias que vemos hoy, que predicen menos de 1 muerte al día en 2-3 semanas, continúan. Seguimos monitoreando.
Principal Component Analysis in a gif.
The first principal component of a matrix is the first dimension of the orthogonal transformation that maximizes the variability of that first dimension. These transformations can be visualized as rotations of the points in the matrix rows.
In our data visualization lectures, we go over dataviz principles and show examples of charts that violate these. I've decided this is my favorite bad plot.
Source:
Vaccines work in a gif: COVID19 cases versus vaccination rates in US states through time.
Cases start decreasing after 30% are vaccinated. Today states with higher vaccination rates have less cases. Voted Trump = red Biden = blue. Increase in July probably due to Delta variant.
Ya terminamos los capítulos sobre Probabilidad del libro Introducción a la Ciencia de Datos
Sugerencias son bienvenidas a través de GitHub.
Trabajando ahora en los capítulos de Inferencia y Modelos Estadísticos
HS students: Don't feel defeated if you don't get into your "dream school". There are many paths to success that don't involve going to a famous private college. Sometimes having an extra 4 years to catch up to kids with access to better preparation is a good thing.
Offering a 5-week machine learning course. It covers algorithm development and fundamental concepts. Focus is on genomics datasets. Lectures are in real-time, with discussion board, feedback on homework, and help showcasing your work on GitHub. Apply here: