Searching a first name for a child

Use of INSEE open data to find a first name corresponding to the chosen criteria

Marie Vaugoyeau

82 minutes read

Aim

The purpose of this article is to see how information corresponding to a certain number of criteria can be found in an open data file created by INSEE, the French National Institute of Statistics and Economic Studies.

Here, we will look for a first name :

  • for a boy
  • that is not a hyphenated name
  • not starting with S because when your last name starts with M, there’s better ^^
  • that is not already present in the family or close friends, which takes about sixty of them out
  • common but not among the most given either
  • that is not a word in the French language, for example, no Pierre (Stone in French), Colin (Coley), Iris,…

Open source data retrieval from INSEE

INSEE has published a file of first names from 1900 to 2018, available on the French government’s open data site.

The first column gives the sex of the children born, 1 for boys and 2 for girls. Quickly I select only the first names given to the boys so this column does not appear anymore in the worked data set.
preusual is the usual first name given to the children.
annais, the year of birth concerned.
nombre is the number of births for a given sex, first name and year, unless it is less than 3. In this case, the counts are cumulated over all the years and the year is replaced by “XXXX” or under the name _PRENOMS_RARES (uncommon first name) per year. In view of our limitations, these lines have been removed from the dataset.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## Warning: le package 'ggplot2' a été compilé avec la version R 4.0.3
## Warning: le package 'tidyr' a été compilé avec la version R 4.0.3
## Warning: le package 'readr' a été compilé avec la version R 4.0.3
## Warning: le package 'dplyr' a été compilé avec la version R 4.0.3
## Warning: le package 'forcats' a été compilé avec la version R 4.0.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
list_first_name_insee <- read.csv2(
  "nat2018.csv",
  encoding = "UTF-8"
) %>% 
  filter(
    annais != "XXXX", # we remove the lines without the year
    preusuel != "_PRENOMS_RARES" # we remove the first names not communicated, i.e. those whose number is less than 3 for one year
  ) %>% 
  mutate(annais = annais %>% as.integer()) # the year of birth is transformed into a numerical variable in order to be able to graphically represent the evolution of first names  

# dataset format  
list_first_name_insee %>% 
  glimpse()
## Rows: 601,221
## Columns: 4
## $ X.U.FEFF.sexe <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ preusuel      <chr> "A", "A", "AADAM", "AADAM", "AADAM", "AADAM", "AADAM"...
## $ annais        <int> 1980, 1998, 2009, 2014, 2016, 2017, 2018, 1976, 1978,...
## $ nombre        <int> 3, 3, 4, 3, 4, 4, 3, 5, 3, 3, 5, 4, 3, 5, 4, 6, 6, 6,...
# number of different first names
list_first_name_insee %>% 
  distinct(preusuel) %>% 
  nrow()
## [1] 31707

Selected filters use

Selection of male first names only

list_first_name_m <- list_first_name_insee %>% 
  filter(X.U.FEFF.sexe == 1) %>% 
  select(- X.U.FEFF.sexe)

list_first_name_m %>% 
  glimpse()
## Rows: 273,864
## Columns: 3
## $ preusuel <chr> "A", "A", "AADAM", "AADAM", "AADAM", "AADAM", "AADAM", "AA...
## $ annais   <int> 1980, 1998, 2009, 2014, 2016, 2017, 2018, 1976, 1978, 1980...
## $ nombre   <int> 3, 3, 4, 3, 4, 4, 3, 5, 3, 3, 5, 4, 3, 5, 4, 6, 6, 6, 8, 9...
list_first_name_m %>% 
  distinct(preusuel) %>% 
  nrow()
## [1] 15304
# we logically divide the number of first names by about two

Remove hyphenated first name

Compound first names must have a “-” so detection of this one allows to exclude them.

list_first_name_simple <- list_first_name_m %>% 
  filter(
    preusuel %>% 
      str_detect("-") == FALSE
  )

list_first_name_simple %>% 
  glimpse()
## Rows: 255,370
## Columns: 3
## $ preusuel <chr> "A", "A", "AADAM", "AADAM", "AADAM", "AADAM", "AADAM", "AA...
## $ annais   <int> 1980, 1998, 2009, 2014, 2016, 2017, 2018, 1976, 1978, 1980...
## $ nombre   <int> 3, 3, 4, 3, 4, 4, 3, 5, 3, 3, 5, 4, 3, 5, 4, 6, 6, 6, 8, 9...
list_first_name_simple %>% 
  distinct(preusuel) %>% 
  nrow()
## [1] 14142
# 1,000 down

Do not start with S

list_first_name_without_s <- list_first_name_simple %>% 
  mutate(
    initiale = 
      preusuel %>% 
      str_sub(
        start = 1,
        end = 1
      )
  ) %>% 
  filter(
    initiale != "S"
  )

list_first_name_without_s %>% 
  glimpse()
## Rows: 236,141
## Columns: 4
## $ preusuel <chr> "A", "A", "AADAM", "AADAM", "AADAM", "AADAM", "AADAM", "AA...
## $ annais   <int> 1980, 1998, 2009, 2014, 2016, 2017, 2018, 1976, 1978, 1980...
## $ nombre   <int> 3, 3, 4, 3, 4, 4, 3, 5, 3, 3, 5, 4, 3, 5, 4, 6, 6, 6, 8, 9...
## $ initiale <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A"...
list_first_name_without_s %>% 
  distinct(preusuel) %>% 
  nrow()
## [1] 13011
# another 1,000 down

# check that the first names beginning with S have been removed
list_first_name_without_s %>% 
  distinct(preusuel, initiale) %>%
  count(initiale)
##    initiale    n
## 1         A 1572
## 2         Â    1
## 3         B  514
## 4         C  586
## 5         Ç    1
## 6         D  676
## 7         E  721
## 8         É   55
## 9         F  396
## 10        G  505
## 11        H  560
## 12        I  384
## 13        Î    2
## 14        Ï    1
## 15        J  701
## 16        K  736
## 17        L  773
## 18        M 1341
## 19        N  602
## 20        O  258
## 21        Ö    3
## 22        P  221
## 23        Q   25
## 24        R  617
## 25        T  704
## 26        U   45
## 27        V  184
## 28        W  209
## 29        X   11
## 30        Y  416
## 31        Z  191

Not being in family and friends

first_name_in_family <- tibble(
  prenom = c(
    "Pierre",
    "Pierre-Yves",
    "Alain",
    "Philippe",
    "Christophe",
    "Éric",
    "Stéphane",
    "David",
    "Étienne",
    "Antoine",
    "Clovis",
    "François",
    "Quentin",
    "Jean-Baptiste",
    "Rafael",
    "Zacharrie",
    "Anatole",
    "Auguste",
    "Françis",
    "Christian",
    "Jean-Luc",
    "Thierry",
    "Eric",
    "Jérôme",
    "Sylvain",
    "Grégoire",
    "Greg",
    "Benoît",
    "Alexis",
    "Julien",
    "Florian",
    "Mael",
    "Maël",
    "Gabriel",
    "Edouard",
    "Tom",
    "Amaury",
    "Mathias",
    "Yves",
    "Pierre-Yves",
    "Jean-François",
    "Sebastien",
    "Quentin",
    "Stephane",
    "Thierry",
    "Christian",
    "Charles",
    "Thomas",
    "Alexis",
    "Robin",
    "Arthur",
    "Mathis",
    "Marius",
    "Robin",
    "Sacha",
    "Clément",
    "Medhi",
    "Mehdi",
    "Pierre",
    "Jean-Marie",
    "Jeannot",
    "Julien",
    "Louis"
  )
) %>% 
  distinct() %>% 
  mutate(
    prenom = prenom %>% str_to_upper()
  )


friends_first_name <- tibble(
  prenom = c(
    "Thomas",
    "Benoit",
    "Joao",
    "Florian",
    "Bertrand",
    "Aubin",
    "Sébastien",
    "Arthur",
    "Clément",
    "Goulven",
    "Brieuc",
    "Jerome",
    "Jérome",
    "Laurent",
    "Joan",
    "Romain",
    "Armand",
    "Olivier",
    "Christophe",
    "Adrien",
    "Alexis",
    "Grégoire",
    "Eiffel",
    "Gabriel"
    )
) %>% 
  mutate(
    prenom = prenom %>% str_to_upper()
  )

list_first_name_without_close <- 
  list_first_name_without_s %>% 
  anti_join(
    first_name_in_family,
    by = c("preusuel" = "prenom")
  ) %>% 
  anti_join(
    friends_first_name,
    by = c("preusuel" = "prenom")
  )

list_first_name_without_close %>% 
  glimpse()
## Rows: 230,465
## Columns: 4
## $ preusuel <chr> "A", "A", "AADAM", "AADAM", "AADAM", "AADAM", "AADAM", "AA...
## $ annais   <int> 1980, 1998, 2009, 2014, 2016, 2017, 2018, 1976, 1978, 1980...
## $ nombre   <int> 3, 3, 4, 3, 4, 4, 3, 5, 3, 3, 5, 4, 3, 5, 4, 6, 6, 6, 8, 9...
## $ initiale <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A"...
list_first_name_without_close %>% 
  distinct(preusuel) %>% 
  nrow()
## [1] 12953
# Fifty or so with redundancies and hyphenated first names

First names common but not trendy either

With my husband we have two very common first names that have earned us at school to be almost never the only person to wear it which leads to the addition of adjectives or numbers to differentiate us from each other and it is not very pleasant…

Even today, our acquaintances know other couples who have the same first names which can lead to misunderstandings more or less funny…

In short, we want to avoid this so we have chosen to remove the 20 most given names per year for the last 5 years.

On the other hand, we don’t want a name that people don’t know, so we set the limit that over the last 50 years, at least 20 people a year have had this name.

Warning: the INSEE data stops in 2018, so to have the year 2019, we’ll get it from the Parents magazine’s website.

# Rare first name
list_rare_first_name <- list_first_name_m %>% 
  right_join(
    list_first_name_m %>% 
      filter(annais >= 1970) %>% 
      distinct(preusuel) %>% 
      merge(
        tibble(annais = c(1970:2018))
      )
  ) %>% 
  filter(
    nombre < 20 | is.na(nombre)
  ) %>% 
  distinct(preusuel)
## Joining, by = c("preusuel", "annais")
list_rare_first_name %>% 
  glimpse()
## Rows: 13,936
## Columns: 1
## $ preusuel <chr> "A", "AADAM", "AADEL", "AADIL", "AAHIL", "AAKASH", "AARON"...
# Impressive the number of first names involved!

# Trendy first name
list_trendy_firs_name_2015_2018 <- list_first_name_m %>% 
  filter(
    annais >= 2015
  ) %>% 
  group_by(annais) %>% 
  arrange(desc(nombre)) %>% 
  slice(1:20) %>% 
  ungroup() %>% 
  distinct(preusuel)

list_trendy_firs_name_2015_2018 %>% 
  glimpse()
## Rows: 24
## Columns: 1
## $ preusuel <chr> "GABRIEL", "JULES", "LUCAS", "LOUIS", "ADAM", "HUGO", "LÉO...
library(rvest)
## Warning: le package 'rvest' a été compilé avec la version R 4.0.3
## Le chargement a nécessité le package : xml2
## 
## Attachement du package : 'rvest'
## The following object is masked from 'package:purrr':
## 
##     pluck
## The following object is masked from 'package:readr':
## 
##     guess_encoding
list_trendy_firs_name_2019 <-
  tibble(
    preusuel = read_html("https://www.parents.fr/prenoms/top-100-des-prenoms-de-garcons-100988") %>% # website url
      html_node(xpath = '//*[@id="main"]/article/div/div/div[1]/ol') %>% # table selected, xpath copy and paste by "inspecting" the webpage
      html_text() %>% 
      str_split("\n\t\t", simplify = TRUE) %>% 
      str_remove_all("[:blank:]")
  ) %>% 
  filter(preusuel != "") %>% 
  slice(1:20) %>% 
  mutate(
    preusuel = preusuel %>% str_to_upper()
  )

list_trendy_firs_name_2019 %>% 
  glimpse()
## Rows: 0
## Columns: 1
## $ preusuel <chr>
list_trendy_first_name <- 
  bind_rows(
    list_trendy_firs_name_2015_2018,
    list_trendy_firs_name_2019
  ) %>% 
  distinct()
# well surprisingly the whole 2019 list is already in the 2015 and 2018 list so this information doesn't seem correct... For having made the tour of several sites, the list does not move between 2018, 2019 and 2020 so I guess the people who wrote the different articles did not notice that the INSEE data stops in 2018 .... 

# Now we remove the rare and trendy names from our already reduced list
list_common_first_name <- list_first_name_without_close %>% 
  filter(annais >= 1970) %>% 
  anti_join(list_rare_first_name) %>% 
  anti_join(list_trendy_first_name)
## Joining, by = "preusuel"
## Joining, by = "preusuel"
list_common_first_name %>% 
  glimpse()
## Rows: 8,673
## Columns: 4
## $ preusuel <chr> "ABDALLAH", "ABDALLAH", "ABDALLAH", "ABDALLAH", "ABDALLAH"...
## $ annais   <int> 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979...
## $ nombre   <int> 68, 57, 52, 68, 51, 64, 52, 48, 34, 38, 49, 49, 67, 43, 55...
## $ initiale <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A"...
list_common_first_name %>% 
  distinct(preusuel) %>% 
  nrow()
## [1] 177
# Phew, we've gone under 200 possible names! 

First name only

To do this part I used the list of French words from FREELANG, the problem is that this list contains first names but each time they have a capital letter, so I will remove from this list all the words beginning with a capital letter and there thank you the regular expressions!

list_french_words <- read.table(
  "liste_francais.txt"
) %>% 
  filter(
    V1 %>% 
      str_detect("^[:upper:]") == TRUE
  )

list_french_words %>% 
  glimpse()
## Rows: 1,437
## Columns: 1
## $ V1 <chr> "Aaron", "Abdel", "Abidjan", "Abyssin", "Abyssine", "Abyssinie",...

My idea is not the right one because it also removes the place names and I don’t want to…
After a quick look at the list, I realized that only Ange (Angel) and Martial (Martial) would be concerned so I will leave this part out, removing only those two.
At the same time, it’s true that it affects more the names given to the girls than to the boys…

list_final_first_name <- list_common_first_name %>% 
  anti_join(
    tibble(
      preusuel = c("ANGE", "MARTIAL")
    )
  )
## Joining, by = "preusuel"
list_final_first_name %>% 
  glimpse()
## Rows: 8,575
## Columns: 4
## $ preusuel <chr> "ABDALLAH", "ABDALLAH", "ABDALLAH", "ABDALLAH", "ABDALLAH"...
## $ annais   <int> 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979...
## $ nombre   <int> 68, 57, 52, 68, 51, 64, 52, 48, 34, 38, 49, 49, 67, 43, 55...
## $ initiale <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A"...
list_final_first_name %>% 
  distinct(preusuel) %>% 
  nrow()
## [1] 175

Conclusion

I end up with a list of 175 first names which is much less than the 15304 first names given to boys since 1900 ^^
Will it help us decide, not sure but at least I would have tried!

One last idea, choose the initial you want?

For that a little function and the use of Plotly that I love and hop, we just have to choose!

library(plotly)
## Warning: le package 'plotly' a été compilé avec la version R 4.0.3
## 
## Attachement du package : 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# function to visualize the evolution of the first names concerned since the 70s
graph_first_name_by_initial_letter <- function(initial_letters){
  return(
    list_final_first_name %>% 
      filter(initiale %in% initial_letters) %>% 
      ggplot() +
      aes(x = annais, y = nombre, colour = preusuel) +
      geom_line() +
      theme_classic() +
      ggtitle(
        paste(initial_letters, collapse = ", ")
        )
  )
} 
  
# sum-up by initial letters
list_final_first_name %>% 
  distinct(preusuel, initiale) %>% 
  count(initiale)
##    initiale  n
## 1         A 25
## 2         B  6
## 3         C  6
## 4         D 10
## 5         E  8
## 6         F  6
## 7         G 11
## 8         H  7
## 9         I  2
## 10        J 15
## 11        K  4
## 12        L  9
## 13        M 24
## 14        N  5
## 15        O  1
## 16        P  4
## 17        R  7
## 18        T  7
## 19        V  6
## 20        W  2
## 21        X  1
## 22        Y  9
graph <- list(
  "A",
  "B",
  "C",
  "D",
  c("E", "É"),
  "G",
  c("H", "I"),
  "J",
  "K",
  "L",
  "M",
  c("N", "O"),
  c("P", "R"),
  "T",
  c("V", "W", "X"),
  "Y"
) %>% 
  map(graph_first_name_by_initial_letter)


graph[[1]] %>% 
  ggplotly()
graph[[2]] %>% 
  ggplotly()
graph[[11]] %>% 
  ggplotly()
graph[[12]] %>% 
  ggplotly()