R - Nested list to (wide) dataframe

Multi tool use
R - Nested list to (wide) dataframe
I currently have the following problem: I extracted some data via the crunchbase API, resulting in a big nested list of the following structure (there are many more nested lists on several instances included, I here only display the part of the structure currently relevant for me):
> str(x[[1]])
$ uuid : chr "5f9957b0841251e6e439d757XXXXXX"
$ relationships: List of 27
..$ websites: List of 3
.. ..$ cardinality: chr "OneToMany"
.. ..$ items :'data.frame': 4 obs. of 7 variables:
.. .. ..$ properties.website_type: chr [1:4] "homepage" "facebook" "twitter" "linkedin"
.. .. ..$ properties.url : chr [1:4] "http://www.example.com" "https://www.facebook.com/example" "http://twitter.com/example" "http://www.linkedin.com/company/example"
Consider the following minimal example:
x <- list()
x[[1]] <- list(uuid = "123",
relationships = list(websites = list(items = list(
properties.website_type = c("homepage", "facebook", "twitter", "linkedin"),
properties.url = c("www.example1.com", "www.fbex1.com", "www.twitterex1.com", "www.linkedinex1.com") ) ) ) )
x[[2]] <- list(uuid = "987",
relationships = list(websites = list(items = list(
properties.website_type = c("homepage", "facebook", "twitter" ),
properties.url = c("www.example2.com", "www.fbex2.com", "www.twitterex2.com") ) ) ) )
Now, I would like to create a dataframe with the following column structure:
> x.df
uuid web.url web.facebook web.twitter web.linkedin
1 123 www.example1.com www.fbex1.com www.twitterex1.com www.linkedinex1.com
2 987 www.example2.com www.fbex2.com www.twitterex2.com <NA>
Meaning: I would like to have every uuid (a unique firm identifier) in a single column, followed by the urls of the different platforms (fb, twitter...). I tried a lot of different things with a combination of lapply()
, spread()
, and row_bind()
, yet didn't manage to make anything work. Any help on that would be appreciated.
lapply()
spread()
row_bind()
dput
Done. I added a downloadable link for a few datapoints.
– Daniel S. Hain
Jun 25 at 15:27
please make a minimal example instead of a 1000-line file to a link that may break at any time. See how to make a reproducible example
– Calum You
Jun 25 at 21:40
Done. Hope now it is clear.
– Daniel S. Hain
Jun 26 at 7:02
1 Answer
1
dplyr
approach could be
dplyr
library(dplyr)
library(tidyr)
#convert list to dataframe in long format
df <- do.call(rbind, lapply(x, data.frame, stringsAsFactors = FALSE))
#final result
df1 <- df %>%
spread(relationships.websites.items.properties.website_type, relationships.websites.items.properties.url)
which gives
uuid facebook homepage linkedin twitter
1 123 www.fbex1.com www.example1.com www.linkedinex1.com www.twitterex1.com
2 987 www.fbex2.com www.example2.com <NA> www.twitterex2.com
Sample data:
x <- list(structure(list(uuid = "123", relationships = structure(list(
websites = structure(list(items = structure(list(properties.website_type = c("homepage",
"facebook", "twitter", "linkedin"), properties.url = c("www.example1.com",
"www.fbex1.com", "www.twitterex1.com", "www.linkedinex1.com"
)), .Names = c("properties.website_type", "properties.url"
))), .Names = "items")), .Names = "websites")), .Names = c("uuid",
"relationships")), structure(list(uuid = "987", relationships = structure(list(
websites = structure(list(items = structure(list(properties.website_type = c("homepage",
"facebook", "twitter"), properties.url = c("www.example2.com",
"www.fbex2.com", "www.twitterex2.com")), .Names = c("properties.website_type",
"properties.url"))), .Names = "items")), .Names = "websites")), .Names = c("uuid",
"relationships")))
Update: In order to fix below error
Error in (function (..., row.names = NULL, check.rows = FALSE,
check.names = TRUE, : arguments imply differing number of rows: 1,
0
you would need to remove corrupted elements from input data where website_type
has one value but properties.url
has NULL
. Run this chunk of code as a pre-processing step before executing the main solution:
website_type
properties.url
NULL
idx <- which(sapply(x, function(k) is.null(k$relationships$websites$items$properties.url)))
x <- x[-idx]
Sample data to test this pre-processing step:
x <- list(structure(list(uuid = "123", relationships = structure(list(
websites = structure(list(items = structure(list(properties.website_type = c("homepage",
"facebook", "twitter", "linkedin"), properties.url = c("www.example1.com",
"www.fbex1.com", "www.twitterex1.com", "www.linkedinex1.com"
)), .Names = c("properties.website_type", "properties.url"
))), .Names = "items")), .Names = "websites")), .Names = c("uuid",
"relationships")), structure(list(uuid = "987", relationships = structure(list(
websites = structure(list(items = structure(list(properties.website_type = "homepage",
properties.url = NULL), .Names = c("properties.website_type",
"properties.url"))), .Names = "items")), .Names = "websites")), .Names = c("uuid",
"relationships")), structure(list(uuid = "345", relationships = structure(list(
websites = structure(list(items = structure(list(properties.website_type = "homepage",
properties.url = NULL), .Names = c("properties.website_type",
"properties.url"))), .Names = "items")), .Names = "websites")), .Names = c("uuid",
"relationships")))
Great, that generally seems to be what I need. Runs perfectly with the example. However, when I try it with my full dataset, I always get an error message: "Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 1, 0" Any idea what the problem could be?
– Daniel S. Hain
Jun 26 at 15:06
Probably you have an element in your sample data wherein number of values in
$relationships$websites$items$properties.website_type
& $relationships$websites$items$properties.url
is not matching. Because of this data.frame
is throwing this error. So first you need to think on how do you want to handle such cases i.e. website_type is there but url is missing.– Prem
Jun 26 at 17:34
$relationships$websites$items$properties.website_type
$relationships$websites$items$properties.url
data.frame
Indeed, on that, you are probably right! I didnt consider that case. In case the url is missing, it in the optimal case should be an NA.
– Daniel S. Hain
Jun 26 at 20:54
I think you are missing a point here. Consider this example and let me know the desired output -
x <- structure(list(uuid = "123", relationships = structure(list(websites = structure(list( items = structure(list(properties.website_type = c("homepage", "facebook", "twitter", "linkedin"), properties.url = c("www.example1.com", "www.fbex1.com", "www.linkedinex1.com")), .Names = c("properties.website_type", "properties.url"))), .Names = "items")), .Names = "websites")), .Names = c("uuid", "relationships"))
Here twitter has no url in this example and gives the same error.– Prem
Jun 27 at 7:06
x <- structure(list(uuid = "123", relationships = structure(list(websites = structure(list( items = structure(list(properties.website_type = c("homepage", "facebook", "twitter", "linkedin"), properties.url = c("www.example1.com", "www.fbex1.com", "www.linkedinex1.com")), .Names = c("properties.website_type", "properties.url"))), .Names = "items")), .Names = "websites")), .Names = c("uuid", "relationships"))
Alright, it works! Thank you so much!
– Daniel S. Hain
Jul 9 at 16:48
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Please provide a sample of your data using
dput
– docendo discimus
Jun 25 at 10:11