We gathered data for nearly 7000 companies from the Crunchbase API, limiting our data gathering to companies that were founded prior to 2011. We then cleaned the data to form indicators based on location, sector, and relevant information about the company's founders. We also had access to data on each company's funding rounds, acquisitions, and IPO's.
Our goal was to come up with a definition for what constitutes a successful company based off some of these metrics, and use the remainder of the features to accurately predict success. As we were analyzing these companies to determine if we should fund a startup, we also looked at the problem through the lens of Profit and ROC curves as funding a company that is ultimately unsuccessful has a much different cost than missing a company that winds up successful.
Finally, we wanted to approach the data again from an entirely different direction - seeing if we could cluster certain startups together through a similarity measure. We used this to see which companies were connected, and tried to glean useful information from these connections.
Crunchbase has both a REST API and an Excel spreadsheet database. We used the Excel spreadsheet to inform our REST API calls. When we first looked at the data through the REST API, it was clear that many companies had very sparse data. Also, the data was extremely decentralized, with gathering one organization requiring many different API calls. The Excel spreadsheet on the other hand, while only a subset of the companies, provided overall a much better picture of the companies. We scraped the Excel spreadsheet, coming up with a list of 25171 eligible companies that were funded before 2011.
The Crunchbase API is separated into different entities. Each entity has connections with other entities. After looking through the API we created a map of entities to relevant attributes we wanted to parse. This was not as smooth of a process as we had hoped. We had 8 different entities we were concerned about - Person, Address, Market, Product, Organization, FundingRound, Acquisition and IPO. Each entity has relevant attributes we care about (biographies, specifics on funding, etc) as well as relationships between different entities (competitors, categories, investors, etc). A single organization could require as many as 15 API calls. We also noticed that after having automated queries for a certain period of time, Crunchbase would disable our API key temporarily. We then had to set up scripts to catch those and sleep for a specified amount of time. Altogether, we were able to gather 6852 companies over the course of three days.
100
Using dimensionality reduction, we were able to measure the similarity between two companies based on the commonality between features. We identified connected components, or isolated clusters, of companies that all shared a certain level of similarity. Below, we assembled a node graph from the largest connected component (360 companies). Each blue node represents a company, the values of the edges are the distance between the companies -- the closer the points are, the more similar they are. This tool could be used to help identify the competitors of a start up. Check it out, you could try to find Yahoo and Baidu who are link together or Salesforce and SurveyMonkey who share a common neighbor!