Perceptron Ventures

Using Crunchbase data to predict the success of start-ups.

The project

We gathered data for nearly 7000 companies from the Crunchbase API, limiting our data gathering to companies that were founded prior to 2011. We then cleaned the data to form indicators based on location, sector, and relevant information about the company's founders. We also had access to data on each company's funding rounds, acquisitions, and IPO's.

Our goal was to come up with a definition for what constitutes a successful company based off some of these metrics, and use the remainder of the features to accurately predict success. As we were analyzing these companies to determine if we should fund a startup, we also looked at the problem through the lens of Profit and ROC curves as funding a company that is ultimately unsuccessful has a much different cost than missing a company that winds up successful.

Finally, we wanted to approach the data again from an entirely different direction - seeing if we could cluster certain startups together through a similarity measure. We used this to see which companies were connected, and tried to glean useful information from these connections.

The data

Excel Database

Crunchbase has both a REST API and an Excel spreadsheet database. We used the Excel spreadsheet to inform our REST API calls. When we first looked at the data through the REST API, it was clear that many companies had very sparse data. Also, the data was extremely decentralized, with gathering one organization requiring many different API calls. The Excel spreadsheet on the other hand, while only a subset of the companies, provided overall a much better picture of the companies. We scraped the Excel spreadsheet, coming up with a list of 25171 eligible companies that were funded before 2011.

REST API

The Crunchbase API is separated into different entities. Each entity has connections with other entities. After looking through the API we created a map of entities to relevant attributes we wanted to parse. This was not as smooth of a process as we had hoped. We had 8 different entities we were concerned about - Person, Address, Market, Product, Organization, FundingRound, Acquisition and IPO. Each entity has relevant attributes we care about (biographies, specifics on funding, etc) as well as relationships between different entities (competitors, categories, investors, etc). A single organization could require as many as 15 API calls. We also noticed that after having automated queries for a certain period of time, Crunchbase would disable our API key temporarily. We then had to set up scripts to catch those and sleep for a specified amount of time. Altogether, we were able to gather 6852 companies over the course of three days.





Companies that have raised at least (by sector):




Company Success Rate by Industry






























Company Success Rate by City



















  • What's a successful company?

    Criteria for successful company:
    1072 acquired, 502 IPOed, 1001 with at least 4 rounds of funding = 2153 (31.4% success rate)

    Crunchbase is an already selective group of startups, so a couple funding rounds is not uncommon.

    Funding Rounds
  • Baselines

    After splitting our data set into training, test, and validation groups, our all-fails baseline for the test set has a 66.47% accuracy.

    Accuracy isn't the only thing we care about - we also want profit. We assume we'll invest 50K for 2.5% equity stake in a company. Given the information in Crunchbase, let's assume the value of a company is the total funding it receives, and since it's highly skewed to the right we look at the median total funding to be our "average success company". This turns out to be 12 million!! Our utility matrix is:

    Pred Neg Pred Pos
    Obs Neg 0 -50K
    Obs Pos -300K 250K

    Using this our baseline in terms of profit is actually the all-success baseline: $50583.09

  • Accuracies

    We tried 5 different models, cross-validating over hyperparameters with each, and built an ensemble over the best results from them. Here are our accuracies:

    • K-Nearest with n=30: 72.28%
    • Logistic Regression with 5-fold cv, l2 reg: 72.89%
    • SVM-rbf: 71.14%
    • Naive Bayes (50 top words, then drop top 40): 63.75%
    • Random Forest (20 trees): 74.8%
    • Ensemble: 74.93%
  • Not So Fast

    Funding Rounds Funding Rounds

    Our ROC/profit curves aren't great, meaning our predictors aren't accepting successful companies until our probability threshold essentially goes to 0... that's not an accurate model. Predicting startups is not easy!

The Top 100 Companies worldwide by funding raised

Similarity Measures


Using dimensionality reduction, we were able to measure the similarity between two companies based on the commonality between features. We identified connected components, or isolated clusters, of companies that all shared a certain level of similarity. Below, we assembled a node graph from the largest connected component (360 companies). Each blue node represents a company, the values of the edges are the distance between the companies -- the closer the points are, the more similar they are. This tool could be used to help identify the competitors of a start up. Check it out, you could try to find Yahoo and Baidu who are link together or Salesforce and SurveyMonkey who share a common neighbor!

Team

  • Team Member

    Neil Chainani

    Come to Neil for advice on optimizing your ping pong swing.

    Github: nchainan

    Team Member

    Nicolas Drizard

    A modern-day Cupid, Nico can make anyone fall in love with his code.

    Github: nicodri

    Team Member

    Avery Faller

    Avery has the unique ability to create charts out of thin air.

    Github: averyfaller

    Team Member

    Charles Liu

    Master of the Jedi arts, and somehow, data science.

    Github: chuckyouliu

Thanks to Crunchbase, and the instructors of AC209 for helping us out