Intermediate and high-mass forming stars have a large impact on the interstellar medium and nearby star forming regions. Historically, the study of the general properties of intermediate- to high-mass pre-main sequence stars has been hampered by the lack of a well-defined, homogeneous sample, and because few and mostly serendipitously discovered sources were known. As a consequence, many open problems involving high-mass star formation suffer from biases and lack of completeness, and we know much less about these sources than about their lower mass T Tauri counterparts. Applying machine learning techniques to Gaia data we have constructed a large and homogeneous catalogue of 2226 new intermediate- to high-mass pre-main sequence stars, increasing by an order of magnitude the number of known objects of the class. This unique list of new massive forming stars is an excellent dataset to conduct research on several open problems in high-mass star formation.
To exemplify this, I present the results of a spectroscopic survey that targeted 145 stars from this catalogue. These observations allowed us to derive accretion rates and to study which accretion mechanism is predominant in different stellar mass ranges. I provide further evidence to the transition from magnetospheric accretion to boundary layer accretion happening at around 4 solar masses. I conclude the talk with a Gaia-based analysis on the clustering properties of massive pre-main sequence stars, using the largest sample ever considered. I discuss the fraction of massive pre-main sequence stars found in clusters as a function of stellar mass. One long-standing problem in high-mass star formation is whether massive stars can form in isolation. As the catalogue sources were selected in a location-independent fashion, they provide a unique perspective into this problem.