It is found that the main component of soot, e.g., PAHs, are more complicated than we expected and they are affected by many factors, such as size, functional group, cross-linking, and aliphatic chains ( Commodo et al., 2019 Schulz et al., 2019 Gentile et al., 2020 Wang et al., 2021). Adamson ( Adamson et al., 2018) detected aliphatic bridged multi-core PAHs by atmospheric sampling high-resolution tandem mass spectrometry, revealing the presence of alkylated aromatic compounds. Commodo ( Commodo et al., 2019) further studied the early formation stage of different soot samples by AFM/SEM, providing the direct evidence of the formation of cross-linked structures. Although the abundance and relevance of these identified PAHs to soot formation is unknown, it was the first time that the configurations of large PAHs are confirmed in measurements. ( Schulz et al., 2019) used the state-of-art Atomic Force Microscopy (AFM) to identify the detailed configuration of large PAHs (>300 amu). Johansson and his coworkers proposed a radical-assisted PAH growth mechanism supported by the aerosol mass spectrometry measurements ( Johansson et al., 2018). With the application of novel measurement methods, recent researchers have made important progress in the investigation of the key process in soot formation by identifying the potential intermediates. The identification of PAHs and their structures is critical to interpret their growth mechanism, which is the basis for the reduction of the soot emission ( Wang and Chung, 2019). Polycyclic aromatic hydrocarbons (PAHs) generated from the incomplete combustion of hydrocarbon fuels are accepted as the precursors of soot. The excellent performance of the machine learning model provides us an accurate and efficient way to explore the band information of PAHs in soot formation. Furthermore, we developed a machine learning model to predict the HOMO-LUMO gaps of PAHs, and the average absolute error is only 0.19 eV compared with the DFT calculations. Among all these factors, the five-member rings forming nonplanar PAHs impact the gap most. Besides functional groups, we found that both local structure and the position of five-member rings make critical impacts on the bandgap via a detailed analysis of featured PAHs with unexpected low and high gap values. The substitution of ketone group has the greatest reduction on the HOMO-LUMO gap of PAH molecules. The impact of functional groups, including –OH, –CHO, –COOH, =O, –O– and –C nH m on the bandgap is discussed in detail. All collected PAHs are further classified into seven groups according to features in the structures, including the types of functional groups and the molecular planarity. However, the gap values may show a big variation even at the same size due to the complexity in the molecular structure. It is found that the gap values of all PAH molecules exhibit a size dependency to some extent. The gap values lie in the range of 0.64–6.59 eV. The HOMO-LUMO gap value of PAHs was computed at the level of B3LYP/6-311+G (d,p). 2Departamento de Industrias, Universidad Técnica Federico Santa María, Valparaíso, ChileĪ large number of PAH molecules is collected from recent literature.1State Key Lab of Explosion Science and Technology, Beijing Institute of Technology, Beijing, China.Yabei Xu 1, Qingzhao Chu 1, Dongping Chen 1* and Andrés Fuentes 2