Dual-branch image projection network for geographic atrophy segmentation in retinal OCT images

Wait 5 sec.

IntroductionGeographic Atrophy (GA) is an advanced progressive lesion of non-exudative age-related macular degeneration, also referred to as complete retinal pigment epithelium and outer retinal atrophy1. It is estimated that approximately five million people worldwide suffer from GA, with its prevalence increasing exponentially with age2. GA is typically bilateral3, and the development and enlargement of the lesions result in irreversible loss of visual function. Therefore, accurate segmentation of the lesion area is of paramount importance for preventing progression and guiding subsequent treatment4. Optical coherence tomography (OCT) is a non-invasive and rapid biomedical imaging technology5. It can image biological tissues at the micron level and generate high-resolution three-dimensional cross-sectional images, which are widely used in clinical ophthalmology quantitative analysis6,7. Its high-resolution feature enables clear observation of various retinal diseases8, such as macular degeneration and macular holes9, pigment epithelial detachment10, choroidal neovascularization11 and GA1. OCT plays a vital role in the diagnosis and monitoring of retinal diseases12,13,14,15,16. Figure 1 demonstrates OCT projection and B-scan with GA lesions and the corresponding B-scan GT.Fig. 1Example of retinal OCT GA. (a) GA lesion in projection image, (b) GA lesion in B-scan, (c) GT corresponding to the lesion in (b).Full size imagePrevious work has made great efforts in exploring traditional methods for GA segmentation, including geometric active contours17, level set methods18. Niu et al.19 proposed a Chan-Vese model method based on local similarity factors for reducing computational effort. However, which cannot fully explore the advanced features and semantic information in the image and limits the segmentation performance. Most of the recent GA segmentation methods are deep learning networks20,21,22. Wu et al.20 proposed a method to creates OCT projection images by applying constrained sub-volume projection to 3D OCT data. Patil et al.21 used U-Net to automatically segment geographical atrophic lesions. Spaide et al.22 used two multimodal deep learning networks (U-Net and Y-Net) to automatically segment GA lesions on FAF. However, all of the above methods utilize only the features of the Enface image and ignore the spatial information in the volumetric data. To alleviate this problem, our methods used a 2D network framework while incorporating ConvLSTM to capture the neighboring information between slices of volumetric data. In addition, these methods may suffer from the problem of possible mis-segmentation when segmenting GA edges (low contrast of edge pixels), and it is difficult for the network to classify such hard samples. To alleviate this problem, we proposed a contrastive learning enhancement module (CLE) and select an appropriate sampling strategy to improve the classification ability of the network for difficult samples.Compared with projection images, OCT volume data can provide detailed information on retinal structure. Li et al.23 proposed an image projection network (IPN) to achieve three-dimensional to two-dimensional retinal Foveal Avascular Zone(FAZ) segmentation through unidirectional pooling along the volume projection direction. In addition, the IPN V224 was also proposed to enhance horizontal direction perception capabilities. Morano et al.25 proposed a convolutional neural network (CNN) and self-supervised learning (SSL) method for 3D to 2D segmentation. However, the above methods are only suitable for certain lesion segmentation, ignoring the fact that clinics generally use single line scans and radial scans for B-scan data, which results in little 3D data. In the GA segmentation task, if the network relies only on a small amount of volume data, it may lead poor generalization ability. Meanwhile, they used unidirectional pooling to project the image, ignoring multi-scale features and channel information, and the network is unable to capture spatial relationships at different scales, limiting the ability of the model to understand the overall structure of the image. To address the above issues, we proposed a novel two-stage image projection segmentation method. While using volumetric data, we used a large number of B-scan images for pre-training in the first stage, which alleviates the problems of overfitting and poor generalization ability caused by the over-reliance of the network on a small amount of data. To address the limitation of the 2D network framework in overlooking neighboring information between slices of volumetric data, we introduced ConvLSTM in the second stage. This inclusion ensures that spatial information within the volumetric data is effectively leveraged during the segmentation process. At the meantime, we proposed Adaptive Pooling Module (APM) for capturing multi-scale features and channel information while adaptively reducing the feature dimensions. Furthermore, to enhance the network’s focus on features along the projection direction during the dimensionality reduction process, we proposed a Projection Attention Module (PAM) that calculates the affinity between pixels in the projection direction, thereby establishing long-range dependencies.Specifically, we proposed a multi-stage Dual-branch Image Projection Network that can obtain pre-training weights using many B-scan images during the pre-training stage. In addition, inspired by Liu et al.26, we propose a Projective Attention Module (PAM) to integrate long-range dependencies by calculating the affinity between two different pixels on each projection column in the B-scan. An Adaptive Pooling Module (APM) is also proposed, which focuses on the channels while extracting and fusing multi-scale features, thus effectively improving the feature utilization. Finally, to ensure that the spatial information in the volumetric data is fully utilized during the segmentation process, we incorporated ConvLSTM to capture the neighborhood information between images in the fine-tuning stage. Utilizing a contrastive learning module to enhance the network’s ability to distinguish boundary features.MethodsFramework of the proposed methodFigure 2 demonstrates the segmentation of retinal GA by our proposed DIPN network through three stages: pre-training, fine-tuning and inference. We have defined the retinal GA segmentation task. In the pre-training phase, training is conducted using the data set ${D}_{train}^{1}={\left\{\left({X}_{n}^{1},{S}_{n}^{1}\right)\right\}}_{n=1}^{N}$, where ${S}_{n}^{1}$ is the corresponding label. The sizes of ${X}_{n}^{1}\in {R}^{C\times H\times W}$ and ${S}_{n}^{1}\in {\left\{\text{0,1}\right\}}^{1\times 1\times W}$ are different, because our method involves dimensionality reduction along the projection direction, finally obtains a line segment. In the fine-tuning phase, the training dataset ${D}_{train}^{2}={\left\{\left({X}_{n}^{2},{S}_{n}^{2}\right)\right\}}_{n=1}^{N}$, where ${X}_{n}^{2}\in {R}^{C\times L\times H\times W}$ denotes the set of 3D images, and ${S}_{n}^{2}\in {\left\{\text{0,1}\right\}}^{1\times L\times 1\times W}$ is the labels corresponding to ${X}_{n}^{2}$. We use the test dataset ${D}_{test}={\{\{{X}_{m},{S}_{m}\}\}}_{m=1}^{N}$(${X}_{m}\in {R}^{C\times L\times H\times W},{S}_{m}\in {\left\{\text{0,1}\right\}}^{1\times L\times 1\times W}$) for testing.Fig. 2Flowchart of our proposed method. The data used at different stages are shown in the figure. The first stage uses a single B-scan image for pre-training to enable the network to learn the feature representation. The second stage uses the complete volume data for training to enable the network to fully learn and utilize the proximity information between slices. Finally, testing is performed.Full size imageThe pre-training stage model is shown in Fig. 3. It consists of an image projection branch (IPB) and a feature complementary branch (FCB), each containing five stages. Each stage is connected to each other and the feature complementary branch passes the extracted feature ${f}_{i}^{APM}(i\in \text{1,2},\text{3,4},5)$ of each stage to the feature projection branch. It is connected to the stage feature ${f}_{i}^{IPB}(i\in \text{1,2},\text{3,4},5)$. The two branch features ${f}^{FB}$ and ${f}^{SB}$ are concatenated and then the contrast loss ${L}^{CLE}$ and the segmentation loss ${L}^{SEG}$ are computed by the projection and segmentation headers.Fig. 3Pre-training network architecture. The first branch is the image projection branch (IPB), which incorporates the Projection Attention Module (PAM) that focuses on attention in the projection direction when reducing dimensions. The second branch is our proposed Feature Complement Branch (FCB), which contains the proposed Adaptive Pooling Module (APM) to ensure feature retention and feature fusion during the projection process. When using multiple convolutions for a feature, we use the residual structure.Full size imageProjective attention module (PAM)For retinal OCT B-scan images containing GA lesions, the lesions are usually located near the RPE and exhibit significant translucency (as shown in Fig. 1b), so many previous GA segmentation methods use projection maps or 3D volume data segmentation rather than 2D network segmentation slices. In addition, we observe that the degree of pixel contribution varies from region to region when determining pixel labels in a deep learning-based approach. Therefore, it is crucial to distinguish these feature representations. In order to build rich contextual relationships on local features, we proposed the PAM inspired26,27,28. The PAM encodes broader contextual information into local features, thus enhancing the representation of local features. Next, we describe in detail how the PAM works.As shown in Fig. 4 PAM, our proposed PAM is capable of processing features of different scales at different stages. ⨂ denotes the batch matrix multiplication with batch size W. For easy understanding, we take the first PAM in the image projection branch as an example. Given a feature ${\mathbb{R}}^{64\times H\times W}$, we first feed it into a convolutional layer to obtain a new feature $F\in {\mathbb{R}}^{64\times H\times W}$. Then, we feed the feature F into a 1 × 1 convolutional layer to obtain two new feature mappings Q and K, where $\{Q,K\}\in {\mathbb{R}}^{32\times H\times W}$, and arrange them as ${\mathbb{R}}^{W\times H\times 32}$, ${\mathbb{R}}^{W\times 32\times H}$. Then reshape them as ${\mathbb{R}}^{W\times (H\times 32)}$, ${\mathbb{R}}^{W\times (32\times H)}$. We then perform matrix multiplication on Q and K and apply a Sigmoid to compute the space-attentive mapping $Atts\in {\mathbb{R}}^{W\times H\times H}$.$$\begin{array}{c}{Atts}_{kij}=1-\frac{\text{exp}\left(-{\Sigma }_{c=1}^{32}{Q}_{kic}\cdot {K}_{kcj}\right)}{1 +\text{exp}\left(-{\Sigma }_{c=1}^{32}{Q}_{kic}\cdot {K}_{kcj}\right)}\end{array}$$(1)where $\{k,i,j,c|1\le k\le W,1\le i,j\le H,1\le c\le 32\}$, ${Atts}_{kij}$ denotes the influence of position $i$ on position $j$. The closer the feature representations of two positions are, the greater is their correlation. ${Q}_{kic}$ denotes the $i$-th pixel on the $k$-th column in the $c$-th feature map. $\cdot$ represents the element multiplication. At the meantime, we input the feature $F$ into the convolutional layer to generate a new feature mapping $V\in {\mathbb{R}}^{32\times H\times W}$ and reshape it into ${\mathbb{R}}^{W\times (H\times 32)}$. We then perform a matrix multiplication between $V$ and $Atts$ and arrange the result as ${\mathbb{R}}^{W\times H\times 32}$. The result is then fed into the convolutional layer to recover the number of channels, and the result is arranged as ${\mathbb{R}}^{64\times H\times W}$. Finally, we perform an element-wise summation operation between it and the feature $F$ to obtain the final output $E\in {\mathbb{R}}^{64\times H\times W}$ as follows.$$\begin{array}{c}E=pReLU\left(Conv\left(Perm\left(Atts\otimes V\right)\right)+F\right)\end{array}$$(2)where $pReLU$ represents the activation function, $Conv$ and $Perm$ represent 1 × 1 convolution and permutation operations respectively, and $\otimes$ represents matrix multiplication. $F$ represents the input feature. From Eq. (2), it can be inferred that the obtained feature E for each location is a weighted sum of the features of all locations and the original feature. Thus, it has a global context view and selectively aggregates contexts based on spatial attention. Similar semantic features gain from each other, thus improving intra-class compactness and semantic consistency.Fig. 4Key components in the framework. Projection Attention Module (PAM) is to make the network focus on the attention in the projection direction to model the dependencies in the projection direction. Adaptive Pooling Module (APM) aims to grasp multi-scale features and channel information while adaptively reducing the feature dimensions. Contrast Learning Enhancement Module (CLE) aims to improve feature differentiation between lesions and their contexts.Full size imageAdaptive Pooling Module (APM)Because GA lesions are translucent, the upper and lower boundaries of the lesion cannot be accurately defined in OCT B-scan images, thus most current GA segmentation methods use Enface image and 3D volume data instead of volume slices. However, segmentation using enface image ignores a large amount of spatial information and also requires a large amount of volume data. Whereas in image projection networks, the feature dimension decreases with increasing depth and a large number of important features may be lost in the dimensionality reduction process. The loss of these important information may cause the model to be unable to effectively capture the subtle features in the image, thus reducing the segmentation performance. Therefor it is very important to effectively reduce feature dimensions and retain important information to the greatest extent. Therefore, we propose the Adaptive Pooling Module (APM) to achieve the above purpose, and the APM structure is shown in Fig. 4. APM performs dimensionality reduction on the features in the feature complementation branch and inputs these features into the image projection branch to complement the features that were lost during the dimensionality reduction process. As shown in Fig. 4, input feature $I\in {\mathbb{R}}^{C\times H\times W}$ to APM and then output feature $O\in {\mathbb{R}}^{C\times P\times W}$. $P$ is the output feature height (the P of the first APM is 256). First, in order to extract multiscale features, we split the input feature $I$ into two parts (in the channel dimension). Splitting the input data into two groups on the channel and processing it through different branching paths. Specifically, each branch is characterized by ${I}_{i}\in {\mathbb{R}}^{\frac{C}{2}\times H\times W}$, $i=1,2$. $C1$ and $C2$ denote 3 × 3 size convolution and 5 × 5 size convolution, respectively. Then, to obtain the multiscale feature map, the multiscale features obtained from the two branches are concatenated in the channel dimension to obtain the whole multiscale feature map M. The process is shown in the following equation.$$\begin{array}{c}M= Concat\left({C}_{1}\left({I}_{1}\right),{C}_{2}\left({I}_{2}\right)\right)\end{array}$$(3)where $M\in {\mathbb{R}}^{C\times H\times W}$, M contains rich deep semantic information. In order to focus on spatial information while also focusing on information between channels. Channel descriptions are obtained using global average pooling in the spatial information of the multiscale feature M. Then, the channel correlation terms are captured by convolution. The correlations between features are captured by these operations. The channel attention $S$ is defined as follows.$$\begin{array}{c}S= \sigma \left({Conv}_{2}\left(\delta \left({Conv}_{1}\left(AvgPool\left(M\right)\right)\right)\right)\right)\end{array}$$(4)where $S\in {\mathbb{R}}^{C\times 1\times 1}$, $\sigma$ represent the sigmoid function, δ represents the ReLU function, and ${Conv}_{1}$ and ${Conv}_{2}$ represent the 1 × 1 convolution. To facilitate the later feature fusion, we reconstruct the multiscale feature $M$ as ${M}{\prime}\in {\mathbb{R}}^{C\times (H/P)\times W}$. In addition, some features may be lost during the extraction of multi-scale features, thus, a reconstruction operation is performed on the input $I$ to obtain ${I}{\prime}\in {\mathbb{R}}^{C\times (H/P)\times W}$. ${I}{\prime}$ is then reweighted and summed along the projection direction of the features to achieve feature fusion. The final output of the APM can be written in the following form.$$\begin{array}{c}O= Concat(U\left({I}{\prime}\cdot \left(Softmax\left({M}{\prime}\right)\odot Softmax\left(S\right)\right)\right))\end{array}$$(5)where $Softmax$ is used to obtain attentional weights in the projection direction and channel dimension. $\odot$ denotes broadcast element-by-element multiplication and $\cdot$ represents element-by-element multiplication. $U$ denotes unidirectional pooling of size $H/P$, and then concatenates P features of size $C\times 1\times W$ along the projection direction. The O obtained through the above steps fuses multi-scale information and channel information, which alleviates the information loss caused by simple unidirectional pooling to some extent. The feature representation thus obtained is richer and more comprehensive and can better reflect the complex structure and semantic information of the input data.Contrastive learning enhancement (CLE)We can see in Fig. 1b that the contrast between the GA and the noise and between the borders on both sides of the GA is low. When dealing with regions that are similar to GA features, with insufficiently distinctive features, the network may have difficulty in distinguishing these regions leading to incorrect segmentation. To increase the differentiation of features between GA and its context, we proposed a contrastive learning strategy29.The CLE module is shown in Fig. 4. We use the projection head for contrast learning after concatenating two branching features. The final output feature of the network is ${f}^{out}\in {\mathbb{R}}^{C\times 12\times W}$($C=32$) . To calculate the contrast loss, the projection head maps each pixel in the feature map to 128 dimensions. The contrast learning head consists of a multi-layer perceptron with two 1 × 1 convolutional layers. In supervised contrast learning, computing contrast loss on a single image would lack category diversity, this is because our OCT images contain only one category. To solve this problem, inspired by30,31, we propose a pathology samples library, consisting of two parts: one pixel library and one region library. We maintain a queue for every class in the pixel library. We select $V=24$ pixels (random) from each category in each image based on GT labels, these pixels are then arranged into a queue of size $Q=V\times N$ ($N$: Number of images in a batch). This produces a library of samples of size $2\times Q\times D$ (2 is the number of categories, $D$ is a 128-dimensional feature embedding). We also incorporate a region library to trap global semantic information. During training, we calculate the mean and combine it with the pixel embedding of the categories in the image, to obtain a D-dimensional global feature vector. The size of the region library is $2\times N\times D$. Therefore, our sample library size is $2\times (Q+N)\times D$. Note that the pathology sample library only takes effect during the training process.After establishing the pathology sample library, we need to design a sampling strategy to select more reliable anchors $\mathcal{p}$. In previous work30, it was found that when both positive and negative samples are close to the anchor $\mathcal{p}$ (the reference point that defines the relative relationship between positive and negative samples), it is difficult to distinguish negative samples from them, especially negative samples that are similar to the anchor. Similarly, when both positive and negative samples are far from the anchor $\mathcal{p}$, it is difficult to distinguish the positive samples from them. The specific matching probability formula is as follows.$$\begin{array}{c}{\rho }_{{\mathcal{p}}^{+/-}}=\frac{\text{exp}\left(\mathcal{p}\cdot {\mathcal{p}}^{+/-}/\uptau \right)}{{\Sigma }_{{\mathcal{p}}{\prime}\in {Z}_{\mathcal{p}}\cup {N}_{\mathcal{p}}}\text{exp}\left(\mathcal{p}\cdot {\mathcal{p}}{\prime}/\uptau \right)}\end{array}$$(6)where $\mathcal{p}$ denotes an anchor in the sample library, and ${\mathcal{p}}^{+/-}$ represents the positive samples (for a pixel $\mathcal{p}$ with its GT labeling class, the positive sample belongs to other pixels in the same class) similar to the $\mathcal{p}$ category or dissimilar negative samples in the sample library. $\rho \in (\text{0,1})$ is the matching probability. $\uptau (\uptau >0)$ represents the temperature hyperparameter. ${Z}_{\mathcal{p}}$ represents the set of pixel embeddings for positive samples and ${N}_{\mathcal{p}}$ represents the set of pixel embeddings for negative samples. It can be seen from the Eq. (6), the anchor $\mathcal{p}$ obtained from different sampling strategies will affect the discriminative power in the training samples. A reasonable sampling strategy will improve the distinction between positive and negative samples and help train a more accurate model. We designed a mixed sampling strategy in which we sampled a total of 240 anchor pixels (120 per class) in the maintained sample library. During the collection of positive samples, the first 30% of the hardest samples (negative samples that are close to 1 when the multiplication with the anchor point $\mathcal{p}$ is done, and conversely, positive samples that are close to -1 when the multiplication with the anchor point $\mathcal{p}$ is done) are collected. Then 40% of projection boundary points were randomly sampled, and the remaining 30% are randomly collected from the entire sample library. Negative samples are sampled in the same way as positive samples. After the above is done, we calculate the supervised contrast loss.$$\begin{array}{c}{L}_{\mathcal{p}}^{CLE}=\frac{1}{\left|{Z}_{\mathcal{p}}\right|}{\Sigma }_{{\mathcal{p}}^{+}\in {Z}_{\mathcal{p}}}\\ -\text{log}\frac{\text{exp}\left(\mathcal{p}\cdot {\mathcal{p}}^{+}/\uptau \right)}{\text{exp}\left(\mathcal{p}\cdot {\mathcal{p}}^{+}/\uptau \right)+{\Sigma }_{{\mathcal{p}}^{-}\in {N}_{\mathcal{p}}}\text{exp}\left(\mathcal{p}\cdot {\mathcal{p}}^{-}/\uptau \right)}\end{array}$$(7)$$\begin{array}{c}{L}^{CLE}=\frac{1}{N}{\Sigma }_{i=1}^{N}{L}_{i}^{CLE}\end{array}$$(8)where N is the number of anchors in the training dataset. ${L}_{\mathcal{p}}^{CLE}$ is the contrastive loss for individual anchors. ${L}^{CLE}$ is the total contrast loss.ConvLSTM-based fine-tuning stageIn the pre-training stage, the network (e.g., Fig. 2) learns the GA lesion features using a large number of B-scan images. After obtaining the pre-training weights, in order to exploit the large amount of spatial information contained in the volumetric data. we propose the fine-tuning stage.As shown in Fig. 5, the network structure before the fine-tuning stage ConvLSTM32 is the same as the pre-training network structure. After using the pre-training weights obtained in the pre-training stage, 3D volume data is input for training. Unlike conventional LSTM methods, convolutional LSTM uses the convolution operator * instead of matrix multiplication to preserve the spatial information of long-term sequences. The entire definition is as follows.$$\begin{array}{c}{i}_{t}=\sigma \left({X}_{t}*{W}_{xi}+{h}_{t-1}*{W}_{hi}+{c}_{t-1}\circ {W}_{ci}+{b}_{i}\right),\end{array}$$(9)$$\begin{array}{c}{f}_{t}=\sigma \left({X}_{t}*{W}_{xf}+{h}_{t-1}*{W}_{hf}+{c}_{t-1}\circ {W}_{cf}+{b}_{f}\right),\end{array}$$(10)$$\begin{array}{c}{c}_{t}={c}_{t-1}\circ {f}_{t}+{i}_{t}\circ \text{tanh}\left({X}_{t}*{W}_{xc}+{h}_{t-1}*{W}_{hc}+{b}_{c}\right),\end{array}$$(11)$$\begin{array}{c}{o}_{t}=\sigma \left({X}_{t}*{W}_{xo}+{h}_{t-1}*{W}_{ho}+{b}_{o}\right),\end{array}$$(12)$$\begin{array}{c}{h}_{t}={o}_{t}\circ \text{tanh}\left({c}_{t}\right).\end{array}$$(13)where $\sigma$ is the sigmoid function, $*$ denotes the convolution operation, and $\text{tanh}$ is the hyperbolic tangent function. Input gate ${i}_{t}$, forget gate ${f}_{t}$ and output gate ${o}_{t}$ are the three gates that make up the entire network. ${b}_{i}$, ${b}_{f}$, ${b}_{c}$ and ${b}_{o}$ are the bias terms, while ${X}_{t}$, ${c}_{t}$, and ${h}_{t}$ are the input, active, and hidden states at moment $t$. W represents the weight matrix, e.g., ${W}_{hi}$ is responsible for controlling how the input gate gets its value from the hidden state. $\circ$ denotes Hadamard product.Fig. 5Flowchart of the fine-tuning phase. Input volumetric data, segmentation of data slices, and utilization of ConvLSTM to grasp the adjacency between slices.Full size imageLoss functionBecause GA segmentation is a pixel-level binary segmentation task, we use Dice loss ${L}^{Dice}$ and binary cross-entropy loss ${L}^{BCE}$ to guide model training. Our segmentation loss is calculated as follows.$$\begin{array}{c}{L}^{Dice}=1-\frac{2\left|P*S\right|}{\left|P\right|+\left|S\right|}\end{array}$$(14)$$\begin{array}{c}{L}^{BCE}=-{\Sigma }_{1,w}\left(1-S\right)\text{log}\left(1-P\right)+Slog\left(P\right)\end{array}$$(15)$$\begin{array}{c}{L}^{SEG}={{\lambda }_{Dice}L}^{Dice}+{{\lambda }_{BCE}L}^{BCE}\end{array}$$(16)where ${\lambda }_{Dice}$ is set to 0.5 and ${\lambda }_{BCE}$ is set to 0.5, $P$ and $S$ represent the predicated segmentation results and the corresponding ground truth, and 1 and $w$ represents the coordinates of the pixel on $P$ and $S$. As shown in Eq. (16), Dice loss is used to evaluate the spatial overlap ratio between the ground truth and the predicted GA area. While binary cross-entropy loss is used to optimize the model at the pixel level.Finally, the total loss of the proposed DIPN is defined as.$$\begin{array}{c}{L}^{total}={L}^{SEG}+{{\lambda }_{CLE}L}^{CLE}\end{array}$$(17)where ${\lambda }_{CLE}$ is set to 1, ${L}^{CLE}$ is defined as the contrastive learning enhanced loss in the inference stage, excluding the L2 norm.Experimental and resultsData sets and processingThe proposed method is experimented on two datasets. The first dataset was the retinal geographic atrophy dataset (RGA dataset for short) obtained at Wuhan Aier Eye Hospital. It contains 44 OCT volumes and individual 2823 GA B-scan images. Each OCT volume contains 512 B-scans with a resolution of 1024 × 512. All lesions were manually labeled. In our experiments, we used all the individual 2823 GA B-scan images for the pre-training phase and performed fivefold cross-validation on 34 volume data in the fine-tuning stage (among these 5 parts, one part contains 6 volume data, and the remaining four parts contain 7 volume data each.). We resized the individual B-scan images to 512 $\times$ 512. For the volumes in the dataset, we resized them to 512 $\times$ 512 $\times 512$.To explore the cross-domain generalizability of our method33, the second data set is the public data set OCTA50023,24. It is a multimodal dataset containing two different morphology types (OCT and OCTA), with subsets of two different field of view types, namely OCTA_6M (No.10001-10300) and OCTA_3M (No.10301-10500). The OCTA500 dataset includes 3D FAZ segmentation labels and retinal vessel (RV) segmentation labels. In our experiments, we choose to segment the FAZ region and RV in OCTA-500. For the selection of the two modalities, we do the same as previous work34, select the FAZ data and RV data in OCTA_6M to evaluate our method. This dataset contains 300 subjects and has a volume size of 400 × 400 × 640. For the fairness of the experiment, we based on the previous work34, the data set is divided into training set (No.10001-10180), verification set (No.10181-10200), and test set (No.10200-10300). For more details on the OCTA500 dataset, see24. We resized the dataset to a uniform size of 400 $\times$ 512 $\times$ 512. Due to the lack of separate B-scan images, we skipped the pre-training phase and started training from the fine-tuning phase.Implementation detailsThe proposed method is implemented in PyTorch framework, and all experiments are conducted on a single NVIDIA 3090 GPU. Each dataset is independently trained and tested on the model. We train the model using the Adam optimizer with an initial learning rate of 1e-5 and momentum parameters of ${\beta }_{1}=0.9$ and ${\beta }_{2}=0.999$. 250 epochs are trained in the pre-training phase with a batch size of 4. The best weights are migrated to the fine-tuning phase, where 100 epochs are trained in the fine-tuning phase. The test set and the training set are independent during the experiments.Evaluation metricsIn order to evaluate the segmentation performance in different methods. We quantitatively analyze the experimental results using five metrics as in recent methods23: the Jaccard index (Jac), the Dice similarity coefficient (DSC), the balanced accuracy (BACC), precision (PRE) and the recall (REC), where Jac and DSC are widely used in evaluating segmentation performance35,36,37. Using accuracy to evaluate the results may lead to overestimation or loss of significance. To evaluate the accuracy when positive and negative samples are not balanced, we use balanced accuracy instead of general accuracy to evaluate the results. The evaluation metric formula is shown below.$$\begin{array}{c}Jac=\frac{TP}{TP+FN+FP}\end{array}$$(18)$$\begin{array}{c}DSC=\frac{2TP}{2TP+FN+FP}\end{array}$$(19)$$\begin{array}{c}BACC=\frac{TPR+TNR}{2}\end{array}$$(20)$$\begin{array}{c}PRE=\frac{TP}{TP+FP}\end{array}$$(21)$$\begin{array}{c}REC=\frac{TP}{TP+FN}\end{array}$$(22)where TP is true positive, TN is true negative, FP is false positive and FN is false negative, TPR is true positive rate, and TNR is true negative rate.Results and analysisPerformance comparison and analysisIn this section, to evaluate the performance of our proposed method for GA segmentation, we compare it with several other best methods on the RGA dataset. In addition, to verify the significance of our method with the method of others. We performed a two-tailed paired t-test on the results of the above networks, which is significant at P